How best to read only selected columns in virtual

Good day
I need to read only a few, user-selected, columns of data from a spreadsheet.
Since the original spreadsheet may be very large it makes sense to do this

in virtual mode so as to save memory and time. However, I am wondering what the best way is to achieve this. I could try to iterate through the drawing grid to extract appropriate cell values (but don't think this is possible because as far as I know the values are not located by cell once imported). For a large spreadsheet it also makes sense to only show the first 50 so the grid would not have these in any case. The alternative seems to be to identify the cells after opening a spreadsheet and partially copying to the drawgrid and then to repeat the whole read process for only the columns of interest. I would transfer values to my own storage system (a local clientdataset) so as not to read all cell values into the drawgrid.

Any suggestions for a better way to do this?



I am not really sure I fully understand the situation, but some remarks:
1)Virtual mode reads one cell at a time from the file, tells you the value, and drops it. So you can't get the cell values back from the XlsFile object once you loaded it.

In virtual mode you would load the file into say a TList<> and then use that TList to read the values and fill the grid. (You can use a TGrid in virtual mode, so the data won't be there twice, it will only be in the TList, the "virtual mode" demo uses that).  Of course, if you load only the first 50 rows, then you won't have the rest and you will have to load the file again.

2)Make sure that you need virtual mode at all. FlexCel 6 is very optimized for loading huge files, we've tested it here with 1 million rows x 100 columns, and while of course it is a little slow, it doesn't use that much memory. It will for sure use less memory than a TClientDataSet.
So it might make sense to first measure times and memory, and see if normal mode isn't enough. You load the full file, then show some values on the grid, say the first 50 rows (by using the grid in virtual mode, you don't have to have those values twice in memory), and then when you need to process all the values, just use the values in the TXlsFile. This might be faster than the alternatives, so it might be worth measuring.

3)Another solution is, as you say, to load the data into a client dataset, or a database, or a TList, and use that to feed the grid. Also as you say, you can do it 2 ways:
3a)Load the first 50 rows, fil the grid, display it to the user. If the user presses Ok, then reload all the rows.

3b)Load all the rows, display only the first 50, and if the user presses Ok, process the already loaded data.

Whether 3a) or 3b) are better in your case depends in how often the user presses "cancel":  If most of the time he presses ok, then it might be faster to load everything at once, since laoding it twice will be a little slower. But I thing 3a): loading it twice is the best option if you want to go with virtual mode. Reading the first 50 rows will be instant, so you can just do it twice. The first pass you load 50 rows and show them to the user, if he presses ok, you load all the file again. The extra time because you loaded the first 50 rows twice won't be noticeable, but for big files the time to show the file to the user will be better.

You might even do the first load normal, and if the user presses ok, load the second time from a thread, so the user can keep doing things (probably selecting the next xls file) while you load the data in the background. This will make the app more responsive.

As said, I am not 100% sure I undestood the idea, if I answered the wrong question, just let me know :)


  thanks for the response. I am intrigued that the Flexcel implementation for accessing a spreadsheet is so fast. I had also not anticipated that it is so much faster than populating and using a clientdataset in memory.
Your suggestion to read twice with a first read to show a user the basic layout of the spreadsheet followed by importation of the relevant columns sounds like the best way to go.
I don't need to read multiple spreadsheets rapidly one after another but rather to be able to read occasional very large spreadsheets for instance one published data compilation contains almost 200 000 rows. It would be nice to be able to do all the calculations using just the spreadsheet if access is so fast but I do need to select only a few columns (typically only 3 for this purpose) and to check for and screen out certain values. I also need to then iterate through the accepted values several times as I do Gaussian summations across multiple records so as to produce probability density plots and histograms. Clientdatasets are the most convenient for these latter operations although I might need to start thinking of doing most of the calculations with dynamic data arrays.
Whatever the case, thanks for the explanations and suggestions. I will be able to come up with a better working solution now.