Fastest transfer of xlsx to database

I need to transfer data to SQLite as quickly as possible.
I assume two code passes.
First to determine the type of data
The second one for data filling.
The settings are:
     xls.VirtualMode: = True;
     xls.IgnoreFormulaText: = True;
The code "OnCellRead" is as follows.


procedure TCellReader.OnCellRead(const sender: TObject; const e: TVirtualCellReadEventArgs);
var
  iRow, iCol : Integer;
  Clr        : TUIColor;
  s          : String;
  b          : boolean;
  dt2        : TDateTime;
  d2         : TDate;
  t2         : TTime;
  realN      : Extended;
  fmt        : TFlxFormat;
begin
  if e.Cell.Sheet <> fSheetToRead then
    begin
    e.NextSheet           := ''; //Stop reading all sheets.
    exit;
    end;
  iRow                    := e.Cell.Row;
  iCol                    := e.Cell.Col;
  case e.Cell.Value.ValueType of
    TCellValueType.Empty       :
        s                 := '';
    TCellValueType.Number      : 
        begin
        fmt               := excelFile.GetFormat( e.Cell.XF );
        if TFlxNumberFormat.HasDateOrTime( fmt.Format ) then
          dt2             :=e.Cell.Value.ToDateTime( false )
         else
           if TFlxNumberFormat.HasDate(fmt.Format) then
             d2           := TDate( e.Cell.Value.ToDateTime( false ) )
           else
             if TFlxNumberFormat.HasDate(fmt.Format) then
               t2         := TTime( e.Cell.Value.ToDateTime( false ) )
             else
               realN      := e.Cell.Value.ToNumberInvariant;
        end;
    TCellValueType.DateTime    :
        dt2               := e.Cell.Value.ToDateTime( false ); // Spravny  'dd/mm/yyyy hh:mm'. Inak sa neda poznat. Je to cislo dt:=e.Cell.Value.ToDateTime( True);    // ZLY
    TCellValueType.StringValue : 
        s                 := e.Cell.Value.ToSimpleString;// .ToString;
    TCellValueType.Boolean     : 
        b                 := e.Cell.Value.AsBoolean;
    TCellValueType.Error       : 
        s                 := TCellValue( e.Cell.value.AsFormula.FormulaResult ).ToSimpleString;
    TCellValueType.Formula     : 
        case e.Cell.value.AsFormula.formulaResult.ValueType of
          TFormulaValueType.Empty       : 
              s           := '';
          TFormulaValueType.Number      : 
              begin
              fmt         := excelFile.GetFormat( e.Cell.XF );
              if TFlxNumberFormat.HasDateOrTime( fmt.Format ) then
                dt2       :=  TCellValue( e.Cell.value.AsFormula.FormulaResult ).ToDateTime( false )
              else
                if TFlxNumberFormat.HasDate(fmt.Format) then
                  d2      := TDate( TCellValue( e.Cell.value.AsFormula.FormulaResult ).ToDateTime( false ) )
                else
                  if TFlxNumberFormat.HasDate(fmt.Format) then
                    t2    := TTime( TCellValue( e.Cell.value.AsFormula.FormulaResult ).ToDateTime( false ) )
                  else
                    realN := e.Cell.value.AsFormula.FormulaResult.AsNumber;
                  end;
          TFormulaValueType.StringValue :
              s           := TCellValue( e.Cell.value.AsFormula.FormulaResult ).ToSimpleString;
          TFormulaValueType.Boolean     : 
              b           := e.Cell.value.AsFormula.FormulaResult.AsBoolean;
          TFormulaValueType.Error       : 
              s           := TCellValue( e.Cell.value.AsFormula.FormulaResult ).ToSimpleString;
          end;
    end;
end;

It is not final, but represents the ability to load all types of cells(I hope)
Please confirm that this is the right solution..?
Or something can be improved.?
I believe that this topic will also benefit other users

If you understand Czech miroB here: http://forum.delphi.cz/index.php/board,10.0.html
Or address him in private English

Hi,

The solution seems fine at first sight, but let me say some things that might be interesting:

1. Before, a warning: Most of what I will say here makes no difference in practice. While approach a might be in theory faster than b, normally the differences can be so little that they won't matter. You normally need to profile to see where the bottlenecks are.

2. Virtual mode is normally not faster than normal mode, in fact it might be a little slower. This is because we need to call an external event and that is expensive. We also can do some optimizations in normal mode that we can't if we have to call an event for every cell. But, I refer to point 1: Difference is normally not noticeable. And while virtual mode isn't faster, it does use much less memory.

3. The main issue I see here is with the datetime handling. Knowing if a cell has a date is expensive, as TFlxNumberFormat.HasDateOrTime  has to check the format string, something like "hh:mm:ss" and see if the format is a date or a time, or a number. And you are checking the format string 3 times: HasDate, HasDateTime and HasTime.

While FlexCel can't improve much the performance of those calls, because if the cell is a date or not is not stored in the file, what you can do is call them as little as possible. And this normally implies caching.

The first idea would be to have an enum:
TFormatType = (undefined, date, time, datetime), and then a Dictionary<string, TFormatType> where you cache what a format string is. So for example the first time you read a format 'dd/mm/yyyy', you check it is not in the dictionary, so you call IsDate/DateTime/Time, and store the result as a TFormatType in the dictionary. The next time you find the same format string, you can read the value directly from the dictionary.

But that is not as fast as it could be. Strings are kind of slow in Delphi (because of the referencing counting) and a dictionary needs to has the format string every time, which isn't cheap either.

So a better idea would be to have an array of XF:
  Formats: Array[0..4000] of TFormatType;

And then you check:
if XF is < 0 then it is a number (general format)
if XF is >= Length(Formats) you have to check manually. 
else if Formats[XF] = undefined then you check the value and store there, 
else has date/time/etc = Formats[XF]

Note: I used 4000 here because your file is not likely to have more than 4000 different formats. We could have used a list instead of a fixed array to 4000, but fixed arrays are faster. But as we used a fixed array, you need to check you don't have more than 4000 XF formats. If some XF is > 4000, you can't use the cache for that value.

4. In my experience, the bottleneck here will be likely inserting the data to sqlite, more than the read part. It's been a long time since I last used SQLite in my projects so I won't comment more here, but make sure that you pass the data in an efficient way. For example, it is likely faster to read many rows and pass them all together than one by one.

5. The biggest "silent" performance killer in Delphi is when you have strings and exceptions in the same method. Then the method can get order of magnitude slower. So make sure your code is not throwing exceptions directly (if you need to throw an exception, throw it in a separate method).  Basically you have to profile the method and see that it doesn't have a big initialization/cleanup block. I always recommend this really good article about it: https://www.delphitools.info/2009/05/06/code-optimization-go-for-the-jugular/

Hi,

I personally don't read Czech, but if I understood google translate right, the alternatives evaluated in that post are:

1. Using OLE automation with late binding  - This is the slowest solution.
2. Using OLE automation with early binding - A little faster, but still really slow
3. Using ADO. Faster than OLE, but still much slower than a native Delphi solution like FlexCel.

So yes, ADO is much faster than OLE, but if you are looking for the fastest way to transfer an xlsx file to a database, native delphi components like FlexCel or similars are faster. There are no multiple layers between your code and the xls file (database driver, dispatch, etc): a native component is as fast as it gets (and you get more flexibility in the import/export too).

Of course many times it doesn't matter: I mean if a solution takes 1 second to export a file and the other 2 seconds, one is 100% faster than the other, but your users are not likely to see a difference. The most important advantages of FlexCel over OLE or ADO are not that much raw speed, but that you have much more power to customize, and that you don't need to install Excel or ADO drivers in the customer's machines. But when looking for raw speed, FlexCel is faster anyway.

And just to be clear: there are cases where using ADO or OLE make sense, but fastest way to create or read files isn't one of those.

Hi Adrian, first of all thank you very much for the quick and comprehensive response. 

  • Of course I accept comments on array of XF:  Formats
  • I will use Flexcel's Virtual mode in case of big data transfer
  • For common data reading, I use the Flexcel's normal mode. .
  • I also have my own tool that is extremely fast but only handles the correct xlsx format. Neither xls nor marginal excel (xlsx) formats
  • OLE late binding (LB) is slow and unsuitable for my purposes
  • I only use OLE early binding (EB) to work with Excel, which is currently open. If the data is not already on the disk..
  • SQLite with array DML is very fast. See :http://docwiki.embarcadero.com/RADStudio/Rio/en/Array_DML_(FireDAC)
  • ADO is problematic in terms of installation and other things. I have been using it for cca 20 years
  • ADO -  It has to be said that it was pretty fast. But the Access engine has different limits. On the other hand, nice compatibility with Excel functions. About 100 VBA functions
  • Finally, I would remind that I work with local data. I do not use client / server
  • Thanks to FireDAC I have SQLite database very well compatible with Excel
  • And.. SQLite allows me to replace VBA functions with my own

Thank you again for your great approach

Miro