Problem opening text file

Hi,


when I try to open a certain csv file (below a download link)  the xls object indicates that RowCount is 0. When I open more or less the same file RowCount is as expected.

 SetLength(colTypes, 2000);
 for i := 0 to 1999 do  colTypes := TColumnImportType.Text;
xls.Open(filename, TFileFormats.Text, #9, 1, 1, colTypes)

What is wrong with the file?


Regards,
Stephan

Hi,

Nothing is wrong with the file, it is just that it is not in UTF8 encoding, and FlexCel by default tries to open it as if it was UTF8-encoded.

This is actually one of the biggest problems with using CSV: You need to know which encoding was used to create the file. In this case I am not really sure on what encoding the file has, but it is not UTF8, since it has sequences of characters which are invalid in UTF8. When you try to open a file with encoding UTF8 and it has characters that are invalid in a UTF8 string, Delphi will just reject the full thing and return an empty string. And this is why rowcount is 0: FlexCel is looking at an empty string.

Normally, if the file was created in a "western" locale, the encoding is Win1252:https://en.wikipedia.org/wiki/Windows-1252
But it might be 1251 for example if it is a russian locale.

If you try with win1252 encoding, the number of rows is correct:



program Project59;


{$APPTYPE CONSOLE}


{$R *.res}


uses
  System.SysUtils,
  VCL.FlexCel.Core,
  FlexCel.XlsAdapter;


var
  xls: TXlsFile;
  i: integer;
  colTypes: Tarray<TColumnImportType>;
  Encoding: TEncoding;
begin
  xls := TXlsFile.Create(true);
  try
    SetLength(colTypes, 2000);
    for i := 0 to 1999 do  colTypes := TColumnImportType.Text;
    Encoding := TEncoding.GetEncoding(1252);
    try
      xls.Open('r:\Elp_95.csv', TFileFormats.Text, #9, 1, 1, colTypes, nil, Encoding, true);
    finally
      Encoding.Free;
    end;
    WriteLn(xls.RowCount);
    ReadLn;
  finally
    xls.Free;
  end;


end.




But I am not really sure the text will be the correct one. You will only get the correct text if you specify the correct encoding which was used to create the file.

Thanks for super fast answer and the solution. In Notepad++ the encoding is indicated as ANSI, so not UTF-8. However, many other csv files I process have the same ANSI encoding and they can be imported by TXlsFile without any problem. Anyway, I am happy to know what I can do to get it working.

Stephan

Glad to know you got it working, but let me clarify a little more about encodings. 


First of all, there is no such thing as "ANSI" encoding: "Ansi" means any Windows encoding. It could be win1232, win1231, whatever, they are all "ANSI". Normally what people refers with ANSI is (and what you get if you use TEncoding.ANSI) is the default encoding in your machine. But if your machine is in Win1232 and the machine that generated the file is in Win1231, then you might have problems. This is why it is important to know the locale of the machine that generated the file. (and if possible, convince them to use utf8...)

Second, you can't really know which encoding the file has if you don't know which encoding was used to create it, so you can't get to know the encoding by opening it in Notepad++. Notepad++ can try to guess the encoding, but it can't really know it.

What we have is:
ASCII: This is a 128 character encoding (from char 0 to 127), and those first 128 characters are the same in UTF8 or any ANSI encoding. So as long as your file doesn't have special characters like a ñ, you can use either UTF9, ASCII or any ANSI encoding and it will work. That's why you can import those other files without any problem: They don't have characters outside the ASCII range.

ANSI: This is a group of encodings whose first 128 characters are the same as in ASCII, but the characters from 128 to 255 are different depending on the locale/country.

So imagine that I export a file containing the text AÑO in my Win1252 locale. As you can see from here:
https://en.wikipedia.org/wiki/Windows-1252
The A is encoded as 65, the Ñ as 209 (outside the ASCII range from 0 to 127) and the O as 79.

So I send you this file with a 65, 209 and 79 to you.

Now, let's imagine your locale is 1251 (cyrillic) https://en.wikipedia.org/wiki/Windows-1251
You get my file, and you decode it as:
65: A
209: С (note that this is not a normal "C" but a different character)
79: O

So you read AСO instead of AÑO

As said, there is no way to know what is the encoding used to create the file, if you only have the final file. It might be a file that says AСO or AÑO, and you can't know. If you open it in NotePad++ it will likely use the encoding in your machine, so if you are in a cyrillic machine it will show you AСO and if you are in a western machine it will show  AÑO (and always show the encoding as ANSI, because both are ANSI encodings)

UTF8 on the other hand is a mutibyte encoding, where again, the first 127 characters are the same as ascii. But after that, you can have multiple bytes to represent the rest of the characters. The nice thing about UTF8 is that it doesn't change with the locale: There is only one UTF8 and it can represent every unicode character. (While there are many ANSI encodings and every on can represents the characters of a group of languages).

So if possible, it is good to tell the people producing CSV to use UTF8. If you can't, you should really know what encoding they used (unless it is all ASCII characters), or the characters from 128 to 255 are going to be wrong. If you can't know which ANSI encoding was used, normally Win1252 is the best choice, since it is used in most western locales. But it is a guess.