What happened to the signature?

I just found out that a file created by Flexcel is now sending the standard HEX signature file so I can recognize it as a XSLX document.



This is the standard HEX signature to recognize a Microsoft Office 2007 document:



        ' If this is a Microsoft Office 2007 document

        If Mid(lcString, 1, 14) = Chr(80) + Chr(75) + Chr(3) + Chr(4) + Chr(20) + Chr(0) + Chr(6) + Chr(0) + _

        Chr(8) + Chr(0) + Chr(0) + Chr(0) + Chr(33) + Chr(0) Then



Can you explain?

...I meant "NOT" sending the standard HEX signature

If you look at this page, you will see bytes 7th and up are not as it should:



http://www.garykessler.net/library/file_sigs.html



For byte 7th, you put 0 instead of 6.

For byte 8th, you put 8 instead of 0.

Etc.



Is this a new protocol?

Hi,


As the page you pointed says, an xlsx file is just a zip file. You can rename a .xlsx file to .zip and extract it with any zip tool.

Now, zip files don't have a header at the top, the header is at the bottom. This allows people to create files that when you name them with an extension of jpg they are an image, and if you rename them to zip they are a compressed file. (the headers of jpg are at the top, the headers of png are at the bottom)

You can't assume anything about the start of a zip, the only data is at the end (And it isn't in a fixed position either, as the zip file could have a comment after the eof header).

What normally happens is, that even when the start of a zip file might be anything, you start by putting there the first compressed file. Every compressed file has this headers:
http://en.wikipedia.org/wiki/Zip_(file_format)#File_headers

0x04034b50 in little indian notation is:
50 4b 03 04
After that we have the minimum version needed to extract: 00 14
After that a general purpose bit flag which is the one you are seeing differently.
In that page, they use 06 08
We use 08 00

If you search for in the zip specification: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
for 
4.4.4 general purpose bit flag: (2 bytes)
You'll see the meaning of every bit. (beware, this is in little endian order)

What just happens is that we use different compression options than Excel. In general, FlexCel compresses the file more creating smaller xlsx files.

But you just can't use that header to identify an xlsx file. A still bad but better solution would be to use the File header "50 4b 03 04". Even the 2 next bytes " 00 14" might change if the minimum version needed to encrypt changes, so you can't trust they will always be 00 14.

But even using only the file header is wrong: While it would probably detect most files right, you don't really need to start a zip file with a file header. You could put  random data there: what matters is what you write at the end of the file (a directory list where you have the offset of every file header in the file).

And it gets worse. All of this will allow you to detect that the file is zip, not xlsx. There are many other file formats that are also zip (including zip files themselves) which might pass this check.

Sadly the only real way to know if a file is xlsx is to unzip it first, then look for the file [Content_Types].xml


Yes, this is exactly what my class does. I mentioned that header for Microsoft Office 2007. But, in reality, I should have said this is the main condition (PKZip) to go through the unzipped version, open the XML and so on to detect XLSX, DOCX, PPTX, etc.



Now, if I understand correctly, it works so far in my class as all my tests, as well as all files processed so far in production, where using a PKZip header I was able to recognize. You mentioned the start of the file might not always work in exceptions.



From that, to make it work, it seems I would need to grab only the first 6 bytes of the start of the file to detect if this is a zip. Then, it would go in my condition code, unzip, open the XML, etc. The goal is to simply have the first condition to pick it up. So, instead of verifying for the first 16 bytes, I should verify the first 6 only. If I do that, it would recognize your file as well as a zip and would eventually detect the final detection to be a XLSX. Is this what you recommend?

the zip file works like this:


<data>
Central directory
   File1: It is at offset n in data
   File2: It is at offset m in data
...
Zip header
Offset of central directory
comment.

So, in theory, files can be anywhere in the zip file. You read a zip file by starting at the bottom, finding the central directory, and from there the offset of every file.

Now, in most cases, the offset of the first file will be 0. It makes sense, what else would you write but a list of files? 
So, you'll find a local file header there: 50 4b 03 04

So looking for that file header will work in most cases. But I could have a completely valid xlsx file which Is:
100 bytes of random data, might be some copytight text.
50 4b 03 04
...
Central file directory
  File1: offset 100

And it would work, but not be detected by your routine.
But the problem is, the "correct" way to detect a zip file isn't straightforward either: because the header isn't at the very end. At the very end you have a comment (Which again, in 99.99% of the files is empty, but you could have some data there).

So to find a zip file, you need to try the first 64kb from the bottom looking for a signature ( 06054b50) . The signature can be everywhere in there.

If you are doing this in .NET or delphi, we have a Zip class in FlexCel itself which you could use to detect if the file is a valid zip the "correct" way. There is a "TryOpen" method which returns true if the file is a valid zip.

Thanks, I will try to get more info on the 64kb detection for that header. For now, the routine works with the Flexcel created XLXS file.