Performance of Blake2B in TMS Cryptography Pack – request for DLL alternative or optimization

Dear TMS Support,

while testing the new version of TMS Cryptography Pack (5.1.0), I encountered a significant performance difference compared to the older version (prior to 5.0.9.5), which used RandomDLL.dll.

  • With the older DLL implementation, computing 1,000,000 Blake2B hashes on ~1 KB input took about 3–4 seconds. ( Win64 release both)
  • With the new Delphi-only implementation, the same test takes ~56 seconds on the same machine (with 64 GB RAM).
  • In my application I process around 200,000 photographs (up to 20 MB each, ~700 GB total). This performance gap makes the new implementation impractical for large-scale workloads.
    I would kindly ask you to verify this comparison yourselves. If needed, I can provide my own source/test code (loop of 1,000,000 iterations, input size ~1 KB).

Could you please consider:

  • keeping support for the legacy DLL backend (RandomDLL.dll) as an optional alternative for performance‑critical scenarios,
  • or exploring optimization of the Blake2B core in C/C++ with SIMD instructions (e.g. SSE2, AVX2, AVX‑512). These instruction sets are widely used in cryptographic libraries to accelerate hashing of large data blocks, and they can deliver throughput in the hundreds of MB/s range, which is crucial when hashing multi‑megabyte files.

For my projects, the hash output format (base64url, 44 bytes) must remain unchanged, but the performance difference between the old and new implementations is critical. I appreciate the improvements in other TMS components, but for Blake2B I would greatly value either a faster backend or the option to continue using the DLL approach.

Thank you for considering this request, and I look forward to your response.

Best regards,
Miro

Hi Miro,
Thanks for spotting this performance issue.
The difference is not due to RandomDLL that only calls Windows random functions. It is due to the port from C to Delphi of the Blake2d functions that were in the old HashObj.pas file.
Note that it is a general issue, not specific to Blake2.
I will look at possible options to speed up this hash function (possibly with different data structures), but it will never match the C implementation.

I will also consider generating a DLL with the old (revised) C code, but that's not gonna be an immediate endeavor.

Regards,
bernard

Miro, I tidied up the code an even found a buffer overflow, but I don't think this will speed up the execution. The major issue is with the internals that can be efficiently coded in C and not so much in Pascal. Note the the 64-bit implementation is slightly faster.

Hi Bernard,

Performance issue with Blake2B in CP v5.1.0.2

Problem Description

There is a significant performance degradation when computing Blake2B hashes using Cryptography Pack (CP) v5.1.0.2 compared to CP v4.3.3.0. Under my own testing conditions, the newer version is almost 20 times slower.

Test Conditions (my environment)

  • testPC: running CP v5.1.0.2
  • mainPC: running CP v4.3.3.0
  • Both are my own computers, with comparable hardware (64 GB RAM each).
  • Compilation done with Delphi 12 Pro, Win64/Release build.
  • Test program uses TMemoryStream (not File).

Steps to Reproduce

  1. Compile a simple test program in Delphi 12 Pro.
  2. Run a cycle of 1,000,000 calls.
  3. Hash source text size: 10 kB.
  4. Parameters:
    • Key := sKey
    • HashSizeBytes := 44
    • OutputFormat := base64url
    • Unicode := noUni

Expected Result

Hash computation in CP v5+ should perform at least comparably to CP v4.3.3.0, ideally faster due to improvements and removal of DLL dependency.

Actual Result

  • CP v4.3.3.0: 5443 ms
  • CP v5.1.0.2: 105731 ms
    → CP v5+ is ~20× slower.

With smaller input (~70 B), CP v5+ is still ~7× slower.
With larger input (~15 MB, typical for photo management), computation time increases drastically (hours in CP v4.3.3.0, expected to be even worse in CP v5+).

Notes

  • These results reflect my own testing conditions (testPC and mainPC).
  • For certainty, I tested the performance of the sample program (exe) in both CP versions on both PCs. The difference between the two machines was only about 2% (insignificant). All reported numbers are from the faster PC.
  • CP v5+ is appreciated for being pure Delphi (no DLLs).
  • However, the performance regression is concerning.
  • A resolution or compromise would be highly valued.

Best regards,
Miro

P.S. For now, I must keep the old TMS All Access version on my main PC.

Hi Miro, thanks for the benchmark of the two versions.
Did you notice an improvement between 5.1.0.2 and 5.1.0.1?
As I said before, there is no magical trick to speed up the internals of Blake2d in Delphi. However, I will investigate possible options.
Regards,
bernard

Hi Bernard,
I don’t have the original text as input. I generated another one, slightly smaller (up to 10 kB), and used it for all the tests. Altogether, I only have these 4 possible scenarios.

Today I upgraded the newer PC to v5.1.0.2, but I still have the earlier debug version v5.1.0.1. Here are the complete results:

  • For v5.1.0.2 Win64/Release: 65140 ms
  • For v5.1.0.1 Win64/Debug: 91100 ms (Debug-based exe from yesterday before the upgrade)
  • For v4.3.3.0 Win64/Release: 4221 ms
  • For v4.3.3.0 Win64/Debug: 4334 ms

Regards,
Miro

Hi,
Here is my own test, that unfortunately confirms yours:
10,000 bytes, 64 byte hash, 1M iterations, Delphi: 431,985 ms
10,000 bytes, 64 byte hash, 1M iterations, C: 32,021 ms
1,000 bytes, 64 byte hash, 1M iterations, Delphi: 29,266 ms
1,000 bytes, 64 byte hash, 1M iterations, C: 3,319 ms

An option to get a better throughput is too call a Blake2b executable from Delphi.
Another option would be to mix C++ and Delphi code. I have tried this but gave up due to the complexity of the approcha. I am also skeptical about the portability.

Hi,
With a rewrite of CompressRound and the disabling of overflow checks (otherwise it crashes), the test result is:
10,000 bytes, 64 byte hash, 1M iterations, Delphi (Release mode, all): ~255,000 ms
1,000 bytes, 64 byte hash, 1M iterations, Delphi: ~28,000 ms

Hi.
Okay, thank you. However, for practical calculations, I still have to use the CP 4+ version due to speed.

OK. I shaved off a few percents with another overflow check removal, but I am close to the maximum gain with Delphi.
Do you have C++ or Delphi only?

Hi, Delphi only

Bad luck, can't mix Delphi and C++ then.
I released 5.1.0.3 yesterday and it is another 10% faster on 1K blocks and about 5% faster on 10K blocks. However, we are still ~8 times slower than C.
An option is to use parallelisation for different digest computations but it is indeed very much depending on the CPU.