[Issue 21560] md5 poor performance out of the box

Jan 20, 2021

Witold Baryluk

Jan 20, 2021

Jan 20, 2021

Jan 20, 2021

Jan 20, 2021

Jan 20, 2021

Jan 20, 2021

Jan 20, 2021

Jan 21, 2021

Dec 17, 2022

Dec 01

dlangBugzillaToGithub

https://issues.dlang.org/show_bug.cgi?id=21560 Witold Baryluk <witold.baryluk+d@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|md5 |md5 poor performance out of | |the box --- Comment #1 from Witold Baryluk <witold.baryluk+d@gmail.com> --- 100GB sparse file in tmpfs. phobos: 009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data real 3m23.589s md5sum from Debian testing: 009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data real 2m20.709s 32KiB buffers were used in both cases (confirmed by strace). Code compiled in release mode, with optimisations enabled. AMD ThreadRipper 2950X, water cooled. 128GB , quad channel DDR4-2933 memory. Linux 5.10.4 --

https://issues.dlang.org/show_bug.cgi?id=21560 --- Comment #2 from Witold Baryluk <witold.baryluk+d@gmail.com> --- void main(string[] args) { import std.digest.md : MD5, toHexString; import std.digest : LetterCase; import std.stdio : File, writefln; foreach (filename; args[1..$]) { ubyte[32768] buffer_ = void; MD5 md5; md5.start(); foreach (ubyte[] buffer; File(filename).byChunk(buffer_)) { md5.put(buffer); } auto hash = md5.finish(); writefln!("%s %s")(toHexString!(LetterCase.lower)(hash), filename); } } --

https://issues.dlang.org/show_bug.cgi?id=21560 --- Comment #3 from Witold Baryluk <witold.baryluk+d@gmail.com> --- also openssl 1.1.1i-2 from Debian testing (uses 8kiB buffers): MD5(/usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data)= 009f07e9b8fb09a820dd180441502d46 real 2m18.517s Similar to md5sum (coreutils 8.32). Both faster than phobos. --

https://issues.dlang.org/show_bug.cgi?id=21560 --- Comment #4 from Witold Baryluk <witold.baryluk+d@gmail.com> --- FYI. Using File(filename).byChunk(32*1024)), to allocate buffer once on a heap, instead on a stack (which could be unaligned and use big stack offsets, leading to a bit more poor instruction encodings), leads to the same performance results. --

https://issues.dlang.org/show_bug.cgi?id=21560 Basile-z <b2.temp@gmx.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |b2.temp@gmx.com --- Comment #5 from Basile-z <b2.temp@gmx.com> --- using ldc2 too ? there are option to enable best vectorization and bit op --

https://issues.dlang.org/show_bug.cgi?id=21560 --- Comment #6 from Chloé <chloekek@use.startmail.com> --- Since you have 128 GB memory you could load the entire file into a byte array and compute the hash from there. Start the timer after loading the entire file. This should eliminate any potential difference in I/O from the tests. Of course, you will need to do the same with md5sum and OpenSSL. --

https://issues.dlang.org/show_bug.cgi?id=21560 --- Comment #7 from Witold Baryluk <witold.baryluk+d@gmail.com> --- (In reply to Basile-z from comment #5) > using ldc2 too ? there are option to enable best vectorization and bit op Quite a bit better with ldc2 (1.24.0 with LLVM 11.0.0, -release -mcpu=native -O3): 009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data real 2m47.200s But, that is with precompiled ldc's phobos from Debian testing (-O -inline -release, aka -O2), so a bit limited vectorization and const propagation. --

https://issues.dlang.org/show_bug.cgi?id=21560 --- Comment #8 from Witold Baryluk <witold.baryluk+d@gmail.com> --- BTW. When using precompiled dmd and phobos, from dmd 2.095 from dlang.org. It is really really slow: dmd -O -inline -release -mcpu=avx2 -boundscheck=off md5.d 009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data real 15m14.536s It uses the precompiled 64-bit phobos and links statically to it. Almost feels like it is in debug mode, or maybe with bounds checks on. I didn't checked what it has by default in some time. --

January 21, 2021

[Issue 21560] md5 poor performance out of the box

Posted by Witold Baryluk

Permalink

Witold Baryluk

Permalink

https://issues.dlang.org/show_bug.cgi?id=21560

--- Comment #9 from Witold Baryluk <witold.baryluk+d@gmail.com> ---
In-memory tests, on 16384 byte blocks (should fit nicely in caches).

OpenSSL 1.1.1i (gcc-10.2.1 -fPIC -O2 -fstack-protector-strong ... -DOPENSSL_PIC
-DMD5_ASM ...):
833MB/s

standard/optimized (non-asm) C version with clang-11.0.0 -O3 -march=native
-flto -fomit-frame-pointer:
707MB/s

standard/optimized (non-asm) C version with gcc-10.2.1 -O3 -march=native -flto
-fomit-frame-pointer:
594MB/s

hand optimized x86-64 assembly with gcc-10.2.1 ....: 716MB/s

hand optimized x86-64 assembly with clang-11.0.0 ....: 717MB/s

md5sum-coreutils-8.32-4 on big files in tmpfs (uses 32KiB buffers, but also
doing syscalls), C + gcc-10:
763MB/s

md5sum-busybox-static-1.30.1-6 on big files in tmpfs (uses 32KiB buffers, but
also doing syscalls), C + gcc-10:
565MB/s

D / phobos:

gdc-10.2.1 -O3 -march=native -frelease -fno-weak  (using shared Phobos, which
uses -fPIC, from Debian testing)
569MB/s

dmd-2.095 -O -inline -release (precompiled Phobos from dlang.org binary
release, statically linked)
120MB/s

ldc2-1.24.0 -O3 -release (precompiled Phobos from Debian testing, dynamically
linked)
677MB/s

dmd-2.095 -O -inline -release -mcpu=avx2 -boundscheck=no + hand compiled Phobos
with same options, statically linked.
544MB/s



"performance" cpu frequency governor, no other load on system. Reruns were 10s+ each, few MB/s variations between reruns.



So, ldc2 actually does very good. Approaching the performance of pure-C version compiled with clang.

gdc despite poor codegen with -fPIC in MD5.transform (missed a lot of inlining opportunities for 1–2-instruction functions), is close to pure-C version compiled with gcc.

dmd. It depends how you compile the Phobos apparently. The version distributed on dlang.org, and as built by default, does poorly. Properly compiled it actually doesn't do too bad.

The pre-compiled version works horribly tho.

--

https://issues.dlang.org/show_bug.cgi?id=21560 Iain Buclaw <ibuclaw@gdcproject.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P1 |P4 --

Forums