Jump to page: 1 2
Thread overview
[Issue 21560] md5 poor performance out of the box
Jan 20, 2021
Witold Baryluk
Jan 20, 2021
Witold Baryluk
Jan 20, 2021
Witold Baryluk
Jan 20, 2021
Witold Baryluk
Jan 20, 2021
Basile-z
Jan 20, 2021
Chloé
Jan 20, 2021
Witold Baryluk
Jan 20, 2021
Witold Baryluk
Jan 21, 2021
Witold Baryluk
Dec 17, 2022
Iain Buclaw
January 20, 2021
https://issues.dlang.org/show_bug.cgi?id=21560

Witold Baryluk <witold.baryluk+d@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|md5                         |md5 poor performance out of
                   |                            |the box

--- Comment #1 from Witold Baryluk <witold.baryluk+d@gmail.com> ---
100GB sparse file in tmpfs.

phobos:

009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data real    3m23.589s


md5sum from Debian testing:

009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data real    2m20.709s


32KiB buffers were used in both cases (confirmed by strace).

Code compiled in release mode, with optimisations enabled.

AMD ThreadRipper 2950X, water cooled. 128GB , quad channel DDR4-2933 memory. Linux 5.10.4

--
January 20, 2021
https://issues.dlang.org/show_bug.cgi?id=21560

--- Comment #2 from Witold Baryluk <witold.baryluk+d@gmail.com> ---
void main(string[] args) {
  import std.digest.md : MD5, toHexString;
  import std.digest : LetterCase;
  import std.stdio : File, writefln;

  foreach (filename; args[1..$]) {
    ubyte[32768] buffer_ = void;
    MD5 md5;
    md5.start();
    foreach (ubyte[] buffer; File(filename).byChunk(buffer_)) {
        md5.put(buffer);
    }
    auto hash = md5.finish();
    writefln!("%s  %s")(toHexString!(LetterCase.lower)(hash), filename);
  }
}

--
January 20, 2021
https://issues.dlang.org/show_bug.cgi?id=21560

--- Comment #3 from Witold Baryluk <witold.baryluk+d@gmail.com> ---
also openssl 1.1.1i-2 from Debian testing (uses 8kiB buffers):

MD5(/usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data)=
009f07e9b8fb09a820dd180441502d46
real    2m18.517s

Similar to md5sum (coreutils 8.32).

Both faster than phobos.

--
January 20, 2021
https://issues.dlang.org/show_bug.cgi?id=21560

--- Comment #4 from Witold Baryluk <witold.baryluk+d@gmail.com> ---
FYI. Using File(filename).byChunk(32*1024)), to allocate buffer once on a heap,
instead on a stack (which could be unaligned and use big stack offsets, leading
to a bit more poor instruction encodings), leads to the same performance
results.

--
January 20, 2021
https://issues.dlang.org/show_bug.cgi?id=21560

Basile-z <b2.temp@gmx.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |b2.temp@gmx.com

--- Comment #5 from Basile-z <b2.temp@gmx.com> ---
using ldc2 too ? there are option to enable best vectorization and bit op

--
January 20, 2021
https://issues.dlang.org/show_bug.cgi?id=21560

--- Comment #6 from Chloé <chloekek@use.startmail.com> ---
Since you have 128 GB memory you could load the entire file into a byte array and compute the hash from there. Start the timer after loading the entire file. This should eliminate any potential difference in I/O from the tests. Of course, you will need to do the same with md5sum and OpenSSL.

--
January 20, 2021
https://issues.dlang.org/show_bug.cgi?id=21560

--- Comment #7 from Witold Baryluk <witold.baryluk+d@gmail.com> ---
(In reply to Basile-z from comment #5)
> using ldc2 too ? there are option to enable best vectorization and bit op

Quite a bit better with ldc2 (1.24.0 with LLVM 11.0.0, -release -mcpu=native
-O3):

009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data real    2m47.200s


But, that is with precompiled ldc's phobos from Debian testing (-O -inline -release, aka -O2), so a bit limited vectorization and const propagation.

--
January 20, 2021
https://issues.dlang.org/show_bug.cgi?id=21560

--- Comment #8 from Witold Baryluk <witold.baryluk+d@gmail.com> ---
BTW. When using precompiled dmd and phobos, from dmd 2.095 from dlang.org. It is really really slow:

dmd -O -inline -release -mcpu=avx2 -boundscheck=off md5.d

009f07e9b8fb09a820dd180441502d46 /usr/lib/live/mount/overlay/rw/var/lib/docker/devicemapper/devicemapper/data real    15m14.536s

It uses the precompiled 64-bit phobos and links statically to it. Almost feels like it is in debug mode, or maybe with bounds checks on. I didn't checked what it has by default in some time.

--
January 21, 2021
https://issues.dlang.org/show_bug.cgi?id=21560

--- Comment #9 from Witold Baryluk <witold.baryluk+d@gmail.com> ---
In-memory tests, on 16384 byte blocks (should fit nicely in caches).

OpenSSL 1.1.1i (gcc-10.2.1 -fPIC -O2 -fstack-protector-strong ... -DOPENSSL_PIC
-DMD5_ASM ...):
833MB/s

standard/optimized (non-asm) C version with clang-11.0.0 -O3 -march=native
-flto -fomit-frame-pointer:
707MB/s

standard/optimized (non-asm) C version with gcc-10.2.1 -O3 -march=native -flto
-fomit-frame-pointer:
594MB/s

hand optimized x86-64 assembly with gcc-10.2.1 ....: 716MB/s

hand optimized x86-64 assembly with clang-11.0.0 ....: 717MB/s

md5sum-coreutils-8.32-4 on big files in tmpfs (uses 32KiB buffers, but also
doing syscalls), C + gcc-10:
763MB/s

md5sum-busybox-static-1.30.1-6 on big files in tmpfs (uses 32KiB buffers, but
also doing syscalls), C + gcc-10:
565MB/s

D / phobos:

gdc-10.2.1 -O3 -march=native -frelease -fno-weak  (using shared Phobos, which
uses -fPIC, from Debian testing)
569MB/s

dmd-2.095 -O -inline -release (precompiled Phobos from dlang.org binary
release, statically linked)
120MB/s

ldc2-1.24.0 -O3 -release (precompiled Phobos from Debian testing, dynamically
linked)
677MB/s

dmd-2.095 -O -inline -release -mcpu=avx2 -boundscheck=no + hand compiled Phobos
with same options, statically linked.
544MB/s



"performance" cpu frequency governor, no other load on system. Reruns were 10s+ each, few MB/s variations between reruns.



So, ldc2 actually does very good. Approaching the performance of pure-C version compiled with clang.

gdc despite poor codegen with -fPIC in MD5.transform (missed a lot of inlining opportunities for 1–2-instruction functions), is close to pure-C version compiled with gcc.

dmd. It depends how you compile the Phobos apparently. The version distributed on dlang.org, and as built by default, does poorly. Properly compiled it actually doesn't do too bad.

The pre-compiled version works horribly tho.

--
December 17, 2022
https://issues.dlang.org/show_bug.cgi?id=21560

Iain Buclaw <ibuclaw@gdcproject.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P1                          |P4

--
« First   ‹ Prev
1 2