December 10

On Thursday, 7 December 2023 at 16:32:32 UTC, Witold Baryluk wrote:

>

Reference code:

#!/usr/bin/env -S dmd -run

This doesn't work nice from a read only directory:

$ pwd
/tmp/read_only_directory

$ ./test.d
Error: error writing file 'test.o'

$ rdmd test.d
John is 45 years old
Kate is 30 years old
3
9
15
xor
btc
mov
cli
afoo

But rdmd is much slower than dmd -run. As another possibly better alternative, changing the boilerplate to the following lines allows to use dub as a launcher:

#!/usr/bin/env dub
/+dub.sdl:+/

The existence of the slow rdmd tool is a liability, because some outsiders may judge D compilation speed based on rdmd performance when comparing different programming languages.

December 15

On Friday, 8 December 2023 at 04:15:45 UTC, kinke wrote:

>

Not just a little; the default bfd linker is terrible. My timings with various linkers (mold built myself) on Ubuntu 22, using a writeln variant, best of 5:

bfd v2.38 gold v1.16 lld v14 mold v2.4
DMD v2.106.0 0.34 0.22 0.18 fails to link
LDC v1.36.0-beta1 0.47 0.24 0.22 0.18

Bench cmdline: dmd -Xcc=-fuse-ld=<bfd,gold,lld,mold> -run bench.d

Regarding the "failed to link" table entry for the dmd+mold combo. I tried to search a bit and found:

It would be great if dmd could resolve the mold compatibility problems. Compilation speed is the primary dmd's differentiating feature justifying its very existence, so maybe this issue deserves much more attention?

December 16

Appears that linking Phobos as a static or a shared library makes a gigantic difference for this test. The shared library variant is much faster to compile. Changing dmd.conf in the unpacked tarball of the binary DMD compiler release to append -L-rpath=%@P%/../lib64 -defaultlib=phobos2 makes dmd -run almost twice faster for me.

[Environment32]
DFLAGS=-I%@P%/../../src/phobos -I%@P%/../../src/druntime/import -L-L%@P%/../lib32 -L--export-dynamic -fPIC -L-rpath=%@P%/../lib32 -defaultlib=phobos2

[Environment64]
DFLAGS=-I%@P%/../../src/phobos -I%@P%/../../src/druntime/import -L-L%@P%/../lib64 -L--export-dynamic -fPIC -L-rpath=%@P%/../lib64 -defaultlib=phobos2

And it's interesting that using the mold linker (via adding the -Xcc=-fuse-ld=mold option) appears to fail only when linking static 64-bit Phobos. Linking static 32-bit Phobos works, both shared 64-bit and 32-bit Phobos appear to work too.

Creating binaries that depend on the shared Phobos library isn't a reasonable default configuration. However it seems to be perfectly fine if used specifically for the "-run" option. Would adding an extra section in the dmd.conf file for the "-run" configuration be justified?

Oh, and I already mentioned rdmd before. Burn this thing with fire!

December 15
On 12/7/2023 8:32 AM, Witold Baryluk wrote:
> I do not use `dmd -run` too often, but recently was checking out Nim language, and they also do have run option, plus pretty fast compiler, so checked it out.

If you use `rdmd`, it caches the generated executable, so should be mucho faster for the 1+nth iterations.

December 15
On 12/7/2023 12:39 PM, Witold Baryluk wrote:
> Inspecting output of `dmd -v`, shows that a lot of time is spend on various helpers of `writefln`. Changing `writefln` to `writeln` (and adjusting things so the output is still the same), speeds things a lot:
> 
> 729.5 ms -> 431.8 ms (dmd 2.106.0)
> 896.6 ms -> 638.7 (dmd 2.098.1)
> 
> Considering that most script like programs will need to do some IO, `writefln` looks a little bloated (slow to compile). Having string interpolation would probably help a little, but even then 431 ms is not that great.

It would be illuminating to compare using printf rather than writefln.

December 15
On 12/7/2023 2:33 PM, Adam D Ruppe wrote:
> On Thursday, 7 December 2023 at 20:39:03 UTC, Witold Baryluk wrote:
>> Inspecting output of `dmd -v`, shows that a lot of time is spend on various helpers of `writefln`. Changing `writefln` to `writeln` (and adjusting things so the output is still the same), speeds things a lot:
> 
> Most of Phobos is slow to import. Some of it is *brutally* slow to import.
> 
> My D2 programs come in at about 300ms mostly by avoiding it... but even that is slow compared to the 100ms that was common back in the old days.

One of our objectives for the next Phobos is to reduce the "every module imports every other module" design of current Phobos.
December 16

On Saturday, 16 December 2023 at 03:26:16 UTC, Walter Bright wrote:

>

On 12/7/2023 8:32 AM, Witold Baryluk wrote:

>

I do not use dmd -run too often, but recently was checking out Nim language, and they also do have run option, plus pretty fast compiler, so checked it out.

If you use rdmd, it caches the generated executable, so should be mucho faster for the 1+nth iterations.

Running the program from source in a script-like fashion is a very common standard feature for compilers. Also supported by at least Nim, Go and Crystal. And being a standard feature, fair apples-to-apples comparisons can be done for different programming languages.

Out-of-the-box experience with the latest version of DMD on a Linux system typically involves something like downloading a pre-build https://downloads.dlang.org/releases/2.x/2.106.0/dmd.2.106.0.linux.tar.xz binary release tarball. See the https://forum.dlang.org/thread/dqaiyvgncpuwmmicsjyk@forum.dlang.org thread for a feedback from one of the new onboarded users.

There's the rdmd tool bundled there, many new users will notice it and take into use. The documentation at https://dlang.org/dmd-linux.html describes it as "D build tool for script-like D code execution". And it even has its own page here: https://dlang.org/rdmd.html

I guess, you see what is coming next. DMD is widely advertised as a fast compiler. The users will judge its performance using the rdmd tool. Because again, it's a pretty standard functionality, provided by many programming languages in one way or another. Comparison between DMD 2.106.0, Crystal 1.10.1, Nim 1.6.14 and Go 1.21.4 on my computer (x86-64 Linux) using the test program from the first post in this thread:

test normal cached
echo " " >> bench.cr && time crystal bench.cr 1.81s 1.81s
echo " " >> bench_writefln.d && time rdmd bench_writefln.d 1.22s 0.01s
echo " " >> bench.nim && time nim --hints:off r bench.nim 1.05s 0.18s
echo " " >> bench.cr && time crystal i bench.cr 0.91s 0.91s
echo " " >> bench_writefln.d && time dub bench_writefln.d 0.84s 0.03s
echo " " >> bench_writeln.d && time rdmd bench_writeln.d 0.59s 0.01s
echo " " >> bench_writeln.d && time dub bench_writeln.d 0.49s 0.03s
echo " " >> bench.go && time go run bench.go 0.15s 0.13s

The cached column shows the time without the echo part (Crystal apparently doesn't implement caching). That's what any normal user would see if they compare compilers out of the box in the most straightforward manner without tweaking anything. I also added results for dub (with the /+dub.sdl:+/ boilerplate line added) to compare them against rdmd.

December 17

On Saturday, 16 December 2023 at 03:31:05 UTC, Walter Bright wrote:

>

On 12/7/2023 12:39 PM, Witold Baryluk wrote:

>

Inspecting output of dmd -v, shows that a lot of time is spend on various helpers of writefln. Changing writefln to writeln (and adjusting things so the output is still the same), speeds things a lot:

729.5 ms -> 431.8 ms (dmd 2.106.0)
896.6 ms -> 638.7 (dmd 2.098.1)

Considering that most script like programs will need to do some IO, writefln looks a little bloated (slow to compile). Having string interpolation would probably help a little, but even then 431 ms is not that great.

It would be illuminating to compare using printf rather than writefln.

Hi Walter.

Ok, added extern(C) int printf(const char *format, ...);, and measured various variants.

(no version was used, just direct editing of the code for various variants).

Minimums of about 250 runs (in batches of 60, and various orders). As before time is compile time + run time. Run is essentially same for all variants, and about 2.5 ms minimum (with 6ms average).

  • DMD 2.106.0
  • gdc 13.2.0
  • ldc2 1.35.0 (DMD v2.105.2, LLVM 16.0.6)
  • ld 2.41.50.20231202
  • mold 2.3.3 and mold 2.4.0, segfault, when using with dmd or ldc2.
  • gold 1.16 (binutils 2.41.50.20231202)
variant min time [ms] Notes
3×writefln+split 723 ms
3×writeln+split 432 ms (from previous tests)
2×writefln+1×printf+split 722 ms
1×writefln+2×printf+split 715 ms
0×writefln+3×printf+split 396 ms with unused import std.stdio : writefln
0×writefln+3×printf+split 389 ms without import std.stdio
3×printf 278 ms without import std.array either
3×printf using gdc 158 ms ditto, gdmd -run
3×printf using ldc2 129 ms ditto, ldmd2 -run
3×printf + gold 153 ms
3×printf using gdc + gold 146 ms
3×printf using ldc2 + gold 132 ms
3×printf using gdc + mold 125 ms

(I also tried, -frelease, -O0, etc, no huge difference)

"with unused import std.stdio : writefln" - imports std.stdio (and all unconditional transitive imports), but no function from std.stdio is actually compiled, which is good.

"without import std.stdio, but still with std.array" - for split, doesn't import std.stdio transitively, but still imports a lot of crap, like core.time, std.format.*, core.stdc.config, and dozen other unnecessary things. Most of it is not compiled but still, parsing this is not free. As of the weird stuff being compiled due to split, are things like core.checkedint.mulu, std.algorithm.comparison.max, std.exception.enforce, std.utf._utfException!Flag.no, and few more.

"without import std.array either" - is clean, but compiler still decides to do few things that are dubious a little, i.e. core.internal.array.equality.__equals (which is the only things that I cannot find to map to anything in the source code, but could be something implicit about [string]bool associative array).

Another observation, it looks quite a bit of overhead is due to work spent in kernel. I see about 50% in user space, and 50% in kernel space (i.e. reading directories, files, etc).

For fastest run: User: 70.5 ms, System: 46.0 ms

For slowest run: User: 392.5 ms, System: 363.2 ms

straceing the dmd itself, it is not too crazy, but some sequences can be optimized:

stat("/usr/include/dmd/druntime/import/core/internal/array/comparison.di", 0x7ffecfc2a0f0) = -1 ENOENT (No such file or directory)
stat("/usr/include/dmd/druntime/import/core/internal/array/comparison.d", {st_mode=S_IFREG|0644, st_size=7333, ...}) = 0
stat("/usr/include/dmd/druntime/import/core/internal/array/comparison.d", {st_mode=S_IFREG|0644, st_size=7333, ...}) = 0
openat(AT_FDCWD, "/usr/include/dmd/druntime/import/core/internal/array/comparison.d", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=7333, ...}) = 0
read(3, "/**\n * This module contains comp"..., 7333) = 7333
close(3)

Each of these syscalls is about 15μs on my system (when stracing, probably little less in real run without strace overheads)

There should be a way to reduce this in half with smarter sequencing (i.e. do open first instead of stat + open + fstat).

But the cc (used by dmd to do final linking), does a lot more stupid things in this respect, including repeating exactly same syscall on a file again and again (i.e. like seek to the end). In total cc and childs (but without final executable running), does about 49729 syscalls (so easily 250-400ms), with command line and object file produced by dmd. "only" 8400 syscalls, with command line and object file produced by ldc2.

Using gdmd -pipe -run, is minimally faster than without (155.2 ms vs 157.8 ms), but it is close to noise in measurement.

For reference, the last variant:

#!/usr/bin/env -S dmd -run

extern(C) int printf(const char *format, ...);

struct Person {
  string name;
  int age;
}

auto people = [
  Person("John", 45),
  Person("Kate", 30),
];

void main() {
  // import std.stdio : writefln;
  foreach (person; people) {
    printf("%.*s is %d years old\n", person.name.length, person.name.ptr, person.age);
    // writefln("%s is %d years old", person.name, person.age);
  }

  static auto oddNumbers(T)(T[] a) {
    return delegate int(int delegate(ref T) dg) {
      foreach (x; a) {
        if (x % 2 == 0) continue;
        auto ret = dg(x);
        if (ret != 0) return ret;
      }
      return 0;
    };
  }

  foreach (odd; oddNumbers([3, 6, 9, 12, 15, 18])) {
    printf("%d\n", odd);
    // writefln("%d", odd);
  }

  static auto toLookupTable(string data) {
    // import std.array : split;
    bool[string] result;
    // foreach (w; data.split(';')) {
    //   result[w] = true;
    // }
    return result;
  }

  enum data = "mov;btc;cli;xor;afoo";
  enum opcodes = toLookupTable(data);

  foreach (o, k; opcodes) {
    printf("%.*s\n", o.length, o.ptr);
    // writefln("%s", o);
  }
}
December 17

On Friday, 8 December 2023 at 04:15:45 UTC, kinke wrote:

>

On Thursday, 7 December 2023 at 22:19:43 UTC, Witold Baryluk wrote:

>

Maybe switching to something like gold or mold linker could be help a little. This should help with dmd too a little.

Not just a little; the default bfd linker is terrible. My timings with various linkers (mold built myself) on Ubuntu 22, using a writeln variant, best of 5:

bfd v2.38 gold v1.16 lld v14 mold v2.4
DMD v2.106.0 0.34 0.22 0.18 fails to link
LDC v1.36.0-beta1 0.47 0.24 0.22 0.18

Bench cmdline: dmd -Xcc=-fuse-ld=<bfd,gold,lld,mold> -run bench.d

Hi Martin.

mold 2.3.0 and 2.4.0 unfortunately fail for me with ldc version 1.35.0 (DMD v2.105.2, LLVM 16.0.6)

Starting program: /usr/bin/ld.mold -v -plugin /usr/libexec/gcc/x86_64-linux-gnu/13/liblto_plugin.so -plugin-opt=/usr/libexec/gcc/x86_64-linux-gnu/13/lto-wrapper -plugin-opt=-fresolution=/tmp/ccYlnSJL.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --eh-frame-hdr -m elf_x86_64 --hash-style=gnu --as-needed -dynamic-linker /lib64/ld-linux-x86-64.so.2 -pie -o a /usr/lib/gcc/x86_64-linux-gnu/13/../../../x86_64-linux-gnu/Scrt1.o /usr/lib/gcc/x86_64-linux-gnu/13/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/13/crtbeginS.o -L/usr/lib -L/usr/lib/gcc/x86_64-linux-gnu/13 -L/usr/lib/gcc/x86_64-linux-gnu/13/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/13/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/13/../../.. a.o /usr/lib/ldc_rt.dso.o -lphobos2-ldc-shared -ldruntime-ldc-shared --gc-sections -lrt -ldl -lpthread -lm -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state /usr/lib/gcc/x86_64-linux-gnu/13/crtendS.o /usr/lib/gcc/x86_64-linux-gnu/13/../../../x86_64-linux-gnu/crtn.o
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
mold 2.4.0 (compatible with GNU ld)
[Detaching after fork from child process 515412]

Program received signal SIGSEGV, Segmentation fault.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=11, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=11, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007ffff7aa815f in __pthread_kill_internal (signo=11, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007ffff7a5a472 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#3  0x000055555663bf5d in mold::elf::fork_child () at ./elf/subprocess.cc:47
#4  0x0000555555d75500 in mold::elf::elf_main<mold::elf::X86_64> (argc=<optimized out>, argv=<optimized out>) at ./elf/main.cc:365
#5  0x00007ffff7a456ca in __libc_start_call_main (main=main@entry=0x555555666f90 <main(int, char**)>, argc=argc@entry=56, argv=argv@entry=0x7fffffffd7e8)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#6  0x00007ffff7a45785 in __libc_start_main_impl (main=0x555555666f90 <main(int, char**)>, argc=56, argv=0x7fffffffd7e8, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffffffd7d8) at ../csu/libc-start.c:360
#7  0x0000555555667e51 in _start ()
(gdb)
December 17

On Saturday, 16 December 2023 at 00:38:37 UTC, Siarhei Siamashka wrote:

>

Appears that linking Phobos as a static or a shared library makes a gigantic difference for this test. The shared library variant is much faster to compile. Changing dmd.conf in the unpacked tarball of the binary DMD compiler release to append -L-rpath=%@P%/../lib64 -defaultlib=phobos2 makes dmd -run almost twice faster for me.

[Environment32]
DFLAGS=-I%@P%/../../src/phobos -I%@P%/../../src/druntime/import -L-L%@P%/../lib32 -L--export-dynamic -fPIC -L-rpath=%@P%/../lib32 -defaultlib=phobos2

[Environment64]
DFLAGS=-I%@P%/../../src/phobos -I%@P%/../../src/druntime/import -L-L%@P%/../lib64 -L--export-dynamic -fPIC -L-rpath=%@P%/../lib64 -defaultlib=phobos2

And it's interesting that using the mold linker (via adding the -Xcc=-fuse-ld=mold option) appears to fail only when linking static 64-bit Phobos. Linking static 32-bit Phobos works, both shared 64-bit and 32-bit Phobos appear to work too.

Wow. That is enormous difference.

I was able to go from 278 ms (no phobos imports, just some printf), down to 88.3 ms. And with gold linker down to 80.2 ms.

For the full original version (std.stdio.writefln and std.array.split), it went from 723 ms down to 499 ms. And with gold linker down to 477.5 ms

So, 190 ms shaved off. This is huge.

(This is with standard ld.bfd linker).

>

Creating binaries that depend on the shared Phobos library isn't a reasonable default configuration. However it seems to be perfectly fine if used specifically for the "-run" option. Would adding an extra section in the dmd.conf file for the "-run" configuration be justified?

What?! I always (for a decade) tough that dmd by default links dynamically phobos. I think it should definitively link dynamically by default. Just like gcc, gdc, ldc, clang are doing things. Compiling phobos statically by default does not really solve versioning fully anyway, as one still have dependencies on glibc, and such.

Also for fast edit + compile + run cycles, as well as running unittests frequently, it definitively make sense to do dynamic linking.

>

Oh, and I already mentioned rdmd before. Burn this thing with fire!

That is "cheating". :) Yes, useful, but not for making sure compiler is fast.