On Saturday, 16 December 2023 at 03:31:05 UTC, Walter Bright wrote:
> On 12/7/2023 12:39 PM, Witold Baryluk wrote:
> Inspecting output of dmd -v
, shows that a lot of time is spend on various helpers of writefln
. Changing writefln
to writeln
(and adjusting things so the output is still the same), speeds things a lot:
729.5 ms -> 431.8 ms (dmd 2.106.0)
896.6 ms -> 638.7 (dmd 2.098.1)
Considering that most script like programs will need to do some IO, writefln
looks a little bloated (slow to compile). Having string interpolation would probably help a little, but even then 431 ms is not that great.
It would be illuminating to compare using printf rather than writefln.
Hi Walter.
Ok, added extern(C) int printf(const char *format, ...);
, and measured various variants.
(no version
was used, just direct editing of the code for various variants).
Minimums of about 250 runs (in batches of 60, and various orders). As before time is compile time + run time
. Run is essentially same for all variants, and about 2.5 ms minimum (with 6ms average).
- DMD 2.106.0
- gdc 13.2.0
- ldc2 1.35.0 (DMD v2.105.2, LLVM 16.0.6)
- ld 2.41.50.20231202
- mold 2.3.3 and mold 2.4.0, segfault, when using with dmd or ldc2.
- gold 1.16 (binutils 2.41.50.20231202)
variant |
min time [ms] |
Notes |
3×writefln+split |
723 ms |
|
3×writeln+split |
432 ms |
(from previous tests) |
2×writefln+1×printf+split |
722 ms |
|
1×writefln+2×printf+split |
715 ms |
|
0×writefln+3×printf+split |
396 ms |
with unused import std.stdio : writefln |
0×writefln+3×printf+split |
389 ms |
without import std.stdio |
3×printf |
278 ms |
without import std.array either |
3×printf using gdc |
158 ms |
ditto, gdmd -run |
3×printf using ldc2 |
129 ms |
ditto, ldmd2 -run |
3×printf + gold |
153 ms |
|
3×printf using gdc + gold |
146 ms |
|
3×printf using ldc2 + gold |
132 ms |
|
3×printf using gdc + mold |
125 ms |
|
(I also tried, -frelease
, -O0
, etc, no huge difference)
"with unused import std.stdio : writefln
" - imports std.stdio
(and all unconditional transitive imports), but no function from std.stdio
is actually compiled, which is good.
"without import std.stdio
, but still with std.array
" - for split
, doesn't import std.stdio
transitively, but still imports a lot of crap, like core.time
, std.format.*
, core.stdc.config
, and dozen other unnecessary things. Most of it is not compiled but still, parsing this is not free. As of the weird stuff being compiled due to split
, are things like core.checkedint.mulu
, std.algorithm.comparison.max
, std.exception.enforce
, std.utf._utfException!Flag.no
, and few more.
"without import std.array
either" - is clean, but compiler still decides to do few things that are dubious a little, i.e. core.internal.array.equality.__equals
(which is the only things that I cannot find to map to anything in the source code, but could be something implicit about [string]bool
associative array).
Another observation, it looks quite a bit of overhead is due to work spent in kernel. I see about 50% in user space, and 50% in kernel space (i.e. reading directories, files, etc).
For fastest run: User: 70.5 ms, System: 46.0 ms
For slowest run: User: 392.5 ms, System: 363.2 ms
strace
ing the dmd itself, it is not too crazy, but some sequences can be optimized:
stat("/usr/include/dmd/druntime/import/core/internal/array/comparison.di", 0x7ffecfc2a0f0) = -1 ENOENT (No such file or directory)
stat("/usr/include/dmd/druntime/import/core/internal/array/comparison.d", {st_mode=S_IFREG|0644, st_size=7333, ...}) = 0
stat("/usr/include/dmd/druntime/import/core/internal/array/comparison.d", {st_mode=S_IFREG|0644, st_size=7333, ...}) = 0
openat(AT_FDCWD, "/usr/include/dmd/druntime/import/core/internal/array/comparison.d", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=7333, ...}) = 0
read(3, "/**\n * This module contains comp"..., 7333) = 7333
close(3)
Each of these syscalls is about 15μs on my system (when stracing, probably little less in real run without strace
overheads)
There should be a way to reduce this in half with smarter sequencing (i.e. do open
first instead of stat + open + fstat
).
But the cc
(used by dmd to do final linking), does a lot more stupid things in this respect, including repeating exactly same syscall on a file again and again (i.e. like seek to the end). In total cc
and childs (but without final executable running), does about 49729 syscalls (so easily 250-400ms), with command line and object file produced by dmd. "only" 8400 syscalls, with command line and object file produced by ldc2.
Using gdmd -pipe -run
, is minimally faster than without (155.2 ms vs 157.8 ms), but it is close to noise in measurement.
For reference, the last variant:
#!/usr/bin/env -S dmd -run
extern(C) int printf(const char *format, ...);
struct Person {
string name;
int age;
}
auto people = [
Person("John", 45),
Person("Kate", 30),
];
void main() {
// import std.stdio : writefln;
foreach (person; people) {
printf("%.*s is %d years old\n", person.name.length, person.name.ptr, person.age);
// writefln("%s is %d years old", person.name, person.age);
}
static auto oddNumbers(T)(T[] a) {
return delegate int(int delegate(ref T) dg) {
foreach (x; a) {
if (x % 2 == 0) continue;
auto ret = dg(x);
if (ret != 0) return ret;
}
return 0;
};
}
foreach (odd; oddNumbers([3, 6, 9, 12, 15, 18])) {
printf("%d\n", odd);
// writefln("%d", odd);
}
static auto toLookupTable(string data) {
// import std.array : split;
bool[string] result;
// foreach (w; data.split(';')) {
// result[w] = true;
// }
return result;
}
enum data = "mov;btc;cli;xor;afoo";
enum opcodes = toLookupTable(data);
foreach (o, k; opcodes) {
printf("%.*s\n", o.length, o.ptr);
// writefln("%s", o);
}
}