On Monday, 30 May 2022 at 23:19:09 UTC, Iain Buclaw wrote:
>On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:
>$ gdc-11.2.0 -O3 -g -frelease -flto test.d && time ./a.out
55836809328
real 0m6.520s
user 0m6.519s
sys 0m0.000s
What do you think about all of this?
Out of curiosity, are you linking in phobos statically or dynamically? You can force either with -static-libphobos
or -shared-libphobos
.
It's statically linked with libphobos. Both GDC and LDC can inline everything here. And one major difference is that LDC is also able to eliminate GC allocation: https://d.godbolt.org/z/x1jK1M149
Another major difference is that LDC does some extra "cheat" to use a 32-bit division instruction if dividend and divisor are small enough. But it only does this trick for '-mcpu=x86-64' (default) and stops doing it for '-mcpu=native' (which in my case is nehalem): https://d.godbolt.org/z/8ExEqqE41
If everything is manually inlined into main function, then benchmarks look like this:
$ ldc2 -O -g -release -mcpu=x86-64 test2.d && perf stat ./test2
55836809328
Performance counter stats for './test2':
1,920.80 msec task-clock:u # 0.986 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
245 page-faults:u # 0.128 K/sec
5,378,786,526 cycles:u # 2.800 GHz
1,745,747,872 stalled-cycles-frontend:u # 32.46% frontend cycles idle
636,941,001 stalled-cycles-backend:u # 11.84% backend cycles idle
7,218,615,757 instructions:u # 1.34 insn per cycle
# 0.24 stalled cycles per insn
1,371,563,853 branches:u # 714.057 M/sec
45,272,029 branch-misses:u # 3.30% of all branches
1.947334248 seconds time elapsed
1.921595000 seconds user
0.000000000 seconds sys
$ ldc2 -O -g -release -mcpu=nehalem test2.d && perf stat ./test2
55836809328
Performance counter stats for './test2':
4,599.54 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
235 page-faults:u # 0.051 K/sec
12,892,558,448 cycles:u # 2.803 GHz
4,894,073,820 stalled-cycles-frontend:u # 37.96% frontend cycles idle
1,550,118,424 stalled-cycles-backend:u # 12.02% backend cycles idle
4,995,853,241 instructions:u # 0.39 insn per cycle
# 0.98 stalled cycles per insn
804,146,718 branches:u # 174.832 M/sec
44,490,815 branch-misses:u # 5.53% of all branches
4.600090630 seconds time elapsed
4.599885000 seconds user
0.000000000 seconds sys
$ gdc-11.2.0 -O3 -g -frelease -flto test2.d && perf stat ./a.out
55836809328
Performance counter stats for './a.out':
4,604.69 msec task-clock:u # 0.995 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
172 page-faults:u # 0.037 K/sec
12,909,554,223 cycles:u # 2.804 GHz
4,693,546,651 stalled-cycles-frontend:u # 36.36% frontend cycles idle
1,132,407,891 stalled-cycles-backend:u # 8.77% backend cycles idle
5,313,064,245 instructions:u # 0.41 insn per cycle
# 0.88 stalled cycles per insn
1,042,903,599 branches:u # 226.487 M/sec
41,603,467 branch-misses:u # 3.99% of all branches
4.626366827 seconds time elapsed
4.605163000 seconds user
0.000000000 seconds sys
GDC and LDC become equally fast if closure allocations overhead is negligible and if LDC does not use 32-bit division instead of 64-bit one.