DMD and GDC are unnecessarily using heap allocations for closures (page 2)

May 31, 2022

Re: DMD and GDC are unnecessarily using heap allocations for closures

Posted by Siarhei Siamashka
in reply to Iain Buclaw

Permalink

Siarhei Siamashka

Posted in reply to Iain Buclaw

Permalink

On Monday, 30 May 2022 at 23:19:09 UTC, Iain Buclaw wrote:

On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:

$ gdc-11.2.0 -O3 -g -frelease -flto test.d && time ./a.out
55836809328

real 0m6.520s
user 0m6.519s
sys 0m0.000s

What do you think about all of this?

Out of curiosity, are you linking in phobos statically or dynamically? You can force either with -static-libphobos or -shared-libphobos.

It's statically linked with libphobos. Both GDC and LDC can inline everything here. And one major difference is that LDC is also able to eliminate GC allocation: https://d.godbolt.org/z/x1jK1M149

Another major difference is that LDC does some extra "cheat" to use a 32-bit division instruction if dividend and divisor are small enough. But it only does this trick for '-mcpu=x86-64' (default) and stops doing it for '-mcpu=native' (which in my case is nehalem): https://d.godbolt.org/z/8ExEqqE41

If everything is manually inlined into main function, then benchmarks look like this:

$ ldc2 -O -g -release -mcpu=x86-64 test2.d && perf stat ./test2
55836809328

 Performance counter stats for './test2':

          1,920.80 msec task-clock:u              #    0.986 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               245      page-faults:u             #    0.128 K/sec
     5,378,786,526      cycles:u                  #    2.800 GHz
     1,745,747,872      stalled-cycles-frontend:u #   32.46% frontend cycles idle
       636,941,001      stalled-cycles-backend:u  #   11.84% backend cycles idle
     7,218,615,757      instructions:u            #    1.34  insn per cycle
                                                  #    0.24  stalled cycles per insn
     1,371,563,853      branches:u                #  714.057 M/sec
        45,272,029      branch-misses:u           #    3.30% of all branches

       1.947334248 seconds time elapsed

       1.921595000 seconds user
       0.000000000 seconds sys

$ ldc2 -O -g -release -mcpu=nehalem test2.d && perf stat ./test2
55836809328

 Performance counter stats for './test2':

          4,599.54 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               235      page-faults:u             #    0.051 K/sec
    12,892,558,448      cycles:u                  #    2.803 GHz
     4,894,073,820      stalled-cycles-frontend:u #   37.96% frontend cycles idle
     1,550,118,424      stalled-cycles-backend:u  #   12.02% backend cycles idle
     4,995,853,241      instructions:u            #    0.39  insn per cycle
                                                  #    0.98  stalled cycles per insn
       804,146,718      branches:u                #  174.832 M/sec
        44,490,815      branch-misses:u           #    5.53% of all branches

       4.600090630 seconds time elapsed

       4.599885000 seconds user
       0.000000000 seconds sys

$ gdc-11.2.0 -O3 -g -frelease -flto test2.d && perf stat ./a.out
55836809328

 Performance counter stats for './a.out':

          4,604.69 msec task-clock:u              #    0.995 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               172      page-faults:u             #    0.037 K/sec
    12,909,554,223      cycles:u                  #    2.804 GHz
     4,693,546,651      stalled-cycles-frontend:u #   36.36% frontend cycles idle
     1,132,407,891      stalled-cycles-backend:u  #    8.77% backend cycles idle
     5,313,064,245      instructions:u            #    0.41  insn per cycle
                                                  #    0.88  stalled cycles per insn
     1,042,903,599      branches:u                #  226.487 M/sec
        41,603,467      branch-misses:u           #    3.99% of all branches

       4.626366827 seconds time elapsed

       4.605163000 seconds user
       0.000000000 seconds sys

GDC and LDC become equally fast if closure allocations overhead is negligible and if LDC does not use 32-bit division instead of 64-bit one.

On Monday, 30 May 2022 at 10:41:10 UTC, Johan wrote:

The @nogc attribute is a language-level guarantee that no GC allocations will happen in that function.

Then maybe another attribute @makebestefforttohavenocginoptimizedbuild (with a different shorter name) could be introduced? So that the GC allocations originating from such functions would be additionally marked as 'please optimize me out' by the frontend. The GC2Stack IR optimization pass would do its normal job. And if some of such marked GC allocations still remain after GC2Stack, then the compiler could spit out a helpful performance warning message. The non-optimized debug builds can do GC allocations without making any noise. And its okay for any compiler to just ignore this attribute (similar to how 'inline' is handled in C language).

I think that there are two somewhat different use cases for @nogc:

we can't tolerate any GC allocations, because the runtime has no garbage collector at all
we can do GC allocations, but strongly desire to avoid them for performance reasons

The current @nogc is apparently designed for the use case 1 and is overly conservative. But I'm primarily interested in the use case 2.

This guarantee can not depend on an optimization, unless any compiler implementation is required to implement that optimization also for unoptimized code: effectively making it no longer an "optimization", but just normal language behavior. (example: C++17 requires copy elision when a function returns a temporary object+ (unnamed object), but does not require it when a function returns a named object.)

Would it be difficult/practical to implement that optimization (or some simplified form of it) also for unoptimized code in the frontend and make it the normal language behavior?

long binary_search(long n, long val, long i) @nogc { struct C { long n, val; } return zip(repeat(C(n, val)), iota(1, i + 1)).map!(x => x[0].n / x[1] == x[0].val) .assumeSorted.upperBound(false).length; }

Forums