Performance test of short-circuiting AliasSeq (page 2)

June 03, 2020

Re: Performance test of short-circuiting AliasSeq

Posted by Stefan Koch
in reply to Stefan Koch

Permalink

Stefan Koch

Posted in reply to Stefan Koch

Permalink

On Monday, 1 June 2020 at 20:16:55 UTC, Stefan Koch wrote:
> Hi,
>
> So I've asked myself if the PR https://github.com/dlang/dmd/pull/11057
>
> Which complicated the compiler internals and got pulled on the basis that it would increase performance did actually increase performance.
> So I ran it on a staticMap benchmark which it should speed up.
>
> I am going to use the same benchmark as I used for
> https://forum.dlang.org/post/kktulflpozdrsxeinfbg@forum.dlang.org
>
> Firstly let me post the results of the benchmark without walters patch applied.
>
> ---
> Benchmark #1: ./dmd.sh sm.d
>   Time (mean ± σ):     436.0 ms ±  14.6 ms    [User: 384.0 ms, System: 51.8 ms]
>   Range (min … max):   413.5 ms … 476.7 ms    40 runs
>
> Benchmark #2: ./dmd.sh sm.d -version=DotDotDot
>   Time (mean ± σ):     219.9 ms ±   5.6 ms    [User: 210.0 ms, System: 10.0 ms]
>   Range (min … max):   208.6 ms … 235.9 ms    40 runs
>
> Benchmark #3: ./dmd.sh sm.d -version=Walter
>   Time (mean ± σ):     330.4 ms ±   7.1 ms    [User: 290.8 ms, System: 39.5 ms]
>   Range (min … max):   316.6 ms … 345.7 ms    40 runs
>
> Summary
>   './dmd.sh sm.d -version=DotDotDot' ran
>     1.50 ± 0.05 times faster than './dmd.sh sm.d -version=Walter'
>     1.98 ± 0.08 times faster than './dmd.sh sm.d'
>
> ---
>
> If you care to compare the timings at the bottom they pretty much match the results I've measured in the benchmark previously and posted in the thread above.
>
> Now let's see how the test performs with the patches applied.
>
> Benchmark #1: ./dmd.sh sm.d
>   Time (mean ± σ):     423.5 ms ±   8.9 ms    [User: 377.6 ms, System: 45.7 ms]
>   Range (min … max):   411.0 ms … 444.0 ms    40 runs
>
> Benchmark #2: ./dmd.sh sm.d -version=DotDotDot
>   Time (mean ± σ):     231.0 ms ±   4.3 ms    [User: 220.3 ms, System: 10.9 ms]
>   Range (min … max):   223.3 ms … 243.9 ms    40 runs
>
> Benchmark #3: ./dmd.sh sm.d -version=Walter
>   Time (mean ± σ):     342.4 ms ±   8.1 ms    [User: 306.4 ms, System: 36.0 ms]
>   Range (min … max):   331.0 ms … 375.1 ms    40 runs
>
> Summary
>   './dmd.sh sm.d -version=DotDotDot' ran
>     1.48 ± 0.04 times faster than './dmd.sh sm.d -version=Walter'
>     1.83 ± 0.05 times faster than './dmd.sh sm.d'
>
> We see the difference between `...` and Walters unrolled staticMap shrink.
> And we go see a decrease of the divide and conquer version of staticMap.
>
> However We do see the mean times of the `...` and Walters unrolled staticMap actually increase.
>
> That made me curious and I repeated the measurements with a higher repetition count. Just to make sure that this is not a spur of the moment thing.
>
> ---- "short-circuit" patch applied.
> uplink@uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ hyperfine "./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" "./dmd.sh sm.d -version=Walter" -r 90
> Benchmark #1: ./dmd.sh sm.d
>   Time (mean ± σ):     425.9 ms ±  13.4 ms    [User: 373.7 ms, System: 51.9 ms]
>   Range (min … max):   409.6 ms … 468.8 ms    90 runs
>
> Benchmark #2: ./dmd.sh sm.d -version=DotDotDot
>   Time (mean ± σ):     234.3 ms ±   9.5 ms    [User: 224.1 ms, System: 10.2 ms]
>   Range (min … max):   220.0 ms … 272.1 ms    90 runs
>
> Benchmark #3: ./dmd.sh sm.d -version=Walter
>   Time (mean ± σ):     340.6 ms ±   7.1 ms    [User: 299.7 ms, System: 40.9 ms]
>   Range (min … max):   328.9 ms … 359.3 ms    90 runs
>
> Summary
>   './dmd.sh sm.d -version=DotDotDot' ran
>     1.45 ± 0.07 times faster than './dmd.sh sm.d -version=Walter'
>     1.82 ± 0.09 times faster than './dmd.sh sm.d'
>
> This is consistent with what we got before.
> For good measure (pun intended), I tested the DMD version without the patch with an increased repetition count as well.
>
> ---- without "short-circuit" patch:
> uplink@uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ hyperfine "./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" "./dmd.sh sm.d -version=Walter" -r 90
> Benchmark #1: ./dmd.sh sm.d
>   Time (mean ± σ):     428.9 ms ±  11.3 ms    [User: 376.2 ms, System: 52.3 ms]
>   Range (min … max):   412.8 ms … 464.5 ms    90 runs
>
> Benchmark #2: ./dmd.sh sm.d -version=DotDotDot
>   Time (mean ± σ):     217.8 ms ±   5.2 ms    [User: 208.9 ms, System: 9.0 ms]
>   Range (min … max):   209.0 ms … 241.6 ms    90 runs
>
> Benchmark #3: ./dmd.sh sm.d -version=Walter
>   Time (mean ± σ):     329.9 ms ±   9.4 ms    [User: 287.8 ms, System: 41.9 ms]
>   Range (min … max):   318.7 ms … 364.6 ms    90 runs
>
> Summary
>   './dmd.sh sm.d -version=DotDotDot' ran
>     1.51 ± 0.06 times faster than './dmd.sh sm.d -version=Walter'
>     1.97 ± 0.07 times faster than './dmd.sh sm.d'
>
> The results seem quite solid.
> At leasr on benchmark I have used for "short-circuiting" AliasSeq leads to a 4% slowdown for walters unrolled staticMap. and a 7% slowdown for `...`
>
> I think we should not assumed include performance improvements before measuring.


The reason the old version of staticMap did not see the slowdown is because I didn't disable codegen.

code-gen inefficiencies occurring when emitting an unreasonable number to symbols tend to hide other problems.

Here is a Benchmark which does not relay on our branch
But uses the released dmd 2.092.0

Enjoy!

uplink@uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine "./dmd_without_patch sm.d -c -o- -version=Walter" "./dmd_with_patch sm.d -c -o- -version=Walter" -m 90
Benchmark #1: ./dmd_without_patch sm.d -c -o- -version=Walter
  Time (mean ± σ):     452.8 ms ±   7.5 ms    [User: 415.5 ms, System: 37.4 ms]
  Range (min … max):   442.2 ms … 483.9 ms    90 runs

Benchmark #2: ./dmd_with_patch sm.d -c -o- -version=Walter
  Time (mean ± σ):     455.1 ms ±  10.4 ms    [User: 417.3 ms, System: 37.7 ms]
  Range (min … max):   441.5 ms … 489.2 ms    90 runs

Summary
  './dmd_without_patch sm.d -c -o- -version=Walter' ran
    1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o- -version=Walter'
uplink@uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine "./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c -o-" -m 90
Benchmark #1: ./dmd_without_patch sm.d -c -o-
  Time (mean ± σ):     583.2 ms ±  11.0 ms    [User: 529.9 ms, System: 53.1 ms]
  Range (min … max):   570.0 ms … 631.0 ms    90 runs

Benchmark #2: ./dmd_with_patch sm.d -c -o-
  Time (mean ± σ):     584.3 ms ±  14.3 ms    [User: 533.1 ms, System: 51.0 ms]
  Range (min … max):   566.5 ms … 657.9 ms    90 runs

Summary
  './dmd_without_patch sm.d -c -o-' ran
    1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o-'
uplink@uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine "./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c -o-" -m 90
Benchmark #1: ./dmd_without_patch sm.d -c -o-
  Time (mean ± σ):     583.4 ms ±  10.5 ms    [User: 529.2 ms, System: 54.0 ms]
  Range (min … max):   566.9 ms … 624.0 ms    90 runs

Benchmark #2: ./dmd_with_patch sm.d -c -o-
  Time (mean ± σ):     585.9 ms ±  13.9 ms    [User: 530.5 ms, System: 55.2 ms]
  Range (min … max):   565.0 ms … 631.7 ms    90 runs

Summary
  './dmd_without_patch sm.d -c -o-' ran
    1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o-'

On Wednesday, 3 June 2020 at 14:52:09 UTC, Stefan Koch wrote:
> The reason the old version of staticMap did not see the slowdown is because I didn't disable codegen.
>
> code-gen inefficiencies occurring when emitting an unreasonable number to symbols tend to hide other problems.
>
> Here is a Benchmark which does not relay on our branch
> But uses the released dmd 2.092.0
>
> Enjoy!
>
> uplink@uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine "./dmd_without_patch sm.d -c -o- -version=Walter" "./dmd_with_patch sm.d -c -o- -version=Walter" -m 90
> Benchmark #1: ./dmd_without_patch sm.d -c -o- -version=Walter
>   Time (mean ± σ):     452.8 ms ±   7.5 ms    [User: 415.5 ms, System: 37.4 ms]
>   Range (min … max):   442.2 ms … 483.9 ms    90 runs
>
> Benchmark #2: ./dmd_with_patch sm.d -c -o- -version=Walter
>   Time (mean ± σ):     455.1 ms ±  10.4 ms    [User: 417.3 ms, System: 37.7 ms]
>   Range (min … max):   441.5 ms … 489.2 ms    90 runs
>
> Summary
>   './dmd_without_patch sm.d -c -o- -version=Walter' ran
>     1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o- -version=Walter'
> uplink@uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine "./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c -o-" -m 90
> Benchmark #1: ./dmd_without_patch sm.d -c -o-
>   Time (mean ± σ):     583.2 ms ±  11.0 ms    [User: 529.9 ms, System: 53.1 ms]
>   Range (min … max):   570.0 ms … 631.0 ms    90 runs
>
> Benchmark #2: ./dmd_with_patch sm.d -c -o-
>   Time (mean ± σ):     584.3 ms ±  14.3 ms    [User: 533.1 ms, System: 51.0 ms]
>   Range (min … max):   566.5 ms … 657.9 ms    90 runs
>
> Summary
>   './dmd_without_patch sm.d -c -o-' ran
>     1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o-'
> uplink@uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine "./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c -o-" -m 90
> Benchmark #1: ./dmd_without_patch sm.d -c -o-
>   Time (mean ± σ):     583.4 ms ±  10.5 ms    [User: 529.2 ms, System: 54.0 ms]
>   Range (min … max):   566.9 ms … 624.0 ms    90 runs
>
> Benchmark #2: ./dmd_with_patch sm.d -c -o-
>   Time (mean ± σ):     585.9 ms ±  13.9 ms    [User: 530.5 ms, System: 55.2 ms]
>   Range (min … max):   565.0 ms … 631.7 ms    90 runs
>
> Summary
>   './dmd_without_patch sm.d -c -o-' ran
>     1.00 ± 0.03 times faster than './dmd_with_patch sm.d -c -o-'

Disregard this one.
I had AliasSeq defined as: template AliasSeq(seq...) { enum AliasSeq = seq; }
Which does not trigger the optimization.

When I however do define AliasSeq as template AliasSeq(seq...) { alias AliasSeq = seq; }

The optimization triggers and you get:

uplink@uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine "./dmd_without_patch sm.d -c -o- -version=Walter" "./dmd_with_patch sm.d -c -o- -version=Walter" -m 50
Benchmark #1: ./dmd_without_patch sm.d -c -o- -version=Walter
  Time (mean ± σ):     296.2 ms ±   6.8 ms    [User: 263.6 ms, System: 32.5 ms]
  Range (min … max):   285.7 ms … 330.8 ms    50 runs

Benchmark #2: ./dmd_with_patch sm.d -c -o- -version=Walter
  Time (mean ± σ):     301.4 ms ±  11.7 ms    [User: 270.6 ms, System: 30.8 ms]
  Range (min … max):   285.6 ms … 333.3 ms    50 runs

Summary
  './dmd_without_patch sm.d -c -o- -version=Walter' ran
    1.02 ± 0.05 times faster than './dmd_with_patch sm.d -c -o- -version=Walter'
uplink@uplink-black:~/d/dmd-master/dmd(stable)$ hyperfine "./dmd_without_patch sm.d -c -o-" "./dmd_with_patch sm.d -c -o-" -m 50
Benchmark #1: ./dmd_without_patch sm.d -c -o-
  Time (mean ± σ):     388.6 ms ±   8.6 ms    [User: 346.5 ms, System: 42.2 ms]
  Range (min … max):   378.5 ms … 419.3 ms    50 runs

Benchmark #2: ./dmd_with_patch sm.d -c -o-
  Time (mean ± σ):     375.7 ms ±   9.9 ms    [User: 332.8 ms, System: 42.8 ms]
  Range (min … max):   362.2 ms … 396.3 ms    50 runs

Summary
  './dmd_with_patch sm.d -c -o-' ran
    1.03 ± 0.04 times faster than './dmd_without_patch sm.d -c -o-'

Which is somewhat consistent with the previous results.
The that I did which does not do the optimization, shows no measurable difference.
That means that if the optimization does not trigger no performance penalty in incurred FOR THIS TEST.

Another thing that's surprising is ... somehow applying the patch does reduce the size of the binary. Which just goes to show that you really cannot actually tell right from wrong anymore with modern optimizers.

-rwxrwxr-x 1 uplink uplink 19281504 Jun  3 16:30 dmd_without_patch
-rwxrwxr-x 1 uplink uplink 19279120 Jun  3 16:31 dmd_with_patch

My guess is that llvm's inliner went less crazy because of an unpredictable branch in there.

Forums