Jump to page: 1 2
Thread overview
Performance test of short-circuiting AliasSeq
Jun 01, 2020
Stefan Koch
Jun 01, 2020
Stefan Koch
Jun 02, 2020
Walter Bright
Jun 02, 2020
FeepingCreature
Jun 02, 2020
Stefan Koch
Jun 02, 2020
FeepingCreature
Jun 02, 2020
Stefan Koch
Jun 02, 2020
FeepingCreature
Jun 02, 2020
Stefan Koch
Jun 02, 2020
Stefan Koch
Jun 02, 2020
Stefan Koch
Jun 02, 2020
Stefan Koch
Jun 03, 2020
Stefan Koch
Jun 03, 2020
Stefan Koch
June 01, 2020
Hi,

So I've asked myself if the PR https://github.com/dlang/dmd/pull/11057

Which complicated the compiler internals and got pulled on the basis that it would increase performance did actually increase performance.
So I ran it on a staticMap benchmark which it should speed up.

I am going to use the same benchmark as I used for
https://forum.dlang.org/post/kktulflpozdrsxeinfbg@forum.dlang.org

Firstly let me post the results of the benchmark without walters patch applied.

---
Benchmark #1: ./dmd.sh sm.d
  Time (mean ± σ):     436.0 ms ±  14.6 ms    [User: 384.0 ms, System: 51.8 ms]
  Range (min … max):   413.5 ms … 476.7 ms    40 runs

Benchmark #2: ./dmd.sh sm.d -version=DotDotDot
  Time (mean ± σ):     219.9 ms ±   5.6 ms    [User: 210.0 ms, System: 10.0 ms]
  Range (min … max):   208.6 ms … 235.9 ms    40 runs

Benchmark #3: ./dmd.sh sm.d -version=Walter
  Time (mean ± σ):     330.4 ms ±   7.1 ms    [User: 290.8 ms, System: 39.5 ms]
  Range (min … max):   316.6 ms … 345.7 ms    40 runs

Summary
  './dmd.sh sm.d -version=DotDotDot' ran
    1.50 ± 0.05 times faster than './dmd.sh sm.d -version=Walter'
    1.98 ± 0.08 times faster than './dmd.sh sm.d'

---

If you care to compare the timings at the bottom they pretty much match the results I've measured in the benchmark previously and posted in the thread above.

Now let's see how the test performs with the patches applied.

Benchmark #1: ./dmd.sh sm.d
  Time (mean ± σ):     423.5 ms ±   8.9 ms    [User: 377.6 ms, System: 45.7 ms]
  Range (min … max):   411.0 ms … 444.0 ms    40 runs

Benchmark #2: ./dmd.sh sm.d -version=DotDotDot
  Time (mean ± σ):     231.0 ms ±   4.3 ms    [User: 220.3 ms, System: 10.9 ms]
  Range (min … max):   223.3 ms … 243.9 ms    40 runs

Benchmark #3: ./dmd.sh sm.d -version=Walter
  Time (mean ± σ):     342.4 ms ±   8.1 ms    [User: 306.4 ms, System: 36.0 ms]
  Range (min … max):   331.0 ms … 375.1 ms    40 runs

Summary
  './dmd.sh sm.d -version=DotDotDot' ran
    1.48 ± 0.04 times faster than './dmd.sh sm.d -version=Walter'
    1.83 ± 0.05 times faster than './dmd.sh sm.d'

We see the difference between `...` and Walters unrolled staticMap shrink.
And we go see a decrease of the divide and conquer version of staticMap.

However We do see the mean times of the `...` and Walters unrolled staticMap actually increase.

That made me curious and I repeated the measurements with a higher repetition count. Just to make sure that this is not a spur of the moment thing.

---- "short-circuit" patch applied.
uplink@uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ hyperfine "./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" "./dmd.sh sm.d -version=Walter" -r 90
Benchmark #1: ./dmd.sh sm.d
  Time (mean ± σ):     425.9 ms ±  13.4 ms    [User: 373.7 ms, System: 51.9 ms]
  Range (min … max):   409.6 ms … 468.8 ms    90 runs

Benchmark #2: ./dmd.sh sm.d -version=DotDotDot
  Time (mean ± σ):     234.3 ms ±   9.5 ms    [User: 224.1 ms, System: 10.2 ms]
  Range (min … max):   220.0 ms … 272.1 ms    90 runs

Benchmark #3: ./dmd.sh sm.d -version=Walter
  Time (mean ± σ):     340.6 ms ±   7.1 ms    [User: 299.7 ms, System: 40.9 ms]
  Range (min … max):   328.9 ms … 359.3 ms    90 runs

Summary
  './dmd.sh sm.d -version=DotDotDot' ran
    1.45 ± 0.07 times faster than './dmd.sh sm.d -version=Walter'
    1.82 ± 0.09 times faster than './dmd.sh sm.d'

This is consistent with what we got before.
For good measure (pun intended), I tested the DMD version without the patch with an increased repetition count as well.

---- without "short-circuit" patch:
uplink@uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ hyperfine "./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" "./dmd.sh sm.d -version=Walter" -r 90
Benchmark #1: ./dmd.sh sm.d
  Time (mean ± σ):     428.9 ms ±  11.3 ms    [User: 376.2 ms, System: 52.3 ms]
  Range (min … max):   412.8 ms … 464.5 ms    90 runs

Benchmark #2: ./dmd.sh sm.d -version=DotDotDot
  Time (mean ± σ):     217.8 ms ±   5.2 ms    [User: 208.9 ms, System: 9.0 ms]
  Range (min … max):   209.0 ms … 241.6 ms    90 runs

Benchmark #3: ./dmd.sh sm.d -version=Walter
  Time (mean ± σ):     329.9 ms ±   9.4 ms    [User: 287.8 ms, System: 41.9 ms]
  Range (min … max):   318.7 ms … 364.6 ms    90 runs

Summary
  './dmd.sh sm.d -version=DotDotDot' ran
    1.51 ± 0.06 times faster than './dmd.sh sm.d -version=Walter'
    1.97 ± 0.07 times faster than './dmd.sh sm.d'

The results seem quite solid.
At leasr on benchmark I have used for "short-circuiting" AliasSeq leads to a 4% slowdown for walters unrolled staticMap. and a 7% slowdown for `...`

I think we should not assumed include performance improvements before measuring.


June 01, 2020
On Monday, 1 June 2020 at 20:16:55 UTC, Stefan Koch wrote:
> Hi,
>
> So I've asked myself if the PR https://github.com/dlang/dmd/pull/11057
>
> [...]

TLDR; Performance patch caused a slowdown.
Why? Because the checking for the case which it wants to optimize takes more time than you safe by optimizing it.
June 02, 2020
On Monday, 1 June 2020 at 20:16:55 UTC, Stefan Koch wrote:
> Hi,
>
> So I've asked myself if the PR https://github.com/dlang/dmd/pull/11057
>
> [...]

To state this more fairly:

The short-circuiting patch did IMPROVE the average performance of older staticMap implementation by 0.7%

And only showed a decrease of performance with the new staticMap and ...
The amount of which if 4% and 7% respectively.
June 01, 2020
On 6/1/2020 1:23 PM, Stefan Koch wrote:
> TLDR; Performance patch caused a slowdown.
> Why? Because the checking for the case which it wants to optimize takes more time than you safe by optimizing it.

You'll need a finer grained profile to reach this conclusion than the gross measurement of the compiler runtime. I suggest using Vtune.
June 02, 2020
On Tuesday, 2 June 2020 at 06:52:00 UTC, Walter Bright wrote:
> On 6/1/2020 1:23 PM, Stefan Koch wrote:
>> TLDR; Performance patch caused a slowdown.
>> Why? Because the checking for the case which it wants to optimize takes more time than you safe by optimizing it.
>
> You'll need a finer grained profile to reach this conclusion than the gross measurement of the compiler runtime. I suggest using Vtune.

Oh hey, Vtune is free again for noncommercial use! Nice, there was a phase where it was only available commercially.

That said, for multi-run timings I'm very fond of multitime ( https://tratt.net/laurie/src/multitime/ ) and perf-tools can also give you good per-instruction analytics, though its presentation is not nearly as nice as vtune's.
June 02, 2020
On Tuesday, 2 June 2020 at 06:52:00 UTC, Walter Bright wrote:
> On 6/1/2020 1:23 PM, Stefan Koch wrote:
>> TLDR; Performance patch caused a slowdown.
>> Why? Because the checking for the case which it wants to optimize takes more time than you safe by optimizing it.
>
> You'll need a finer grained profile to reach this conclusion than the gross measurement of the compiler runtime. I suggest using Vtune.

Vtune doesn't do intrusive profiling as far as I know.
Therefore it's measurements _can_ miss functions which are called often,
but are very short running.
A profile  obtained by periodic sampling is has the potential of misrepresenting the situation.

In this case finer grained analysis is of course needed to do a 100% definite statement, however the measurement itself is quite telling.
Your short-circuiting patch was the only thing that changed between the versions.

I do have a fully instrumented version of dmd.
I know exactly which functions spends how much time processing which symbol.

I will post those results soon.
June 02, 2020
On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
> Vtune doesn't do intrusive profiling as far as I know.
> Therefore it's measurements _can_ miss functions which are called often,
> but are very short running.
> A profile  obtained by periodic sampling is has the potential of misrepresenting the situation.
>

I don't see how this can be.

A sampling profiler doesn't rely on any sort of program breakpointing. Be the function long-running or short-running, if it's active 1% of the time it will (on net) show up on 1% of sampled backtraces, and so on. To my best understanding, the only way a function can fail to be visible if it's (depending on sample density, more or less) irrelevant to the total runtime. (Or it's invisibly inlined.)
June 02, 2020
On Tuesday, 2 June 2020 at 09:28:48 UTC, FeepingCreature wrote:
> On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
>> Vtune doesn't do intrusive profiling as far as I know.
>> Therefore it's measurements _can_ miss functions which are called often,
>> but are very short running.
>> A profile  obtained by periodic sampling is has the potential of misrepresenting the situation.
>>
>
> I don't see how this can be.
>
> A sampling profiler doesn't rely on any sort of program breakpointing. Be the function long-running or short-running, if it's active 1% of the time it will (on net) show up on 1% of sampled backtraces, and so on. To my best understanding, the only way a function can fail to be visible if it's (depending on sample density, more or less) irrelevant to the total runtime. (Or it's invisibly inlined.)

You are correct.
Given a uniform program execution and very fine grained sampling i.e. sampling breaks in a sub microsecond interval, you get an understanding of where the program spends it's time overall. In a quite detailed matter.

However because you frequently interrupting execution in short intervals you are also changing the execution profile.
Making your results potentially inaccurate.
June 02, 2020
On Tuesday, 2 June 2020 at 09:48:47 UTC, Stefan Koch wrote:
> You are correct.
> Given a uniform program execution and very fine grained sampling i.e. sampling breaks in a sub microsecond interval, you get an understanding of where the program spends it's time overall. In a quite detailed matter.
>
> However because you frequently interrupting execution in short intervals you are also changing the execution profile.
> Making your results potentially inaccurate.

Ah, that's true...
June 02, 2020
On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
> On Tuesday, 2 June 2020 at 06:52:00 UTC, Walter Bright wrote:
>> On 6/1/2020 1:23 PM, Stefan Koch wrote:
>>> TLDR; Performance patch caused a slowdown.
>>> Why? Because the checking for the case which it wants to optimize takes more time than you safe by optimizing it.
>>
>> You'll need a finer grained profile to reach this conclusion than the gross measurement of the compiler runtime. I suggest using Vtune.
>
> Vtune doesn't do intrusive profiling as far as I know.
> Therefore it's measurements _can_ miss functions which are called often,
> but are very short running.
> A profile  obtained by periodic sampling is has the potential of misrepresenting the situation.
>
> In this case finer grained analysis is of course needed to do a 100% definite statement, however the measurement itself is quite telling.
> Your short-circuiting patch was the only thing that changed between the versions.
>
> I do have a fully instrumented version of dmd.
> I know exactly which functions spends how much time processing which symbol.
>
> I will post those results soon.

So I have a little profiling tool in DMD which tells me how often certain functions are being called.
This is the output for the test with the patch:

 === Phase Time Distribution : ===
 phase                                                                                      avgTime    absTime       freq

 dmd.dsymbolsem.dsymbolSemantic                                                             127270.86  37145399296   291861
 dmd.dsymbolsem.templateInstanceSemantic                                                    355060.25  17450856448   49149
 dmd.expressionsem.symbolToExp                                                              993.92     113991488     114689
 dmd.dtemplate.TemplateInstance.syntaxCopy                                                  1327.78    92444120      69623
 dmd.dmangle.Mangler.mangleType                                                             293.39     60637312      206680
 dmd.dtemplate.TemplateDeclaration.findExistingInstance                                     950.36     19462416      20479
 dmd.mtype.TypeIdentifier.syntaxCopy                                                        203.80     12570752      61681
 Type::mangleToBuffer                                                                       83.38      2107624       25277
 dmd.dmangle.Mangler.mangleSymbol                                                           573.35     108936        190
 dmd.func.FuncDeclaration.functionSemantic                                                  374.00     374           1

 === Phase Time Distribution : ===
 phase                                                                                      avgTime    absTime       freq
 dmd.dsymbolsem.dsymbolSemantic                                                             124654.58  36400631808   292012
 dmd.dsymbolsem.templateInstanceSemantic                                                    348265.00  17116876800   49149
 dmd.expressionsem.symbolToExp                                                              1013.76    116267384     114689
 dmd.dtemplate.TemplateInstance.syntaxCopy                                                  1317.11    91701088      69623
 dmd.dmangle.Mangler.mangleType                                                             288.24     59609648      206809
 dmd.dtemplate.TemplateDeclaration.findExistingInstance                                     954.04     19537828      20479
 dmd.mtype.TypeIdentifier.syntaxCopy                                                        205.71     12690568      61693
 Type::mangleToBuffer                                                                       91.08      2307294       25332
 dmd.dmangle.Mangler.mangleSymbol                                                           832.58     169014        203
 dmd.func.FuncDeclaration.functionSemantic                                                  680.00     680           1

As you can see the templateInstanceSemantic is called the same number of times.
That would imply the optimization does never actually get triggered.
« First   ‹ Prev
1 2