Thread overview | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
June 01, 2020 Performance test of short-circuiting AliasSeq | ||||
---|---|---|---|---|
| ||||
Hi, So I've asked myself if the PR https://github.com/dlang/dmd/pull/11057 Which complicated the compiler internals and got pulled on the basis that it would increase performance did actually increase performance. So I ran it on a staticMap benchmark which it should speed up. I am going to use the same benchmark as I used for https://forum.dlang.org/post/kktulflpozdrsxeinfbg@forum.dlang.org Firstly let me post the results of the benchmark without walters patch applied. --- Benchmark #1: ./dmd.sh sm.d Time (mean ± σ): 436.0 ms ± 14.6 ms [User: 384.0 ms, System: 51.8 ms] Range (min … max): 413.5 ms … 476.7 ms 40 runs Benchmark #2: ./dmd.sh sm.d -version=DotDotDot Time (mean ± σ): 219.9 ms ± 5.6 ms [User: 210.0 ms, System: 10.0 ms] Range (min … max): 208.6 ms … 235.9 ms 40 runs Benchmark #3: ./dmd.sh sm.d -version=Walter Time (mean ± σ): 330.4 ms ± 7.1 ms [User: 290.8 ms, System: 39.5 ms] Range (min … max): 316.6 ms … 345.7 ms 40 runs Summary './dmd.sh sm.d -version=DotDotDot' ran 1.50 ± 0.05 times faster than './dmd.sh sm.d -version=Walter' 1.98 ± 0.08 times faster than './dmd.sh sm.d' --- If you care to compare the timings at the bottom they pretty much match the results I've measured in the benchmark previously and posted in the thread above. Now let's see how the test performs with the patches applied. Benchmark #1: ./dmd.sh sm.d Time (mean ± σ): 423.5 ms ± 8.9 ms [User: 377.6 ms, System: 45.7 ms] Range (min … max): 411.0 ms … 444.0 ms 40 runs Benchmark #2: ./dmd.sh sm.d -version=DotDotDot Time (mean ± σ): 231.0 ms ± 4.3 ms [User: 220.3 ms, System: 10.9 ms] Range (min … max): 223.3 ms … 243.9 ms 40 runs Benchmark #3: ./dmd.sh sm.d -version=Walter Time (mean ± σ): 342.4 ms ± 8.1 ms [User: 306.4 ms, System: 36.0 ms] Range (min … max): 331.0 ms … 375.1 ms 40 runs Summary './dmd.sh sm.d -version=DotDotDot' ran 1.48 ± 0.04 times faster than './dmd.sh sm.d -version=Walter' 1.83 ± 0.05 times faster than './dmd.sh sm.d' We see the difference between `...` and Walters unrolled staticMap shrink. And we go see a decrease of the divide and conquer version of staticMap. However We do see the mean times of the `...` and Walters unrolled staticMap actually increase. That made me curious and I repeated the measurements with a higher repetition count. Just to make sure that this is not a spur of the moment thing. ---- "short-circuit" patch applied. uplink@uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ hyperfine "./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" "./dmd.sh sm.d -version=Walter" -r 90 Benchmark #1: ./dmd.sh sm.d Time (mean ± σ): 425.9 ms ± 13.4 ms [User: 373.7 ms, System: 51.9 ms] Range (min … max): 409.6 ms … 468.8 ms 90 runs Benchmark #2: ./dmd.sh sm.d -version=DotDotDot Time (mean ± σ): 234.3 ms ± 9.5 ms [User: 224.1 ms, System: 10.2 ms] Range (min … max): 220.0 ms … 272.1 ms 90 runs Benchmark #3: ./dmd.sh sm.d -version=Walter Time (mean ± σ): 340.6 ms ± 7.1 ms [User: 299.7 ms, System: 40.9 ms] Range (min … max): 328.9 ms … 359.3 ms 90 runs Summary './dmd.sh sm.d -version=DotDotDot' ran 1.45 ± 0.07 times faster than './dmd.sh sm.d -version=Walter' 1.82 ± 0.09 times faster than './dmd.sh sm.d' This is consistent with what we got before. For good measure (pun intended), I tested the DMD version without the patch with an increased repetition count as well. ---- without "short-circuit" patch: uplink@uplink-black:~/d/dmd-master/dmd(manudotdotdot)$ hyperfine "./dmd.sh sm.d" "./dmd.sh sm.d -version=DotDotDot" "./dmd.sh sm.d -version=Walter" -r 90 Benchmark #1: ./dmd.sh sm.d Time (mean ± σ): 428.9 ms ± 11.3 ms [User: 376.2 ms, System: 52.3 ms] Range (min … max): 412.8 ms … 464.5 ms 90 runs Benchmark #2: ./dmd.sh sm.d -version=DotDotDot Time (mean ± σ): 217.8 ms ± 5.2 ms [User: 208.9 ms, System: 9.0 ms] Range (min … max): 209.0 ms … 241.6 ms 90 runs Benchmark #3: ./dmd.sh sm.d -version=Walter Time (mean ± σ): 329.9 ms ± 9.4 ms [User: 287.8 ms, System: 41.9 ms] Range (min … max): 318.7 ms … 364.6 ms 90 runs Summary './dmd.sh sm.d -version=DotDotDot' ran 1.51 ± 0.06 times faster than './dmd.sh sm.d -version=Walter' 1.97 ± 0.07 times faster than './dmd.sh sm.d' The results seem quite solid. At leasr on benchmark I have used for "short-circuiting" AliasSeq leads to a 4% slowdown for walters unrolled staticMap. and a 7% slowdown for `...` I think we should not assumed include performance improvements before measuring. |
June 01, 2020 Re: Performance test of short-circuiting AliasSeq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefan Koch | On Monday, 1 June 2020 at 20:16:55 UTC, Stefan Koch wrote:
> Hi,
>
> So I've asked myself if the PR https://github.com/dlang/dmd/pull/11057
>
> [...]
TLDR; Performance patch caused a slowdown.
Why? Because the checking for the case which it wants to optimize takes more time than you safe by optimizing it.
|
June 02, 2020 Re: Performance test of short-circuiting AliasSeq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefan Koch | On Monday, 1 June 2020 at 20:16:55 UTC, Stefan Koch wrote:
> Hi,
>
> So I've asked myself if the PR https://github.com/dlang/dmd/pull/11057
>
> [...]
To state this more fairly:
The short-circuiting patch did IMPROVE the average performance of older staticMap implementation by 0.7%
And only showed a decrease of performance with the new staticMap and ...
The amount of which if 4% and 7% respectively.
|
June 01, 2020 Re: Performance test of short-circuiting AliasSeq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefan Koch | On 6/1/2020 1:23 PM, Stefan Koch wrote:
> TLDR; Performance patch caused a slowdown.
> Why? Because the checking for the case which it wants to optimize takes more time than you safe by optimizing it.
You'll need a finer grained profile to reach this conclusion than the gross measurement of the compiler runtime. I suggest using Vtune.
|
June 02, 2020 Re: Performance test of short-circuiting AliasSeq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Tuesday, 2 June 2020 at 06:52:00 UTC, Walter Bright wrote: > On 6/1/2020 1:23 PM, Stefan Koch wrote: >> TLDR; Performance patch caused a slowdown. >> Why? Because the checking for the case which it wants to optimize takes more time than you safe by optimizing it. > > You'll need a finer grained profile to reach this conclusion than the gross measurement of the compiler runtime. I suggest using Vtune. Oh hey, Vtune is free again for noncommercial use! Nice, there was a phase where it was only available commercially. That said, for multi-run timings I'm very fond of multitime ( https://tratt.net/laurie/src/multitime/ ) and perf-tools can also give you good per-instruction analytics, though its presentation is not nearly as nice as vtune's. |
June 02, 2020 Re: Performance test of short-circuiting AliasSeq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Tuesday, 2 June 2020 at 06:52:00 UTC, Walter Bright wrote:
> On 6/1/2020 1:23 PM, Stefan Koch wrote:
>> TLDR; Performance patch caused a slowdown.
>> Why? Because the checking for the case which it wants to optimize takes more time than you safe by optimizing it.
>
> You'll need a finer grained profile to reach this conclusion than the gross measurement of the compiler runtime. I suggest using Vtune.
Vtune doesn't do intrusive profiling as far as I know.
Therefore it's measurements _can_ miss functions which are called often,
but are very short running.
A profile obtained by periodic sampling is has the potential of misrepresenting the situation.
In this case finer grained analysis is of course needed to do a 100% definite statement, however the measurement itself is quite telling.
Your short-circuiting patch was the only thing that changed between the versions.
I do have a fully instrumented version of dmd.
I know exactly which functions spends how much time processing which symbol.
I will post those results soon.
|
June 02, 2020 Re: Performance test of short-circuiting AliasSeq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefan Koch | On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
> Vtune doesn't do intrusive profiling as far as I know.
> Therefore it's measurements _can_ miss functions which are called often,
> but are very short running.
> A profile obtained by periodic sampling is has the potential of misrepresenting the situation.
>
I don't see how this can be.
A sampling profiler doesn't rely on any sort of program breakpointing. Be the function long-running or short-running, if it's active 1% of the time it will (on net) show up on 1% of sampled backtraces, and so on. To my best understanding, the only way a function can fail to be visible if it's (depending on sample density, more or less) irrelevant to the total runtime. (Or it's invisibly inlined.)
|
June 02, 2020 Re: Performance test of short-circuiting AliasSeq | ||||
---|---|---|---|---|
| ||||
Posted in reply to FeepingCreature | On Tuesday, 2 June 2020 at 09:28:48 UTC, FeepingCreature wrote:
> On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote:
>> Vtune doesn't do intrusive profiling as far as I know.
>> Therefore it's measurements _can_ miss functions which are called often,
>> but are very short running.
>> A profile obtained by periodic sampling is has the potential of misrepresenting the situation.
>>
>
> I don't see how this can be.
>
> A sampling profiler doesn't rely on any sort of program breakpointing. Be the function long-running or short-running, if it's active 1% of the time it will (on net) show up on 1% of sampled backtraces, and so on. To my best understanding, the only way a function can fail to be visible if it's (depending on sample density, more or less) irrelevant to the total runtime. (Or it's invisibly inlined.)
You are correct.
Given a uniform program execution and very fine grained sampling i.e. sampling breaks in a sub microsecond interval, you get an understanding of where the program spends it's time overall. In a quite detailed matter.
However because you frequently interrupting execution in short intervals you are also changing the execution profile.
Making your results potentially inaccurate.
|
June 02, 2020 Re: Performance test of short-circuiting AliasSeq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefan Koch | On Tuesday, 2 June 2020 at 09:48:47 UTC, Stefan Koch wrote:
> You are correct.
> Given a uniform program execution and very fine grained sampling i.e. sampling breaks in a sub microsecond interval, you get an understanding of where the program spends it's time overall. In a quite detailed matter.
>
> However because you frequently interrupting execution in short intervals you are also changing the execution profile.
> Making your results potentially inaccurate.
Ah, that's true...
|
June 02, 2020 Re: Performance test of short-circuiting AliasSeq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefan Koch | On Tuesday, 2 June 2020 at 09:19:57 UTC, Stefan Koch wrote: > On Tuesday, 2 June 2020 at 06:52:00 UTC, Walter Bright wrote: >> On 6/1/2020 1:23 PM, Stefan Koch wrote: >>> TLDR; Performance patch caused a slowdown. >>> Why? Because the checking for the case which it wants to optimize takes more time than you safe by optimizing it. >> >> You'll need a finer grained profile to reach this conclusion than the gross measurement of the compiler runtime. I suggest using Vtune. > > Vtune doesn't do intrusive profiling as far as I know. > Therefore it's measurements _can_ miss functions which are called often, > but are very short running. > A profile obtained by periodic sampling is has the potential of misrepresenting the situation. > > In this case finer grained analysis is of course needed to do a 100% definite statement, however the measurement itself is quite telling. > Your short-circuiting patch was the only thing that changed between the versions. > > I do have a fully instrumented version of dmd. > I know exactly which functions spends how much time processing which symbol. > > I will post those results soon. So I have a little profiling tool in DMD which tells me how often certain functions are being called. This is the output for the test with the patch: === Phase Time Distribution : === phase avgTime absTime freq dmd.dsymbolsem.dsymbolSemantic 127270.86 37145399296 291861 dmd.dsymbolsem.templateInstanceSemantic 355060.25 17450856448 49149 dmd.expressionsem.symbolToExp 993.92 113991488 114689 dmd.dtemplate.TemplateInstance.syntaxCopy 1327.78 92444120 69623 dmd.dmangle.Mangler.mangleType 293.39 60637312 206680 dmd.dtemplate.TemplateDeclaration.findExistingInstance 950.36 19462416 20479 dmd.mtype.TypeIdentifier.syntaxCopy 203.80 12570752 61681 Type::mangleToBuffer 83.38 2107624 25277 dmd.dmangle.Mangler.mangleSymbol 573.35 108936 190 dmd.func.FuncDeclaration.functionSemantic 374.00 374 1 === Phase Time Distribution : === phase avgTime absTime freq dmd.dsymbolsem.dsymbolSemantic 124654.58 36400631808 292012 dmd.dsymbolsem.templateInstanceSemantic 348265.00 17116876800 49149 dmd.expressionsem.symbolToExp 1013.76 116267384 114689 dmd.dtemplate.TemplateInstance.syntaxCopy 1317.11 91701088 69623 dmd.dmangle.Mangler.mangleType 288.24 59609648 206809 dmd.dtemplate.TemplateDeclaration.findExistingInstance 954.04 19537828 20479 dmd.mtype.TypeIdentifier.syntaxCopy 205.71 12690568 61693 Type::mangleToBuffer 91.08 2307294 25332 dmd.dmangle.Mangler.mangleSymbol 832.58 169014 203 dmd.func.FuncDeclaration.functionSemantic 680.00 680 1 As you can see the templateInstanceSemantic is called the same number of times. That would imply the optimization does never actually get triggered. |
Copyright © 1999-2021 by the D Language Foundation