Thread overview
Optimization fun
Nov 06, 2014
H. S. Teoh
Nov 06, 2014
Walter Bright
Nov 06, 2014
Sean Kelly
Nov 07, 2014
thedeemon
Nov 07, 2014
Ary Borenszweig
November 06, 2014
So today, I was playing around with profiling and optimizing my sliding block puzzle solver, and found some interesting things:

1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks). The profiler indicated that one of the hotspots is the GC. So I added an option to the program to call GC.disable() if a command line option is given. The result is a 40%-50% reduction in running time.

2) Auto-decoding is evil: the solver currently uses a naïve char[] representation for the puzzle board, and uses countUntil() extensively to locate things on the board. The original code had:

	auto offset = currentBoard.countUntil(ch);

which, of course, triggers autodecoding (even though there is actually no reason to do so: ch is a char, and therefore countUntil could've used strchr instead). Simply changing that to:

	auto offset = currentBoard.representation.countUntil(ch);

gave an additional 20%-30% performance boost.


3) However, countUntil remains at the top of the profiler's list of hotspots. DMD's optimizer was rather disappointing here, so I looked at gdc's output instead. Digging into the disassembly, I saw that it was using a foreach loop over the char[], but for some reason even gdc -O3 failed to simplify that loop significantly (I'm not sure why -- maybe what threw off the optimizer is the array indexing + length check operation, where in C one would normally jump bump the pointer instead?). Eventually, I tried writing a manual loop:

            auto ptr = current.ptr;
            auto c = current.length;
            while (c > 0 && *ptr != m.id)
                ptr++;
            cert.headOffset = ptr - current.ptr;

This pushed gdc far enough in the right direction that it finally produced a 3-instruction inner loop. Strangely enough, the assembly had no check for (c > 0)... could it be because there is an assert following the loop claiming that that will never happen, therefore gdc elided the check? I thought Walter hasn't gotten around to implementing optimizations based on assert yet??  Anyway, with this optimization, I managed to shave off another 3%-5% of the total running time.

On that note, at one point I was wondering why gdc -O3 didn't generate a "rep scasb" for the inner loop instead of manually incrementing the loop pointer; but a little research revealed that I'm about 6-7 years out of date: since about 2008 gcc's optimizer has not used "rep scasb" because on modern hardware it has atrocious performance -- much worse than a manually-written C loop! So much for "built-in" string processing that used to be touted in the old days. Yet another proof of the rule that overly-complex CPU instruction sets rarely actually perform better.


Anyway, after everything is said and done, a puzzle that used to take about 20 seconds to solve now only takes about 6 seconds. W00t!


T

-- 
Which is worse: ignorance or apathy? Who knows? Who cares? -- Erich Schubert
November 06, 2014
On 11/6/2014 2:58 PM, H. S. Teoh via Digitalmars-d wrote:
> 2) Auto-decoding is evil: the solver currently uses a naïve char[]
> representation for the puzzle board, and uses countUntil() extensively
> to locate things on the board. The original code had:
>
> 	auto offset = currentBoard.countUntil(ch);
>
> which, of course, triggers autodecoding (even though there is actually
> no reason to do so: ch is a char, and therefore countUntil could've used
> strchr instead). Simply changing that to:
>
> 	auto offset = currentBoard.representation.countUntil(ch);
>
> gave an additional 20%-30% performance boost.

I've argued this point for over a year, apparently without convincing anyone :-(

BTW, use .byChar. instead of .representation. because the former leaves it typed as 'char' and the latter converts it to 'ubyte'.


> 3) However, countUntil remains at the top of the profiler's list of
> hotspots. DMD's optimizer was rather disappointing here, so I looked at
> gdc's output instead. Digging into the disassembly, I saw that it was
> using a foreach loop over the char[], but for some reason even gdc -O3
> failed to simplify that loop significantly (I'm not sure why -- maybe
> what threw off the optimizer is the array indexing + length check
> operation, where in C one would normally jump bump the pointer
> instead?). Eventually, I tried writing a manual loop:
>
>              auto ptr = current.ptr;
>              auto c = current.length;
>              while (c > 0 && *ptr != m.id)
>                  ptr++;
>              cert.headOffset = ptr - current.ptr;
>
> This pushed gdc far enough in the right direction that it finally
> produced a 3-instruction inner loop. Strangely enough, the assembly had
> no check for (c > 0)... could it be because there is an assert following
> the loop claiming that that will never happen, therefore gdc elided the
> check? I thought Walter hasn't gotten around to implementing
> optimizations based on assert yet??  Anyway, with this optimization, I
> managed to shave off another 3%-5% of the total running time.

Walter hasn't gotten around to implementing optimizations in GDC, that's fer sure! :-)

November 06, 2014
On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via
Digitalmars-d wrote:
> So today, I was playing around with profiling and optimizing my sliding
> block puzzle solver, and found some interesting things:
>
> 1) The GC could use some serious improvement: it just so happens that
> the solver's algorithm only ever needs to allocate memory, never release
> it (it keeps a hash of visited states that only grows, never shrinks).
> The profiler indicated that one of the hotspots is the GC. So I added an
> option to the program to call GC.disable() if a command line  option is
> given. The result is a 40%-50% reduction in running time.

We've talked before about keeping statistics of how much memory
has been reclaimed by collections, and use that to influence
whether a collection should be run before going to the OS for
more.  Something like this should definitely get a bugzilla
ticket.

For now, you might want to consider guessing at how much memory
you'll need and calling GC.reserve() as well.  This should
improve the performance of your app even further.


> 2) Auto-decoding is evil

Yep, it's the worst.  Most string operations don't even need
auto-decoding to work properly.  I honestly have no idea why we
have this feature.
November 07, 2014
On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via Digitalmars-d wrote:
> 1) The GC could use some serious improvement: it just so happens that
> the solver's algorithm only ever needs to allocate memory, never release
> it (it keeps a hash of visited states that only grows, never shrinks).

When you grow a hash table it periodically reallocates bucket arrays it uses internally, so some garbage to collect appears anyway even if you only add elements.

November 07, 2014
On 11/6/14, 7:58 PM, H. S. Teoh via Digitalmars-d wrote:
> So today, I was playing around with profiling and optimizing my sliding
> block puzzle solver, and found some interesting things:
>
> 1) The GC could use some serious improvement: it just so happens that
> the solver's algorithm only ever needs to allocate memory, never release
> it (it keeps a hash of visited states that only grows, never shrinks).
> The profiler indicated that one of the hotspots is the GC. So I added an
> option to the program to call GC.disable() if a command line option is
> given. The result is a 40%-50% reduction in running time.
>
> 2) Auto-decoding is evil: the solver currently uses a naïve char[]
> representation for the puzzle board, and uses countUntil() extensively
> to locate things on the board. The original code had:
>
> 	auto offset = currentBoard.countUntil(ch);
>
> which, of course, triggers autodecoding (even though there is actually
> no reason to do so: ch is a char, and therefore countUntil could've used
> strchr instead). Simply changing that to:
>
> 	auto offset = currentBoard.representation.countUntil(ch);
>
> gave an additional 20%-30% performance boost.
>
>
> 3) However, countUntil remains at the top of the profiler's list of
> hotspots. DMD's optimizer was rather disappointing here, so I looked at
> gdc's output instead. Digging into the disassembly, I saw that it was
> using a foreach loop over the char[], but for some reason even gdc -O3
> failed to simplify that loop significantly (I'm not sure why -- maybe
> what threw off the optimizer is the array indexing + length check
> operation, where in C one would normally jump bump the pointer
> instead?). Eventually, I tried writing a manual loop:
>
>              auto ptr = current.ptr;
>              auto c = current.length;
>              while (c > 0 && *ptr != m.id)
>                  ptr++;
>              cert.headOffset = ptr - current.ptr;
>
> This pushed gdc far enough in the right direction that it finally
> produced a 3-instruction inner loop. Strangely enough, the assembly had
> no check for (c > 0)... could it be because there is an assert following
> the loop claiming that that will never happen, therefore gdc elided the
> check? I thought Walter hasn't gotten around to implementing
> optimizations based on assert yet??  Anyway, with this optimization, I
> managed to shave off another 3%-5% of the total running time.
>
> On that note, at one point I was wondering why gdc -O3 didn't generate a
> "rep scasb" for the inner loop instead of manually incrementing the loop
> pointer; but a little research revealed that I'm about 6-7 years out of
> date: since about 2008 gcc's optimizer has not used "rep scasb" because
> on modern hardware it has atrocious performance -- much worse than a
> manually-written C loop! So much for "built-in" string processing that
> used to be touted in the old days. Yet another proof of the rule that
> overly-complex CPU instruction sets rarely actually perform better.
>
>
> Anyway, after everything is said and done, a puzzle that used to take
> about 20 seconds to solve now only takes about 6 seconds. W00t!
>
>
> T
>

Hi,

Is the code public? I'd like to port it to other languages and see how they behave, see if this is a general problem or just something specific to D.

Thanks!