Jump to page: 1 2 3
Thread overview
Re: Optimization fun
Nov 06, 2014
H. S. Teoh
Nov 06, 2014
Walter Bright
Nov 07, 2014
Meta
Nov 07, 2014
H. S. Teoh
Nov 07, 2014
H. S. Teoh
Nov 07, 2014
H. S. Teoh
Nov 07, 2014
H. S. Teoh
Nov 07, 2014
H. S. Teoh
Nov 07, 2014
H. S. Teoh
Nov 07, 2014
Dmitry Olshansky
Nov 07, 2014
H. S. Teoh
Nov 08, 2014
Walter Bright
Nov 08, 2014
H. S. Teoh
Nov 08, 2014
Walter Bright
Nov 08, 2014
H. S. Teoh
Nov 08, 2014
Walter Bright
Nov 08, 2014
Kiith-Sa
Nov 08, 2014
H. S. Teoh
Nov 08, 2014
H. S. Teoh
Nov 08, 2014
Dmitry Olshansky
Nov 07, 2014
H. S. Teoh
Nov 07, 2014
H. S. Teoh
Nov 07, 2014
Ary Borenszweig
November 06, 2014
On Thu, Nov 06, 2014 at 02:58:16PM -0800, H. S. Teoh via Digitalmars-d wrote: [...]
>             auto ptr = current.ptr;
>             auto c = current.length;
>             while (c > 0 && *ptr != m.id)
>                 ptr++;
>             cert.headOffset = ptr - current.ptr;
> 
> This pushed gdc far enough in the right direction that it finally produced a 3-instruction inner loop. Strangely enough, the assembly had no check for (c > 0)... could it be because there is an assert following the loop claiming that that will never happen, therefore gdc elided the check? I thought Walter hasn't gotten around to implementing optimizations based on assert yet??
[...]

Argh... the reason c>0 had no check was because the code is wrong: c is
never decremented. :-(  Replacing (c > 0) with (c-- > 0) reduced the
performance improvement to about 3% instead of 4%-5%. Oh well. :-/

	mouth.open(); mouth.insert(foot);


T

-- 
I am a consultant. My job is to make your job redundant. -- Mr Tom
November 06, 2014
On 11/6/2014 3:13 PM, H. S. Teoh via Digitalmars-d wrote:
> 	mouth.open(); mouth.insert(foot);

Now you've done it. You must pull the Stone of Shame!

https://pbs.twimg.com/media/BrENFq3CMAEqFUL.jpg

November 07, 2014
On Thu, Nov 06, 2014 at 03:15:04PM -0800, Walter Bright via Digitalmars-d wrote:
> On 11/6/2014 2:58 PM, H. S. Teoh via Digitalmars-d wrote:
> >2) Auto-decoding is evil: the solver currently uses a naïve char[] representation for the puzzle board, and uses countUntil() extensively to locate things on the board. The original code had:
> >
> >	auto offset = currentBoard.countUntil(ch);
> >
> >which, of course, triggers autodecoding (even though there is actually no reason to do so: ch is a char, and therefore countUntil could've used strchr instead). Simply changing that to:
> >
> >	auto offset = currentBoard.representation.countUntil(ch);
> >
> >gave an additional 20%-30% performance boost.
> 
> I've argued this point for over a year, apparently without convincing anyone :-(

I dunno, my impression is that some people agree with it, but consider that the cost of changing it now a little too prohibitive due to the extensive code breakage it will cause. But the primary obstacle is that Andrei is not convinced, so he's going to veto any change in this direction regardless. You two will have to tussle it out to resolve this. ;-)


> BTW, use .byChar. instead of .representation. because the former leaves it typed as 'char' and the latter converts it to 'ubyte'.

I was going to use .byChar, but since I'm doing performance testing on gdc, and (my version of) gdc hasn't caught up to the latest Phobos yet, I have to settle with .representation for now. For my purposes it's harmless, since I'm not actually doing anything with the ubyte representation once countUntil is done; the return value is used to index the original char[].


> >3) However, countUntil remains at the top of the profiler's list of hotspots. DMD's optimizer was rather disappointing here, so I looked at gdc's output instead. Digging into the disassembly, I saw that it was using a foreach loop over the char[], but for some reason even gdc -O3 failed to simplify that loop significantly (I'm not sure why -- maybe what threw off the optimizer is the array indexing + length check operation, where in C one would normally jump bump the pointer instead?). Eventually, I tried writing a manual loop:
> >
> >             auto ptr = current.ptr;
> >             auto c = current.length;
> >             while (c > 0 && *ptr != m.id)
> >                 ptr++;
> >             cert.headOffset = ptr - current.ptr;
> >
> >This pushed gdc far enough in the right direction that it finally produced a 3-instruction inner loop. Strangely enough, the assembly had no check for (c > 0)... could it be because there is an assert following the loop claiming that that will never happen, therefore gdc elided the check? I thought Walter hasn't gotten around to implementing optimizations based on assert yet??  Anyway, with this optimization, I managed to shave off another 3%-5% of the total running time.
> 
> Walter hasn't gotten around to implementing optimizations in GDC, that's fer sure! :-)

LOL... I wasn't sure whether said optimizations were going to be in the front end, or in the backend only. It might not be a bad idea to put it in the front end where applicable, then all compilers will benefit from it.


T

-- 
If the comments and the code disagree, it's likely that *both* are wrong. -- Christopher
November 07, 2014
On Thursday, 6 November 2014 at 23:30:56 UTC, Walter Bright wrote:
> On 11/6/2014 3:13 PM, H. S. Teoh via Digitalmars-d wrote:
>> 	mouth.open(); mouth.insert(foot);
>
> Now you've done it. You must pull the Stone of Shame!
>
> https://pbs.twimg.com/media/BrENFq3CMAEqFUL.jpg

The Shame Cube is far more efficient at inflicting shame.

http://stream1.gifsoup.com/view3/2323130/shame-cube-o.gif
November 07, 2014
On Thu, Nov 06, 2014 at 02:58:16PM -0800, H. S. Teoh via Digitalmars-d wrote:
> So today, I was playing around with profiling and optimizing my sliding block puzzle solver, and found some interesting things:
[...]

Oh, I forgot to mention something else that might be of interest to
people: in the course of optimizing the performance of canFind(), one of
the techniques I used was to reduce the frequency with which it's
called. One such situation was the duplication between the canMove()
method, which checks whether a prospective move is valid, and move(),
which performs the actual move (update the puzzle board).

In more general terms, the same pattern applies to data structures that have a check method for evaluating whether a particular operation is legal or not, and another method for actually performing the operation. For example, when updating a hash table or binary tree you might do a "find" to see if an entry already exists, before you insert a new entry. While check() followed by insert() makes sense logically (first you check, then you proceed if the check is OK), it performs poorly in code, because quite often there is a fair amount of overlap in the work done by either method. The check, for example, may have found the node in which the insert will take place, but since it has to return to the caller first, this information is lost and insert() has to recompute something that it should already have known.

There are a few common solutions to this problem:

1) Have a checkAndInsert() method that both checks and performs the insertion if the check passes. While this solves the problem, it's ugly because it conflates two logically distinct operations into one, thus breaking the principle of single responsibility.

A variation on (1) is to pass a flag (ugh) to insert() to do a check
before inserting. It's essentially the same solution with the same
drawbacks.

2) check() returns a node (or iterator, or range, whatever) instead of a
bool, which is then passed to insert() so it doesn't have to traverse
the structure again. This is a slightly better solution, but the
disadvantage is that you expose the implementation details of the
structure, and besides, why should a method named "check" return an
iterator? It's also unable to capture other repeated computations
effectively; e.g., check() may have computed things like the balance
factor of the tree at that point, which insert() can profitably reuse,
but an iterator/range doesn't (and shouldn't!) capture such a kind of
information.

3) check() takes an additional ref parameter that can be used to store extra information to be passed to insert(), then the caller passes this extra information when it calls insert(). This is also ugly, because it pollutes user code with implementation details. This can be restricted to only private use by other methods, so that we don't break encapsulation, but then user code won't be able to benefit from it.

In the end, the idea occurred to me that the reason insert() can reuse certain things check() has already computed, is because those things were computed as part of the process of *verifying* the validity of a certain operation. Much like verifying the validity of a piece of data with a checksum, the checksum computed by the verification function is a proof, or certificate, that the data is valid. So it occurred to me that check() should return not a bool, but an opaque *certificate* object that captures the information computed by check() in the process of validating the user-proposed operation. This certificate object can then be inspected by the caller to see if the check passed (a default, "invalid" certificate is returned if check() fails), and it can also capture additional information that insert() can use afterwards. Then insert() itself will have two overloads: one that takes the normal parameters, which will (re)compute whatever information is necessary to perform the operation, and another that takes a certificate that "certifies" that the given operation has been previously validated, and therefore insert() can skip recomputation of things already computed by check() -- which are captured in the certificate object. Furthermore, being a certificate, it also contains information about the original request (otherwise you could illegally pair an invalid request with the certificate of a different, valid request and cause insert() to malfunction).

So my eventual solution was to change the canMove() method's signature
from:

	bool canMove(Move m);

to:

	MoveCert canMove(Move m);

And then the overloads of move() are:

	void move(Move m) {
		// Unvalidated overload: will compute the values needed
		// by the second overload of move manually since it's
		// not already computed.
		MoveCert cert = ... ;

		// The actual implementation is second overload; now
		// that we have computed the necessary info to proceed,
		// we share the rest of the implementation with it. Thus
		// there is no needless code duplication.
		move(cert);
	}

	void move(MoveCert cert) {
		// Here, we can use info stored in cert that have been
		// precomputed by check(), thereby avoiding the
		// unnecessary overhead of computing them all over
		// again.
	}

In the specific case of my solver, MoveCert looks like this:

	private struct MoveCert {
		private Move m;			// original "request"
		private ptrdiff_t offset;	// cached result of countUntil()

		// This makes this change transparent to existing
		// callers that expect a bool:
		bool isValid() { ... }
		alias isValid this;
	}

MoveCert.offset caches the value of countUntil() that canMove() has
already computed, so move() can simply look it up instead of recomputing
it.

This particular optimization significantly reduced the number of calls to countUntil(), and resulted in about 20-30% performance improvement. But the best part it, it did so by ostensibly making the code *prettier* and not breaking encapsulation, quite the opposite effect of typical optimization techniques!

(Having said that, this technique does have to be used with care... while working on this, I noticed that another value was being recomputed by move() that canMove() already computes. However, adding that value to MoveCert actually *degraded* performance -- because it made MoveCert too large, and the overhead of passing it back to the caller from canMove() and then passing it to move() afterwards outweighed the cost of just recomputing the value in move(). So the things worth caching in MoveCert are those that are expensive to recompute, not those that are trivially recomputed. In this case, the value in question was calculated by multiplying a few integers. Evidently, performing this multiplication the second time is cheaper than the cost of making MoveCert too large and needing more overhead to pass around. Profiling is a must in deciding just how much to put in the certificate object.)


T

-- 
If I were two-faced, would I be wearing this one? -- Abraham Lincoln
November 07, 2014
On Thu, Nov 06, 2014 at 02:58:16PM -0800, H. S. Teoh via Digitalmars-d wrote: [...]
> 1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks).  The profiler indicated that one of the hotspots is the GC. So I added an option to the program to call GC.disable() if a command line option is given. The result is a 40%-50% reduction in running time.
[...]

Yet more GC drama unfolded tonight: while testing my new GC-disabling option on several other puzzles, one instance ate up my PC's RAM at an astonishing rate, causing the system to almost freeze up from the ensuing thrashing. A little investigation revealed that there are several instances of excessive GC load caused by unnecessary (re)allocations.

One case was a function that constructs and returns a new array each call -- unnecessary because after the resulting array is traversed, it is discarded. Perfect candidate for a reusable buffer. After implementing this, the rate of memory consumption is far less frightening, but still far too rapid compared to when the GC is enabled, so clearly, *something* else is still doing way more allocations than necessary.

A little more investigation revealed the culprit: a queue object implemented with subarrays, and subarrays were being used as stacks by appending with ~= and popping by decrementing .length. A dangerous combination that causes excessive reallocations! After inserting assumeSafeAppend() in the right place, the memory consumption rate dropped to more reasonable levels. It's still visibly faster than when GC is enabled, so probably there are other places with excessive allocations, but this particular change stood out because after adding assumeSafeAppend() to the code, the solver's performance *even with GC enabled* improved by about 40%-50%. So this explains why calling GC.disable() had such a pronounced effect before: the GC is too busy cleaning up after unnecessarily allocation-hungry code!

The moral of the story, is that code like the following should raise big red warning flags:

	struct MyData(T) {
		T[] data;
		T pop() {
			...
			data.length--; // <--- yikes
			...
		}
		void push(T t) {
			...
			data ~= t; // <--- yikes
			...
		}
	}

The easy fix (though not a complete one) is to insert a call to assumeSafeAppend before `data ~= t`. That suppresses the majority of the unnecessary reallocations, while leaving the necessary ones intact (e.g. when the GC block runs out of space to store the array).

A more complete fix, of course, is to use a linked-list of presized blocks, so that the arrays don't keep getting moved around in memory needlessly.

So while the GC does need to be improved, in this particular case a big part of the problem is poorly-written user code (by yours truly :-P), not so much the GC itself. ;-)


T

-- 
Skill without imagination is craftsmanship and gives us many useful objects such as wickerwork picnic baskets.  Imagination without skill gives us modern art. -- Tom Stoppard
November 07, 2014
On 11/7/14 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
>
> A little more investigation revealed the culprit: a queue object
> implemented with subarrays, and subarrays were being used as stacks by
> appending with ~= and popping by decrementing .length. A dangerous
> combination that causes excessive reallocations! After inserting
> assumeSafeAppend() in the right place, the memory consumption rate
> dropped to more reasonable levels. It's still visibly faster than when
> GC is enabled, so probably there are other places with excessive
> allocations, but this particular change stood out because after adding
> assumeSafeAppend() to the code, the solver's performance *even with GC
> enabled* improved by about 40%-50%. So this explains why calling
> GC.disable() had such a pronounced effect before: the GC is too busy
> cleaning up after unnecessarily allocation-hungry code!

In D1 this would have worked fine. A "queue array" wrapper may be a good candidate for Phobos that works just like a normal array but calls assume safe append whenever decrementing length.

-Steve

November 07, 2014
On 11/7/14 8:43 AM, Steven Schveighoffer wrote:
> On 11/7/14 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
>>
>> A little more investigation revealed the culprit: a queue object
>> implemented with subarrays, and subarrays were being used as stacks by
>> appending with ~= and popping by decrementing .length. A dangerous
>> combination that causes excessive reallocations! After inserting
>> assumeSafeAppend() in the right place, the memory consumption rate
>> dropped to more reasonable levels. It's still visibly faster than when
>> GC is enabled, so probably there are other places with excessive
>> allocations, but this particular change stood out because after adding
>> assumeSafeAppend() to the code, the solver's performance *even with GC
>> enabled* improved by about 40%-50%. So this explains why calling
>> GC.disable() had such a pronounced effect before: the GC is too busy
>> cleaning up after unnecessarily allocation-hungry code!
>
> In D1 this would have worked fine. A "queue array" wrapper may be a good
> candidate for Phobos that works just like a normal array but calls
> assume safe append whenever decrementing length.

Also to note -- even with this, you will create garbage. As the array grows, it may have to relocate to a new block. The old block would be garbage.

Once you get large enough, there is a chance that it will just extend into new pages. But if it ever has to reallocate a very large array, then you just created a very large piece of garbage that will not be cleaned up.

I'm wondering if said "queue array" should have an option to forcefully delete the old array. In which case, we'd have to make sure the data cannot be accessed directly, or mark it as unsafe.

-Steve

November 07, 2014
On Fri, Nov 07, 2014 at 09:28:56AM +0000, thedeemon via Digitalmars-d wrote:
> On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via Digitalmars-d wrote:
> >1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks).
> 
> When you grow a hash table it periodically reallocates bucket arrays it uses internally, so some garbage to collect appears anyway even if you only add elements.

I realize that, it's just that even after accounting for that, the memory consumption rate is much higher than expected, and sure enough further investigation revealed unnecessary GC load caused by reallocating arrays where (re)using an existing one would suffice.

In the case of the function that used to return an array, which would allocate a new array every call (and there are hundreds of thousands of calls for this particular test case), after implementing buffer reuse, even though it does need to reallocate the buffer sometimes when the array runs out of space, it only reallocates 3-4 times for the entire run.

I'm not sure what the overhead for the hash table would be, but I'm expecting it shouldn't be more than, say, 50-100 reallocations of the bucket array (I didn't measure this). So it shouldn't add up to *that* much.


T

-- 
There is no gravity. The earth sucks.
November 07, 2014
On Fri, Nov 07, 2014 at 08:46:29AM -0500, Steven Schveighoffer via Digitalmars-d wrote:
> On 11/7/14 8:43 AM, Steven Schveighoffer wrote:
> >On 11/7/14 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
> >>
> >>A little more investigation revealed the culprit: a queue object implemented with subarrays, and subarrays were being used as stacks by appending with ~= and popping by decrementing .length. A dangerous combination that causes excessive reallocations! After inserting assumeSafeAppend() in the right place, the memory consumption rate dropped to more reasonable levels. It's still visibly faster than when GC is enabled, so probably there are other places with excessive allocations, but this particular change stood out because after adding assumeSafeAppend() to the code, the solver's performance *even with GC enabled* improved by about 40%-50%. So this explains why calling GC.disable() had such a pronounced effect before: the GC is too busy cleaning up after unnecessarily allocation-hungry code!
> >
> >In D1 this would have worked fine. A "queue array" wrapper may be a good candidate for Phobos that works just like a normal array but calls assume safe append whenever decrementing length.
> 
> Also to note -- even with this, you will create garbage. As the array grows, it may have to relocate to a new block. The old block would be garbage.
> 
> Once you get large enough, there is a chance that it will just extend into new pages. But if it ever has to reallocate a very large array, then you just created a very large piece of garbage that will not be cleaned up.

Good point. Maybe that's where the remaining garbage is coming from: remnants of reallocated arrays. Some of the subarrays I'm using does grow pretty big, so things could add up.


> I'm wondering if said "queue array" should have an option to forcefully delete the old array. In which case, we'd have to make sure the data cannot be accessed directly, or mark it as unsafe.
[...]

Well, if we're going to manually manage the array's memory, then we're back in malloc/free territory along with all of their pitfalls. :-) So I wouldn't sweat it too much -- I'd make the queue a non-copyable reference type (so there would be no aliasing of the queue array) and leave it at that.

On another note, I wonder if reducing the frequency of GC collections might help keep memory under control while minimizing the performance hits. As a first stab, I could try to manually trigger a collection every, say, 100,000 iterations of the main loop, which would be about 10 or so collections per run, which should keep memory usage under control while avoiding excessively frequent collections that degrade performance.


T

-- 
"I speak better English than this villain Bush" -- Mohammed Saeed al-Sahaf, Iraqi Minister of Information
« First   ‹ Prev
1 2 3