November 07, 2014
On Fri, Nov 07, 2014 at 09:30:47AM -0800, H. S. Teoh via Digitalmars-d wrote: [...]
> On another note, I wonder if reducing the frequency of GC collections might help keep memory under control while minimizing the performance hits. As a first stab, I could try to manually trigger a collection every, say, 100,000 iterations of the main loop, which would be about 10 or so collections per run, which should keep memory usage under control while avoiding excessively frequent collections that degrade performance.
[...]

OK, I did a little test where I would disable the GC at the beginning of the program, then trigger a collection every 250,000 iterations. As a rough visual aid, I print a message before and after the collection so that I can gauge how long each collection is taking. It appears that as the program progresses, collections become longer -- expected, because the hash table is growing rapidly, so the marking phase probably would take longer (I didn't measure this in detail though).

Overall, a puzzle that took ~7.0 seconds to solve now takes ~7.6 seconds, which is about 8% performance degradation. That's pretty high, considering that we already know that most of the new allocations between collections is live. Is there a way to tell the GC to ignore the hash table buckets for collection purposes? It would seem that most of the work is being wasted there since we already know it will be live until the end of the program.

Another alternative is to manually call GC.free on reallocated arrays in the two places I've identified, and turn off collections completely. This could be a better approach in this case, since we already know where the reallocations happen.

Another GC-related question: if the heap grows too large to fit in RAM, and a GC collection cycle is triggered, will this cause thrashing? -- since the GC will have to scan through all the blocks, causing *all* of them to be paged in and paged out again.

Of course, this little side-exercise is more of a curiosity than anything else, since I'm planning to replace the hash table with something else that manually manages memory (and page to disk as necessary) anyway, so any optimizations along these lines will be moot. But it's interesting to explore how far one can go with the current system.


T

-- 
It is not the employer who pays the wages. Employers only handle the money. It is the customer who pays the wages. -- Henry Ford
November 07, 2014
On Fri, Nov 07, 2014 at 10:01:41AM -0800, H. S. Teoh via Digitalmars-d wrote: [...]
> OK, I did a little test where I would disable the GC at the beginning of the program, then trigger a collection every 250,000 iterations. As a rough visual aid, I print a message before and after the collection so that I can gauge how long each collection is taking. It appears that as the program progresses, collections become longer -- expected, because the hash table is growing rapidly, so the marking phase probably would take longer (I didn't measure this in detail though).
> 
> Overall, a puzzle that took ~7.0 seconds to solve now takes ~7.6 seconds, which is about 8% performance degradation. That's pretty high, considering that we already know that most of the new allocations between collections is live.
[...]

P.S. Note, however, that with the GC fully enabled, the average running time is about 9.8 seconds (the 7.0s figure is when the GC is completely disabled and no collections are run, so memory is leaking). That's still a ~25% improvement over the normal GC behaviour, while preventing memory leaks, even if 8% is a little high for a scenario where we already know most of the memory is still live. (One could see an argument for a generational collector here!)

So arguably, the frequency of the GC's collections could be tweaked, as Steven mentioned. We could be getting ~25% performance improvement for free! (But this is of course highly dependent on system-specific and program-specific measurements, so I don't know if this will actually improve performance in general. Still, it's an interesting area to explore in terms of improving the performance of the current GC. Where's Andrei with his nice shiny yet-to-be-materialized new GC when we need him??! :-P)


T

-- 
"640K ought to be enough" -- Bill G. (allegedly), 1984. "The Internet is not a primary goal for PC usage" -- Bill G., 1995. "Linux has no impact on Microsoft's strategy" -- Bill G., 1999.
November 07, 2014
On Fri, Nov 07, 2014 at 04:06:44PM -0300, Ary Borenszweig via Digitalmars-d wrote: [...]
> Is the code public? I'd like to port it to other languages and see how they behave, see if this is a general problem or just something specific to D.
[...]

I haven't posted the code anywhere yet, but I could certainly send you a copy if you want.

However, the test input I'm using is not public, so you'll have to craft your own input files for testing purposes. The public input files I do have are too simple to pose a significant challenge to the solver (plus, they default to BFS instead of A*, so you might get different results from what I described here).


T

-- 
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, / The day and hour soon are coming / When all the IT folks say "Gosh!" / It isn't from a clever lawsuit / That Windowsland will finally fall, / But thousands writing open source code / Like mice who nibble through a wall. -- The Linux-nationale by Greg Baker
November 07, 2014
On 11/7/14, 4:16 PM, H. S. Teoh via Digitalmars-d wrote:
> On Fri, Nov 07, 2014 at 04:06:44PM -0300, Ary Borenszweig via Digitalmars-d wrote:
> [...]
>> Is the code public? I'd like to port it to other languages and see how
>> they behave, see if this is a general problem or just something
>> specific to D.
> [...]
>
> I haven't posted the code anywhere yet, but I could certainly send you a
> copy if you want.
>
> However, the test input I'm using is not public, so you'll have to craft
> your own input files for testing purposes. The public input files I do
> have are too simple to pose a significant challenge to the solver (plus,
> they default to BFS instead of A*, so you might get different results
> from what I described here).
>
>
> T
>

Then I don't think I'll be able to do any useful benchmark/profile with that. Thanks anyway :-)
November 07, 2014
07-Nov-2014 09:05, H. S. Teoh via Digitalmars-d пишет:
> On Thu, Nov 06, 2014 at 02:58:16PM -0800, H. S. Teoh via Digitalmars-d wrote:
> [...]
>> 1) The GC could use some serious improvement: it just so happens that
>> the solver's algorithm only ever needs to allocate memory, never
>> release it (it keeps a hash of visited states that only grows, never
>> shrinks).  The profiler indicated that one of the hotspots is the GC.
>> So I added an option to the program to call GC.disable() if a command
>> line option is given. The result is a 40%-50% reduction in running
>> time.
> [...]
>
> Yet more GC drama unfolded tonight: while testing my new GC-disabling
> option on several other puzzles, one instance ate up my PC's RAM at an
> astonishing rate, causing the system to almost freeze up from the
> ensuing thrashing. A little investigation revealed that there are
> several instances of excessive GC load caused by unnecessary
> (re)allocations.


That's the problem with profilers:
	they say what takes time but not why :)

Often I find myself looking at memcpy at the top of the list, so obvious the "textbook" answer is to optimize memcpy ;) In contrast it should be read as "you seem to do excessive copying of data".

> The moral of the story, is that code like the following should raise big
> red warning flags:
>
> 	struct MyData(T) {
> 		T[] data;
> 		T pop() {
> 			...
> 			data.length--; // <--- yikes
> 			...
> 		}
> 		void push(T t) {
> 			...
> 			data ~= t; // <--- yikes
> 			...
> 		}
> 	}

Well known in the inner circles performance bug. Sadly most of D programmers are bound to hit it once in the while.

Memory safety "by default" has its costs.
Though I'd just call mmap, far more predictable then these pesky memory allocators + no need to reallocate if reserved range is large enough.


-- 
Dmitry Olshansky
November 07, 2014
On Sat, Nov 08, 2014 at 01:58:31AM +0300, Dmitry Olshansky via Digitalmars-d wrote:
> 07-Nov-2014 09:05, H. S. Teoh via Digitalmars-d пишет:
[...]
> >Yet more GC drama unfolded tonight: while testing my new GC-disabling option on several other puzzles, one instance ate up my PC's RAM at an astonishing rate, causing the system to almost freeze up from the ensuing thrashing. A little investigation revealed that there are several instances of excessive GC load caused by unnecessary (re)allocations.
> 
> That's the problem with profilers:
> 	they say what takes time but not why :)

True! But it's better than the stereotypical C veteran's approach of "I know from 50 years' experience (aka personal bias) that X is faster than Y because it's more terse, so I'm gonna optimize all my code into unreadable code from the get-go even if 99% of it isn't the real hotspot anyway."


> Often I find myself looking at memcpy at the top of the list, so obvious the "textbook" answer is to optimize memcpy ;) In contrast it should be read as "you seem to do excessive copying of data".
[...]

Well, when I saw countUntil at the top of the list, it wasn't immediately obvious *why* it was at the top of the list. In the grand scheme of things, it's just a "small cog" in a large machinery filled with many other things that could easily be considered far better candidates for performance bottlenecks. And while optimizing countUntil itself did yield significant improvement, a deeper investigation led to the realization that the underlying logic of the code was causing it to call countUntil more often than it needs to. Fixing this issue drastically reduced the number of calls to countUntil, which yielded no small overall improvement.

In any case, the old optimization adage applies: you should optimize the
underlying algorithms (i.e., the big-O's) before you spend time
optimizing the low-level primitives (the scalar factors).  No matter how
much you optimize the heck out of the scalar factor, an O(N^2) algorithm
is still an O(N^2) algorithm.  As long as the algorithm remains O(N^2),
you aren't going to get much further.


T

-- 
Life is unfair. Ask too much from it, and it may decide you don't deserve what you have now either.
November 08, 2014
On 11/7/2014 2:58 PM, Dmitry Olshansky wrote:
> That's the problem with profilers:
>      they say what takes time but not why :)
>
> Often I find myself looking at memcpy at the top of the list, so obvious the
> "textbook" answer is to optimize memcpy ;) In contrast it should be read as "you
> seem to do excessive copying of data".

dmd's profiler will give you a tree showing who called each function how many times. This helps a lot with the "why" question.

November 08, 2014
On Fri, Nov 07, 2014 at 04:22:49PM -0800, Walter Bright via Digitalmars-d wrote:
> On 11/7/2014 2:58 PM, Dmitry Olshansky wrote:
> >That's the problem with profilers:
> >     they say what takes time but not why :)
> >
> >Often I find myself looking at memcpy at the top of the list, so obvious the "textbook" answer is to optimize memcpy ;) In contrast it should be read as "you seem to do excessive copying of data".
> 
> dmd's profiler will give you a tree showing who called each function how many times. This helps a lot with the "why" question.

gprof also shows that in the section section of its output.

But speaking of which, I found dmd -profile's output in trace.log a little difficult to understand because of the lack of self-documenting headers in the first section. gprof, for example, produces nicely-labelled output that describes what the data means. Would you accept a PR along these lines? (Or is there a post-processing step that I'm missing?)

Also, in the second section, I'm getting some negative numbers for the call times, which seem to indicate integer overflow? This basically destroys the usefulness of dmd -profile for my test cases, since most of the interesting cases for me are those with long running times, which for sure will run into integer overflow at the sample rate that dmd -profile is using, causing the second section to be essentially randomly sorted.


T

-- 
He who does not appreciate the beauty of language is not worthy to bemoan its flaws.
November 08, 2014
On Fri, Nov 07, 2014 at 04:41:27PM -0800, H. S. Teoh via Digitalmars-d wrote: [...]
> gprof also shows that in the section section of its output.
[...]
                               ^^^^^^^
Argh, I mean *second* section (which, incidentally, corresponds with the
first section of trace.log in the output generated by dmd -profile).
Dyslexic today. :-(


T

-- 
Study gravitation, it's a field with a lot of potential.
November 08, 2014
On 11/7/2014 4:41 PM, H. S. Teoh via Digitalmars-d wrote:
> On Fri, Nov 07, 2014 at 04:22:49PM -0800, Walter Bright via Digitalmars-d wrote:
>> On 11/7/2014 2:58 PM, Dmitry Olshansky wrote:
>>> That's the problem with profilers:
>>>      they say what takes time but not why :)
>>>
>>> Often I find myself looking at memcpy at the top of the list, so
>>> obvious the "textbook" answer is to optimize memcpy ;) In contrast it
>>> should be read as "you seem to do excessive copying of data".
>>
>> dmd's profiler will give you a tree showing who called each function
>> how many times. This helps a lot with the "why" question.
>
> gprof also shows that in the section section of its output.
>
> But speaking of which, I found dmd -profile's output in trace.log a
> little difficult to understand because of the lack of self-documenting
> headers in the first section. gprof, for example, produces
> nicely-labelled output that describes what the data means. Would you
> accept a PR along these lines?

I don't really know what you have in mind, so I'll have to say "it depends". Keep in mind that subsequent runs of the profiler parse the previous output file and add it to the results.


> Also, in the second section, I'm getting some negative numbers for the
> call times, which seem to indicate integer overflow? This basically
> destroys the usefulness of dmd -profile for my test cases, since most of
> the interesting cases for me are those with long running times, which
> for sure will run into integer overflow at the sample rate that dmd
> -profile is using, causing the second section to be essentially randomly
> sorted.

Yeah, a negative number is likely an overflow. Reduce the test case!