Thread overview
Reserving capacity in associative arrays
Feb 15, 2016
Jon D
Feb 15, 2016
sigod
Feb 15, 2016
Jon D
Feb 16, 2016
Basile B.
Feb 16, 2016
Jon D
Feb 16, 2016
Jon D
Feb 16, 2016
H. S. Teoh
Feb 16, 2016
Jon D
February 15, 2016
Is there a way to reserve capacity in associative arrays? In some programs I've been writing I've been getting reasonable performance up to about 10 million entries, but beyond that performance is impacted considerably (say, 30 million or 50 million entries). GC stats (via the "--DRT-gcopt=profile:1" option) indicate dramatic increases in gc time, which I'm assuming comes from resizing the underlying hash table. I'm guessing that by preallocating a large size the performance degradation would not be quite so dramatic. The underlying implementation of associative arrays appears to take an initial number of buckets, and there's a private resize() method, but it's not clear if there's a public way to use these.

--Jon
February 15, 2016
On Monday, 15 February 2016 at 03:22:44 UTC, Jon D wrote:
> Is there a way to reserve capacity in associative arrays? In some programs I've been writing I've been getting reasonable performance up to about 10 million entries, but beyond that performance is impacted considerably (say, 30 million or 50 million entries). GC stats (via the "--DRT-gcopt=profile:1" option) indicate dramatic increases in gc time, which I'm assuming comes from resizing the underlying hash table. I'm guessing that by preallocating a large size the performance degradation would not be quite so dramatic. The underlying implementation of associative arrays appears to take an initial number of buckets, and there's a private resize() method, but it's not clear if there's a public way to use these.
>
> --Jon

Maybe try using this: http://code.dlang.org/packages/aammm
February 15, 2016
On Monday, 15 February 2016 at 05:29:23 UTC, sigod wrote:
> On Monday, 15 February 2016 at 03:22:44 UTC, Jon D wrote:
>> Is there a way to reserve capacity in associative arrays?
>> [snip]
>
> Maybe try using this: http://code.dlang.org/packages/aammm

Thanks, I wasn't aware of this package. I'll give it a try.

--Jon
February 16, 2016
On 2/14/16 10:22 PM, Jon D wrote:
> Is there a way to reserve capacity in associative arrays? In some
> programs I've been writing I've been getting reasonable performance up
> to about 10 million entries, but beyond that performance is impacted
> considerably (say, 30 million or 50 million entries). GC stats (via the
> "--DRT-gcopt=profile:1" option) indicate dramatic increases in gc time,
> which I'm assuming comes from resizing the underlying hash table. I'm
> guessing that by preallocating a large size the performance degradation
> would not be quite so dramatic. The underlying implementation of
> associative arrays appears to take an initial number of buckets, and
> there's a private resize() method, but it's not clear if there's a
> public way to use these.

There is not a public way to access these methods unfortunately.

It would be a good addition to druntime I believe.

Recently, I added a clear method to the AA, which does not reduce capacity. So if you frequently build large AAs, and then throw them away, you could instead reuse the memory.

I would caution to be sure of this cause, however, before thinking it would solve the problem. The AA not only uses an array for buckets, but allocates a memory location for each element as well. I'm often wrong when I assume what the problem is when it comes to GC issues...

-Steve
February 16, 2016
On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven Schveighoffer wrote:
> There is not a public way to access these methods unfortunately.
>
> It would be a good addition to druntime I believe.
>
> Recently, I added a clear method to the AA, which does not reduce capacity. So if you frequently build large AAs, and then throw them away, you could instead reuse the memory.
>
> I would caution to be sure of this cause, however, before thinking it would solve the problem. The AA not only uses an array for buckets, but allocates a memory location for each element as well. I'm often wrong when I assume what the problem is when it comes to GC issues...
>
> -Steve

After reading the topic i've added this enhancement proposal, not quite sure if it's possible:

https://issues.dlang.org/show_bug.cgi?id=15682

The idea is to concatenate smallers AA into the destination.
February 16, 2016
On Tuesday, 16 February 2016 at 17:05:11 UTC, Basile B. wrote:
> On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven Schveighoffer wrote:
>> There is not a public way to access these methods unfortunately.
>>
>> It would be a good addition to druntime I believe.
>>
>> -Steve
>
> After reading the topic i've added this enhancement proposal, not quite sure if it's possible:
>
> https://issues.dlang.org/show_bug.cgi?id=15682
>
> The idea is to concatenate smallers AA into the destination.

There is also this: https://issues.dlang.org/show_bug.cgi?id=2504
February 16, 2016
On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven Schveighoffer wrote:
> On 2/14/16 10:22 PM, Jon D wrote:
>> Is there a way to reserve capacity in associative arrays?
>> [snip]
>> The underlying implementation of associative arrays
>> appears to take an initial number of buckets, and there's
>> a private resize() method, but it's not clear if there's a
>> public way to use these.
>
> There is not a public way to access these methods unfortunately.
>
> It would be a good addition to druntime I believe.
>
> Recently, I added a clear method to the AA, which does not reduce capacity. So if you frequently build large AAs, and then throw them away, you could instead reuse the memory.
>
My programs build AAs lasting the lifetime of the program.
>
> I would caution to be sure of this cause, however, before thinking it would solve the problem. The AA not only uses an array for buckets, but allocates a memory location for each element as well. I'm often wrong when I assume what the problem is when it comes to GC issues...
>
Completely agree. After posting I decided to take a more methodical look. Not finished yet, but I can share part of it. Key thing so far is noticeable step function related to GC costs related to AA size (likely not a surprise).

My programs work with large data sets. Size is open-ended, what I'm trying to do is get an idea of the data set sizes they will handle reasonably. For purposes of illustration, word-count is a reasonable proxy for what I'm doing. It was in this context that I saw significant performance drop-off after 'size_t[string]' AAs reached about 10 million entries.

I've started measuring with a simple program. Basically:

    StopWatch sw;
    sw.start;
    size_t[size_t] counts;
    foreach (i; 0..iterations)
        counts[uniform(0, uniqMax)]++;
    sw.stop;

Same thing with string as key ('size_t[string]') AAs. 'iterations' and 'uniqMax' are varied between runs. GC stats are printed (via "--DRT-gcopt=profile:1"), plus timing and AA size. (Runs use LDC 17, release mode compiles, a fast 16GB MacBook).

For the integer as key case ('size_t[size_t]', there are notable jumps in GC total time and GC max pause time as AA size crosses specific size thresholds. This makes sense, as the AA needs to grow. Approximate steps:

    | entries | gc_total (ms) | gc_max_pause (ms) |
    |---------+---------------+-------------------|
    | 2M      |            30 |                60 |
    | 4M      |           200 |               100 |
    | 12M     |           650 |               330 |
    | 22M     |          1650 |               750 |
    | 44M     |          5300 |              3200 |

Iterations didn't matter, and gc total time and gc max time were largely flat between these jumps.

This suggests AA resize is the likely driver, and that preallocating a large size might address it.

To the point about being sure about cause - my programs use strings as keys, not integers. The performance drop-off with strings was quite a bit more significant than with integers. That analysis seems a bit trickier, I'm not done with that yet. Different memory allocation, perhaps effects from creating short-lived, temporary strings to test AA membership. Could easily be that string use or the combo of AAs with strings as key is a larger effect.

The other thing that jumps out from the table is the GC max pause time gets to be multiple seconds. Not an issue for my tools, which aren't interactive at those points, but would be significant issue for many interactive apps.

--Jon
February 16, 2016
On Tue, Feb 16, 2016 at 07:34:07PM +0000, Jon D via Digitalmars-d-learn wrote:
> On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven Schveighoffer wrote:
> >On 2/14/16 10:22 PM, Jon D wrote:
> >>Is there a way to reserve capacity in associative arrays?
> >>[snip]
> >>The underlying implementation of associative arrays appears to take
> >>an initial number of buckets, and there's a private resize() method,
> >>but it's not clear if there's a public way to use these.

Rehashing (aa.rehash) would resize the number of buckets, but if you don't already have the requisite number of keys, it wouldn't help.


[...]
> >I would caution to be sure of this cause, however, before thinking it would solve the problem. The AA not only uses an array for buckets, but allocates a memory location for each element as well. I'm often wrong when I assume what the problem is when it comes to GC issues...
> >
> Completely agree. After posting I decided to take a more methodical look.  Not finished yet, but I can share part of it. Key thing so far is noticeable step function related to GC costs related to AA size (likely not a surprise).
> 
> My programs work with large data sets. Size is open-ended, what I'm trying to do is get an idea of the data set sizes they will handle reasonably. For purposes of illustration, word-count is a reasonable proxy for what I'm doing. It was in this context that I saw significant performance drop-off after 'size_t[string]' AAs reached about 10 million entries.

I also have a program that builds very large AA's (also up to millions of keys, sometimes more) that essentially last for the entire duration of the program, and I've found that a lot of the performance degradation can be attributed to overly-frequent GC collection runs.  As a workaround, I did something like this:

	size_t counter = 0;

	void main() {
		GC.disable();
		doWork();
		GC.enable();	// optional
	}

	void doWork() {
		Data[Key] aa = ...;

		mainloop: while(...) {
			... // lots of stuff, including adding things to the AA

			// Manually schedule GC collections.
			// The 10_000 is just an example number, tweak
			// as you see fit.
			if (++counter % 10_000)
				GC.collect();
		}
	}

Basically, I run GC collections on my own schedule, with the frequency controlled by tweaking the counter check. (Obviously, how you implement this depends on whether there are regular or semi-regular units of work that your program does, and whether they are a good basis for scheduling collections.)

By tweaking the frequency of the manual GC collections, I've been able to obtain 30-40% speedups in my program (in some cases, it could get as high as 50%). YMMV.

Of course, as with all performance-related questions, the only way to be sure that the GC (or anything else) is the cause of your problem, is to profile.  More often than not, I've found that the profiler has proven me wrong about where the real bottleneck is, so I try not to assume anything until I see the profile data.


T

-- 
If Java had true garbage collection, most programs would delete themselves upon execution. -- Robert Sewell
February 16, 2016
On Tuesday, 16 February 2016 at 19:49:55 UTC, H. S. Teoh wrote:
> On Tue, Feb 16, 2016 at 07:34:07PM +0000, Jon D via Digitalmars-d-learn wrote:
>> On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven Schveighoffer wrote:
>> >On 2/14/16 10:22 PM, Jon D wrote:
>> >>Is there a way to reserve capacity in associative arrays?
>> >>[snip]
>> >>The underlying implementation of associative arrays appears to take
>> >>an initial number of buckets, and there's a private resize() method,
>> >>but it's not clear if there's a public way to use these.
>
> Rehashing (aa.rehash) would resize the number of buckets, but if you don't already have the requisite number of keys, it wouldn't help.
>
Thanks for the reply and the detailed example for manually controlling GC. I haven't experimented with taking control over GC that way.

Regarding reserving capacity, the relevant method is aa.resize(), not aa.rehash(). See: https://github.com/D-Programming-Language/druntime/blob/master/src/rt/aaA.d#L141. This allocates space for the buckets, doesn't matter if the keys are known. Note that every time the buckets array is resized the old bucket array is walked and elements reinserted. Preallocating allocating a large bucket array would avoid this. See also the private constructor in the same file (line 51). It takes an initial size.

--Jon