https://issues.dlang.org/show_bug.cgi?id=2504: reserve for associative arrays

Our student Alex has been looking at this and I wanted to open the floor for a bit of discussion before we embark on an implementation. Last time I looked our associative arrays were arrays of singly-linked lists. It follows that in order to implement reserve() we'd need a freelist allocator approach, which preallocates a bunch of nodes in a single contiguous block of memory. Each hashtable would have its own freelist, or alternatively all hashtables of the same types share the same freelist. So should we get into this? Pros: * It provides a nice primitive to use with built-in hashtables * Accelerates without much effort a variety of codes using hashtables Cons: * Built-in hashtables are a convenience facility more than a high-performance one, so providing an efficiency-oriented primitive seems a bit odd. * We're considering a number of improvements and refactorings to built-in hashtables, which may obviate the freelist implementation or need extra effort to implement it. Basically defining reserve() limits our future implementation freedom. * Implementation effort is nontrivial. * The time needed to initialize the freelist in one contiguous block of memory is still O(n) (although the constant is small and special implementation techniques may make it O(1)). * The implementation cannot use Phobos' allocator because it's in druntime. Please share your thoughts. Thanks, Andrei

November 01, 2016

Re: https://issues.dlang.org/show_bug.cgi?id=2504: reserve for associative arrays

Posted by Jonathan M Davis
in reply to Andrei Alexandrescu

Permalink

Jonathan M Davis

Posted in reply to Andrei Alexandrescu

Permalink

On Tuesday, November 01, 2016 23:36:42 Andrei Alexandrescu via Digitalmars-d wrote:
> Please share your thoughts.

Hasn't Martin been working on a new, templatized AA implementation to replace the current one? As I recall, he said something about postponing it by a release to focus on something else (helping Walter with @safe issues IIRC), but it sounded like it was close to ready. And if that's the case, I really don't think that it makes sense to be doing anything major with the current implementation.

But regardless of that, I think that the focus of the built-in AAs should be to have something simple that works. It should be reasonably performant, but anything involving performance tuning really should be in library type. We've had quite a few problems with the built-in AAs over the years (especially with stuff like user-defined key types - in part because it uses void* internally IIRC), and while the new implementation will hopefully be much better in that regard, I think that based on the issues with had with the built-in AAs historically, doing anything that would make them more complicated should be approached with extreme caution.

And anyone who really wants to do performance tuning is going to need better facilities than we're going to be able to reasonably provide in the built-in AAs anyway.

The only real downside to user-defined hash tables at this point is that you can't use AA literals with them (which we should probably create a DIP for at some point). Aside from that, there's no reason that a library AA can't work as well or better than the built-in AAs, and it would be much easier to have facilities for tuning performance with a library type than with something built into the language.

I think that effort put towards containers would be much better served working towards the std.container rewrite (though you would know where that stands better than the rest of us as well as whether it makes any sense for anyone to help you right now). Or maybe Martin could use some help in finishing whatever he's done with the new, built-in AA implementation. I don't know. But considering that adding reserve to the built-in AAs is not a small task and that it'is debatable that it should be done at all, there pretty much has to be something better for Alex to work on.

- Jonathan M Davis

On Wednesday, 2 November 2016 at 03:36:42 UTC, Andrei Alexandrescu wrote: > Cons: > > * Built-in hashtables are a convenience facility more than a high-performance one, so providing an efficiency-oriented primitive seems a bit odd. I don't see why it would be odd. If I know my code is going to be working with 10,000,000 values, then telling/hinting so it could prepare appropriately up front rather than taking hits later seems useful. It's not like it couldn't work without it right? I'd hope for a few other features, like the AA could be pre-initialized and ready to go, for fixed/immutable lists.

On Wednesday, 2 November 2016 at 03:36:42 UTC, Andrei Alexandrescu wrote: > > Last time I looked our associative arrays were arrays of singly-linked lists. F.Y.I. It now appears to use quadratic probing since druntime PR #1229. > > Each hashtable would have its own freelist, or alternatively all hashtables of the same types share the same freelist. How do you return memory to the freelist when GC is expected to managed memory of the entries? We no longer GC.free since your PR (#1143.) Without belaboring the point, I think we'd better off with good library AA offerings as well.

On Wednesday, 2 November 2016 at 03:36:42 UTC, Andrei Alexandrescu wrote: > > Last time I looked our associative arrays were arrays of singly-linked lists. It follows that in order to implement reserve() we'd need a freelist allocator approach, which preallocates a bunch of nodes in a single contiguous block of memory. > Aren't associative arrays using open addressing for collision resolution? I'm looking at routines findSlotInsert() and findSlotLookup() here: https://github.com/dlang/druntime/blob/master/src/rt/aaA.d#L102 There is already a resize() routine: https://github.com/dlang/druntime/blob/master/src/rt/aaA.d#L141 The resize() routine takes the new size of the bucket array as an argument. Could a 'reserve' routine call this? Scanning through the code, it appears policy on size of the bucket array is established by callers to resize(), e.g. see the _aaRehash() function, which calls resize().

On 11/1/16 11:36 PM, Andrei Alexandrescu wrote: > Our student Alex has been looking at this and I wanted to open the floor > for a bit of discussion before we embark on an implementation. > > Last time I looked our associative arrays were arrays of singly-linked > lists. It follows that in order to implement reserve() we'd need a > freelist allocator approach, which preallocates a bunch of nodes in a > single contiguous block of memory. They have changed. Collisions are now handled via finding the next available slot in the array. However, elements of the array are still pointers to entries. So technically, the freelist is still needed. But I'm not sure preallocating into a contiguous block is the best approach. People don't expect a hashtable to hold onto all the memory it has used in the past when you remove most of the elements. In dcollections, I let the allocator take care of the details of allocation. I think the best thing to do is to preallocate the array at least (so no rehashing occurs during additions of nodes), and then move towards a templated solution that can use a custom allocator for tuning the performance/memory usage tradeoff. -Steve

On Wednesday, 2 November 2016 at 03:36:42 UTC, Andrei Alexandrescu wrote: > [snip] > > * The implementation cannot use Phobos' allocator because it's in druntime. How about let the user provide the memory block when they calls reserve()? > > Please share your thoughts. > > > Thanks, > > Andrei

On Thursday, 3 November 2016 at 13:19:17 UTC, Steven Schveighoffer wrote: > > So technically, the freelist is still needed. In case I wasn't clear in my previous post: We can't use a freelist because it breaks safety.

On 11/3/16 11:48 PM, safety0ff wrote: > On Thursday, 3 November 2016 at 13:19:17 UTC, Steven Schveighoffer wrote: >> >> So technically, the freelist is still needed. > > In case I wasn't clear in my previous post: > We can't use a freelist because it breaks safety. Yes, you can't use a free-list when removing nodes. AA used to actually free the memory, but that was decided against long ago. There are so many possibilities if we just move to a templated type. -Steve

Forums