Thread overview
[phobos] Sealed Hash Tables
Oct 22, 2010
David Simcha
Oct 22, 2010
David Simcha
Oct 22, 2010
David Simcha
Oct 23, 2010
David Simcha
October 22, 2010
  I've come to the realization that one of the most annoying library
level problems in D2 is lack of a good hash table implementation for
large tables.  The builtin AAs aren't bad for small to medium sized
tables, but are absolutely terrible for large tables due to the way they
manage memory.  Is anyone already working on a sealed container
implementation of hash tables?  If not, would others be interested in my
RandAA (http://dsource.org/projects/aa) after some cleanup, etc.?  I
believe that this design is extremely well suited to the sealed
container paradigm, and it interacts much better with the GC than the
builtin impl. when dealing with large arrays.
October 22, 2010
If your hash table implementation is interface-compatible with the built-in one and does better than it, the logical path to go is to replace the built-in table with yours. Would you be interested in that?

We'd need a battery of tests for such, and I see you already got one from PyDict. Great!


Andrei

On 10/22/10 8:59 CDT, David Simcha wrote:
> I've come to the realization that one of the most annoying library level
> problems in D2 is lack of a good hash table implementation for large
> tables. The builtin AAs aren't bad for small to medium sized tables, but
> are absolutely terrible for large tables due to the way they manage
> memory. Is anyone already working on a sealed container implementation
> of hash tables? If not, would others be interested in my RandAA
> (http://dsource.org/projects/aa) after some cleanup, etc.? I believe
> that this design is extremely well suited to the sealed container
> paradigm, and it interacts much better with the GC than the builtin
> impl. when dealing with large arrays.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
October 22, 2010
The main problem with replacing the builtin hash table is that the builtin isn't sealed (and as far as I can tell isn't supposed to be).  RandAA is intended to be sealed and has no clean way of allowing ref access to contents.  This is the tradeoff that allows very efficient memory management in the case of large tables.

On Fri, Oct 22, 2010 at 10:06 AM, Andrei Alexandrescu <andrei at erdani.com>wrote:

> If your hash table implementation is interface-compatible with the built-in one and does better than it, the logical path to go is to replace the built-in table with yours. Would you be interested in that?
>
> We'd need a battery of tests for such, and I see you already got one from PyDict. Great!
>
>
> Andrei
>
>
> On 10/22/10 8:59 CDT, David Simcha wrote:
>
>> I've come to the realization that one of the most annoying library level
>> problems in D2 is lack of a good hash table implementation for large
>> tables. The builtin AAs aren't bad for small to medium sized tables, but
>> are absolutely terrible for large tables due to the way they manage
>> memory. Is anyone already working on a sealed container implementation
>> of hash tables? If not, would others be interested in my RandAA
>> (http://dsource.org/projects/aa) after some cleanup, etc.? I believe
>> that this design is extremely well suited to the sealed container
>> paradigm, and it interacts much better with the GC than the builtin
>> impl. when dealing with large arrays.
>> _______________________________________________
>> phobos mailing list
>> phobos at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/phobos
>>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20101022/589d1279/attachment.html>
October 22, 2010
P.S.  The builtin hash table is pathetically crufty, buggy, etc. and could use a rewrite, but even if/when this happens, I still think a separate sealed container implementation would be a Good Thing.

On Fri, Oct 22, 2010 at 10:22 AM, David Simcha <dsimcha at gmail.com> wrote:

> The main problem with replacing the builtin hash table is that the builtin isn't sealed (and as far as I can tell isn't supposed to be).  RandAA is intended to be sealed and has no clean way of allowing ref access to contents.  This is the tradeoff that allows very efficient memory management in the case of large tables.
>
>
> On Fri, Oct 22, 2010 at 10:06 AM, Andrei Alexandrescu <andrei at erdani.com>wrote:
>
>> If your hash table implementation is interface-compatible with the built-in one and does better than it, the logical path to go is to replace the built-in table with yours. Would you be interested in that?
>>
>> We'd need a battery of tests for such, and I see you already got one from PyDict. Great!
>>
>>
>> Andrei
>>
>>
>> On 10/22/10 8:59 CDT, David Simcha wrote:
>>
>>> I've come to the realization that one of the most annoying library level
>>> problems in D2 is lack of a good hash table implementation for large
>>> tables. The builtin AAs aren't bad for small to medium sized tables, but
>>> are absolutely terrible for large tables due to the way they manage
>>> memory. Is anyone already working on a sealed container implementation
>>> of hash tables? If not, would others be interested in my RandAA
>>> (http://dsource.org/projects/aa) after some cleanup, etc.? I believe
>>> that this design is extremely well suited to the sealed container
>>> paradigm, and it interacts much better with the GC than the builtin
>>> impl. when dealing with large arrays.
>>> _______________________________________________
>>> phobos mailing list
>>> phobos at puremagic.com
>>> http://lists.puremagic.com/mailman/listinfo/phobos
>>>
>> _______________________________________________
>> phobos mailing list
>> phobos at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/phobos
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20101022/6ecee585/attachment.html>
October 23, 2010
Then I think that's a good addition to std.container. Before that, we should discuss in more depth allocators. This is because some use cases will prefer malloc and others will prefer the GC heap.

For example, a class that has a sealed container as member wouldn't want the container to hold on to some mallocated chunks until the next GC cycle.


Andrei

On 10/22/10 9:22 CDT, David Simcha wrote:
> The main problem with replacing the builtin hash table is that the builtin isn't sealed (and as far as I can tell isn't supposed to be). RandAA is intended to be sealed and has no clean way of allowing ref access to contents.  This is the tradeoff that allows very efficient memory management in the case of large tables.
>
> On Fri, Oct 22, 2010 at 10:06 AM, Andrei Alexandrescu <andrei at erdani.com <mailto:andrei at erdani.com>> wrote:
>
>     If your hash table implementation is interface-compatible with the
>     built-in one and does better than it, the logical path to go is to
>     replace the built-in table with yours. Would you be interested in that?
>
>     We'd need a battery of tests for such, and I see you already got one
>     from PyDict. Great!
>
>
>     Andrei
>
>
>     On 10/22/10 8:59 CDT, David Simcha wrote:
>
>         I've come to the realization that one of the most annoying
>         library level
>         problems in D2 is lack of a good hash table implementation for large
>         tables. The builtin AAs aren't bad for small to medium sized
>         tables, but
>         are absolutely terrible for large tables due to the way they manage
>         memory. Is anyone already working on a sealed container
>         implementation
>         of hash tables? If not, would others be interested in my RandAA
>         (http://dsource.org/projects/aa) after some cleanup, etc.? I believe
>         that this design is extremely well suited to the sealed container
>         paradigm, and it interacts much better with the GC than the builtin
>         impl. when dealing with large arrays.
>         _______________________________________________
>         phobos mailing list
>         phobos at puremagic.com <mailto:phobos at puremagic.com>
>         http://lists.puremagic.com/mailman/listinfo/phobos
>
>     _______________________________________________
>     phobos mailing list
>     phobos at puremagic.com <mailto:phobos at puremagic.com>
>     http://lists.puremagic.com/mailman/listinfo/phobos
>
>
>
>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
October 23, 2010
  On 10/23/2010 11:48 AM, Andrei Alexandrescu wrote:
> Then I think that's a good addition to std.container. Before that, we should discuss in more depth allocators. This is because some use cases will prefer malloc and others will prefer the GC heap.
>
> For example, a class that has a sealed container as member wouldn't want the container to hold on to some mallocated chunks until the next GC cycle.
>
>
> Andrei

Ok.  First, I'll start off by stating that I don't have much knowledge of how custom allocators work in C++.  I know that they're template parameters to STL containers, but that's about it.  I don't know many details.

I guess the first question on the list is, do we want to make allocator choice parametric (i.e. any allocator that meets some compile-time interface can be used) or ad-hoc (i.e. we support only a small, limited set of allocators, such as malloc and core.memory)?  This is basically a tradeoff between simplicity and flexibility.

Secondly, assuming we go with parametric (which is my vote), what would be the API?  The most simple and obvious answer is an alias parameter for a malloc() function and another alias paramter for a free() function.  The free() function may be a dummy/no-op function if malloc'd blocks are GC'd.  On the other hand, while I know C++ makes the allocator a compile-time/template parameter, I don't see what's gained in exchange for losing runtime flexibility in this case.  It makes a lot more sense to me that it should be changeable at runtime and be either an object or a pair of function pointers/delegates (one for malloc(), one for free()).

Third, how about things like stack allocators that might only be able to free memory in LIFO order or mark-release allocators that might only be able to free whole chunks of memory at once?  I'd assume we can't support these, at least not well.  In terms of my personal experience, I actually wrote a hash table implementation for a stack allocator at one point as a performance optimization.  It worked well for its intended use cases, but required a substantial amount of awareness of the underlying allocator in the implementation and some in the API.  For example, you had to specify in advance the approximate size of the table, and removed entries were kept on an internal free list since otherwise that memory would go to waste.
October 25, 2010
On 10/23/10 18:38 CDT, David Simcha wrote:
> On 10/23/2010 11:48 AM, Andrei Alexandrescu wrote:
>> Then I think that's a good addition to std.container. Before that, we should discuss in more depth allocators. This is because some use cases will prefer malloc and others will prefer the GC heap.
>>
>> For example, a class that has a sealed container as member wouldn't want the container to hold on to some mallocated chunks until the next GC cycle.
>>
>>
>> Andrei
>
> Ok. First, I'll start off by stating that I don't have much knowledge of how custom allocators work in C++. I know that they're template parameters to STL containers, but that's about it. I don't know many details.

Probably you're better off making good judgments than I am; I know about STL allocators, I also know they are wrong, but I can't figure why.

> I guess the first question on the list is, do we want to make allocator choice parametric (i.e. any allocator that meets some compile-time interface can be used) or ad-hoc (i.e. we support only a small, limited set of allocators, such as malloc and core.memory)? This is basically a tradeoff between simplicity and flexibility.

Well it also imposes certain constraints. If parametric, you can implement swap(). If not, you can't, or swap() could fail. (You can't swap two objects leaving memory for one being freed by another allocator.)

> Secondly, assuming we go with parametric (which is my vote), what would
> be the API? The most simple and obvious answer is an alias parameter for
> a malloc() function and another alias paramter for a free() function.
> The free() function may be a dummy/no-op function if malloc'd blocks are
> GC'd. On the other hand, while I know C++ makes the allocator a
> compile-time/template parameter, I don't see what's gained in exchange
> for losing runtime flexibility in this case. It makes a lot more sense
> to me that it should be changeable at runtime and be either an object or
> a pair of function pointers/delegates (one for malloc(), one for free()).

Often an object would need to figure out statically if they need to explicitly free everything; depending on that, the object may opt for different designs (e.g. refcounted vs. not).

> Third, how about things like stack allocators that might only be able to free memory in LIFO order or mark-release allocators that might only be able to free whole chunks of memory at once? I'd assume we can't support these, at least not well. In terms of my personal experience, I actually wrote a hash table implementation for a stack allocator at one point as a performance optimization. It worked well for its intended use cases, but required a substantial amount of awareness of the underlying allocator in the implementation and some in the API. For example, you had to specify in advance the approximate size of the table, and removed entries were kept on an internal free list since otherwise that memory would go to waste.

We don't have the type system chops to support region-based allocation statically. The most we could is to define an interface for region allocation (e.g. allocate() and freeAll() but not free() for individual elements) that allows containers to statically figure out they're dealing with regions and tailor code appropriately.


Andrei
October 25, 2010
I am not an expert in allocators.  However, feel free to use my dcollections chunk allocator and related API.

I just implemented what I thought would be good.  I handle the free individual elements question by a freeNeeded enum.  The freeAll I figure can be a noop for allocators that don't care, since it's not called frequently.

freeNeeded could potentially be an external template instead of an internal enum.  I wrote that code long before I was using d2 and had experience with Andrei's excellent type checking code.

http://www.dsource.org/projects/dcollections/browser/branches/d2/dcollections/DefaultAllocator.d


Anyways, use what you want.  It's boost 1.0 license.

-Steve



----- Original Message ----
> From: Andrei Alexandrescu <andrei at erdani.com>
> 
> On 10/23/10 18:38 CDT, David Simcha wrote:
> > 
> > Ok. First,  I'll start off by stating that I don't have much knowledge of
> > how custom  allocators work in C++. I know that they're template
> > parameters to STL  containers, but that's about it. I don't know many
> >  details.
> 
> Probably you're better off making good judgments than I am; I  know about STL
>allocators, I also know they are wrong, but I can't figure  why.