November 20, 2005
Walter Bright wrote:
> "Manfred Nowak" <svv1999@hotmail.com> wrote in message
> news:Xns9713199CF924Esvv1999hotmailcom@63.105.9.61...
>> Although the
>> unbalanced binary tree Walter has choosen for the resolution of
>> collisions is bad as hell in theory it performs very well under the
>> constraints given.
> 
> I've been using variations of that for 25 years. Every once in a while, I
> try out some new table method that is theoretically supposed to be faster.
> They always turn out to be slower. So back to the binary tree!
> 
> I've also found it scales very well, from a few entries to tens of thousands
> of entries.

It does.  And the memory cost of an additonal pointer per node isn't a big deal, though it may be an issue for some extreme cases.  Out of curiosity, is the binary tree implementation balanced in any way, or is it a plain old btree?  Given the relatively small expected chain size I'd expect the latter, but I figured I'd ask anyway.


Sean
November 20, 2005
Munchgreeble wrote:
> 
> Take a look at this performance comparision for example, it demonstrates that
> even for hashmaps, the performance varies quite widely among implementations:
> http://www.tommti-systems.de/go.html?http://www.tommti-systems.de/main-Dateien/reviews/languages/benchmarks.html
> 
> Clearly they're all O(1) algorithms, but surprisingly the C++ implementation is
> the slowest even though it has the potential to be fastest.

Actually, the C++ map is implemented as a balanced binary tree, which has O(logN) find/insert performance.  The hash-based associated container, unordered_map, is in the C++ Technical Report #1 but won't be an official part of the standard until around 2009 when the next C++ standard is published.  For what it's worth, I've written an implementation of std::tr1::unordered_map which beats the Dinkumware implementation of std::map quite soundly in memory usage, cache miss, and test execution speed tests given some average to large-sized datasets.  I can't remember the exact proportion, but I think it was about twice as good for all 3 metrics, which would put it on par with the C# dictionary in that test you linked to.  There are a few other C++ TR1 implementations floating around and I expect them all to perform approximately as well as mine, given that I didn't do anything too terribly fancy (std::vector on top of a single std::list).


Sean
November 20, 2005
"Sean Kelly" <sean@f4.ca> wrote in message news:dloh13$1glb$1@digitaldaemon.com...
> Walter Bright wrote:
> > I've been using variations of that for 25 years. Every once in a while,
I
> > try out some new table method that is theoretically supposed to be
faster.
> > They always turn out to be slower. So back to the binary tree!
> >
> > I've also found it scales very well, from a few entries to tens of
thousands
> > of entries.
>
> It does.  And the memory cost of an additonal pointer per node isn't a big deal, though it may be an issue for some extreme cases.

D AA's are general purpose, which means they are meant to do a decent job for a wide range of applications. For very specific or extreme cases, it's likely you can do better with a custom implementation.

> Out of
> curiosity, is the binary tree implementation balanced in any way, or is
> it a plain old btree?  Given the relatively small expected chain size
> I'd expect the latter, but I figured I'd ask anyway.

It gets rebalanced when a rehash happens. Rebalancing it on the fly is one of those theoretical improvements that in practice makes things much worse.


November 20, 2005
Sean Kelly wrote:

[...]
> Amortizing the cost of these rehashes over all N insertions seems like it should yield an amortized constant time insert
[...]

Agreed, because I did not take into account that the size of the table changed at every rehash.

Of course is the sum over the 2**-i (from i=1 to infinity) bounded by a constant.

Sometimes facts generalized without thought turn out to be plain preconception :-(

-manfred
November 20, 2005
Walter Bright wrote:

[...]
> It gets rebalanced when a rehash happens.
[...]

I do not see any balancing in the implementation. Its plain inserting.

-manfred
November 20, 2005
Manfred Nowak wrote:
> Sean Kelly wrote:
> 
> [...]
>> Amortizing the cost of these rehashes over all N insertions seems
>> like it should yield an amortized constant time insert
> [...]
> 
> Agreed, because I did not take into account that the size of the table changed at every rehash.
> 
> Of course is the sum over the 2**-i (from i=1 to infinity) bounded by a constant.

I don't think so, though my math is a bit rusty in this area.  My conclusions regarding all this were more from a general sense of the concepts than because I worked it out on paper.

> Sometimes facts generalized without thought turn out to be plain preconception :-(  

Agreed.


Sean
November 20, 2005
Manfred Nowak wrote:
> Walter Bright wrote:
> 
> [...]
>> It gets rebalanced when a rehash happens.
> [...]
> 
> I do not see any balancing in the implementation. Its plain inserting.

I think that can be considered a rebalancing so long as the insertions don't occur in the same order as they had originally.  Also, the contents of each tree may change on a rehash, since some elements may end up in different buckets, correct?


Sean
November 20, 2005
Sean Kelly wrote:

[...]
>  Also, the contents of each tree may change on a rehash, since
> some elements may end up in different buckets, correct?

Correct. But this leads only to a randomly balanced tree and I doubt, that the factor of 3 for the space requirement of the tree against the space requirement of a 16 element array is justified by the time gain.

However, I assume, that Walter has already tested that.

-manfred
November 21, 2005
> An automatic rehash occurs whenever the number of elements exceeds the table size by the factor four. Currently an adress space of 2**32 is assumed, which results in maximal 14 automatic rehashes.
>
> -manfred

OR when a bucket is (too) full? Is this true? I have no idea about the inner workings of D's hash-table.

If so, what happens if after a rehash the bucket is still (too) full? Does it rehash again? Where can I find more info?

In my hash-table I use a single array of slots (always 2^n) and keep track of a "distance" which is the highest distance from an element to its expected slot (if an element wants to be on [0] but [0..3] is full, it will end up on [3] and the distance is 3). Isn't this safer and more efficient* than a fixed bucket size of 16?

* space efficient: I don't need to allocate buckets but simply use empty
slots in the array
* time efficient: I have direct control over "distance" and can force a
rehash if it gets too high.

(Interestingly, it happens that after a rehash, the "distance" turns out higher than before! Since the hash-function makes no guarantees I'd guess it's just a case of bad luck, as opposed to buggy code)

I probably got a wrong idea about D's AA. I'll wait for more info : )

Lionello.


November 21, 2005
Hi,

In article <dlske4$2k2d$1@digitaldaemon.com>, Lionello Lunesu says...
>
>> An automatic rehash occurs whenever the number of elements exceeds the table size by the factor four. Currently an adress space of 2**32 is assumed, which results in maximal 14 automatic rehashes.
>>
>> -manfred
>
>OR when a bucket is (too) full? Is this true? I have no idea about the inner workings of D's hash-table.

Looking at the source (internal/aaA.d), the only place where an automatic rehash gets done is when the average bucket contains more than 4 elements.

>If so, what happens if after a rehash the bucket is still (too) full? Does it rehash again? Where can I find more info?
>
>In my hash-table I use a single array of slots (always 2^n) and keep track of a "distance" which is the highest distance from an element to its expected slot (if an element wants to be on [0] but [0..3] is full, it will end up on [3] and the distance is 3). Isn't this safer and more efficient* than a fixed bucket size of 16?

What do you mean by a fixed bucket size of 16?

>* space efficient: I don't need to allocate buckets but simply use empty slots in the array

Yes, this is obviously a more space efficient storage if your highest allowed distance is high enough*. No additional pointers or other (non constant) overhead if I understand you correctly.

*) this is a time/space trade-of. In order to be as fast as a traditional hash table, you may need to make the table so much larger that the memory requirements even are higher than for the traditional implementation. You are also much more reliant on a good hash function. It would be very interesting to see time/space comparisons of such a hash table to a more traditional one for a number of real life cases.

>* time efficient: I have direct control over "distance" and can force a rehash if it gets too high.

I'm hardly an expert on hash tables, but in my experience, the only way to tell if it is faster for a particular application is to compare it to the alternative implementation. Some points though:

- You are more sensitive to collisions, since you have not only the collisions
within a bucket, but also collisions with nearby buckets.
- You have no guarantee that a rehashing (and resizing) will improve the
maximum distance. To get around this, you will need an additional condition
that limits rehashing to, for instance, when the hash table size < a*n. But
adding such a condition will remove your guarantee of a maximum highest
distance.
- If m is your maximum distance, lookup is O(m) vs O(log m) for a hash table
that keeps balanced trees for each bucket.
- Deletes are more expensive and needs to move more data.

In conclusion, I think (guess) you are more sensitive to the performance of the hash function. I would be very interested in seeing a comparison with other methods on realistic data.

I think D's hash table implementaion is safer for general purpose, because it gracefully degrades to O(log n) when the hash function is poor. It also minimizes the number of calls to comparison and hash functions, which may, depending on the implementations, be a good thing.

>(Interestingly, it happens that after a rehash, the "distance" turns out higher than before! Since the hash-function makes no guarantees I'd guess it's just a case of bad luck, as opposed to buggy code)

Yes, I would assume such cases could happen.

Regards,

/Oskar