November 22, 2005
Hi,

I found this: http://en.wikipedia.org/wiki/Hashtable#Chaining describes a hash-table with balanced trees, like the one used by D. For everyone's information.

> What do you mean by a fixed bucket size of 16?

Sorry, I've understood this from some other post that the bucket's size is fixed. But indeed, it makes no sense to fix it if it's a tree.

>>* space efficient: I don't need to allocate buckets but simply use empty slots in the array
>
> Yes, this is obviously a more space efficient storage if your highest
> allowed
> distance is high enough*. No additional pointers or other (non constant)
> overhead if I understand you correctly.

Exactly. The implementation is extremely simple. That's what I love about it. But yes, it's O(n) if you have many collisions.

> - You are more sensitive to collisions, since you have not only the
> collisions
> within a bucket, but also collisions with nearby buckets.

This is true. I've tested a couple of different hash functions, and the better hash-function nearly always wins (even if it's slower). For example, for a string-map I've compared CRC32 against java's string-hash (hash=7; foreach(c) hash=hash*31+c;) and the CRC32 won, eventhough it's much slower. The entropy of java's string-hash was improved by multiplying with some magic number (sqrt(5)-1, read about it somewhere) and that's the one I'm using ATM.

> - You have no guarantee that a rehashing (and resizing) will improve the
> maximum distance. To get around this, you will need an additional
> condition
> that limits rehashing to, for instance, when the hash table size < a*n.
> But
> adding such a condition will remove your guarantee of a maximum highest
> distance.

This is true. However, in practice the distance never exceeds 8 (for <50% occupation of the hash-table) so the impact of the linear search is kept to a minimum.

> - If m is your maximum distance, lookup is O(m) vs O(log m) for a hash
> table
> that keeps balanced trees for each bucket.

Yeah, you're right. Indeed, only a specific example can compare two hash implementations.

> - Deletes are more expensive and needs to move more data.

Why exactly are deletes more expensive? The operation is the same as with insertion: jump to the (hash) position, walk some (up to 'distance'), delete the (key,value) pair (if any). The down side is that a look-up for a key not in the table always walks up to the max 'distance'.

> In conclusion, I think (guess) you are more sensitive to the performance
> of the
> hash function. I would be very interested in seeing a comparison with
> other
> methods on realistic data.

That sounds like nice D exercise!

> I think D's hash table implementaion is safer for general purpose, because
> it
> gracefully degrades to O(log n) when the hash function is poor. It also
> minimizes the number of calls to comparison and hash functions, which may,
> depending on the implementations, be a good thing.

Indeed, I can imagine it's faster for general cases, but it's certainly more complex (the code, I mean). Good thing it's built-in then ; )

L.


November 23, 2005
"Manfred Nowak" <svv1999@hotmail.com> wrote in message news:Xns9714624647748svv1999hotmailcom@63.105.9.61...
> Sean Kelly wrote:
>
> [...]
> >  Also, the contents of each tree may change on a rehash, since
> > some elements may end up in different buckets, correct?
>
> Correct. But this leads only to a randomly balanced tree

Yes, that's true.

> and I doubt,
> that the factor of 3 for the space requirement of the tree against
> the space requirement of a 16 element array is justified by the time
> gain.

AA's are usually used when performance matters more than memory consumption. Regular arrays are used where memory consumption is paramount. Thus, it is appropriate to trade memory for speed in AA algorithm design.

> However, I assume, that Walter has already tested that.

It's easy to contrive test cases designed to make any particular scheme look good and any other particular scheme look bad. The current AA scheme has worked well for me for many years on all kinds of data. For example, it's used in DMDScript, and the performance of DMDScript is heavilly dependent on the AA algorithm, as all objects in Javascript are AA's. DMDScript is, by a factor of 2, the fastest Javascript interpreter available.

The AA algorithm is all contained in internal\aaA.d. None of the rest of the language or library has any dependency on it. This means anyone can plug in another algorithm and give it a try.


November 23, 2005
Walter Bright wrote:
> 
> It's easy to contrive test cases designed to make any particular scheme look
> good and any other particular scheme look bad. The current AA scheme has
> worked well for me for many years on all kinds of data. For example, it's
> used in DMDScript, and the performance of DMDScript is heavilly dependent on
> the AA algorithm, as all objects in Javascript are AA's. DMDScript is, by a
> factor of 2, the fastest Javascript interpreter available.

Word occurrence tests seem to be a pretty reasonable way to measure AA performance.  Out of curiosity, I think I'll extend my C++ test to the D AA and see how it compares.


Sean
1 2 3 4
Next ›   Last »