| Thread overview | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
May 02, 2009 Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
I've been playing around with associative array implementations in D, mostly trying to create ones that are more GC-friendly. I had written one that seemed much faster than the builtin for int and float keys but much slower for strings. I've found the reason: The builtin AAs use trees for collision resolution, while mine were using probing or chaining, and with strings there were a heck of a lot of collisions because the default string hashing algorithm is absolutely horrible. Here's a small test program to demonstrate this:
import std.stdio, std.random, std.algorithm, std.range;
void main() {
uint[hash_t] hashFrq;
bool[string] alreadyUsed;
// Generate 100k random strings, compute their hashes and count the
// frequency of each hash.
foreach(i; 0..100_000) {
string myRandString;
// Make sure that no two strings are equal.
do {
myRandString = randString();
} while(myRandString in alreadyUsed);
alreadyUsed[myRandString] = true;
hash_t hash = typeid(string).getHash(cast(void*) &myRandString);
hashFrq[hash]++;
}
auto hashes = hashFrq.keys;
auto counts = hashFrq.values;
sort!((a, b) { return a.at!0 < b.at!0; })(zip(counts, hashes));
foreach(i; 0..hashes.length) {
writeln(hashes[i], "\t", counts[i]);
}
writeln(hashes.length, " unique hashes used.");
}
// Generates a random string of geometrically distributed length composed
// of uniformly distributed characters in [A-Z, a-z]. Expected length is 20.
string randString() {
bool upperCase;
bool keepGoing = true;
string ret;
uint upperA = cast(uint) 'A';
uint upperZ = cast(uint) 'Z' + 1;
uint lowerA = cast(uint) 'a';
uint lowerZ = cast(uint) 'z' + 1;
while(keepGoing) {
upperCase = cast(bool) uniform!"[]"(0, 1);
ret ~= cast(char)
(upperCase ? uniform(upperA, upperZ) : uniform(lowerA, lowerZ));
keepGoing = uniform(0.0, 1.0) > 0.05;
}
return ret;
}
The result is that only about 8400 unique hashes are used, meaning 90+ % of keys cannot be stored directly in the position corresponding to their hash. Note that this is in full hash_t space, not in the modulus space typically used for AAs, so the real results are probably even worse. If the hashes were uniformly distributed across all possible values of hash_t, one only would expect about 30 collisions, meaning about 99970 unique hashes. (This is based on monte carlo, not exact combinatorics, so it's only a rough figure, but it doesn't really matter for the take-home message anyhow.)
Using trees instead of some simpler O(N) mechanism to resolve collisions is very costly, in that the resulting array cannot be iterated over in O(1) auxiliary space, and that the trees require extra space overhead to maintain, meaning more GC pressure, etc. If we want to fix D's AAs, we first need to fix D's string hashing.
| ||||
May 02, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to dsimcha | dsimcha: > we first need to fix D's string hashing.< Ages ago I did suggest this one in the main D newsgroup: http://www.azillionmonkeys.com/qed/hash.html Bye, bearophile | |||
May 02, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | == Quote from bearophile (bearophileHUGS@lycos.com)'s article > dsimcha: > > we first need to fix D's string hashing.< > Ages ago I did suggest this one in the main D newsgroup: > http://www.azillionmonkeys.com/qed/hash.html > Bye, > bearophile Yeah, upon further reverse engineering, the way string hashing "works" is that it simply adds up all the character code values of the characters and returns this sum. The implementation is in dmd\src\druntime\src\compiler\dmd\typeinfo\ti_AC.d , and I've verified this both by reading the code and by implementing the same thing myself and seeing if the results are the same. If anyone has a better way to do it (which would be almost anything; I've written a few myself, although they are extremely simplistic and I'm sure anyone who actually knows what they're doing could do even better), please submit a patch. I've filed this as bug 2922, because it's a severe performance issue, so please submit there. | |||
May 02, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to dsimcha | On Sat, May 2, 2009 at 1:00 PM, dsimcha <dsimcha@yahoo.com> wrote:
> The result is that only about 8400 unique hashes are used, meaning 90+ % of keys cannot be stored directly in the position corresponding to their hash. Note that this is in full hash_t space, not in the modulus space typically used for AAs, so the real results are probably even worse. If the hashes were uniformly distributed across all possible values of hash_t, one only would expect about 30 collisions, meaning about 99970 unique hashes. (This is based on monte carlo, not exact combinatorics, so it's only a rough figure, but it doesn't really matter for the take-home message anyhow.)
Here's something bizarre: porting this example to Tango yields 99997 unique hashes, which is right in line with your estimate. But I think it's bizarre since, well, shouldn't druntime use _the same typeinfo code?_
| |||
May 02, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | == Quote from bearophile (bearophileHUGS@lycos.com)'s article
> dsimcha:
> > we first need to fix D's string hashing.<
> Ages ago I did suggest this one in the main D newsgroup:
> http://www.azillionmonkeys.com/qed/hash.html
> Bye,
> bearophile
On second thought...The real problem is just a strange bug in D's RTTI.
(cast(void*) typeid(immutable(char)[])) != (cast(void*) typeid(char[])). If one
uses the typeinfo for char[] instead of that for immutable(char)[], then the hash
performance is actually good, around 96k unique hashes.
| |||
May 03, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to dsimcha | Hello dsimcha,
> == Quote from bearophile (bearophileHUGS@lycos.com)'s article
>
>> dsimcha:
>>
>>> we first need to fix D's string hashing.<
>>>
>> Ages ago I did suggest this one in the main D newsgroup:
>> http://www.azillionmonkeys.com/qed/hash.html
>> Bye,
>> bearophile
> Yeah, upon further reverse engineering, the way string hashing "works"
> is that it simply adds up all the character code values of the
> characters and returns this sum. The implementation is in
>
> dmd\src\druntime\src\compiler\dmd\typeinfo\ti_AC.d ,
>
> and I've verified this both by reading the code and by implementing
> the same thing myself and seeing if the results are the same. If
> anyone has a better way to do it (which would be almost anything; I've
> written a few myself, although they are extremely simplistic and I'm
> sure anyone who actually knows what they're doing could do even
> better), please submit a patch. I've filed this as bug 2922, because
> it's a severe performance issue, so please submit there.
>
IIRC something like this is common:
hash_t ret = 0;
foreach(c;str) { ret *= SomePrime; ret += c; }
| |||
May 03, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jarrett Billingsley | Jarrett Billingsley wrote:
> Here's something bizarre: porting this example to Tango yields 99997
> unique hashes, which is right in line with your estimate. But I think
> it's bizarre since, well, shouldn't druntime use _the same typeinfo
> code?_
No, Tango recently upgraded its hashing algorithms. This was done after druntime was created, and presumably before the Tango version on your machine :).
| |||
May 03, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to BCS | On Sun, 03 May 2009 04:59:26 +0300, BCS <none@anon.com> wrote: > > IIRC something like this is common: > > hash_t ret = 0; > foreach(c;str) { ret *= SomePrime; ret += c; } > I think that's the basis for the best general string hashing funcs around. IIRC from the university, it doesn't matter, in practice, if the multiplier is a prime number or not. So, the multiplication can be replaced with bit shifting (that's as fast as the addition operation). E.g. foreach(c; str) { ret = (ret << 4) + c; } I quickly googled around and found the algorithm djb2 which uses the multiplier 33: http://www.cse.yorku.ca/~oz/hash.html djb2 is obviously a little bit slower but it *may* give slightly better distribution (I'm not gonna run tests now to see if there is a real world difference... ;) ). | |||
May 03, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Kristian Kilpi Attachments: | Kristian Kilpi wrote:
| On Sun, 03 May 2009 04:59:26 +0300, BCS <none@anon.com> wrote:
|>
|> IIRC something like this is common:
|>
|> hash_t ret = 0;
|> foreach(c;str) { ret *= SomePrime; ret += c; }
|>
|
| I think that's the basis for the best general string hashing funcs
| around. IIRC from the university, it doesn't matter, in practice,
if the
| multiplier is a prime number or not. So, the multiplication can be
| replaced with bit shifting (that's as fast as the addition operation).
| E.g.
|
| foreach(c; str)
| {
| ret = (ret << 4) + c;
| }
|
That one is very bad because it only takes into account the last
few characters of the string (how many depends on the size of the
hash). However, several hashing algorithms use 2^n+1 as their
multiplier, which is very fast even on old/small/embedded hardware
because it can be implemented as a shift and an addition.
| I quickly googled around and found the algorithm djb2 which uses the
| multiplier 33:
| http://www.cse.yorku.ca/~oz/hash.html
|
| djb2 is obviously a little bit slower but it *may* give slightly
better
| distribution (I'm not gonna run tests now to see if there is a real
| world difference... ;) ).
Jerome
- --
mailto:jeberger@free.fr
http://jeberger.free.fr
Jabber: jeberger@jabber.fr
| |||
May 03, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Kristian Kilpi | This guy looks to have a pretty good hash function: http://www.azillionmonkeys.com/qed/hash.html He matched Bob Jenkins' One-at-a-time hash on a variety of tests and beat it on speed. --bb On Sun, May 3, 2009 at 5:19 AM, Kristian Kilpi <kjkilpi@gmail.com> wrote: > On Sun, 03 May 2009 04:59:26 +0300, BCS <none@anon.com> wrote: >> >> IIRC something like this is common: >> >> hash_t ret = 0; >> foreach(c;str) { ret *= SomePrime; ret += c; } >> > > I think that's the basis for the best general string hashing funcs around. > IIRC from the university, it doesn't matter, in practice, if the multiplier > is a prime number or not. So, the multiplication can be replaced with bit > shifting (that's as fast as the addition operation). > E.g. > > foreach(c; str) > { > ret = (ret << 4) + c; > } > > I quickly googled around and found the algorithm djb2 which uses the > multiplier 33: > http://www.cse.yorku.ca/~oz/hash.html > > djb2 is obviously a little bit slower but it *may* give slightly better distribution (I'm not gonna run tests now to see if there is a real world difference... ;) ). > | |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply