Jump to page: 1 2
Thread overview
Absolutely horrible default string hashing
May 02, 2009
dsimcha
May 02, 2009
bearophile
May 02, 2009
dsimcha
May 03, 2009
BCS
May 03, 2009
Kristian Kilpi
May 03, 2009
Jérôme M. Berger
May 03, 2009
Kristian Kilpi
May 03, 2009
Steve Teale
May 03, 2009
dsimcha
May 05, 2009
Sean Kelly
May 03, 2009
Steve Teale
May 03, 2009
Bill Baxter
May 02, 2009
dsimcha
May 03, 2009
Frits van Bommel
May 02, 2009
I've been playing around with associative array implementations in D, mostly trying to create ones that are more GC-friendly.  I had written one that seemed much faster than the builtin for int and float keys but much slower for strings.  I've found the reason:  The builtin AAs use trees for collision resolution, while mine were using probing or chaining, and with strings there were a heck of a lot of collisions because the default string hashing algorithm is absolutely horrible.  Here's a small test program to demonstrate this:

import std.stdio, std.random, std.algorithm, std.range;

void main() {
    uint[hash_t] hashFrq;
    bool[string] alreadyUsed;

    // Generate 100k random strings, compute their hashes and count the
    // frequency of each hash.
    foreach(i; 0..100_000) {
        string myRandString;

        // Make sure that no two strings are equal.
        do {
            myRandString = randString();
        } while(myRandString in alreadyUsed);
        alreadyUsed[myRandString] = true;

        hash_t hash = typeid(string).getHash(cast(void*) &myRandString);
        hashFrq[hash]++;
    }

    auto hashes = hashFrq.keys;
    auto counts = hashFrq.values;
    sort!((a, b) { return a.at!0 < b.at!0; })(zip(counts, hashes));
    foreach(i; 0..hashes.length) {
        writeln(hashes[i], "\t", counts[i]);
    }

    writeln(hashes.length, " unique hashes used.");
}


// Generates a random string of geometrically distributed length composed
// of uniformly distributed characters in [A-Z, a-z].  Expected length is 20.
string randString() {
    bool upperCase;
    bool keepGoing = true;
    string ret;

    uint upperA = cast(uint) 'A';
    uint upperZ = cast(uint) 'Z' + 1;
    uint lowerA = cast(uint) 'a';
    uint lowerZ = cast(uint) 'z' + 1;

    while(keepGoing) {
        upperCase = cast(bool) uniform!"[]"(0, 1);
        ret ~= cast(char)
            (upperCase ? uniform(upperA, upperZ) : uniform(lowerA, lowerZ));
        keepGoing = uniform(0.0, 1.0) > 0.05;
    }
    return ret;
}

The result is that only about 8400 unique hashes are used, meaning 90+ % of keys cannot be stored directly in the position corresponding to their hash. Note that this is in full hash_t space, not in the modulus space typically used for AAs, so the real results are probably even worse.  If the hashes were uniformly distributed across all possible values of hash_t, one only would expect about 30 collisions, meaning about 99970 unique hashes.  (This is based on monte carlo, not exact combinatorics, so it's only a rough figure, but it doesn't really matter for the take-home message anyhow.)

Using trees instead of some simpler O(N) mechanism to resolve collisions is very costly, in that the resulting array cannot be iterated over in O(1) auxiliary space, and that the trees require extra space overhead to maintain, meaning more GC pressure, etc.  If we want to fix D's AAs, we first need to fix D's string hashing.
May 02, 2009
dsimcha:
> we first need to fix D's string hashing.<

Ages ago I did suggest this one in the main D newsgroup: http://www.azillionmonkeys.com/qed/hash.html

Bye,
bearophile
May 02, 2009
== Quote from bearophile (bearophileHUGS@lycos.com)'s article
> dsimcha:
> > we first need to fix D's string hashing.<
> Ages ago I did suggest this one in the main D newsgroup:
> http://www.azillionmonkeys.com/qed/hash.html
> Bye,
> bearophile

Yeah, upon further reverse engineering, the way string hashing "works" is that it simply adds up all the character code values of the characters and returns this sum.  The implementation is in

dmd\src\druntime\src\compiler\dmd\typeinfo\ti_AC.d ,

and I've verified this both by reading the code and by implementing the same thing myself and seeing if the results are the same.  If anyone has a better way to do it (which would be almost anything; I've written a few myself, although they are extremely simplistic and I'm sure anyone who actually knows what they're doing could do even better), please submit a patch.  I've filed this as bug 2922, because it's a severe performance issue, so please submit there.
May 02, 2009
On Sat, May 2, 2009 at 1:00 PM, dsimcha <dsimcha@yahoo.com> wrote:

> The result is that only about 8400 unique hashes are used, meaning 90+ % of keys cannot be stored directly in the position corresponding to their hash. Note that this is in full hash_t space, not in the modulus space typically used for AAs, so the real results are probably even worse.  If the hashes were uniformly distributed across all possible values of hash_t, one only would expect about 30 collisions, meaning about 99970 unique hashes.  (This is based on monte carlo, not exact combinatorics, so it's only a rough figure, but it doesn't really matter for the take-home message anyhow.)

Here's something bizarre: porting this example to Tango yields 99997 unique hashes, which is right in line with your estimate.  But I think it's bizarre since, well, shouldn't druntime use _the same typeinfo code?_
May 02, 2009
== Quote from bearophile (bearophileHUGS@lycos.com)'s article
> dsimcha:
> > we first need to fix D's string hashing.<
> Ages ago I did suggest this one in the main D newsgroup:
> http://www.azillionmonkeys.com/qed/hash.html
> Bye,
> bearophile

On second thought...The real problem is just a strange bug in D's RTTI.
(cast(void*) typeid(immutable(char)[])) != (cast(void*) typeid(char[])).  If one
uses the typeinfo for char[] instead of that for immutable(char)[], then the hash
performance is actually good, around 96k unique hashes.
May 03, 2009
Hello dsimcha,

> == Quote from bearophile (bearophileHUGS@lycos.com)'s article
> 
>> dsimcha:
>> 
>>> we first need to fix D's string hashing.<
>>> 
>> Ages ago I did suggest this one in the main D newsgroup:
>> http://www.azillionmonkeys.com/qed/hash.html
>> Bye,
>> bearophile
> Yeah, upon further reverse engineering, the way string hashing "works"
> is that it simply adds up all the character code values of the
> characters and returns this sum.  The implementation is in
> 
> dmd\src\druntime\src\compiler\dmd\typeinfo\ti_AC.d ,
> 
> and I've verified this both by reading the code and by implementing
> the same thing myself and seeing if the results are the same.  If
> anyone has a better way to do it (which would be almost anything; I've
> written a few myself, although they are extremely simplistic and I'm
> sure anyone who actually knows what they're doing could do even
> better), please submit a patch.  I've filed this as bug 2922, because
> it's a severe performance issue, so please submit there.
> 

IIRC something like this is common:

hash_t ret = 0;
foreach(c;str) { ret *= SomePrime; ret += c; }


May 03, 2009
Jarrett Billingsley wrote:
> Here's something bizarre: porting this example to Tango yields 99997
> unique hashes, which is right in line with your estimate.  But I think
> it's bizarre since, well, shouldn't druntime use _the same typeinfo
> code?_

No, Tango recently upgraded its hashing algorithms. This was done after druntime was created, and presumably before the Tango version on your machine :).
May 03, 2009
On Sun, 03 May 2009 04:59:26 +0300, BCS <none@anon.com> wrote:
>
> IIRC something like this is common:
>
> hash_t ret = 0;
> foreach(c;str) { ret *= SomePrime; ret += c; }
>

I think that's the basis for the best general string hashing funcs around. IIRC from the university, it doesn't matter, in practice, if the multiplier is a prime number or not. So, the multiplication can be replaced with bit shifting (that's as fast as the addition operation).
E.g.

foreach(c; str)
{
    ret = (ret << 4) + c;
}

I quickly googled around and found the algorithm djb2 which uses the multiplier 33:
http://www.cse.yorku.ca/~oz/hash.html

djb2 is obviously a little bit slower but it *may* give slightly better distribution (I'm not gonna run tests now to see if there is a real world difference... ;) ).
May 03, 2009
Kristian Kilpi wrote:
| On Sun, 03 May 2009 04:59:26 +0300, BCS <none@anon.com> wrote:
|>
|> IIRC something like this is common:
|>
|> hash_t ret = 0;
|> foreach(c;str) { ret *= SomePrime; ret += c; }
|>
|
| I think that's the basis for the best general string hashing funcs
| around. IIRC from the university, it doesn't matter, in practice,
if the
| multiplier is a prime number or not. So, the multiplication can be
| replaced with bit shifting (that's as fast as the addition operation).
| E.g.
|
| foreach(c; str)
| {
|     ret = (ret << 4) + c;
| }
|
	That one is very bad because it only takes into account the last
few characters of the string (how many depends on the size of the
hash). However, several hashing algorithms use 2^n+1 as their
multiplier, which is very fast even on old/small/embedded hardware
because it can be implemented as a shift and an addition.

| I quickly googled around and found the algorithm djb2 which uses the
| multiplier 33:
| http://www.cse.yorku.ca/~oz/hash.html
|
| djb2 is obviously a little bit slower but it *may* give slightly
better
| distribution (I'm not gonna run tests now to see if there is a real
| world difference... ;) ).

		Jerome
- --
mailto:jeberger@free.fr
http://jeberger.free.fr
Jabber: jeberger@jabber.fr
May 03, 2009
This guy looks to have a pretty good hash function:
http://www.azillionmonkeys.com/qed/hash.html
He matched Bob Jenkins' One-at-a-time hash on a variety of tests and
beat it on speed.

--bb

On Sun, May 3, 2009 at 5:19 AM, Kristian Kilpi <kjkilpi@gmail.com> wrote:
> On Sun, 03 May 2009 04:59:26 +0300, BCS <none@anon.com> wrote:
>>
>> IIRC something like this is common:
>>
>> hash_t ret = 0;
>> foreach(c;str) { ret *= SomePrime; ret += c; }
>>
>
> I think that's the basis for the best general string hashing funcs around.
> IIRC from the university, it doesn't matter, in practice, if the multiplier
> is a prime number or not. So, the multiplication can be replaced with bit
> shifting (that's as fast as the addition operation).
> E.g.
>
> foreach(c; str)
> {
>    ret = (ret << 4) + c;
> }
>
> I quickly googled around and found the algorithm djb2 which uses the
> multiplier 33:
> http://www.cse.yorku.ca/~oz/hash.html
>
> djb2 is obviously a little bit slower but it *may* give slightly better distribution (I'm not gonna run tests now to see if there is a real world difference... ;) ).
>
« First   ‹ Prev
1 2