Thread overview
[Issue 2922] New: Egregiously bad hashing performance with strings
May 02, 2009
d-bugmail
May 02, 2009
d-bugmail
May 13, 2009
David Simcha
Oct 23, 2009
David Simcha
Nov 06, 2009
David Simcha
Nov 06, 2009
David Simcha
May 02, 2009
http://d.puremagic.com/issues/show_bug.cgi?id=2922

           Summary: Egregiously bad hashing performance with strings
           Product: D
           Version: 2.029
          Platform: PC
        OS/Version: Windows
            Status: NEW
          Keywords: performance
          Severity: normal
          Priority: P2
         Component: DMD
        AssignedTo: bugzilla@digitalmars.com
        ReportedBy: dsimcha@yahoo.com


The current hashing scheme for strings causes massive hash collisions unnecessarily.  Here's a small program that generates random strings, hashes them, and tracks how frequently each key is used, to demonstrate this issue.

import std.stdio, std.random, std.algorithm, std.range;

void main() {
    uint[hash_t] hashFrq;
    bool[string] alreadyUsed;

    // Generate 100k random strings, compute their hashes and count the
    // frequency of each hash.
    foreach(i; 0..100_000) {
        string myRandString;

        // Make sure that no two strings are equal.
        do {
            myRandString = randString();
        } while(myRandString in alreadyUsed);
        alreadyUsed[myRandString] = true;

        hash_t hash = typeid(string).getHash(cast(void*) &myRandString);
        hashFrq[hash]++;
    }

    auto hashes = hashFrq.keys;
    auto counts = hashFrq.values;
    sort!((a, b) { return a.at!0 < b.at!0; })(zip(counts, hashes));
    foreach(i; 0..hashes.length) {
        writeln(hashes[i], "\t", counts[i]);
    }

    writeln(hashes.length, " unique hashes used.");
}


// Generates a random string of geometrically distributed length composed
// of uniformly distributed characters in [A-Z, a-z].  Expected length is 20.
string randString() {
    bool upperCase;
    bool keepGoing = true;
    string ret;

    uint upperA = cast(uint) 'A';
    uint upperZ = cast(uint) 'Z' + 1;
    uint lowerA = cast(uint) 'a';
    uint lowerZ = cast(uint) 'z' + 1;

    while(keepGoing) {
        upperCase = cast(bool) uniform!"[]"(0, 1);
        ret ~= cast(char)
            (upperCase ? uniform(upperA, upperZ) : uniform(lowerA, lowerZ));
        keepGoing = uniform(0.0, 1.0) > 0.05;
    }
    return ret;
}

The result is that only about 8400 unique hashes are used, meaning 90+ % of keys cannot be stored directly in the position corresponding to their hash. Note that this is in full hash_t space, not in the modulus space typically used for AAs, so the real results are probably even worse.  If the hashes were uniformly distributed across all possible values of hash_t, one only would expect about 30 collisions, meaning about 99970 unique hashes.  (This is based on monte carlo, not exact combinatorics, so it's only a rough figure, but it doesn't really matter for the take-home message anyhow.)

It appears that the current hashing scheme is to simply add up the integer character codes for each byte in the array, and return this as the hash.  This implementation is the default for an array of objects.  Even something simple like rotating the hash value before each add would be a significant improvement.


-- 

May 02, 2009
http://d.puremagic.com/issues/show_bug.cgi?id=2922





------- Comment #1 from dsimcha@yahoo.com  2009-05-02 13:55 -------
Upon further investigation, the problem is that typeid(string) returns the typeinfo for a generic array, not for a char[]:

import std.stdio;

void main() {
    writeln(typeid(immutable(char)[]) is typeid(char[]));  // False.
}

Using typeid(char[]) instead of typeid(string) in the above program fixes the
problem.  I guess noone remembered to change this when strings were changed to
default immutable.


-- 

May 13, 2009
http://d.puremagic.com/issues/show_bug.cgi?id=2922


David Simcha <dsimcha@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED




--- Comment #2 from David Simcha <dsimcha@yahoo.com>  2009-05-13 09:11:43 PDT ---
Fixed 2.030.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
October 23, 2009
http://d.puremagic.com/issues/show_bug.cgi?id=2922


David Simcha <dsimcha@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |
           Severity|normal                      |major


--- Comment #3 from David Simcha <dsimcha@yahoo.com> 2009-10-23 08:59:22 PDT ---
I just realized that this was only fixed for string, i.e. immutable(char)[]. It still happens on const(char)[].  To demonstrate this, change the line

hash_t hash = typeid(string).getHash(cast(void*) &myRandString);

to

hash_t hash = typeid(const(char)[]).getHash(cast(void*) &myRandString);

in my test program.

When the typeid is changed to const(char)[], the results are exactly the same as they were for string before this bug was fixed.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
November 06, 2009
http://d.puremagic.com/issues/show_bug.cgi?id=2922



--- Comment #4 from David Simcha <dsimcha@yahoo.com> 2009-11-06 08:22:04 PST ---
Fixed again (I think) 2.036.  Not sure why it's not in the changelog, but running the test program in 2.036 gives good results.  I guess this bug is closable, but I'll have to look at the druntime checkins in more detail b/c it appears that, while string hashing is as good as it ever was, const(char)[] hashing has improved so much that it's now better than string hashing.  I have no idea why they don't just use the same function.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
November 06, 2009
http://d.puremagic.com/issues/show_bug.cgi?id=2922


David Simcha <dsimcha@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|                            |FIXED


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------