View mode: basic / threaded / horizontal-split · Log in · Help
November 07, 2012
Performance of hashes and associative arrays
Hello everybody,

we had a conversation with Ali Cehreli about the right ways of 
hashing values / objects. This is related to the D language when 
using associative arrays but applies for any language as well.

One of the points was, we have the following class:

==
class Student // class representing a student
{
    string firstName; // the first name of the student
    string lastName; // the last name of the student
}
==

Let s be a Student:

==
auto s = new Student("Jean", "Martin");
==

We want to be able to get the hash of s. Therefore, we 
re-implement the toHash method of the Student class :

==
class Student
{
    string firstName;
    string lastName;

    override size_t toHash() const
    {
        //...
    }
}
==

The original solution proposed was:

==
    override size_t toHash() const
    {
        return (typeid(firstName).getHash(&firstName) +
                typeid(lastName).getHash(&lastName));
    }
==

However, with this solution, we get the same hash for new 
Student("Jean", "Martin") and new Student("Martin", "Jean"). We 
did an experiment of performance of associative arrays in D when 
the hash function is not well designed (returning the same hash 
for too many values). When the hash function returns the same 
hash for too many values, performance are dramatic (See the end 
of the post for more information).

So here is the second solution that was proposed and that avoids 
having the same hash for two different names:

==
    override size_t toHash() const
    {
        auto h = firstName ~ ":" ~ lastName; // we suppose that 
":" is not a valid character in a name
        return typeid(h).getHash(&h);
    }
==

A ":" was added so new Student("Martin", "Jean") gets a different 
hash than new Student("Mar", "tinJean")).
Let's call the following solution "second bis solution" :

==
    override size_t toHash() const
    {
        auto h = firstName ~ lastName;
        return typeid(h).getHash(&h);
    }
==

This solution suppose that being named Mar tinJean instead of 
Martin Jean is really strange and therefore quite unusual.
However, these two solutions imply string concatenations each 
time we need the hash.
Ali agreed that concatenating strings each time would indeed be 
inefficient. He thought we might cache the value (third solution) 
:

==
class Student
{
    string firstName;
    string lastName;
    size_t cachedHash;

    override size_t toHash() const
    {
        if(cachedHash is null)
        {
            auto h = firstName ~ ":" ~ lastName;
            cachedHash = typeid(h).getHash(&h);
        }
        return cachedHash;
    }
}
==

This solution doesn't make compromise on the "quality" of the 
hash and is fast after the first usage but needs to keep the hash 
as part of the class. The hash is calculated for each new object 
that is used as index in an associative array, even if the hash 
for this specific name has already been calculated in another 
object.

Questions are :
 - what is the most efficient solution, and in which case ?
 - Is it preferable to try to make the hash unique (second bis 
solution), or to try to make the hash easier to evaluate and risk 
in rare case to get the same hashes for two different objects 
(second solution)?
 - Is solution 3 a good practice? Preferable to the others 
solutions? When?
 - Is solution 1 preferable to the others solutions, given that 
first name / last name inversions are quite "rare" ?
 - how compares solution 1 to solution 2 bis ?

It would also be interesting to have ideas for the general case, 
i.e. random objects and not just names that have particularities 
like "First names are not often last names and vice versa", or 
for other specific situations.
Questions are oriented for usage in associative arrays in D, but 
still apply to other use cases of hashes.


For those who are interested, here is Ali's experiment on 
performance of associative arrays in D depending on the design of 
the hash function (I took the liberty to modify it slightly).

==
import std.stdio;
import std.random;
import std.string;

class Clock
{
    int hour;
    int minute;
    int second;

    static size_t opCmpCount = 0;
    static size_t opEqualsCount = 0;
    static size_t toHashCount = 0;

    override equals_t opEquals(Object o) const
    {
        ++opEqualsCount;

        auto rhs = cast(const Clock)o;

        return (rhs &&
                (hour == rhs.hour) &&
                (minute == rhs.minute) &&
                (second == rhs.second));
    }

    override int opCmp(Object o) const
    {
        ++opCmpCount;

        /* Taking advantage of the automatically-maintained
         * order of the types. */
        if (typeid(this) != typeid(o)) {
            return typeid(this).opCmp(typeid(o));
        }

        auto rhs = cast(const Clock)o;
        /* No need to check whether rhs is null, because it is
         * known on this line that it has the same type as o. */

        return (hour != rhs.hour
                ? hour - rhs.hour
                : (minute != rhs.minute
                   ? minute - rhs.minute
                   : second - rhs.second));
    }

    override string toString() const
    {
        return format("%02s:%02s:%02s", hour, minute, second);
    }

    this(int hour, int minute, int second)
    {
        this.hour = hour;
        this.minute = minute;
        this.second = second;
    }

    override hash_t toHash() const
    {
        ++toHashCount;
        return hour + minute + second; // method 1
        // return hour + 60 * minute + 3600 * second; // method 2
    }

    static void printCounts()
    {
        writefln("opEquals: %8s", opEqualsCount);
        writefln("opCmp   : %8s", opCmpCount);
        writefln("toHash  : %8s", toHashCount);
    }
}

Clock randomClock()
{
    return new Clock(uniform(0, 24), uniform(0, 60), uniform(0, 
60));
}

void main()
{
    enum count = 10_000;
    Clock[Clock] aa;

    foreach (i; 0 .. count) {
        auto entry = randomClock();
        aa[entry] = entry;
    }

    foreach (i; 0 .. count) {
        auto result = (randomClock() in aa);
    }

    Clock.printCounts();
}
==

We both get the following results for the method 1:

opEquals:        0
opCmp   :  1438438
toHash  :    20000


and the following results for the method 2:

opEquals:        0
opCmp   :     1681
toHash  :    20000

Apart from the fact associative arrays in gdc and dmd don't use 
the opEquals method, we see that using method 1 implies approx. 
856 times more comparisons than method 2. I calculated that 
method 1 gives 139 different hashes for 80063 different elements, 
which means each hash represents an average of approx. 576 
elements. Method 2 gives exactly one hash for one element.
This seems to be an extreme case, we might think results would be 
completely different from a function giving hashes that 
"sometimes" represents 2 elements.
November 07, 2012
Re: Performance of hashes and associative arrays
On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jakse 
wrote:

> ==
>     override size_t toHash() const
>     {
>         return (typeid(firstName).getHash(&firstName) +
>                 typeid(lastName).getHash(&lastName));
>     }
> ==
>

Isn't the real problem the addition. You want to mix the bits 
together in a consistent way that does not have associativity. 
I've seen things like:

result = 37 * first + 17 * second

I've incorporated this functionality in a toHash that can be 
mixed in. Any feedback on it would be great. See:

http://forum.dlang.org/post/tbjrqaogxegtyexnfags@forum.dlang.org
https://github.com/patefacio/d-help/blob/master/d-help/opmix/mix.d

It only supports structs, but the hashing could be made to 
support classes. It allows for the following code for your 
example. It provides toHash, opEquals and opCmp. Below the 
hashcodes for s1 and s2 are different, the instances are not 
equal and s1 is less than s2 which is what you want.

import std.stdio;
import opmix.mix;

struct Student // class representing a student
{
  mixin(HashSupport);
  string firstName; // the first name of the student
  string lastName; // the last name of the student
}

void main() {
  auto s1 = Student("Jean", "Martin");
  auto s2 = Student("Martin", "Jean");
  writeln(s1.toHash(), " vs ", s2.toHash());
  writeln(s1 == s2);
  writeln(s1 < s2);
}


> However, with this solution, we get the same hash for new 
> Student("Jean", "Martin") and new Student("Martin", "Jean"). We 
> did an experiment of performance of associative arrays in D 
> when the hash function is not well designed (returning the same 
> hash for too many values). When the hash function returns the 
> same hash for too many values, performance are dramatic (See 
> the end of the post for more information).

This makes sense and is not particular to D.

> Ali agreed that concatenating strings each time would indeed be 
> inefficient. He thought we might cache the value (third 
> solution) :

Interesting about caching the hashcode and on large classes could 
save you. But isn't the signature shown const? Would that 
compile? Also, what if you change the strings - you get the wrong 
hash? I suppose you could limit writes to properties and null the 
hashcode on any write.

> Questions are :
>  - what is the most efficient solution, and in which case ?

No string concatenation is good. I think a single pass on all 
important data (in most cases is all the data) is the goal.

> This seems to be an extreme case, we might think results would 
> be completely different from a function giving hashes that 
> "sometimes" represents 2 elements.

Very true. If I increase to 10_000_000 I see opCmp:1913600 for 
the smart method and (several minutes later) opCmp:907216764 for 
the simple addition (method 1).
In this case you know something special about the data and can 
take advantage of it. If I run the example using the 
mixin(HashSupport) it does opCmp:7793499 which is about 4 times 
as many compares as the smart one. The simple addition does 474 
times as many compares as the smart one - so it is clearly very 
bad. So, if you know something special about the data, like it 
can easily be converted into a single number such as seconds, by 
all means include that. But remember, next time you add something 
to the class, if you don't have some "automated" way of pulling 
in info from all fields you need to revisit the hash (and opCmp 
and opEquals).

Thanks
Dan
November 10, 2012
Re: Performance of hashes and associative arrays
Hello,

Thanks for this complete answer. I will take a look to your code.
Additionally, Ali gave me a really interesting link about hashing, good 
practices, what is efficient, etc. If you didn't read it, it might 
interest you. Here it is:

http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

Le 07/11/2012 13:10, Dan a écrit :
> On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jakse wrote:
> [...]
>> Ali agreed that concatenating strings each time would indeed be
>> inefficient. He thought we might cache the value (third solution) :
>
> Interesting about caching the hashcode and on large classes could save
> you. But isn't the signature shown const? Would that compile?

Indeed, I didn't tested this code, I just wanted to explicit what I 
wanted to say. Thank you for the notice.

> Also, what
> if you change the strings - you get the wrong hash? I suppose you could
> limit writes to properties and null the hashcode on any write.

Yes, good question and good idea, I think I would do it this way. I 
didn't think about it.

>
>> Questions are :
>>  - what is the most efficient solution, and in which case ?
>
> No string concatenation is good. I think a single pass on all important
> data (in most cases is all the data) is the goal.

I'm not sure I understood well. You wanted to say that string 
contatenations are good, right ?

I was thinking about a hash function that would take several arguments 
and hash them together. That would let take in account more than one 
string in the hash while avoiding concatenation.

>[...]
>
> Thanks
> Dan
>
November 10, 2012
Re: Performance of hashes and associative arrays
On Saturday, 10 November 2012 at 07:55:18 UTC, Raphaël Jakse 
wrote:
> Hello,
>
> Thanks for this complete answer. I will take a look to your 
> code.

Ok - good. I've been using 2.061 which I just realized allows 
"dup" on an associative array, a feature which was not available 
in 2.060. So mixin(Dup) and the unit tests require 2.061.

> If you didn't read it, it might interest you. Here it is:
>

I had not seen it and will read - thanks.

>>
>>> Questions are :
>>> - what is the most efficient solution, and in which case ?
>>
>> No string concatenation is good. I think a single pass on all 
>> important
>> data (in most cases is all the data) is the goal.
>
> I'm not sure I understood well. You wanted to say that string 
> contatenations are good, right ?
>

I just meant string concatenation is likely unnecessary. Imagine 
two one megabyte strings. It is easy to concat them and call the 
string hash function on the result but you have to create a 2Mb 
string first. Alternatively, you could hash each and combine the 
hash codes in some consistent way.

> I was thinking about a hash function that would take several 
> arguments and hash them together. That would let take in 
> account more than one string in the hash while avoiding 
> concatenation.
>
Yes.

Thanks
Dan
November 10, 2012
Re: Performance of hashes and associative arrays
On 11/10/2012 05:17 AM, Dan wrote:

> I just meant string concatenation is likely unnecessary. Imagine two one
> megabyte strings. It is easy to concat them and call the string hash
> function on the result but you have to create a 2Mb string first.

Just a random thought: std.range.chain can obviate that copy:

  http://dlang.org/phobos/std_range.html#chain

Ali
November 10, 2012
Re: Performance of hashes and associative arrays
On 11/07/2012 07:38 AM, "Raphaël.Jakse" <raphael.jakse@gmail.com>"@puremagic.com 
wrote:
> We want to be able to get the hash of s. Therefore, we re-implement the toHash
> method of the Student class :

OK, now I'm curious.  Assuming I don't write a custom re-implementation, how 
would a custom struct or class be hashed?  (What about a tuple?)

I ask because I'm considering associative arrays where the key is a custom class 
or tuple as part of a current coding project.
November 10, 2012
Re: Performance of hashes and associative arrays
On Saturday, 10 November 2012 at 18:18:07 UTC, Joseph Rushton 
Wakeling wrote:
> On 11/07/2012 07:38 AM, "Raphaël.Jakse" 
> <raphael.jakse@gmail.com>"@puremagic.com wrote:
>> We want to be able to get the hash of s. Therefore, we 
>> re-implement the toHash
>> method of the Student class :
>
> OK, now I'm curious.  Assuming I don't write a custom 
> re-implementation, how would a custom struct or class be 
> hashed?  (What about a tuple?)
>
> I ask because I'm considering associative arrays where the key 
> is a custom class or tuple as part of a current coding project.

Not sure I understand the question. But here is how I'm doing it. 
No guarantees, but output looks promising. Code following output.

Thanks
Dan

-----------------------------
true
false
false
true
7F053B2DCFC0
20883845 vs 20883845
-----------------------------
import std.stdio;
import std.traits;
import std.typecons;
import opmix.mix;

struct S {
  mixin(HashSupport);
  alias Tuple!(int, char, string) X;
  X x;
  char[] mutable;
}

void main() {
  S s1 = { tuple(3, 'a', "foo".idup), ['a','b'].dup };
  S s2 = { tuple(3, 'a', "foo".idup), ['a','b'].dup };
  writeln(s1==s2);
  s1.x[0]++;
  writeln(s1==s2);
  writeln(s1<s2);
  writeln(s2<s1);
  writeln(s1 in [ s1: 3 ]);
  s2.x[0]++;
  writeln(s1.toHash(), " vs ", s2.toHash());
}
November 11, 2012
Re: Performance of hashes and associative arrays
On 11/10/2012 05:37 AM, Joseph Rushton Wakeling wrote:
> On 11/07/2012 07:38 AM, "Raphaël.Jakse"
> <raphael.jakse@gmail.com>"@puremagic.com wrote:
>> We want to be able to get the hash of s. Therefore, we re-implement
>> the toHash
>> method of the Student class :
>
> OK, now I'm curious. Assuming I don't write a custom re-implementation,
> how would a custom struct or class be hashed? (What about a tuple?)
>
> I ask because I'm considering associative arrays where the key is a
> custom class or tuple as part of a current coding project.

For classes, because they are reference types, it is the object identity 
that determines the hash value (probably the pointer of the actual 
object). As a result, even when the values of the members are the same, 
two objects have different hash values.

For structs, because they are value types, it is the bytes that make up 
the object determine the hash value.

But beware: The associative array members of structs are not hashed by 
values of their elements. (There was a long discussion recently on the 
the main newsgroup on this topic.)

Ali
November 11, 2012
Re: Performance of hashes and associative arrays
On 11/11/2012 02:35 AM, Ali Çehreli wrote:
> For classes, because they are reference types, it is the object identity that
> determines the hash value (probably the pointer of the actual object). As a
> result, even when the values of the members are the same, two objects have
> different hash values.
>
> For structs, because they are value types, it is the bytes that make up the
> object determine the hash value.

Ahh, clear.  So where classes are concerned there's a definite need to override 
opHash if you want the _values_ to be the basis of the association, rather than 
the object itself ... this would also have potential implications for memory 
blow-up where a class is used as the key type, no?

> But beware: The associative array members of structs are not hashed by values of
> their elements. (There was a long discussion recently on the the main newsgroup
> on this topic.)

I do recall that discussion, and I'll bear that in mind.

So far the only practical consideration I've had was for a Tuple!() to be the 
key, rather than an actual struct or class; as it happens I've avoided that so 
far, and I think I may continue to do so while there are still issues around 
hashing.

The more pressing reason for avoiding associative arrays as it happens was 
simply that I was getting unacceptably large memory usage as a result: I'd got 
some data stored in the form of byte[size_t] (really this should have been a set 
of size_t's, but this seemed theoretically an acceptable stop gap while there's 
no set class in Phobos), and for some reason that was generating memory 
consumption twice that of having instead a size_t[] regular array.  I guess 
perhaps because although a byte should mean just that, in practice each value in 
the byte[size_t] associative array was taking up a whole word?

It's a shame, because functionally it was useful to be able to do things of the 
form if(i in list), but memory-wise it just didn't work for the size of data I'm 
dealing with.
November 11, 2012
Re: Performance of hashes and associative arrays
Jakse wrote:

> It would also be interesting to have ideas for the general
> case 

Yes, ideas might be interesting. :-)

A root of "good" hashing is incorporated in algorithm2 of your 
posting:

spread the values produced by the hash function uniformly over 
the interval  size_t.min .. size_t.max.

Therefore:

Compute the hash value for a given key by multiplying each byte 
of the key with a factor, which increases from the byte with 
lowest significance to the byte with highest significance; of 
course add the results.

Remarks:
a)
Because of the general case: assume that the original key 
contains much redundance. Therefore do not use the bytes of the 
original key but compute a ( at least close to) lossless 
compression of the original key.

b)
The factors in the factor list should be primes for spreading 
the hash value uniformly over the intended range

c)
The quotint of two consecutives primes in the factor list should 
be greater than 256, because the used key might not contain any 
reduundancy.   

-manfred
« First   ‹ Prev
1 2
Top | Discussion index | About this forum | D home