November 07, 2012
Hello everybody,

we had a conversation with Ali Cehreli about the right ways of hashing values / objects. This is related to the D language when using associative arrays but applies for any language as well.

One of the points was, we have the following class:

==
class Student // class representing a student
{
    string firstName; // the first name of the student
    string lastName; // the last name of the student
}
==

Let s be a Student:

==
auto s = new Student("Jean", "Martin");
==

We want to be able to get the hash of s. Therefore, we re-implement the toHash method of the Student class :

==
class Student
{
    string firstName;
    string lastName;

    override size_t toHash() const
    {
        //...
    }
}
==

The original solution proposed was:

==
    override size_t toHash() const
    {
        return (typeid(firstName).getHash(&firstName) +
                typeid(lastName).getHash(&lastName));
    }
==

However, with this solution, we get the same hash for new Student("Jean", "Martin") and new Student("Martin", "Jean"). We did an experiment of performance of associative arrays in D when the hash function is not well designed (returning the same hash for too many values). When the hash function returns the same hash for too many values, performance are dramatic (See the end of the post for more information).

So here is the second solution that was proposed and that avoids having the same hash for two different names:

==
    override size_t toHash() const
    {
        auto h = firstName ~ ":" ~ lastName; // we suppose that ":" is not a valid character in a name
        return typeid(h).getHash(&h);
    }
==

A ":" was added so new Student("Martin", "Jean") gets a different hash than new Student("Mar", "tinJean")).
Let's call the following solution "second bis solution" :

==
    override size_t toHash() const
    {
        auto h = firstName ~ lastName;
        return typeid(h).getHash(&h);
    }
==

This solution suppose that being named Mar tinJean instead of Martin Jean is really strange and therefore quite unusual.
However, these two solutions imply string concatenations each time we need the hash.
Ali agreed that concatenating strings each time would indeed be inefficient. He thought we might cache the value (third solution) :

==
class Student
{
    string firstName;
    string lastName;
    size_t cachedHash;

    override size_t toHash() const
    {
        if(cachedHash is null)
        {
            auto h = firstName ~ ":" ~ lastName;
            cachedHash = typeid(h).getHash(&h);
        }
        return cachedHash;
    }
}
==

This solution doesn't make compromise on the "quality" of the hash and is fast after the first usage but needs to keep the hash as part of the class. The hash is calculated for each new object that is used as index in an associative array, even if the hash for this specific name has already been calculated in another object.

Questions are :
 - what is the most efficient solution, and in which case ?
 - Is it preferable to try to make the hash unique (second bis solution), or to try to make the hash easier to evaluate and risk in rare case to get the same hashes for two different objects (second solution)?
 - Is solution 3 a good practice? Preferable to the others solutions? When?
 - Is solution 1 preferable to the others solutions, given that first name / last name inversions are quite "rare" ?
 - how compares solution 1 to solution 2 bis ?

It would also be interesting to have ideas for the general case, i.e. random objects and not just names that have particularities like "First names are not often last names and vice versa", or for other specific situations.
Questions are oriented for usage in associative arrays in D, but still apply to other use cases of hashes.


For those who are interested, here is Ali's experiment on performance of associative arrays in D depending on the design of the hash function (I took the liberty to modify it slightly).

==
import std.stdio;
import std.random;
import std.string;

class Clock
{
    int hour;
    int minute;
    int second;

    static size_t opCmpCount = 0;
    static size_t opEqualsCount = 0;
    static size_t toHashCount = 0;

    override equals_t opEquals(Object o) const
    {
        ++opEqualsCount;

        auto rhs = cast(const Clock)o;

        return (rhs &&
                (hour == rhs.hour) &&
                (minute == rhs.minute) &&
                (second == rhs.second));
    }

    override int opCmp(Object o) const
    {
        ++opCmpCount;

        /* Taking advantage of the automatically-maintained
         * order of the types. */
        if (typeid(this) != typeid(o)) {
            return typeid(this).opCmp(typeid(o));
        }

        auto rhs = cast(const Clock)o;
        /* No need to check whether rhs is null, because it is
         * known on this line that it has the same type as o. */

        return (hour != rhs.hour
                ? hour - rhs.hour
                : (minute != rhs.minute
                   ? minute - rhs.minute
                   : second - rhs.second));
    }

    override string toString() const
    {
        return format("%02s:%02s:%02s", hour, minute, second);
    }

    this(int hour, int minute, int second)
    {
        this.hour = hour;
        this.minute = minute;
        this.second = second;
    }

    override hash_t toHash() const
    {
        ++toHashCount;
        return hour + minute + second; // method 1
        // return hour + 60 * minute + 3600 * second; // method 2
    }

    static void printCounts()
    {
        writefln("opEquals: %8s", opEqualsCount);
        writefln("opCmp   : %8s", opCmpCount);
        writefln("toHash  : %8s", toHashCount);
    }
}

Clock randomClock()
{
    return new Clock(uniform(0, 24), uniform(0, 60), uniform(0, 60));
}

void main()
{
    enum count = 10_000;
    Clock[Clock] aa;

    foreach (i; 0 .. count) {
        auto entry = randomClock();
        aa[entry] = entry;
    }

    foreach (i; 0 .. count) {
        auto result = (randomClock() in aa);
    }

    Clock.printCounts();
}
==

We both get the following results for the method 1:

opEquals:        0
opCmp   :  1438438
toHash  :    20000


and the following results for the method 2:

opEquals:        0
opCmp   :     1681
toHash  :    20000

Apart from the fact associative arrays in gdc and dmd don't use the opEquals method, we see that using method 1 implies approx. 856 times more comparisons than method 2. I calculated that method 1 gives 139 different hashes for 80063 different elements, which means each hash represents an average of approx. 576 elements. Method 2 gives exactly one hash for one element.
This seems to be an extreme case, we might think results would be completely different from a function giving hashes that "sometimes" represents 2 elements.
November 07, 2012
On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jakse wrote:

> ==
>     override size_t toHash() const
>     {
>         return (typeid(firstName).getHash(&firstName) +
>                 typeid(lastName).getHash(&lastName));
>     }
> ==
>

Isn't the real problem the addition. You want to mix the bits together in a consistent way that does not have associativity. I've seen things like:

result = 37 * first + 17 * second

I've incorporated this functionality in a toHash that can be mixed in. Any feedback on it would be great. See:

http://forum.dlang.org/post/tbjrqaogxegtyexnfags@forum.dlang.org
https://github.com/patefacio/d-help/blob/master/d-help/opmix/mix.d

It only supports structs, but the hashing could be made to support classes. It allows for the following code for your example. It provides toHash, opEquals and opCmp. Below the hashcodes for s1 and s2 are different, the instances are not equal and s1 is less than s2 which is what you want.

import std.stdio;
import opmix.mix;

struct Student // class representing a student
{
  mixin(HashSupport);
  string firstName; // the first name of the student
  string lastName; // the last name of the student
}

void main() {
  auto s1 = Student("Jean", "Martin");
  auto s2 = Student("Martin", "Jean");
  writeln(s1.toHash(), " vs ", s2.toHash());
  writeln(s1 == s2);
  writeln(s1 < s2);
}


> However, with this solution, we get the same hash for new Student("Jean", "Martin") and new Student("Martin", "Jean"). We did an experiment of performance of associative arrays in D when the hash function is not well designed (returning the same hash for too many values). When the hash function returns the same hash for too many values, performance are dramatic (See the end of the post for more information).

This makes sense and is not particular to D.

> Ali agreed that concatenating strings each time would indeed be inefficient. He thought we might cache the value (third solution) :

Interesting about caching the hashcode and on large classes could save you. But isn't the signature shown const? Would that compile? Also, what if you change the strings - you get the wrong hash? I suppose you could limit writes to properties and null the hashcode on any write.

> Questions are :
>  - what is the most efficient solution, and in which case ?

No string concatenation is good. I think a single pass on all important data (in most cases is all the data) is the goal.

> This seems to be an extreme case, we might think results would be completely different from a function giving hashes that "sometimes" represents 2 elements.

Very true. If I increase to 10_000_000 I see opCmp:1913600 for the smart method and (several minutes later) opCmp:907216764 for the simple addition (method 1).
In this case you know something special about the data and can take advantage of it. If I run the example using the mixin(HashSupport) it does opCmp:7793499 which is about 4 times as many compares as the smart one. The simple addition does 474 times as many compares as the smart one - so it is clearly very bad. So, if you know something special about the data, like it can easily be converted into a single number such as seconds, by all means include that. But remember, next time you add something to the class, if you don't have some "automated" way of pulling in info from all fields you need to revisit the hash (and opCmp and opEquals).

Thanks
Dan


November 10, 2012
Hello,

Thanks for this complete answer. I will take a look to your code.
Additionally, Ali gave me a really interesting link about hashing, good practices, what is efficient, etc. If you didn't read it, it might interest you. Here it is:

http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

Le 07/11/2012 13:10, Dan a écrit :
> On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jakse wrote:
> [...]
>> Ali agreed that concatenating strings each time would indeed be
>> inefficient. He thought we might cache the value (third solution) :
>
> Interesting about caching the hashcode and on large classes could save
> you. But isn't the signature shown const? Would that compile?

Indeed, I didn't tested this code, I just wanted to explicit what I wanted to say. Thank you for the notice.

> Also, what
> if you change the strings - you get the wrong hash? I suppose you could
> limit writes to properties and null the hashcode on any write.

Yes, good question and good idea, I think I would do it this way. I didn't think about it.

>
>> Questions are :
>>  - what is the most efficient solution, and in which case ?
>
> No string concatenation is good. I think a single pass on all important
> data (in most cases is all the data) is the goal.

I'm not sure I understood well. You wanted to say that string contatenations are good, right ?

I was thinking about a hash function that would take several arguments and hash them together. That would let take in account more than one string in the hash while avoiding concatenation.

>[...]
>
> Thanks
> Dan
>

November 10, 2012
On Saturday, 10 November 2012 at 07:55:18 UTC, Raphaël Jakse wrote:
> Hello,
>
> Thanks for this complete answer. I will take a look to your code.

Ok - good. I've been using 2.061 which I just realized allows "dup" on an associative array, a feature which was not available in 2.060. So mixin(Dup) and the unit tests require 2.061.

> If you didn't read it, it might interest you. Here it is:
>

I had not seen it and will read - thanks.

>>
>>> Questions are :
>>> - what is the most efficient solution, and in which case ?
>>
>> No string concatenation is good. I think a single pass on all important
>> data (in most cases is all the data) is the goal.
>
> I'm not sure I understood well. You wanted to say that string contatenations are good, right ?
>

I just meant string concatenation is likely unnecessary. Imagine two one megabyte strings. It is easy to concat them and call the string hash function on the result but you have to create a 2Mb string first. Alternatively, you could hash each and combine the hash codes in some consistent way.

> I was thinking about a hash function that would take several arguments and hash them together. That would let take in account more than one string in the hash while avoiding concatenation.
>
Yes.

Thanks
Dan


November 10, 2012
On 11/10/2012 05:17 AM, Dan wrote:

> I just meant string concatenation is likely unnecessary. Imagine two one
> megabyte strings. It is easy to concat them and call the string hash
> function on the result but you have to create a 2Mb string first.

Just a random thought: std.range.chain can obviate that copy:

  http://dlang.org/phobos/std_range.html#chain

Ali

November 10, 2012
On 11/07/2012 07:38 AM, "Raphaël.Jakse" <raphael.jakse@gmail.com>"@puremagic.com wrote:
> We want to be able to get the hash of s. Therefore, we re-implement the toHash
> method of the Student class :

OK, now I'm curious.  Assuming I don't write a custom re-implementation, how would a custom struct or class be hashed?  (What about a tuple?)

I ask because I'm considering associative arrays where the key is a custom class or tuple as part of a current coding project.
November 10, 2012
On Saturday, 10 November 2012 at 18:18:07 UTC, Joseph Rushton Wakeling wrote:
> On 11/07/2012 07:38 AM, "Raphaël.Jakse" <raphael.jakse@gmail.com>"@puremagic.com wrote:
>> We want to be able to get the hash of s. Therefore, we re-implement the toHash
>> method of the Student class :
>
> OK, now I'm curious.  Assuming I don't write a custom re-implementation, how would a custom struct or class be hashed?  (What about a tuple?)
>
> I ask because I'm considering associative arrays where the key is a custom class or tuple as part of a current coding project.

Not sure I understand the question. But here is how I'm doing it. No guarantees, but output looks promising. Code following output.

Thanks
Dan

-----------------------------
true
false
false
true
7F053B2DCFC0
20883845 vs 20883845
-----------------------------
import std.stdio;
import std.traits;
import std.typecons;
import opmix.mix;

struct S {
  mixin(HashSupport);
  alias Tuple!(int, char, string) X;
  X x;
  char[] mutable;
}

void main() {
  S s1 = { tuple(3, 'a', "foo".idup), ['a','b'].dup };
  S s2 = { tuple(3, 'a', "foo".idup), ['a','b'].dup };
  writeln(s1==s2);
  s1.x[0]++;
  writeln(s1==s2);
  writeln(s1<s2);
  writeln(s2<s1);
  writeln(s1 in [ s1: 3 ]);
  s2.x[0]++;
  writeln(s1.toHash(), " vs ", s2.toHash());
}

November 11, 2012
On 11/10/2012 05:37 AM, Joseph Rushton Wakeling wrote:
> On 11/07/2012 07:38 AM, "Raphaël.Jakse"
> <raphael.jakse@gmail.com>"@puremagic.com wrote:
>> We want to be able to get the hash of s. Therefore, we re-implement
>> the toHash
>> method of the Student class :
>
> OK, now I'm curious. Assuming I don't write a custom re-implementation,
> how would a custom struct or class be hashed? (What about a tuple?)
>
> I ask because I'm considering associative arrays where the key is a
> custom class or tuple as part of a current coding project.

For classes, because they are reference types, it is the object identity that determines the hash value (probably the pointer of the actual object). As a result, even when the values of the members are the same, two objects have different hash values.

For structs, because they are value types, it is the bytes that make up the object determine the hash value.

But beware: The associative array members of structs are not hashed by values of their elements. (There was a long discussion recently on the the main newsgroup on this topic.)

Ali

November 11, 2012
On 11/11/2012 02:35 AM, Ali Çehreli wrote:
> For classes, because they are reference types, it is the object identity that
> determines the hash value (probably the pointer of the actual object). As a
> result, even when the values of the members are the same, two objects have
> different hash values.
>
> For structs, because they are value types, it is the bytes that make up the
> object determine the hash value.

Ahh, clear.  So where classes are concerned there's a definite need to override opHash if you want the _values_ to be the basis of the association, rather than the object itself ... this would also have potential implications for memory blow-up where a class is used as the key type, no?

> But beware: The associative array members of structs are not hashed by values of
> their elements. (There was a long discussion recently on the the main newsgroup
> on this topic.)

I do recall that discussion, and I'll bear that in mind.

So far the only practical consideration I've had was for a Tuple!() to be the key, rather than an actual struct or class; as it happens I've avoided that so far, and I think I may continue to do so while there are still issues around hashing.

The more pressing reason for avoiding associative arrays as it happens was simply that I was getting unacceptably large memory usage as a result: I'd got some data stored in the form of byte[size_t] (really this should have been a set of size_t's, but this seemed theoretically an acceptable stop gap while there's no set class in Phobos), and for some reason that was generating memory consumption twice that of having instead a size_t[] regular array.  I guess perhaps because although a byte should mean just that, in practice each value in the byte[size_t] associative array was taking up a whole word?

It's a shame, because functionally it was useful to be able to do things of the form if(i in list), but memory-wise it just didn't work for the size of data I'm dealing with.
November 11, 2012
Jakse wrote:

> It would also be interesting to have ideas for the general case

Yes, ideas might be interesting. :-)

A root of "good" hashing is incorporated in algorithm2 of your posting:

spread the values produced by the hash function uniformly over the interval  size_t.min .. size_t.max.

Therefore:

Compute the hash value for a given key by multiplying each byte of the key with a factor, which increases from the byte with lowest significance to the byte with highest significance; of course add the results.

Remarks:
a)
Because of the general case: assume that the original key
contains much redundance. Therefore do not use the bytes of the
original key but compute a ( at least close to) lossless
compression of the original key.

b)
The factors in the factor list should be primes for spreading
the hash value uniformly over the intended range

c)
The quotint of two consecutives primes in the factor list should
be greater than 256, because the used key might not contain any
reduundancy.

-manfred
« First   ‹ Prev
1 2
Top | Discussion index | About this forum | D home