Thread overview
Strings hash value memoized
Jul 12, 2008
bearophile
Jul 12, 2008
bearophile
Jul 12, 2008
Koroskin Denis
Jul 12, 2008
JAnderson
July 12, 2008
I presume AAs are used often enough in D. If D 2.x strings are immutable, then their hash function doesn't change, so it can be useful to store the hash value inside the data structure of the string (Python works like this), by default it can be initialized to a value like -1, so it's computed only the first time it's actually needed, avoiding wasting time to compute the hash value if it's not used.

Having the hash value inside strings and other immutable data structures allows quite faster AAs (and sets too, even if they are inside an external library).
Currently strings aren't special structures, they are just immutable dynamical arrays of chars, so the has value (32 or 64 bit according to the word size of the CPU, I presume) can't be stored in the dynamic array record, so it may be stored in the -4/-8 bytes of the data block, but this can't work for string slices. A possible solution is to create a special string type that's just like the dynamic array struct, plus the hash value, so this struct becomes bigger (12 instead of 8 bytes on 32 bit CPUs), slowing down some operations a little.

Bye,
bearophile
July 12, 2008
"bearophile" <bearophileHUGS@lycos.com> wrote in message news:g5a8h5$gmk$1@digitalmars.com...
>I presume AAs are used often enough in D. If D 2.x strings are immutable, then their hash function doesn't change, so it can be useful to store the hash value inside the data structure of the string (Python works like this), by default it can be initialized to a value like -1, so it's computed only the first time it's actually needed, avoiding wasting time to compute the hash value if it's not used.
>
> Having the hash value inside strings and other immutable data structures
> allows quite faster AAs (and sets too, even if they are inside an external
> library).
> Currently strings aren't special structures, they are just immutable
> dynamical arrays of chars, so the has value (32 or 64 bit according to the
> word size of the CPU, I presume) can't be stored in the dynamic array
> record, so it may be stored in the -4/-8 bytes of the data block, but this
> can't work for string slices. A possible solution is to create a special
> string type that's just like the dynamic array struct, plus the hash
> value, so this struct becomes bigger (12 instead of 8 bytes on 32 bit
> CPUs), slowing down some operations a little.

Or, you store the precomputed hash in the structure (i.e. AA) that uses them.  Which is exactly what the default AA implementation does, actually.


July 12, 2008
Jarrett Billingsley:
> Or, you store the precomputed hash in the structure (i.e. AA) that uses them.  Which is exactly what the default AA implementation does, actually.

Having the hash value in the string has some advantages, if have two strings and both of them have their hash value is already computed, you have a way to test quickly if they are different. You can compute intersection and union between two string sets quickly. If the same string is in more AAs/sets you compute and store its hash value only once for its life.

Bye,
bearophile
July 12, 2008
On Sat, 12 Jul 2008 16:36:53 +0400, bearophile <bearophileHUGS@lycos.com> wrote:

> I presume AAs are used often enough in D. If D 2.x strings are immutable, then their hash function doesn't change, so it can be useful to store the hash value inside the data structure of the string (Python works like this), by default it can be initialized to a value like -1, so it's computed only the first time it's actually needed, avoiding wasting time to compute the hash value if it's not used.
>
> Having the hash value inside strings and other immutable data structures allows quite faster AAs (and sets too, even if they are inside an external library).
> Currently strings aren't special structures, they are just immutable dynamical arrays of chars, so the has value (32 or 64 bit according to the word size of the CPU, I presume) can't be stored in the dynamic array record, so it may be stored in the -4/-8 bytes of the data block, but this can't work for string slices. A possible solution is to create a special string type that's just like the dynamic array struct, plus the hash value, so this struct becomes bigger (12 instead of 8 bytes on 32 bit CPUs), slowing down some operations a little.
>
> Bye,
> bearophile

Why distiguish between strings and other data type? Just wrap any invariant data type into a special structure that lazily computes and stores hash, and use it in a AA instead.
July 12, 2008
bearophile wrote:
> I presume AAs are used often enough in D. If D 2.x strings are immutable, then their hash function doesn't change, so it can be useful to store the hash value inside the data structure of the string (Python works like this), by default it can be initialized to a value like -1, so it's computed only the first time it's actually needed, avoiding wasting time to compute the hash value if it's not used.
> 
> Having the hash value inside strings and other immutable data structures allows quite faster AAs (and sets too, even if they are inside an external library).
> Currently strings aren't special structures, they are just immutable dynamical arrays of chars, so the has value (32 or 64 bit according to the word size of the CPU, I presume) can't be stored in the dynamic array record, so it may be stored in the -4/-8 bytes of the data block, but this can't work for string slices. A possible solution is to create a special string type that's just like the dynamic array struct, plus the hash value, so this struct becomes bigger (12 instead of 8 bytes on 32 bit CPUs), slowing down some operations a little.
> 
> Bye,
> bearophile

Sounds like a good case for a standard lib function.  Potentially the hashing part could be a generic class for any type with a particular spealization for strings.  It could also be lazily computed (ie compute it the first time its needed then cache it).

Maybe something like:

alias hashed(string) hashed_string;
alias hashed(int[])  hashed_array;

I guess one downside is that compile time string hashes may have to be figured out at run-time with that approach.

-Joel