View mode: basic / threaded / horizontal-split · Log in · Help
July 12, 2008
Strings hash value memoized
I presume AAs are used often enough in D. If D 2.x strings are immutable, then their hash function doesn't change, so it can be useful to store the hash value inside the data structure of the string (Python works like this), by default it can be initialized to a value like -1, so it's computed only the first time it's actually needed, avoiding wasting time to compute the hash value if it's not used.

Having the hash value inside strings and other immutable data structures allows quite faster AAs (and sets too, even if they are inside an external library).
Currently strings aren't special structures, they are just immutable dynamical arrays of chars, so the has value (32 or 64 bit according to the word size of the CPU, I presume) can't be stored in the dynamic array record, so it may be stored in the -4/-8 bytes of the data block, but this can't work for string slices. A possible solution is to create a special string type that's just like the dynamic array struct, plus the hash value, so this struct becomes bigger (12 instead of 8 bytes on 32 bit CPUs), slowing down some operations a little.

Bye,
bearophile
July 12, 2008
Re: Strings hash value memoized
"bearophile" <bearophileHUGS@lycos.com> wrote in message 
news:g5a8h5$gmk$1@digitalmars.com...
>I presume AAs are used often enough in D. If D 2.x strings are immutable, 
>then their hash function doesn't change, so it can be useful to store the 
>hash value inside the data structure of the string (Python works like 
>this), by default it can be initialized to a value like -1, so it's 
>computed only the first time it's actually needed, avoiding wasting time to 
>compute the hash value if it's not used.
>
> Having the hash value inside strings and other immutable data structures 
> allows quite faster AAs (and sets too, even if they are inside an external 
> library).
> Currently strings aren't special structures, they are just immutable 
> dynamical arrays of chars, so the has value (32 or 64 bit according to the 
> word size of the CPU, I presume) can't be stored in the dynamic array 
> record, so it may be stored in the -4/-8 bytes of the data block, but this 
> can't work for string slices. A possible solution is to create a special 
> string type that's just like the dynamic array struct, plus the hash 
> value, so this struct becomes bigger (12 instead of 8 bytes on 32 bit 
> CPUs), slowing down some operations a little.

Or, you store the precomputed hash in the structure (i.e. AA) that uses 
them.  Which is exactly what the default AA implementation does, actually.
July 12, 2008
Re: Strings hash value memoized
Jarrett Billingsley:
> Or, you store the precomputed hash in the structure (i.e. AA) that uses 
> them.  Which is exactly what the default AA implementation does, actually.

Having the hash value in the string has some advantages, if have two strings and both of them have their hash value is already computed, you have a way to test quickly if they are different. You can compute intersection and union between two string sets quickly. If the same string is in more AAs/sets you compute and store its hash value only once for its life.

Bye,
bearophile
July 12, 2008
Re: Strings hash value memoized
On Sat, 12 Jul 2008 16:36:53 +0400, bearophile <bearophileHUGS@lycos.com>  
wrote:

> I presume AAs are used often enough in D. If D 2.x strings are  
> immutable, then their hash function doesn't change, so it can be useful  
> to store the hash value inside the data structure of the string (Python  
> works like this), by default it can be initialized to a value like -1,  
> so it's computed only the first time it's actually needed, avoiding  
> wasting time to compute the hash value if it's not used.
>
> Having the hash value inside strings and other immutable data structures  
> allows quite faster AAs (and sets too, even if they are inside an  
> external library).
> Currently strings aren't special structures, they are just immutable  
> dynamical arrays of chars, so the has value (32 or 64 bit according to  
> the word size of the CPU, I presume) can't be stored in the dynamic  
> array record, so it may be stored in the -4/-8 bytes of the data block,  
> but this can't work for string slices. A possible solution is to create  
> a special string type that's just like the dynamic array struct, plus  
> the hash value, so this struct becomes bigger (12 instead of 8 bytes on  
> 32 bit CPUs), slowing down some operations a little.
>
> Bye,
> bearophile

Why distiguish between strings and other data type? Just wrap any  
invariant data type into a special structure that lazily computes and  
stores hash, and use it in a AA instead.
July 12, 2008
Re: Strings hash value memoized
bearophile wrote:
> I presume AAs are used often enough in D. If D 2.x strings are immutable, then their hash function doesn't change, so it can be useful to store the hash value inside the data structure of the string (Python works like this), by default it can be initialized to a value like -1, so it's computed only the first time it's actually needed, avoiding wasting time to compute the hash value if it's not used.
> 
> Having the hash value inside strings and other immutable data structures allows quite faster AAs (and sets too, even if they are inside an external library).
> Currently strings aren't special structures, they are just immutable dynamical arrays of chars, so the has value (32 or 64 bit according to the word size of the CPU, I presume) can't be stored in the dynamic array record, so it may be stored in the -4/-8 bytes of the data block, but this can't work for string slices. A possible solution is to create a special string type that's just like the dynamic array struct, plus the hash value, so this struct becomes bigger (12 instead of 8 bytes on 32 bit CPUs), slowing down some operations a little.
> 
> Bye,
> bearophile

Sounds like a good case for a standard lib function.  Potentially the 
hashing part could be a generic class for any type with a particular 
spealization for strings.  It could also be lazily computed (ie compute 
it the first time its needed then cache it).

Maybe something like:

alias hashed(string) hashed_string;
alias hashed(int[])  hashed_array;

I guess one downside is that compile time string hashes may have to be 
figured out at run-time with that approach.

-Joel
Top | Discussion index | About this forum | D home