Fast hashtable

Feb 28, 2017

Andrei Alexandrescu

Feb 28, 2017

bauss

Feb 28, 2017

Feb 28, 2017

Mar 01, 2017

Mar 01, 2017

Mar 02, 2017

Mar 01, 2017

Mar 02, 2017

On Tuesday, 28 February 2017 at 17:57:14 UTC, Andrei Alexandrescu wrote: > This is of possible interest: https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ -- Andrei > But let’s say you know that your hash function returns numbers that are well distributed and that you’re rarely going to get hash collisions even if you use powers of two. In which case you don't need powers of 2 either. ucent h = hash64(key); ulong slot = (h * slotCount) >> 64; And you get what you want, with no constraint on the number of slots. Note that on most architectures, this actually lowers to one mulhi operation, which is typically 3 cycles.

On Tue, Feb 28, 2017 at 12:57:14PM -0500, Andrei Alexandrescu via Digitalmars-d wrote: > This is of possible interest: https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ -- Related to this, recently I found some interesting papers for large external (i.e., on-disk) hash tables: http://www.itu.dk/people/pagh/papers/linear.pdf http://www.weizhewei.com/papers/pods020-wei.pdf AIUI, blocked probing (1st paper) is similar to Robin Hood hashing, in that inserting an entry may cause an existing entry to be moved out of its occupied slot to a different one, but blocked probing also has additional interesting characteristics: - Scanning is done in blocks of size 2^i with starting slots of index 2^i for incrementing i. This structure makes it cache-oblivious, and thus able to take advantage of the modern cache hierarchy without needing cache-specific tuning. - Something special about the 2^i block sizes makes the math just work out (see 2nd paper), so that lookups have expected average 1 + 1/2^Ω(b) I/O operations, where b is the block size (provided the load factor is bounded away from 1), which is a marked improvement over plain linear probing which has expected 1 + O(α/b) I/O operations, where α is the load factor. I didn't look closely enough at the analysis to know for sure, but it seems that since the analysis is cache-oblivious, the O(1 + 1/2^Ω(b)) I/O operations should also generalize to cache misses as well (as part of the memory hierarchy, if you regard secondary storage as the lowest level of the hierarchy). So I'm expecting this might be even faster than the Robin Hood hashing in your linked blog. T -- Sometimes the best solution to morale problems is just to fire all of the unhappy people. -- despair.com

March 01, 2017

Re: Fast hashtable

Posted by Cecil Ward
in reply to H. S. Teoh

Permalink

Cecil Ward

Posted in reply to H. S. Teoh

Permalink

On Tuesday, 28 February 2017 at 21:00:05 UTC, H. S. Teoh wrote:
> On Tue, Feb 28, 2017 at 12:57:14PM -0500, Andrei Alexandrescu via Digitalmars-d wrote:
>> This is of possible interest: https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ --
>
> Related to this, recently I found some interesting papers for large external (i.e., on-disk) hash tables:
>
> 	http://www.itu.dk/people/pagh/papers/linear.pdf
> 	http://www.weizhewei.com/papers/pods020-wei.pdf
>
> AIUI, blocked probing (1st paper) is similar to Robin Hood hashing, in that inserting an entry may cause an existing entry to be moved out of its occupied slot to a different one, but blocked probing also has additional interesting characteristics:
>
> - Scanning is done in blocks of size 2^i with starting slots of index
>   2^i for incrementing i. This structure makes it cache-oblivious, and
>   thus able to take advantage of the modern cache hierarchy without
>   needing cache-specific tuning.
>
> - Something special about the 2^i block sizes makes the math just work
>   out (see 2nd paper), so that lookups have expected average 1 +
>   1/2^Ω(b) I/O operations, where b is the block size (provided the load
>   factor is bounded away from 1), which is a marked improvement over
>   plain linear probing which has expected 1 + O(α/b) I/O operations,
>   where α is the load factor.
>
> I didn't look closely enough at the analysis to know for sure, but it seems that since the analysis is cache-oblivious, the O(1 + 1/2^Ω(b)) I/O operations should also generalize to cache misses as well (as part of the memory hierarchy, if you regard secondary storage as the lowest level of the hierarchy).  So I'm expecting this might be even faster than the Robin Hood hashing in your linked blog.
>
>
> T

I liked that article. I didn't really understand the point about implementation of modulo primes, maybe I missed something. Given that our man is doing modulo a 'known' value (he had a switch statement to get to them), why not do something rather cheaper than a compiler-expanded constant div/mod made up of multiplies and shifts

    const uint power2 = 512; // say, some 1 << n anyway
    const uint prime = 509; // some prime just below the power, some prime > power2/2

    static assert( power2 - 1 - prime < prime );

    x = x & ( power2 - 1 );
    x = ( x >= prime ) ? x - prime : x;

which is good news on my x86 with GDC -O3 (only 3 operations, and sub cmovx ) - all well provided you make sure that you are getting CMOVx not branches. I could work out the power from the prime using CTFE given a bit of thought. Maybe CTFE could even do the reverse?

Have I finally gone mad?

On Wednesday, 1 March 2017 at 06:44:34 UTC, Cecil Ward wrote: > > I liked that article. I didn't really understand the point about implementation of modulo primes, maybe I missed something. Given that our man is doing modulo a 'known' value (he had a switch statement to get to them), why not do something rather cheaper than a compiler-expanded constant div/mod made up of multiplies and shifts > > const uint power2 = 512; // say, some 1 << n anyway > const uint prime = 509; // some prime just below the power, some prime > power2/2 > > static assert( power2 - 1 - prime < prime ); > > x = x & ( power2 - 1 ); > x = ( x >= prime ) ? x - prime : x; > > which is good news on my x86 with GDC -O3 (only 3 operations, and sub cmovx ) - all well provided you make sure that you are getting CMOVx not branches. I could work out the power from the prime using CTFE given a bit of thought. Maybe CTFE could even do the reverse? > > Have I finally gone mad? Yes :D, this is something compiler should do. btw: https://github.com/dlang/phobos/pull/1452

On Wednesday, 1 March 2017 at 06:44:34 UTC, Cecil Ward wrote: > const uint power2 = 512; // say, some 1 << n anyway > const uint prime = 509; // some prime just below the power, some prime > power2/2 > > static assert( power2 - 1 - prime < prime ); > > x = x & ( power2 - 1 ); > x = ( x >= prime ) ? x - prime : x; > > which is good news on my x86 with GDC -O3 (only 3 operations, and sub cmovx ) - all well provided you make sure that you are getting CMOVx not branches. I could work out the power from the prime using CTFE given a bit of thought. Maybe CTFE could even do the reverse? > > Have I finally gone mad? The lower slot will be twice as crowded as the higher ones.

On Wednesday, 1 March 2017 at 12:59:30 UTC, deadalnix wrote: > On Wednesday, 1 March 2017 at 06:44:34 UTC, Cecil Ward wrote: >> const uint power2 = 512; // say, some 1 << n anyway >> const uint prime = 509; // some prime just below the power, some prime > power2/2 >> >> static assert( power2 - 1 - prime < prime ); >> >> x = x & ( power2 - 1 ); >> x = ( x >= prime ) ? x - prime : x; >> >> which is good news on my x86 with GDC -O3 (only 3 operations, and sub cmovx ) - all well provided you make sure that you are getting CMOVx not branches. I could work out the power from the prime using CTFE given a bit of thought. Maybe CTFE could even do the reverse? >> >> Have I finally gone mad? > > The lower slot will be twice as crowded as the higher ones. Sorry, I think I was unclear, I was suggesting the author should use modulo the prime. The power of two is irrelevant, it's just a quick(er?) way of computing modulo. Are we on the same wavelength?

On Wednesday, 1 March 2017 at 09:45:23 UTC, Daniel Kozak wrote: > On Wednesday, 1 March 2017 at 06:44:34 UTC, Cecil Ward wrote: >> >> I liked that article. I didn't really understand the point about implementation of modulo primes, maybe I missed something. Given that our man is doing modulo a 'known' value (he had a switch statement to get to them), why not do something rather cheaper than a compiler-expanded constant div/mod made up of multiplies and shifts >> >> const uint power2 = 512; // say, some 1 << n anyway >> const uint prime = 509; // some prime just below the power, some prime > power2/2 >> >> static assert( power2 - 1 - prime < prime ); >> >> x = x & ( power2 - 1 ); >> x = ( x >= prime ) ? x - prime : x; >> >> which is good news on my x86 with GDC -O3 (only 3 operations, and sub cmovx ) - all well provided you make sure that you are getting CMOVx not branches. I could work out the power from the prime using CTFE given a bit of thought. Maybe CTFE could even do the reverse? >> >> Have I finally gone mad? > > Yes :D, this is something compiler should do. > > btw: https://github.com/dlang/phobos/pull/1452 Does anyone know if the compilers could use this for code generation? I did later CTFE the prime from the power, can do the reverse more easily which is the way compilers would need to do it for division by known x.

Forums