April 01, 2015
On Tue, 31 Mar 2015 23:45:28 +0200, Martin Nowak wrote:

> On 03/31/2015 11:12 PM, Martin Nowak wrote:
>> Anyone benchmarked that SipHash?
> 
> No way we use this for druntime. https://github.com/rurban/smhasher#readme

hay, can we switch to `donothing128`? it's speed is terrifying! ;-)

April 01, 2015
On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:

> Robin Hood sounds like a good idea, but it really isn't. Keep your load factor reasonable and distribute values evenly, then you don't need a LRU lookup.

Is there a D version of a hash table with open addressing and quadratic probing? It would be interesting to compare speeds.
I like Robin Hood for cache-friendliness, and it works quite well for many combinations of key-value types.


April 01, 2015
Am Tue, 31 Mar 2015 23:45:28 +0200
schrieb Martin Nowak <code+news.digitalmars@dawg.eu>:

> On 03/31/2015 11:12 PM, Martin Nowak wrote:
> > Anyone benchmarked that SipHash?
> 
> No way we use this for druntime. https://github.com/rurban/smhasher#readme
> 
> I think we should have a flexible Hash in std.container though, that should allow to use a customized hash function.

It's probably more a problem for vibe-d or other
server-like applications. Those should make sure to use DOS-safe hash
tables. For most applications there's no possibility for DOS attacks
using hash tables and we indeed shouldn't make these applications
slower.
April 02, 2015
On Wednesday, 1 April 2015 at 12:05:37 UTC, Johannes Pfau wrote:
> It's probably more a problem for vibe-d or other
> server-like applications. Those should make sure to use DOS-safe hash
> tables. For most applications there's no possibility for DOS attacks
> using hash tables and we indeed shouldn't make these applications
> slower.

The vulnerability presentation suggests perl solution (random hash seed) is good enough, it doesn't slow down anything. The seed can be left zero and initialized by an application as needed. One can also use a longer key and add more its bits every, say, 10 bytes of hashed data, not sure if it will make any difference.
April 03, 2015
On Wednesday, 1 April 2015 at 06:31:28 UTC, thedeemon wrote:
> On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:
>
>> Robin Hood sounds like a good idea, but it really isn't. Keep your load factor reasonable and distribute values evenly, then you don't need a LRU lookup.
>
> Is there a D version of a hash table with open addressing and quadratic probing? It would be interesting to compare speeds.
> I like Robin Hood for cache-friendliness, and it works quite well for many combinations of key-value types.

Now there is. I just changed the hashmap in my container library to use open addressing and quadratic probing. There's a benchmark program in the library for testing it in comparison to the standard associative array. I tested using 2.066 with optimisations on, and it seems to be about the same.

https://github.com/w0rp/dstruct/blob/master/source/dstruct/map.d

Now, I am not the most fantastic Computer Science guy in the world, and I probably got a few things wrong. If anyone would like to look at my code and point out mistakes, please do. I will add any improvements suggested.
April 03, 2015
On Friday, 3 April 2015 at 16:12:00 UTC, w0rp wrote:
> On Wednesday, 1 April 2015 at 06:31:28 UTC, thedeemon wrote:
>> On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:
>>
>>> Robin Hood sounds like a good idea, but it really isn't. Keep your load factor reasonable and distribute values evenly, then you don't need a LRU lookup.
>>
>> Is there a D version of a hash table with open addressing and quadratic probing? It would be interesting to compare speeds.
>> I like Robin Hood for cache-friendliness, and it works quite well for many combinations of key-value types.
>
> Now there is. I just changed the hashmap in my container library to use open addressing and quadratic probing. There's a benchmark program in the library for testing it in comparison to the standard associative array. I tested using 2.066 with optimisations on, and it seems to be about the same.
>
> https://github.com/w0rp/dstruct/blob/master/source/dstruct/map.d
>
> Now, I am not the most fantastic Computer Science guy in the world, and I probably got a few things wrong. If anyone would like to look at my code and point out mistakes, please do. I will add any improvements suggested.

Also, I changed the API slightly.

1. I removed setDefault for the standard associative array type. Now it's only there for my type. The module now only deals with the added library type.
2. I changed my range functions to match the names for range functions in 2.067, so they are now named byKey, byValue, and byKeyValue. I left out 'keys' and 'values', which I think are mistakes, when you can do byKey.array yourself, etc.
3. I renamed ItemRange to KeyValueRange. (My ranges for the hashmaps are all named, which is supposed to make it easier to build higher order ranges from them.)
4. I added a constructor for the hashmaps which lets you maybe avoid some allocations by providing a minimum size, where the hashmap will probably use a size larger than the one you provide.
5. I started using prime sizes for the hashmaps, because quadratic probing doesn't seem to work without primes.

I think that's the bigger changes at least.
April 03, 2015
On Friday, 3 April 2015 at 16:12:00 UTC, w0rp wrote:
>> Is there a D version of a hash table with open addressing and quadratic probing?

> Now there is. https://github.com/w0rp/dstruct/blob/master/source/dstruct/map.d

Great! I'll experiment with it and do some comparisons.
April 03, 2015
On 04/03/2015 06:11 PM, w0rp wrote:
> 
> Now, I am not the most fantastic Computer Science guy in the world, and I probably got a few things wrong. If anyone would like to look at my code and point out mistakes, please do. I will add any improvements suggested.

You should use triangular numbers and power of 2 bucket sizes instead of quadratic numbers and prime sized buckets, because that guarantees full utilisation of the buckets and a minimal period of the numbers.

The necessary loop is really trivial.

https://github.com/D-Programming-Language/dmd/blob/f234c39a0e633fc9a0b5474fe2def76be9a373ef/src/root/stringtable.c#L162

If you replace

    i = (i + j) & (tabledim - 1);

with this

    i = (i + 1) & (tabledim - 1);

you get linear probing btw.
April 03, 2015
On 04/02/2015 02:10 PM, Kagamin wrote:
> 
> The vulnerability presentation suggests perl solution (random hash seed) is good enough, it doesn't slow down anything. The seed can be left zero and initialized by an application as needed. One can also use a longer key and add more its bits every, say, 10 bytes of hashed data, not sure if it will make any difference.

A global random hash seed would work, but it needs to be accessible for reproducing test cases (druntime DRT option or in core.runtime).
April 04, 2015
On Friday, 3 April 2015 at 20:35:40 UTC, Martin Nowak wrote:
> On 04/03/2015 06:11 PM, w0rp wrote:
>> 
>> Now, I am not the most fantastic Computer Science guy in the world, and
>> I probably got a few things wrong. If anyone would like to look at my
>> code and point out mistakes, please do. I will add any improvements
>> suggested.
>
> You should use triangular numbers and power of 2 bucket sizes instead of
> quadratic numbers and prime sized buckets, because that guarantees full
> utilisation of the buckets and a minimal period of the numbers.
>
> The necessary loop is really trivial.
>
> https://github.com/D-Programming-Language/dmd/blob/f234c39a0e633fc9a0b5474fe2def76be9a373ef/src/root/stringtable.c#L162
>
> If you replace
>
>     i = (i + j) & (tabledim - 1);
>
> with this
>
>     i = (i + 1) & (tabledim - 1);
>
> you get linear probing btw.

I have pushed again, and now the hashmap uses powers of two and triangular numbers, per your suggestion. I noticed a small improvement in speed, probably coming from the & for modulo trick. If you would like to test the maps with the benchmark program you can run the following:

dub run -q --build=release --config=benchmark_hashmap

You can use the --compiler switch for dub to try different compilers.