Thread overview
Memory Efficient HashSet
Mar 08, 2016
Nordlöw
Mar 09, 2016
Nordlöw
Mar 10, 2016
Nordlöw
Mar 10, 2016
Nordlöw
Mar 10, 2016
thedeemon
Mar 08, 2016
Edwin van Leeuwen
March 08, 2016
Has anybody put together a memory-efficient D-implementation of a HashSet

Something like

sparse_hash_set<> contained in

https://github.com/sparsehash/sparsehash

but in D.
March 08, 2016
On Tuesday, 8 March 2016 at 08:12:04 UTC, Nordlöw wrote:
> sparse_hash_set<> contained in
>
> https://github.com/sparsehash/sparsehash

It appears to be very slow?
What do you need it for?

March 08, 2016
On Tuesday, 8 March 2016 at 08:12:04 UTC, Nordlöw wrote:
> Has anybody put together a memory-efficient D-implementation of a HashSet
>
> Something like
>
> sparse_hash_set<> contained in
>
> https://github.com/sparsehash/sparsehash
>
> but in D.

There is an implementation in:
code.dlang.org/packages/emsi_containers

But to be honest I got stuck trying to use it (copy constructor disabled), so I used this very minimal wrapper around associative array:

private struct HashSet(E) {
    // TODO switch to proper implementation (not using AA)
    bool put( E el )
    {
        if ( el in set )
            return false;
        set[el] = set.length;
        return true;
    }
    size_t[E] set;
}

(I only needed put, since I used it to check whether we already encountered a value before in a lazy/non sorted implementation of uniq)
March 09, 2016
On Tuesday, 8 March 2016 at 12:25:36 UTC, Ola Fosheim Grøstad wrote:
> On Tuesday, 8 March 2016 at 08:12:04 UTC, Nordlöw wrote:
>> sparse_hash_set<> contained in
>>
>> https://github.com/sparsehash/sparsehash
>
> It appears to be very slow?
> What do you need it for?

My knowledge database engine I'm building cannot afford the memory overhead of D's builtin associative arrays.

For instance the following program

void main(string[] args)
{
    alias T = uint;
    bool[T] x; // emulates a HashSet with significant memory overhead
    import std.range : iota;
    const n = 10_000_000;
    foreach (const i; iota(0, n))
    {
        x[i] = true; // indicate that it's stored
    }

    import std.stdio : writeln, readln;
    writeln("Press return: ");
    readln;
}

consumes 842.m MiB on my Ubuntu.

The consumption with Google's sparsehash unordered_set is about 50 MiB.
March 10, 2016
On Wednesday, 9 March 2016 at 22:31:50 UTC, Nordlöw wrote:
[...]
>     foreach (const i; iota(0, n))
>     {
>         x[i] = true; // indicate that it's stored
>     }
>
>     import std.stdio : writeln, readln;
>     writeln("Press return: ");
>     readln;
> }
>
> consumes 842.m MiB on my Ubuntu.
>
> The consumption with Google's sparsehash unordered_set is about 50 MiB.
May be the discussion associated with this post, can give help???
https://forum.dlang.org/post/mailman.217.1457590242.26339.digitalmars-d-bugs@puremagic.com


I tried your little program and played around. I tried x.rehash, but this resulted in even more memory consumption. 80 bytes for 4 bytes key and one bit storage seams a lot.
With smaller n it comes down to app. 60 Bytes.

With how many data sets at the end, you will have to deal?
March 10, 2016
On Wednesday, 9 March 2016 at 22:31:50 UTC, Nordlöw wrote:
> The consumption with Google's sparsehash unordered_set is about 50 MiB.

See also: http://smerity.com/articles/2015/google_sparsehash.html
March 10, 2016
On Wednesday, 9 March 2016 at 22:31:50 UTC, Nordlöw wrote:
> My knowledge database engine I'm building cannot afford the memory overhead of D's builtin associative arrays.

Sounds like a cool project! You could also look into using a trie.

March 10, 2016
On Thursday, 10 March 2016 at 10:15:10 UTC, Martin Tschierschke wrote:
> With how many data sets at the end, you will have to deal?

I need something close to the memory overhead of a sorted array of values. That is why I'm considering converting SparseHash's sparse_hash_set to D.
March 10, 2016
On Wednesday, 9 March 2016 at 22:31:50 UTC, Nordlöw wrote:
> consumes 842.m MiB on my Ubuntu.

Here's mine:
https://bitbucket.org/infognition/robinhood/src
(you just need one file rbhash.d to use in your apps)

The following test takes ~130 MB and can take less with some tweaks in the settings:

import std.stdio, rbhash;

alias Set(T) = RHHash!(T, void);

void main() {
    auto set = new Set!uint;
    foreach(i; 0..10_000_000)
        set.add(i);

    writeln(set.contains(444));
    readln;
}

It's a GC-free Robin Hood hash table implementation with special treatment for void as value type, i.e. sets.
March 10, 2016
On 3/8/16 3:12 AM, Nordlöw wrote:
> Has anybody put together a memory-efficient D-implementation of a HashSet
>
> Something like
>
> sparse_hash_set<> contained in
>
> https://github.com/sparsehash/sparsehash
>
> but in D.

D needs a way to use allocators for hash nodes.

In Dcollections, both the performance and memory usage I was able to optimize by using a specialized allocator that allocates in pages and then divides up the pages.

-Steve