Memory Efficient HashSet

Mar 08, 2016

Nordlöw

Mar 08, 2016

Ola Fosheim Grøstad

Mar 09, 2016

Mar 10, 2016

Mar 10, 2016

Mar 10, 2016

Mar 10, 2016

Mar 10, 2016

Mar 08, 2016

Mar 10, 2016

On Tuesday, 8 March 2016 at 08:12:04 UTC, Nordlöw wrote: > Has anybody put together a memory-efficient D-implementation of a HashSet > > Something like > > sparse_hash_set<> contained in > > https://github.com/sparsehash/sparsehash > > but in D. There is an implementation in: code.dlang.org/packages/emsi_containers But to be honest I got stuck trying to use it (copy constructor disabled), so I used this very minimal wrapper around associative array: private struct HashSet(E) { // TODO switch to proper implementation (not using AA) bool put( E el ) { if ( el in set ) return false; set[el] = set.length; return true; } size_t[E] set; } (I only needed put, since I used it to check whether we already encountered a value before in a lazy/non sorted implementation of uniq)

March 09, 2016

Re: Memory Efficient HashSet

Posted by Nordlöw
in reply to Ola Fosheim Grøstad

Permalink

Nordlöw

Posted in reply to Ola Fosheim Grøstad

Permalink

On Tuesday, 8 March 2016 at 12:25:36 UTC, Ola Fosheim Grøstad wrote:
> On Tuesday, 8 March 2016 at 08:12:04 UTC, Nordlöw wrote:
>> sparse_hash_set<> contained in
>>
>> https://github.com/sparsehash/sparsehash
>
> It appears to be very slow?
> What do you need it for?

My knowledge database engine I'm building cannot afford the memory overhead of D's builtin associative arrays.

For instance the following program

void main(string[] args)
{
    alias T = uint;
    bool[T] x; // emulates a HashSet with significant memory overhead
    import std.range : iota;
    const n = 10_000_000;
    foreach (const i; iota(0, n))
    {
        x[i] = true; // indicate that it's stored
    }

    import std.stdio : writeln, readln;
    writeln("Press return: ");
    readln;
}

consumes 842.m MiB on my Ubuntu.

The consumption with Google's sparsehash unordered_set is about 50 MiB.

On Wednesday, 9 March 2016 at 22:31:50 UTC, Nordlöw wrote: [...] > foreach (const i; iota(0, n)) > { > x[i] = true; // indicate that it's stored > } > > import std.stdio : writeln, readln; > writeln("Press return: "); > readln; > } > > consumes 842.m MiB on my Ubuntu. > > The consumption with Google's sparsehash unordered_set is about 50 MiB. May be the discussion associated with this post, can give help??? https://forum.dlang.org/post/mailman.217.1457590242.26339.digitalmars-d-bugs@puremagic.com I tried your little program and played around. I tried x.rehash, but this resulted in even more memory consumption. 80 bytes for 4 bytes key and one bit storage seams a lot. With smaller n it comes down to app. 60 Bytes. With how many data sets at the end, you will have to deal?

On Wednesday, 9 March 2016 at 22:31:50 UTC, Nordlöw wrote: > The consumption with Google's sparsehash unordered_set is about 50 MiB. See also: http://smerity.com/articles/2015/google_sparsehash.html

On Wednesday, 9 March 2016 at 22:31:50 UTC, Nordlöw wrote: > My knowledge database engine I'm building cannot afford the memory overhead of D's builtin associative arrays. Sounds like a cool project! You could also look into using a trie.

On Thursday, 10 March 2016 at 10:15:10 UTC, Martin Tschierschke wrote: > With how many data sets at the end, you will have to deal? I need something close to the memory overhead of a sorted array of values. That is why I'm considering converting SparseHash's sparse_hash_set to D.

On Wednesday, 9 March 2016 at 22:31:50 UTC, Nordlöw wrote: > consumes 842.m MiB on my Ubuntu. Here's mine: https://bitbucket.org/infognition/robinhood/src (you just need one file rbhash.d to use in your apps) The following test takes ~130 MB and can take less with some tweaks in the settings: import std.stdio, rbhash; alias Set(T) = RHHash!(T, void); void main() { auto set = new Set!uint; foreach(i; 0..10_000_000) set.add(i); writeln(set.contains(444)); readln; } It's a GC-free Robin Hood hash table implementation with special treatment for void as value type, i.e. sets.

On 3/8/16 3:12 AM, Nordlöw wrote: > Has anybody put together a memory-efficient D-implementation of a HashSet > > Something like > > sparse_hash_set<> contained in > > https://github.com/sparsehash/sparsehash > > but in D. D needs a way to use allocators for hash nodes. In Dcollections, both the performance and memory usage I was able to optimize by using a specialized allocator that allocates in pages and then divides up the pages. -Steve

Forums