Performance of hashes and associative arrays (page 2)

November 11, 2012

Re: Performance of hashes and associative arrays

Posted by Ali Çehreli
in reply to Joseph Rushton Wakeling

Permalink

Ali Çehreli

Posted in reply to Joseph Rushton Wakeling

Permalink

On 11/11/2012 01:11 AM, Joseph Rushton Wakeling wrote:

> So where classes are concerned there's a definite need to
> override opHash if you want the _values_ to be the basis of the
> association, rather than the object itself ... this would also have
> potential implications for memory blow-up where a class is used as the
> key type, no?

I am not sure that I understand.

The associative array would still contain just class variables, not class objects. Everytime the hash needed to be calculated, the AA would call toHash() on the variable, which in turn would use the values of the members of the object.

As another case, a single class object is owned by the runtime but many class variables of it can exist in multiple AAs.

> The more pressing reason for avoiding associative arrays as it happens
> was simply that I was getting unacceptably large memory usage as a
> result: I'd got some data stored in the form of byte[size_t] (really
> this should have been a set of size_t's, but this seemed theoretically
> an acceptable stop gap while there's no set class in Phobos),

I think std.container.RedBlackTree can take that responsibility but if you don't need the elements to be in any particular order, then it may be seen as an overkill as well.

> and for
> some reason that was generating memory consumption twice that of having
> instead a size_t[] regular array. I guess perhaps because although a
> byte should mean just that, in practice each value in the byte[size_t]
> associative array was taking up a whole word?

I don't know the actual implementation of AAs by dmd, but the hash buckets are probably singly-linked lists. Even if your data had no collisions, each byte would have to live in a singly-linked list.

> It's a shame, because functionally it was useful to be able to do things
> of the form if(i in list), but memory-wise it just didn't work for the
> size of data I'm dealing with.

If all you need is 'if(i in list)', then a HashSet may help. I haven't used one myself but apparently there are such data structures in other languages like Java and C#.

Ali

On 11/11/2012 07:40 PM, Ali Çehreli wrote: > I think std.container.RedBlackTree can take that responsibility but if you don't > need the elements to be in any particular order, then it may be seen as an > overkill as well. I know, and I probably should give that a go, but as I've got working code right now I'm moving on to something else and will come back to it later. > If all you need is 'if(i in list)', then a HashSet may help. I haven't used one > myself but apparently there are such data structures in other languages like > Java and C#. Indeed, and I was wondering whether it might be worth writing an implementation for std.containers. As it stands I don't really feel familiar enough with the underlying concepts, but I may revisit that idea once I'm done with a few other things.

=?UTF-8?B?QWxpIMOHZWhyZWxp?= wrote: > but the hash buckets are probably singly-linked lists They are unbalanced binary search trees, which indeed turn to singly-linked lists on sorted input. -manfred

Le 11/11/2012 12:38, Manfred Nowak a écrit : > Jakse wrote: > >> It would also be interesting to have ideas for the general >> case > > Yes, ideas might be interesting. :-) > > A root of "good" hashing is incorporated in algorithm2 of your > posting: > > spread the values produced by the hash function uniformly over > the interval size_t.min .. size_t.max. > > Therefore: > > Compute the hash value for a given key by multiplying each byte > of the key with a factor, which increases from the byte with > lowest significance to the byte with highest significance; of > course add the results. > > Remarks: > a) > Because of the general case: assume that the original key > contains much redundance. Therefore do not use the bytes of the > original key but compute a ( at least close to) lossless > compression of the original key. > > b) > The factors in the factor list should be primes for spreading > the hash value uniformly over the intended range > > c) > The quotint of two consecutives primes in the factor list should > be greater than 256, because the used key might not contain any > reduundancy. > > -manfred > Thanks for this interesting answer. Is compressing for performance reasons ? is it more interesting to compress and then hash than just hash ?

Raphaël Jakse wrote: > Is compressing for performance reasons? Hopefully. Because in the general case the distribution of the keys is unknown, no function used for computing the hash value can be guarenteed to indeed spread the hash values uniformly over the hash interval. Compressing would have a negative effect on performance if more time for compressing and decrompessing would be needed than time was lost for resolving conflicting hash values. > is it more interesting to compress and then hash than just hash ? The only purpose of compressing is to create the special case of close to no redundancy in the keys used for hashing. If this special case is already confirmed or it is known that no compression will deminish the computing time: just hash. -manfred

Forums