May 03, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jérôme M. Berger | On Sun, 03 May 2009 15:33:12 +0300, Jérôme M. Berger <jeberger@free.fr> wrote:
> |
> | foreach(c; str)
> | {
> | ret = (ret << 4) + c;
> | }
> |
> That one is very bad because it only takes into account the last
> few characters of the string (how many depends on the size of the
> hash). However, several hashing algorithms use 2^n+1 as their
> multiplier, which is very fast even on old/small/embedded hardware
> because it can be implemented as a shift and an addition.
You are of course right; thanks for the correction. Lets scrap that algorithm! :)
| |||
May 03, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Kristian Kilpi | Kristian Kilpi Wrote:
> On Sun, 03 May 2009 15:33:12 +0300, Jérôme M. Berger <jeberger@free.fr> wrote:
> > |
> > | foreach(c; str)
> > | {
> > | ret = (ret << 4) + c;
> > | }
> > |
> > That one is very bad because it only takes into account the last
> > few characters of the string (how many depends on the size of the
> > hash). However, several hashing algorithms use 2^n+1 as their
> > multiplier, which is very fast even on old/small/embedded hardware
> > because it can be implemented as a shift and an addition.
>
> You are of course right; thanks for the correction. Lets scrap that algorithm! :)
Would something like
foreach(i, c; str)
{
ret ^= cast(uint) c;
if (i & 1)
ret <<= 1;
}
do a better job. That could cover the first 64 chars reasonably well, and I suspect that could cover a lot of use cases.
| |||
May 03, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Steve Teale | == Quote from Steve Teale (steve.teale@britseyeview.com)'s article > Kristian Kilpi Wrote: > > On Sun, 03 May 2009 15:33:12 +0300, Jérôme M. Berger <jeberger@free.fr> wrote: > > > | > > > | foreach(c; str) > > > | { > > > | ret = (ret << 4) + c; > > > | } > > > | > > > That one is very bad because it only takes into account the last > > > few characters of the string (how many depends on the size of the > > > hash). However, several hashing algorithms use 2^n+1 as their > > > multiplier, which is very fast even on old/small/embedded hardware > > > because it can be implemented as a shift and an addition. > > > > You are of course right; thanks for the correction. Lets scrap that algorithm! :) > Would something like > foreach(i, c; str) > { > ret ^= cast(uint) c; > if (i & 1) > ret <<= 1; > } > do a better job. That could cover the first 64 chars reasonably well, and I suspect that could cover a lot of use cases. I guess a lot of people missed my follow-up post on this. This is largely a typeinfo bug, not a string hashing bug. typeid(string) returns a TypeInfo reference to generic array TypeInfo, whereas typeid(char[]) returns a TypeInfo reference for which the hashing works well. What is needed is for typeid(typemodifier(char)[]) to return the same thing as typeid(char[]). Then again, I guess that, given how bad the generic array hashing is, it wouldn't hurt to improve that, too. If someone implements a decent meta-hashing function (i.e. take a bunch of hashes of elements of some object, like an array, a tuple, or a struct, and produce a single good hash out of these), there would be no shortage of use cases for it. | |||
May 03, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to dsimcha | dsimcha wrote: > == Quote from Steve Teale (steve.teale@britseyeview.com)'s article >> Kristian Kilpi Wrote: >>> On Sun, 03 May 2009 15:33:12 +0300, J�r�me M. Berger <jeberger@free.fr> >>> wrote: >>>> | >>>> | foreach(c; str) >>>> | { >>>> | ret = (ret << 4) + c; >>>> | } >>>> | >>>> That one is very bad because it only takes into account the last >>>> few characters of the string (how many depends on the size of the >>>> hash). However, several hashing algorithms use 2^n+1 as their >>>> multiplier, which is very fast even on old/small/embedded hardware >>>> because it can be implemented as a shift and an addition. >>> You are of course right; thanks for the correction. Lets scrap that >>> algorithm! :) >> Would something like >> foreach(i, c; str) >> { >> ret ^= cast(uint) c; >> if (i & 1) >> ret <<= 1; >> } >> do a better job. That could cover the first 64 chars reasonably well, and I > suspect that could cover a lot of use cases. > > I guess a lot of people missed my follow-up post on this. Yes. Such a minor oversight in a face-to-face conversation takes five seconds to fix. In a newsgroup, it becomes a long thread :o). > This is largely a > typeinfo bug, not a string hashing bug. typeid(string) returns a TypeInfo > reference to generic array TypeInfo, whereas typeid(char[]) returns a TypeInfo > reference for which the hashing works well. What is needed is for > typeid(typemodifier(char)[]) to return the same thing as typeid(char[]). > > Then again, I guess that, given how bad the generic array hashing is, it wouldn't > hurt to improve that, too. If someone implements a decent meta-hashing function > (i.e. take a bunch of hashes of elements of some object, like an array, a tuple, > or a struct, and produce a single good hash out of these), there would be no > shortage of use cases for it. I have exchanged email with Paul Hsieh about using his hash function in Phobos (azillionmonkeys.com) and he said his license is very permissive. I seem to recall I have introduced his hash function, with credit, in Phobos' old runtime but I am not sure whether Sean took that over. I suggest we do that in druntime for string, be they mutable or not. Andrei | |||
May 03, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to dsimcha | dsimcha Wrote:
> == Quote from Steve Teale (steve.teale@britseyeview.com)'s article
> > Kristian Kilpi Wrote:
> > > On Sun, 03 May 2009 15:33:12 +0300, Jérôme M. Berger <jeberger@free.fr> wrote:
> > > > |
> > > > | foreach(c; str)
> > > > | {
> > > > | ret = (ret << 4) + c;
> > > > | }
> > > > |
> > > > That one is very bad because it only takes into account the last
> > > > few characters of the string (how many depends on the size of the
> > > > hash). However, several hashing algorithms use 2^n+1 as their
> > > > multiplier, which is very fast even on old/small/embedded hardware
> > > > because it can be implemented as a shift and an addition.
> > >
> > > You are of course right; thanks for the correction. Lets scrap that algorithm! :)
> > Would something like
> > foreach(i, c; str)
> > {
> > ret ^= cast(uint) c;
> > if (i & 1)
> > ret <<= 1;
> > }
> > do a better job. That could cover the first 64 chars reasonably well, and I
> suspect that could cover a lot of use cases.
>
> I guess a lot of people missed my follow-up post on this. This is largely a
> typeinfo bug, not a string hashing bug. typeid(string) returns a TypeInfo
> reference to generic array TypeInfo, whereas typeid(char[]) returns a TypeInfo
> reference for which the hashing works well. What is needed is for
> typeid(typemodifier(char)[]) to return the same thing as typeid(char[]).
>
> Then again, I guess that, given how bad the generic array hashing is, it wouldn't hurt to improve that, too. If someone implements a decent meta-hashing function (i.e. take a bunch of hashes of elements of some object, like an array, a tuple, or a struct, and produce a single good hash out of these), there would be no shortage of use cases for it.
Sorry, I missed that. The topic listing does get a bit fragmented at times.
Thanks Steve
| |||
May 04, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Frits van Bommel | On Sun, 03 May 2009 08:10:43 -0400, Frits van Bommel <fvbommel@remwovexcapss.nl> wrote:
> Jarrett Billingsley wrote:
>> Here's something bizarre: porting this example to Tango yields 99997
>> unique hashes, which is right in line with your estimate. But I think
>> it's bizarre since, well, shouldn't druntime use _the same typeinfo
>> code?_
>
> No, Tango recently upgraded its hashing algorithms. This was done after druntime was created, and presumably before the Tango version on your machine :).
Reading dsimcha's followup, it's probably because tango is D1 only, which doesn't use immutable(char)[]. (man, I gotta get back to working on Tango4D2)
-Steve
| |||
May 05, 2009 Re: Absolutely horrible default string hashing | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
>
> I have exchanged email with Paul Hsieh about using his hash function in Phobos (azillionmonkeys.com) and he said his license is very permissive. I seem to recall I have introduced his hash function, with credit, in Phobos' old runtime but I am not sure whether Sean took that over. I suggest we do that in druntime for string, be they mutable or not.
Unfortunately, that request is languishing in the "oops, I forgot" portion of my to-do list.
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply