The review of std.hash package (page 8) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » The review of std.hash package (page 8)

August 09, 2012

Re: The review of std.hash package

Posted by Johannes Pfau
in reply to Vladimir Panteleev

Johannes Pfau

Posted in reply to Vladimir Panteleev

Am Thu, 09 Aug 2012 11:32:34 +0200
schrieb "Vladimir Panteleev" <vladimir@thecybershadow.net>:

> On Tuesday, 7 August 2012 at 17:39:50 UTC, Dmitry Olshansky wrote:
> > std.hash.hash is a new module for Phobos defining an uniform interface for hashes and checksums. It also provides some useful helper functions to deal with this new API.
> 
> Is it too late to ask to include MurmurHash 2 and/or 3? It's public domain, and great for things like hash tables.
> 
> You can steal some code from here: https://github.com/CyberShadow/ae/blob/master/utils/digest.d https://github.com/CyberShadow/ae/blob/master/utils/digest_murmurhash3.d
> 

To be honest I didn't even know that MurmurHash can be used
incrementally. I could port that code soon, but I think it's
best to do it after the review.
After we have formalized a common API adding new hashes probably won't
require a full review. It should be possible to do that as a pull
request.

August 09, 2012

Re: The review of std.hash package

Posted by David Nadlinger
in reply to Walter Bright

David Nadlinger

Posted in reply to Walter Bright

On Wednesday, 8 August 2012 at 19:27:54 UTC, Walter Bright wrote:
> The idea is to have hash act like a component - not with special added code the user has to write.

Sorry, but I think this is a meaningless statement without specifying what kind of interface the component should adhere to. In my opinion, the proposed std.hash design would be a perfectly valid interface for »accumulate stuff and at some point get a result«-type components.

> In this case, it needs to work like a reduce algorithm, because it is a reduce algorithm. Need to find a way to make this work.

Hash functions are _not_ analogous to reduce(), because the operation performed by reduce() is stateless, whereas hash functions generally have some internal state.

»Continuing« a reduce() operation by repeatedly calling it with the last partial result as the starting value is only possible because there is no additional state to carry over. To make this work with hashes, you'd have to return something encapsulating the internal state from your hash function. But then, you again need to obtain the actual result from that return value from that result somehow, which defeats the original intent of making it work like reduce – and incidentally is what finish() does.

David

August 09, 2012

Re: The review of std.hash package

Posted by Regan Heath
in reply to David Nadlinger

Regan Heath

Posted in reply to David Nadlinger

On Thu, 09 Aug 2012 10:59:47 +0100, David Nadlinger <see@klickverbot.at> wrote:

> On Wednesday, 8 August 2012 at 19:27:54 UTC, Walter Bright wrote:
>> The idea is to have hash act like a component - not with special added code the user has to write.
>
> Sorry, but I think this is a meaningless statement without specifying what kind of interface the component should adhere to. In my opinion, the proposed std.hash design would be a perfectly valid interface for »accumulate stuff and at some point get a result«-type components.
>
>> In this case, it needs to work like a reduce algorithm, because it is a reduce algorithm. Need to find a way to make this work.
>
> Hash functions are _not_ analogous to reduce(), because the operation performed by reduce() is stateless, whereas hash functions generally have some internal state.
>
> »Continuing« a reduce() operation by repeatedly calling it with the last partial result as the starting value is only possible because there is no additional state to carry over. To make this work with hashes, you'd have to return something encapsulating the internal state from your hash function.

This isn't necessarily a problem.

> But then, you again need to obtain the actual result from that return value from that result somehow, which defeats the original intent of making it work like reduce – and incidentally is what finish() does.

But, this is a problem.  finish in most cases pads the remaining data to a boundary of the internal state size, then completes one more iteration of the algorithm to produce the final result.

So, like David has just said, you can have 1 or the other.  Either you can chain hashreduce operations together, but you have to perform a manual finish step to get the actual result, or you cannot chain hashreduce operations together and finish is done automatically when the input range is consumed.

Wild thought.. and I have to admit I've not followed the proposed or suggested API closely, nor have I used ranges extensively so this may not be possible..

If the range/hash object stores the current state and returns this as the result of hashreduce, it would be chainable.  If it also had a "Digest" property/method which performed finish on a /temporary copy/ of the state it would almost be as automatic as reduce.  There would still be a manual step to get the result, but it would be analogous to calling toString on any range object to output it's "value".  The Digest property/method would not modify the internal state, and could be called at any time between (not sure there is a point to this) or after chained hashreduce operations.

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

August 09, 2012

Re: The review of std.hash package

Posted by Regan Heath
in reply to Johannes Pfau

Regan Heath

Posted in reply to Johannes Pfau

On Thu, 09 Aug 2012 10:58:10 +0100, Johannes Pfau <nospam@example.com> wrote:

> Am Thu, 09 Aug 2012 11:32:34 +0200
> schrieb "Vladimir Panteleev" <vladimir@thecybershadow.net>:
>
>> On Tuesday, 7 August 2012 at 17:39:50 UTC, Dmitry Olshansky wrote:
>> > std.hash.hash is a new module for Phobos defining an uniform
>> > interface for hashes and checksums. It also provides some
>> > useful helper functions to deal with this new API.
>>
>> Is it too late to ask to include MurmurHash 2 and/or 3? It's
>> public domain, and great for things like hash tables.
>>
>> You can steal some code from here:
>> https://github.com/CyberShadow/ae/blob/master/utils/digest.d
>> https://github.com/CyberShadow/ae/blob/master/utils/digest_murmurhash3.d
>>
>
> To be honest I didn't even know that MurmurHash can be used
> incrementally. I could port that code soon, but I think it's
> best to do it after the review.
> After we have formalized a common API adding new hashes probably won't
> require a full review. It should be possible to do that as a pull
> request.

Once the API is formalised I can contribute the hashes I have also :)

R


-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

August 09, 2012

Re: The review of std.hash package

Posted by Walter Bright
in reply to David Nadlinger

Walter Bright

Posted in reply to David Nadlinger

On 8/9/2012 2:59 AM, David Nadlinger wrote:
> On Wednesday, 8 August 2012 at 19:27:54 UTC, Walter Bright wrote:
>> The idea is to have hash act like a component - not with special added code
>> the user has to write.
>
> Sorry, but I think this is a meaningless statement without specifying what kind
> of interface the component should adhere to. In my opinion, the proposed
> std.hash design would be a perfectly valid interface for »accumulate stuff and
> at some point get a result«-type components.

It is not a meaningless statement in that components have a predictable set of methods and properties. That's all a range is. Requiring extra methods means there's either an error in the component interface design or an error in the component instance design.

What I'm trying to get away from is the C library style where every library lives in its own world, and when the user tries to connect them he's got a fair amount of work to do building a scaffolding between them.

With component programming, the interfaces between disparate things is standardized. It does not have unique methods for different instances. For example, one component has a finish() method, another has a getResult() method, and a third has no method at all. This situation I wish to avoid.

>> In this case, it needs to work like a reduce algorithm, because it is a reduce
>> algorithm. Need to find a way to make this work.
>
> Hash functions are _not_ analogous to reduce(), because the operation performed
> by reduce() is stateless, whereas hash functions generally have some internal
> state.
>
> »Continuing« a reduce() operation by repeatedly calling it with the last partial
> result as the starting value is only possible because there is no additional
> state to carry over. To make this work with hashes, you'd have to return
> something encapsulating the internal state from your hash function.

Wouldn't that be simply the handle to the hash?

> But then,
> you again need to obtain the actual result from that return value from that
> result somehow, which defeats the original intent of making it work like reduce
> – and incidentally is what finish() does.

I understand what finish() does. The interesting part is trying to figure a way out of needing that method. Or perhaps the reduce component design is incorrect.

August 09, 2012

Re: The review of std.hash package

Posted by Walter Bright
in reply to Johannes Pfau

Walter Bright

Posted in reply to Johannes Pfau

On 8/9/2012 2:48 AM, Johannes Pfau wrote:
> Am Wed, 08 Aug 2012 12:31:29 -0700
> schrieb Walter Bright <newshound2@digitalmars.com>:
>
>> On 8/8/2012 12:14 PM, Martin Nowak wrote:
>>> That hardly works for event based programming without using
>>> coroutines. It's the classical inversion-of-control dilemma of
>>> event based programming that forces you to save/restore your state
>>> with every event.
>>
>> See the discussion on using reduce().
>>
>
> I just don't understand it. Let's take the example by Martin Nowak and
> port it to reduce: (The code added as comments is the same code for
> hashes, working with the current API)
>
> int state; //Hash state;
>
> void onData(void[] data)
> {
>       state = reduce(state, data); //copy(data, state);
>       //state = copy(data, state); //also valid, but not necessary
>       //state.put(data); //simple way, doesn't work for ranges
> }
>
> void main()
> {
>       state = 0; //state.start();
>       auto stream = new EventTcpStream("localhost", 80);
>       stream.onData = &onData;
>       //auto result = hash.finish();
> }
>
> There are only 2 differences:
>
> 1:
> the order of the arguments passed to copy and reduce is swapped. This
> kinda makes sense (if copy is interpreted as copyTo). Solution: Provide
> a method copyInto with swapped arguments if consistency is really so
> important.

Consistency is important so that disparate components can fit together.

>
> 2:
> We need an additional call to finish. I can't say it often enough, I
> don't see a sane way to avoid it. Hashes work on blocks, if you didn't
> pass enough data finish will have to fill the rest of the block with
> zeros before you can get the hash value. This operation can't be
> undone. To get a valid result with every call to copy, you'd have to
> always call finish. This is
> * inefficient, you calculate intermediate values you don't need at all
> * you have to copy the hashes state, as you can't continue hashing
>    after finish has been called
>
> and both, the state and the result would have to fit into the one value
> (called seed for reduce). But then it's still not 100% consistent, as
> reduce will return a single value, not some struct including internal
> state.

That isn't a problem, the internal state can be private data for the struct, and the "finish value" can be the result of overloading operator() on that struct. I'm not sure if that would work, but it's worth investigating.

August 09, 2012

Re: The review of std.hash package

Posted by Christophe Travert
in reply to Walter Bright

Christophe Travert

Posted in reply to Walter Bright

If a has is a range, it's an output range, because it's something you fee data to. Output range have only one method: put. Johannes used this method. But it's not sufficient, you need something to start and to finish the hash.

To bring consistency in the library, we should not remove this start and finish method. We should make all output range of the library use the same functions.

In the library, we have really few output range. We have writable input range and we have Appender. Is there more? There should be files, socket, maybe even signal, but IIRC these don't implement output range at the moment. What did I miss?

Appender doesn't use a finish method, but we have to 'get the result' of the appender, and for this we use appender.data. This name is not appropriate for generically getting a result or terminating an output range.

So We need a name that fits most output range use. start/finish sounds not bad. open/close fits files and socket, but many not all output ranges. Relying solely on constructors, opCall or alias this seems dangerous to me.

-- 
Christophe

August 09, 2012

Re: The review of std.hash package

Posted by Andrei Alexandrescu
in reply to Johannes Pfau

Andrei Alexandrescu

Posted in reply to Johannes Pfau

On 8/9/12 5:05 AM, Johannes Pfau wrote:
> Well that's possible, but I don't like the template bloat it causes.

What have you measured, and what is your dislike based upon?

The library function must be generic. Then users worried about bloating may use it with a limited number of types.

A digest function only dealing with void[void[]] is unacceptable.


Andrei

August 09, 2012

Re: The review of std.hash package

Posted by Johannes Pfau
in reply to Andrei Alexandrescu

Johannes Pfau

Posted in reply to Andrei Alexandrescu

Am Thu, 09 Aug 2012 08:48:37 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:

> On 8/9/12 5:05 AM, Johannes Pfau wrote:
> > Well that's possible, but I don't like the template bloat it causes.
> 
> What have you measured, and what is your dislike based upon?

What annoys me is that as long the function only supported arrays, it didn't need templates _at all_. So template bloat for arrays = 0. But adding range support means the version dealing with arrays now has to be a template as well(which is probably a bug, can't overload template and non template function) and will produce extra code for every array type. I just think adding range support shouldn't cause the array code to change in any way.

But that's why I said I don't like it, not that it's a show stopper. The overhead is probably neglectable, but in theory it shouldn't be there at all.

> 
> The library function must be generic. Then users worried about bloating may use it with a limited number of types.
> 
> A digest function only dealing with void[void[]] is unacceptable.

Sure I agree that range support is necessary, I just forgot to implement it initially. I'm not against range support / ranges in general / the template instances needed for ranges. I just dislike that it affects the array implementation in this specific case.

August 09, 2012

Re: The review of std.hash package

Posted by Johannes Pfau
in reply to Regan Heath

Johannes Pfau

Posted in reply to Regan Heath

Am Thu, 09 Aug 2012 11:16:36 +0100
schrieb "Regan Heath" <regan@netmail.co.nz>:

> 
> Once the API is formalised I can contribute the hashes I have also :)
> 
great! with all those contributions we'll probably have a rather complete set of digests soon.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation