The review of std.hash package (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » The review of std.hash package (page 2)

August 07, 2012

Re: The review of std.hash package

Posted by David
in reply to Johannes Pfau

David

Posted in reply to Johannes Pfau

Am 07.08.2012 21:53, schrieb Johannes Pfau:
> Am Tue, 07 Aug 2012 20:07:07 +0200
> schrieb David <d@dav1d.de>:
>
>> Is the API already set in stone?
> No, as Jonathan already mentioned reviewing the API is an important
> part of the review. (Actually the most important part for this review,
> as we define one API for many implementations)
>
>> Using .start and .finish, feels like the use of OpenSSL. Why not
>> making a Hash object not "restartable" (since e.g. MD5.start() sets
>> `this` just to MD5.init)
> I'm not an expert in the hash/digest area. We had incompatible APIs for
> CRC32, MD5 and SHA1 and I proposed we should have a unique interface.
> As I had some free time and nobody else stepped up I implemented it,
> asking for expert help in the newsgroup. This probably also explains
> why this API isn't revolutionary, it's inspired by other designs
> (tango, .net, ...).
>
> BTW: making it possible to wrap OpenSSL and other C hash libraries was
> also a goal. This is one reason why a start() function is necessary.
> OK, that was the intro ;-)
>
> start is here as structs can't have default constructors. For all
> current hashes it's really mostly useless (except it can be used as a
> simple way to reset a hash, which is also why the hash structs don't
> have a reset member), but I just don't know if there are hashes which
> would require a more complex (runtime) initialization.

Ok this point with the one above makes sense (I implemented my OpenSSL hashing wrapper as a class, initilaization is done in the constructor), it still doesn't feel right, if you have to call .start first. What about a struct-constructor which calls .start internally, so you get a bit more of a modern API (imo) and you're still able to implement the same interface for any kind of wrapper/own implementation/whatever.

> Classes don't have a start method as they can do that initialization in
> the default constructor. But the default constructor can't be used as a
> cheap way to reset a hash, therefore the reset function was added to
> the OOP interface.

Ok. (more below)

>> and making finish private and implementing a
>> digest and hexdigest property
> property is not a good idea, as finish must reset the internal state
> for some objects, so you can't access the property repeatedly. It'd
> have to be a function.

Well, you could store the result internally, property generates on the first call the digest, stores it (that's not too much, 16 byte for a md5) and returns the already stored value on the 2nd call.

>> which calls finish and returns an
>> ubyte[] array (or string for hexdigest)
>
> This could be done and is probably a matter of taste. I prefer the free
> function in this case, it makes clear that digest() really is a generic
> function which can be used with any hash. It's also one function less
> which must be implemented by hash writers. I'd like to keep the number
> of members required to implement a hash as low as possible.

With digest you mean: http://dl.dropbox.com/u/24218791/d/phobos/std_hash_hash.html#digest ?

You normally always want the hexdigest (you barely need "real" digest), so it's a matter of convenience. Well, thanks to UFCS, you can call it like a method.

> (This is however unrelated to the actual naming of the functions. If
> you think 'digest' is not a good name, please let me know)

I think that's a fitting name, but I am not a hashing expert.

>> , this would also eliminate
>> the need of the `digest` wrapper (e.g. you can mixin these properties
>> with template mixins, so no code duplication)
>
> there's actually no code duplication. 'digest' is the only
> implementation, sha1Of, md5Of are just aliases.

Yeah, I meant, you would have to implement these properties for each hash-type and in the end, all properties do the same (calling finish and maybe setting some internal flags, buffers), so you can put them into a template mixin and mixin them for each hash-type.

>> Also I am not sure if we want to use `hash.put` instead of
>> `hash.update` (which is used by python and openssl). But that's just
>> a minor point (e.g. perl uses .add)
>
> I initially called it update, I agree it's a better name. But phobos is
> all about ranges, so we need the put member anyway. Providing an
> 'update' alias just for the name isn't worth it, imho (see above,
> keeping the number of hash members low).

I have to agree, an alias would produce more confusion (what two methods which do the same?)

>> Furthermore, I don't like the `sha1Of` and `md5Of` etc. functions,
>> why not having a static method? e.g. MD5.hexdigest("input"), which
>> returns the hexdigest and not a MD5 object.
>
> Same reason as above, sha1Of, md5Of are just convenience aliases, it
> could be argued you don't need those at all. You can use 'digest'
> directly and it will provide exactly the same result. So hash writers
> don't have to implement anything to support digest, providing an alias
> is optional and just 1 line of code. A static method would have to be
> implemented by every hash.

Yes, you would have to implement it for every hash, but that's 3 lines:

static string hexdigest(void[] data) {
	return toHexString(digest!(typeof(this))("yay"));
}

> Yes you could use mixins to make that
> painless, but I think it's still to much work. (And mixins can't be
> documented in DDOC, but that's an unrelated compiler issue)

> BTW: it is implemented as a member for the OOP API, as we can just
> implement it in the base interface, so the actual implementations don't
> have to care about it.

I've seen that, but I am wondering why it's not a static method, the creation of the Hash Object could be hidden inside. You might say, this would allocate a class instance just for a single use, without the possibility to reuse it. Correct, but I think that's the use of such a function, if you just need a Hash, nothing else, otherwise you could use the class "directly", with all its other functionality.

>> One more thing, `.reset` does the same as `start`? If so, why do both
>> exist?
> See above. It's because start is used in the template/struct API and
> structs can't have default constructors, so we need start.
>
> The OOP/class api can use a default constructor. But unlike the start
> function it can't work as a reset function, so there's an explicit
> reset function.

I am not sure what do think about that. On the one side it's useful if you really need that speed, but then I think, is it really worth it resetting the state hidden in a function. If you really want to avoid another allocation you could use emplace (if I didn't misunderstand the use of emplace). To quote from the Python Zen:
"Explicit is better than implicit."

>> (I also find the examples quite confusing, why do you want to
>> reset the hash onject?
> My approach was to document how to use the functions, not as much when
> it makes sense to use them. Maybe I should care more about that?

When I saw the documentation/examples, I was confused, why do you want to reset a hash object, so does it really reset the state? So I had to take a look into the code before I was sure, it surely does reset the whole thingy (is it called context?).

> It's more efficient than allocating a new hash with 'new' (or do you
> mean why use reset if finish resets anyway? It's useful if you put some
> data into a hash, then decide you don't want the hash (because the
> user aborts the operation, network connection is lost,...) but you still
> need the hash object later on.) You could just call finish and
> disregard the result, but the reset implementation is faster. I agree
> there's probably no _strong_ need for reset, but I think it doesn't do
> any harm.

"Explicit is better than implicit."

>> Well, that's it from my side :)
>
> Thanks for your review!

Thanks for writing std.hash, I think lots of people need it.

August 07, 2012

Re: The review of std.hash package

Posted by Piotr Szturmaj
in reply to Johannes Pfau

Piotr Szturmaj

Posted in reply to Johannes Pfau

Johannes Pfau wrote:
> As I had some free time and nobody else stepped up I implemented it,

I tried it before but I wanted to create whole crypto package at once, and I guess it was a huge mistake. I know you have used my code in your uuid module, you can use it to add it to std.hash if you want. There are all SHA implementations, MD5 and a Tiger (both versions). They all support bit hashing and work with CTFE.

I made a quick look at your proposal, but currently don't have time to do a deep investigation :). One note though. I don't think it's neccessary to maintain two different API sets. Sure, classes need heap allocation but usually that's only one allocation, it's more important to not allocate during algorithm computations. Fortunately crypto algorithms don't do this. So, struct allocation efficency is negligible here. I think separate struct based API set is not worth it (or I'm missing something not related to allocation cost).

August 08, 2012

Re: The review of std.hash package

Posted by Walter Bright
in reply to Dmitry Olshansky

Walter Bright

Posted in reply to Dmitry Olshansky

On 8/7/2012 10:39 AM, Dmitry Olshansky wrote:
> std.hash.hash is a new module for Phobos defining an uniform interface for
> hashes and checksums. It also provides some useful helper functions to deal with
> this new API.

The hash functions must use a Range interface, not a file interface.

This is extremely important.

August 08, 2012

Re: The review of std.hash package

Posted by Denis Shelomovskij
in reply to Dmitry Olshansky

Denis Shelomovskij

Posted in reply to Dmitry Olshansky

The question about module names.

Is it supposed that e.g. std.hash.crc module will contain many CRC implementations, not only one CRC-32? If not, it will be better to call it std.hash.crc32 because other CRC variants are also in use. Or even std.hash.crc.crc32.

Same with std.hash.sha and std.hash.md modules.

-- 
Денис В. Шеломовский
Denis V. Shelomovskij

August 08, 2012

Re: The review of std.hash package

Posted by Johannes Pfau
in reply to David

Johannes Pfau

Posted in reply to David

Am Wed, 08 Aug 2012 00:55:47 +0200
schrieb David <d@dav1d.de>:

> Am 07.08.2012 21:53, schrieb Johannes Pfau:
> > Am Tue, 07 Aug 2012 20:07:07 +0200
> > schrieb David <d@dav1d.de>:
> >
> > start is here as structs can't have default constructors. For all current hashes it's really mostly useless (except it can be used as a simple way to reset a hash, which is also why the hash structs don't have a reset member), but I just don't know if there are hashes which would require a more complex (runtime) initialization.
> 
> Ok this point with the one above makes sense (I implemented my OpenSSL hashing wrapper as a class, initilaization is done in the constructor), it still doesn't feel right, if you have to call .start first. What about a struct-constructor which calls .start internally, so you get a bit more of a modern API (imo) and you're still able to implement the same interface for any kind of wrapper/own implementation/whatever.

You mean an external function which constructs the digest context and
calls start? That's similar to the appender function in std.array and
probably a good idea. We'd need a good name for it though.
* initContext
* initializeContext
* context
* startContext
* startHash / startDigest

> > property is not a good idea, as finish must reset the internal state for some objects, so you can't access the property repeatedly. It'd have to be a function.
> 
> Well, you could store the result internally, property generates on the first call the digest, stores it (that's not too much, 16 byte for a md5) and returns the already stored value on the 2nd call.
> 

yes, that could be done but to be honest I don't think it's worth it.

> >> which calls finish and returns an
> >> ubyte[] array (or string for hexdigest)
> >
> > This could be done and is probably a matter of taste. I prefer the free function in this case, it makes clear that digest() really is a generic function which can be used with any hash. It's also one function less which must be implemented by hash writers. I'd like to keep the number of members required to implement a hash as low as possible.
> 
> With digest you mean: http://dl.dropbox.com/u/24218791/d/phobos/std_hash_hash.html#digest ?
> 
> You normally always want the hexdigest (you barely need "real"
> digest),

really? I though it's the other way round and comparing/verifying hashes is the common case?

> so it's a matter of convenience. Well, thanks to UFCS, you can call it like a method.
> 
a hexDigest which works like digest could probably be useful, although
it's very easy to implement this in user code as well:
string hexDigest(Hash, Order order = Order.increasing)(scope
const(void[])[] data...) if(isDigest!Hash)
{
    return digest!Hash(data).toHexString!order();
}

> > A static
> > method would have to be implemented by every hash.
> 
> Yes, you would have to implement it for every hash, but that's 3 lines:
> 
> static string hexdigest(void[] data) {
> 	return toHexString(digest!(typeof(this))("yay"));
> }

The implementation is indeed quite simple, so it probably is a matter of taste whether hexDigest is implemented as a member or as a free function.

> > BTW: it is implemented as a member for the OOP API, as we can just implement it in the base interface, so the actual implementations don't have to care about it.
> 
> I've seen that, but I am wondering why it's not a static method, the creation of the Hash Object could be hidden inside. You might say, this would allocate a class instance just for a single use, without the possibility to reuse it. Correct, but I think that's the use of such a function, if you just need a Hash, nothing else, otherwise you could use the class "directly", with all its other functionality.

I have to think about this. It's too bad we can't have static & member functions with the same name in D.

I'd probably argue if you want the behavior of a static function, use
the struct API: digest!MD5(data);
Doesn't this do everything you need, without the GC allocation the OOP
api would cause? I mean the OOP API is mostly about polymorphism (and
about reference semantics). If you use a static method as proposed,
you can't use polymorphism and reference semantics are not useful
either, as you never get a reference to the context?

> > The OOP/class api can use a default constructor. But unlike the start function it can't work as a reset function, so there's an explicit reset function.
> 
> I am not sure what do think about that. On the one side it's useful if you really need that speed, but then I think, is it really worth it resetting the state hidden in a function. If you really want to avoid another allocation you could use emplace (if I didn't misunderstand the use of emplace). To quote from the Python Zen: "Explicit is better than implicit."

Yes, it should be possible to use emplace for that. Emplace is not well-known though and I'm not sure how well it works now. Also I'm not sure if emplace is enough in this case, I guess you'd have to destroy the old instance as well (so destructors are run)?

But I'm not really sure about the importance of reset either. As long as the main implementation is in the struct API and WrapperDigest is used to wrap it into the OOP API there's no extra work to implement reset. If the main implementation is in the OOP interface, the reset function probably looks a lot like the constructor, so simply calling reset from the constructor should avoid code duplication.
> 
> >> (I also find the examples quite confusing, why do you want to reset the hash onject?
> > My approach was to document how to use the functions, not as much when it makes sense to use them. Maybe I should care more about that?
> 
> When I saw the documentation/examples, I was confused, why do you want to reset a hash object, so does it really reset the state? So I had to take a look into the code before I was sure, it surely does reset the whole thingy (is it called context?).

Yes, the whole terminology is probably a little confusing. The old API called the contexts like this: MD5_CTX, but that doesn't fit the phobos naming conventions. MD5CTX, MD5Ctx, Md5Ctx and Md5CTX all look ugly (and only the first fits our naming conventions). So MD5Context is the next best choice, but it's a little long. It'd more precise though.

> 
> > It's more efficient than allocating a new hash with 'new' (or do you mean why use reset if finish resets anyway? It's useful if you put some data into a hash, then decide you don't want the hash (because the user aborts the operation, network connection is lost,...) but you still need the hash object later on.) You could just call finish and disregard the result, but the reset implementation is faster. I agree there's probably no _strong_ need for reset, but I think it doesn't do any harm.
> 
> "Explicit is better than implicit."
> 

I'm just not sure if reset can be implemented in a reasonable way outside of the class in all cases. But then it's probably not important enough to justify an extra member...

August 08, 2012

Re: The review of std.hash package

Posted by Johannes Pfau
in reply to Walter Bright

Johannes Pfau

Posted in reply to Walter Bright

Am Tue, 07 Aug 2012 17:39:15 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> On 8/7/2012 10:39 AM, Dmitry Olshansky wrote:
> > std.hash.hash is a new module for Phobos defining an uniform interface for hashes and checksums. It also provides some useful helper functions to deal with this new API.
> 
> The hash functions must use a Range interface, not a file interface.
> 
> This is extremely important.

I guess this is meant as a general statement and not specifically targeted at my std.hash proposal?

I'm a little confused as all hashes already are OutputRanges in my proposal. It's probably not explicit enough in the documentation, but it's mentioned in one example and in the documentation for 'put';

August 08, 2012

Re: The review of std.hash package

Posted by Johannes Pfau
in reply to Piotr Szturmaj

Johannes Pfau

Posted in reply to Piotr Szturmaj

Am Wed, 08 Aug 2012 01:27:45 +0200
schrieb Piotr Szturmaj <bncrbme@jadamspam.pl>:

> Johannes Pfau wrote:
> > As I had some free time and nobody else stepped up I implemented it,
> 
> I tried it before but I wanted to create whole crypto package at once,
I'm sorry, I didn't want to conceal your work. What I meant with
'nobody stepped up' is this: We had a pull request which moved the crc
code into std.crc and it was already merged in for 2.060. I
complained that it had yet another API and that we should have a common
API. In the end that pull request was reverted but I felt kinda guilty
as that left us without a public crc module. In that discussion, nobody
else had time to implement that API, so I thought we'd need a
solution soon and started implementing it.

> and I guess it was a huge mistake.
We still need your crypto work ;-)
I'd personally like to see a bcrypt implementation. There's no
widespread bcrypt C library which could be used, so there's no painless
way to use bcrypt in D. (Although there _is_ public domain C source
code)

> I know you have used my code
> in your uuid module, you can use it to add it to std.hash if you
> want. There are all SHA implementations, MD5 and a Tiger (both
> versions). They all support bit hashing and work with CTFE.
Great, I'll have a look if this review works out (or as soon as we have a standardized hash API in phobos).

BTW: How does it work in CTFE? Don't you have to do endianness conversions at some time? According to Don that's not really supported.

Another problem with prevents CTFE for my proposal would be that the internal state is currently implemented as an array of uints, but the API uses ubyte[] as a return type. That sort of reinterpret cast is not supposed to work in CTFE though. I wonder how you avoided that issue?

And another problem is that void[][] (as used in the 'digest' function)
doesn't work in CTFE (and it isn't supposed to work). But that's a
problem specific to this API.

> I made a quick look at your proposal, but currently don't have time to do a deep investigation :). One note though. I don't think it's neccessary to maintain two different API sets. Sure, classes need heap allocation but usually that's only one allocation, it's more important to not allocate during algorithm computations. Fortunately crypto algorithms don't do this. So, struct allocation efficency is negligible here. I think separate struct based API set is not worth it (or I'm missing something not related to allocation cost).

I'm not sure, I guess some people would disagree. I initially asked whether we should use a struct based API or a class based API (+emplace where necessary). At that time the struct API was favored.

It's about the allocation cost (and the additional GC work) although people could probably also complain about the indirection classes bring along.

There's one important example where you really don't want any
allocation to happen (and especially no GC allocation): You have a
function which always calculates the same type of hash (e.g. md5) and
you don't care about the implementation. You only care about the
final hash, the hash context is never used outside of that function.
This is an perfect fit for stack allocation, especially if the
function is used a lot (GC). With classes you'd have to do
some caching to do it efficiently (which then gives problems with
const).
It would be possible to use a class API + emplace, but that was deemed
to be too cumbersome.

August 08, 2012

Re: The review of std.hash package

Posted by Johannes Pfau
in reply to Denis Shelomovskij

Johannes Pfau

Posted in reply to Denis Shelomovskij

Am Wed, 08 Aug 2012 11:48:54 +0400
schrieb Denis Shelomovskij <verylonglogin.reg@gmail.com>:

> The question about module names.
> 
> Is it supposed that e.g. std.hash.crc module will contain many CRC implementations, not only one CRC-32? If not, it will be better to call it std.hash.crc32 because other CRC variants are also in use. Or even std.hash.crc.crc32.
> 
> Same with std.hash.sha and std.hash.md modules.
> 

They're supposed to contain more implementations in the future. I basically just wrote the new API, then ported what we already had in phobos (and in a pull request by redstar) to the new API. The rest of the SHA functions could probably follow soon as Piotr Szturmaj already has a properly licensed implementation. The Boost project has a templated CRC implementation, we could port that as well.

One problem is that the current MD5 and SHA implementations have been manually optimized for dmds inliner though, so those probably won't be replaced by a more generic version until we have benchmarks in phobos which ensure that there won't be performance regressions.

August 08, 2012

Re: The review of std.hash package

Posted by Piotr Szturmaj
in reply to Johannes Pfau

Piotr Szturmaj

Posted in reply to Johannes Pfau

Johannes Pfau wrote:
> Am Wed, 08 Aug 2012 01:27:45 +0200
> schrieb Piotr Szturmaj <bncrbme@jadamspam.pl>:
>
>> Johannes Pfau wrote:
>>> As I had some free time and nobody else stepped up I implemented it,
>>
>> I tried it before but I wanted to create whole crypto package at
>> once,
> I'm sorry, I didn't want to conceal your work. What I meant with
> 'nobody stepped up' is this: We had a pull request which moved the crc
> code into std.crc and it was already merged in for 2.060. I
> complained that it had yet another API and that we should have a common
> API. In the end that pull request was reverted but I felt kinda guilty
> as that left us without a public crc module. In that discussion, nobody
> else had time to implement that API, so I thought we'd need a
> solution soon and started implementing it.

Hey, I didn't say you did :-) I'm fine with your work :-)

>> and I guess it was a huge mistake.
> We still need your crypto work ;-)
> I'd personally like to see a bcrypt implementation. There's no
> widespread bcrypt C library which could be used, so there's no painless
> way to use bcrypt in D. (Although there _is_ public domain C source
> code)

Yes, there should be bcrypt, scrypt and PBKDF2.

>> I know you have used my code
>> in your uuid module, you can use it to add it to std.hash if you
>> want. There are all SHA implementations, MD5 and a Tiger (both
>> versions). They all support bit hashing and work with CTFE.
> Great, I'll have a look if this review works out (or as soon as we have
> a standardized hash API in phobos).
>
> BTW: How does it work in CTFE? Don't you have to do endianness
> conversions at some time? According to Don that's not really supported.

std.bitmanip.swapEndian() works for me

> Another problem with prevents CTFE for my proposal would be that the
> internal state is currently implemented as an array of uints, but the
> API uses ubyte[] as a return type. That sort of reinterpret cast is not
> supposed to work in CTFE though. I wonder how you avoided that issue?

There is set of functions that abstract some operations to work with CTFE and at runtime: https://github.com/pszturmaj/phobos/blob/master/std/crypto/hash/base.d#L66. Particularly memCopy().

> And another problem is that void[][] (as used in the 'digest' function)
> doesn't work in CTFE (and it isn't supposed to work). But that's a
> problem specific to this API.

Yes, that's why I use ubyte[].

I think if it's possible to do it all with CTFE by using hacks, it should be rather implemented in the compiler, assuming the endiannes of the CPU it's running on.

[...cut...]
> There's one important example where you really don't want any
> allocation to happen (and especially no GC allocation): You have a
> function which always calculates the same type of hash (e.g. md5) and
> you don't care about the implementation. You only care about the
> final hash, the hash context is never used outside of that function.
> This is an perfect fit for stack allocation, especially if the
> function is used a lot (GC). With classes you'd have to do
> some caching to do it efficiently (which then gives problems with
> const).
> It would be possible to use a class API + emplace, but that was deemed
> to be too cumbersome.

I don't think std.typecons.scoped is cumbersome:

auto sha = scoped!SHA1(); // allocates on the stack
auto digest = sha.digest("test");

Why I think classes should be supported is the need of polymorphism. One should be able to accept Digest base class as a function parameter or a field/property. Without polymorphism we need to resort to C-like switches, ifs and so on. Templates are no go, because hash function may be determined at runtime (it may depend on security protocol handshake).

August 08, 2012

Re: The review of std.hash package

Posted by Walter Bright
in reply to Johannes Pfau

Walter Bright

Posted in reply to Johannes Pfau

On 8/8/2012 1:44 AM, Johannes Pfau wrote:
> Am Tue, 07 Aug 2012 17:39:15 -0700
> schrieb Walter Bright <newshound2@digitalmars.com>:
>
>> On 8/7/2012 10:39 AM, Dmitry Olshansky wrote:
>>> std.hash.hash is a new module for Phobos defining an uniform
>>> interface for hashes and checksums. It also provides some useful
>>> helper functions to deal with this new API.
>>
>> The hash functions must use a Range interface, not a file interface.
>>
>> This is extremely important.
>
> I guess this is meant as a general statement and not specifically
> targeted at my std.hash proposal?

Both.

> I'm a little confused as all hashes already are OutputRanges in my
> proposal. It's probably not explicit enough in the documentation, but
> it's mentioned in one example and in the documentation for 'put';

It should accept an input range. But using an Output Range confuses me. A hash function is a reduce algorithm - it accepts a sequence of input values, and produces a single value. You should be able to write code like:

  ubyte[] data;
  ...
  auto crc = data.crc32();

For example, the hash example given is:

  foreach (buffer; file.byChunk(4096 * 1024))
      hash.put(buffer);
  auto result = hash.finish();

Instead it should be something like:

  auto result = file.byChunk(4096 * 1025).joiner.hash();

The magic is that any input range that produces bytes could be used, and that byte producing input range can be hooked up to the input of any reducing function.

The use of a member finish() is not what any other reduce algorithm has, and so the interface is not a general component interface.

I know the documentation on ranges in Phobos is incomplete and confusing.

I appreciate the effort and care you're putting into this.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation