August 08, 2012
"Regan Heath" , dans le message (digitalmars.D:174462), a écrit :
> "Message-Digest Algorithm" is the proper term, "hash" is another, correct, more general term.
> 
> "hash" has other meanings, "Message-Digest Algorithm" does not.

I think the question is: is std.hash going to contain only message-digest algorithm, or could it also contain other hash functions? I think there is enough room in a package to have both message-digest algorithm and other kinds of hash functions.

August 08, 2012
On 8/8/12 8:54 AM, Regan Heath wrote:
> "Hash" has too many meanings, we should avoid it.

Yes please.

Andrei

August 08, 2012
On Wednesday, 8 August 2012 at 13:38:26 UTC, travert@phare.normalesup.org (Christophe Travert) wrote:
> I think the question is: is std.hash going to contain only
> message-digest algorithm, or could it also contain other hash functions?
> I think there is enough room in a package to have both message-digest
> algorithm and other kinds of hash functions.

Even if that were the case, I'd say they should be kept separate. Cryptographic hash functions serve extremely different purposes from regular hash functions. There is no reason they should be categorized the same.
August 08, 2012
On Wed, 08 Aug 2012 14:50:22 +0100, Chris Cain <clcain@uncg.edu> wrote:

> On Wednesday, 8 August 2012 at 13:38:26 UTC, travert@phare.normalesup.org (Christophe Travert) wrote:
>> I think the question is: is std.hash going to contain only
>> message-digest algorithm, or could it also contain other hash functions?
>> I think there is enough room in a package to have both message-digest
>> algorithm and other kinds of hash functions.
>
> Even if that were the case, I'd say they should be kept separate. Cryptographic hash functions serve extremely different purposes from regular hash functions. There is no reason they should be categorized the same.

I don't think there is any reason to separate them.  People should know which digest algorithm they want, they're not going to pick one at random and assume it's "super secure!"(tm).  And if they do, well tough, they deserve what they get.

"std.digest" can encompass all message digest algorithms, whether secure or not.

We could create a 2nd level below "secure" or "crypto" or similar if we really want, but I don't see much point TBH.

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/
August 08, 2012
"Chris Cain" , dans le message (digitalmars.D:174466), a écrit :
> On Wednesday, 8 August 2012 at 13:38:26 UTC, travert@phare.normalesup.org (Christophe Travert) wrote:
>> I think the question is: is std.hash going to contain only
>> message-digest algorithm, or could it also contain other hash
>> functions?
>> I think there is enough room in a package to have both
>> message-digest
>> algorithm and other kinds of hash functions.
> 
> Even if that were the case, I'd say they should be kept separate. Cryptographic hash functions serve extremely different purposes from regular hash functions. There is no reason they should be categorized the same.

They should not be categorized the same. I don't expect a regular hash function to pass the isDigest predicate. But they have many similarities, which explains they are all called hash functions. There is enough room in a package to put several related concepts!

Here, we have a package for 4 files, with a total number of line that is about one third of the single std.algorithm file (which is probably too big, I conceed). There aren't hundreds of message-digest functions to add here.

If it where me, I would have the presently reviewed module std.hash.hash be called std.hash.digest, and leave room here for regular hash functions. In any case, I think regular hash HAVE to be in a std.hash module or package, because people looking for a regular hash function will look here first.


August 08, 2012
Am Wed, 08 Aug 2012 02:49:00 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> 
> It should accept an input range. But using an Output Range confuses me. A hash function is a reduce algorithm - it accepts a sequence of input values, and produces a single value. You should be able to write code like:
> 
>    ubyte[] data;
>    ...
>    auto crc = data.crc32();

auto crc = crc32Of(data);
auto crc = data.crc32Of(); //ufcs

This doesn't wok with every InputRange and this needs to be fixed.
That's a quite simple fix (max 10 lines of code, one new overload) and
not a inherent problem of the API (see below for more).

> 
> For example, the hash example given is:
> 
>    foreach (buffer; file.byChunk(4096 * 1024))
>        hash.put(buffer);
>    auto result = hash.finish();
> 
> Instead it should be something like:
> 
>    auto result = file.byChunk(4096 * 1025).joiner.hash();

But it also says this:
//As digests implement OutputRange, we could use std.algorithm.copy
//Let's do it manually for now

You can basically do this with a range interface in 1 line:
----
import std.algorithm : copy;

auto result = copy(file.byChunk(4096 * 1024), hash).finish();
----
or with ufcs:
----
auto result = file.byChunk(4096 * 1024).copy(hash).finish();
----

OK, you have to initialize hash and you have to call finish. With a new overload for digest it's as simple as this:
----
auto result = file.byChunk(4096 * 1024).digest!CRC32();
auto result = file.byChunk(4096 * 1024).crc32Of(); //with alias
----

The digests are OutputRanges, you can write data to them. There's absolutely no need to make them InputRanges as they only produce 1 value, and the hash sum is produced at once, so there's no way to receive the result in a partial way. A digest is very similar to Appender and it's .data property in this regard.

The put function could accept an InputRange, but I think there was a thread recently which said this is evil for OutputRanges as the same feature can be achieved with copy.

There's also no big benefit in doing it that way. If your InputRange is
really unbuffered you could avoid double buffering. But then you
transfer data byte by byte which will be horribly slow.
If your InputRange has an internal buffer copy should just copy from
that internal buffer to the 64 byte buffer used inside the digest
implementation.
This double buffering could only be avoided if the put function
accepted an InputRange and could supply a buffer for that InputRange so
the InputRange could write directly into the 64 byte buffer. But
there's nothing like that in phobos, so this is all speculation.

(Also the internal buffer is only used for the first 64 bytes (or less) of the supplied data. The rest is processed without copying. It could probably be optimized so that there's absolutely no copying as long as the input buffer length is a multiple of 64)

> 
> The magic is that any input range that produces bytes could be used, and that byte producing input range can be hooked up to the input of any reducing function.
See above. Every InputRange with byte element type does work. You just have to use copy.

> 
> The use of a member finish() is not what any other reduce algorithm has, and so the interface is not a general component interface.

It's a struct with state, not a simple reduce function so it needs that finish member. It works like that way in every other language (and this is not cause those languages don't have ranges; streams and iterators (as in C#) work exactly the same in this case).

Let's take a real world example: You want to download a huge file with std.net.curl and hash it on the fly. Completely reading into a buffer is not possible (large file!). Now std.net.curl has a callback interface (which is forced on us by libcurl). How would you map that into an InputRange? (The byLine range in std.net.curl is eager, byLineAsync needs an additional thread). A newbie trying to do that will despair as it would work just fine in every other language, but D forces that InputRange interface.

Implementing it as an OutputRange is much better. The described scenario works fine and hashing an InputRange also works fine - just use copy. OutputRange is much more universal for this usecase.

However, I do agree digest!Hash, md5Of, sha1Of should have an additional overload which takes a InputRange. It would be implemented with copy and be a nice convenience function.

> 
> I know the documentation on ranges in Phobos is incomplete and confusing.

Especially for copy, as the documentation doesn't indicate the line I posted could work in any way ;-)


August 08, 2012
Am Wed, 08 Aug 2012 11:27:49 +0200
schrieb Piotr Szturmaj <bncrbme@jadamspam.pl>:

> > BTW: How does it work in CTFE? Don't you have to do endianness conversions at some time? According to Don that's not really supported.
> 
> std.bitmanip.swapEndian() works for me

Great! I always tried the *endianToNative and nativeTo*Endian functions. So I didn't expect swapEndian to work.
> 
> > Another problem with prevents CTFE for my proposal would be that the internal state is currently implemented as an array of uints, but the API uses ubyte[] as a return type. That sort of reinterpret cast is not supposed to work in CTFE though. I wonder how you avoided that issue?
> 
> There is set of functions that abstract some operations to work with CTFE and at runtime: https://github.com/pszturmaj/phobos/blob/master/std/crypto/hash/base.d#L66. Particularly memCopy().

I should definitely look at this later. Would be great if hashes worked in CTFE.

> > And another problem is that void[][] (as used in the 'digest'
> > function) doesn't work in CTFE (and it isn't supposed to work). But
> > that's a problem specific to this API.
> 
> Yes, that's why I use ubyte[].
But then you can't even hash a string in CTFE. I wanted to special case strings, but for various reasons it didn't work out in the end.
> 
> I don't think std.typecons.scoped is cumbersome:
> 
> auto sha = scoped!SHA1(); // allocates on the stack
> auto digest = sha.digest("test");

Yes I'm not sure about this. But a class only based interface probably hasn't high chances of being accepted into phobos. And I think the struct interface+wrappers approach isn't bad.

> 
> Why I think classes should be supported is the need of polymorphism.
And ABI compatibility and switching the backend (OpenSSL, native D, windows crypto) at runtime. I know it's very useful, this is why we have the OOP api. It's very easy to wrap the OOP api onto the struct api. These are the implementations of MD5Digest, CRC32Digest and SHA1Digest:

alias WrapperDigest!CRC32 CRC32Digest;
alias WrapperDigest!MD5 MD5Digest;
alias WrapperDigest!SHA1 SHA1Digest;

with the support code in std.hash.hash 1LOC is enough to implement the OOP interface if a struct interface is available, so I don't think maintaining two APIs is a problem.

A bigger problem is that the real implementation must be the struct interface, so you can't use polymorphism there. I hope alias this is enough.


August 08, 2012
Am Wed, 08 Aug 2012 11:27:49 +0200
schrieb Piotr Szturmaj <bncrbme@jadamspam.pl>:

> 
> Yes, there should be bcrypt, scrypt and PBKDF2.

Wow, I didn't know about scrypt. Seems to be pretty cool.
August 08, 2012
On Wednesday, 8 August 2012 at 14:14:29 UTC, Regan Heath wrote:
> I don't think there is any reason to separate them.  People should know which digest algorithm they want, they're not going to pick one at random and assume it's "super secure!"(tm).  And if they do, well tough, they deserve what they get.

In this case, I'm not suggesting keep them separate to not confuse those who don't know better. They're simply disparate in actual use.

What do you use a traditional hash function for? Usually to turn a large multibyte stream into some finite size so that you can use a lookup table or maybe to decrease wasted time in comparisons.

What do you use a cryptographic hash function for? Almost always it's to verify the integrity of some data (usually files) or protect the original form from prying eyes (passwords ... though, there are better approaches for that now).

You'd _never_ use a cryptographic hash function in place of a traditional hash function and vice versa because they designed for completely different purposes. At a cursory glance, they bare only one similarity and that's the fact that they turn a big chunk of data into a smaller form that has a fixed size.

On Wednesday, 8 August 2012 at 14:16:40 UTC, travert@phare.normalesup.org (Christophe Travert) wrote:
> function to pass the isDigest predicate. But they have many
> similarities, which explains they are all called hash functions. There
> is enough room in a package to put several related concepts!

Crytographic hash functions are also known as "one-way compression functions." They also have similarities to file compression algorithms. After all, both of them turn large files into smaller data. However, the actual use of them is completely different and you wouldn't use one in place of the other. I wouldn't put the Burrows-Wheeler transform in the same package.



It's just my opinion of course, but I just feel it wouldn't be right to intermingle normal hash functions and cryptographic hash functions in the same package. If we had to make a compromise and group them with something else, I'd really like to see cryptographic hash functions put in the same place we'd put other cryptography (such as AES) ... in a std.crypto package. But std.digest is good if they can exist in their own package.


It also occurs to me that a lot of people are confounding cryptographic hash functions and normal hash functions enough that they think that a normal hash function has a "digest" ... I'm 99% sure that's exclusive to the cryptographic hash functions (at least, I've never heard of a normal hash function producing a digest).
August 08, 2012
Am Wed, 8 Aug 2012 17:50:33 +0200
schrieb Johannes Pfau <nospam@example.com>:

> However, I do agree digest!Hash, md5Of, sha1Of should have an additional overload which takes a InputRange. It would be implemented with copy and be a nice convenience function.

I implemented the function, it's actually quite simple:
----
digestType!Hash digestRange(Hash, Range)(Range data) if(isDigest!Hash &&
    isInputRange!Range && __traits(compiles,
    digest!Hash(ElementType!(Range).init)))
{
    Hash hash;
    hash.start();
    copy(data, hash);
    return hash.finish();
}
----

but I don't know how make it an overload. See thread "overloading a function taking a void[][]" in D.learn for details.