Thread overview | ||||||
---|---|---|---|---|---|---|
|
May 01, 2017 "Rolling Hash computation" or "Content Defined Chunking" | ||||
---|---|---|---|---|
| ||||
Hi Dlander's. Found some interesting reads ([1] [2] [3]) about the $SUBJECT and wonder if there is anything available in the Dland?! If yes, pls. share. If not, how could it be done (D'ish) [1] - https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/ - https://github.com/moinakg/pcompress/blob/master/rabin/rabin_dedup.c [2] - https://restic.github.io/blog/2015-09-12/restic-foundation1-cdc [3] - http://www.infoarena.ro/blog/rolling-hash Thanks & regards |
May 06, 2017 Re: "Rolling Hash computation" or "Content Defined Chunking" | ||||
---|---|---|---|---|
| ||||
Posted in reply to notna | Am Mon, 01 May 2017 21:01:43 +0000 schrieb notna <notna.remove.this@ist-einmalig.de>: > Hi Dlander's. > > Found some interesting reads ([1] [2] [3]) about the $SUBJECT and wonder if there is anything available in the Dland?! > > If yes, pls. share. > If not, how could it be done (D'ish) > > [1] - > https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/ > - > https://github.com/moinakg/pcompress/blob/master/rabin/rabin_dedup.c > > [2] - https://restic.github.io/blog/2015-09-12/restic-foundation1-cdc > > [3] - http://www.infoarena.ro/blog/rolling-hash > > Thanks & regards Interesting concept. I'm not aware of any D implementation but it shouldn't be difficult to implement this in D: https://en.wikipedia.org/wiki/Rolling_hash#Cyclic_polynomial There's a BSD licensed haskell implementation, so a BSD licensed port would be very easy to implement: https://hackage.haskell.org/package/optimal-blocks-0.1.0 https://hackage.haskell.org/package/optimal-blocks-0.1.0/docs/src/Algorithm-OptimalBlocks-BuzzHash.html To make an implementation D'ish it could integrate with either std.digest or process input ranges. If you want to use it exclusively for chunking your code can be more efficient (process InputRange until a boundary condition is met). When using input ranges, prefer some kind of buffered approach, Range!ubyte[] instead of Range!ubyte for better performance. If you really want the rolling hash value for each byte in a sequence this will be less efficient as you'll have to enter data byte-by-byte. In this case it's extremely important for performance that your function can be inlined, so use templates: ubyte[] data; foreach(b; data) { // This needs to be inlined for performance reasons rollinghash.put(b); } -- Johannes |
May 09, 2017 Re: "Rolling Hash computation" or "Content Defined Chunking" | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johannes Pfau | On Saturday, 6 May 2017 at 07:21:51 UTC, Johannes Pfau wrote:
> Am Mon, 01 May 2017 21:01:43 +0000
> schrieb notna <notna.remove.this@ist-einmalig.de>:
>
>> Hi Dlander's.
>>
>> Found some interesting reads ([1] [2] [3]) about the $SUBJECT and wonder if there is anything available in the Dland?!
>>
>> If yes, pls. share.
>> If not, how could it be done (D'ish)
>>
>> [1] -
>> https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/
>> -
>> https://github.com/moinakg/pcompress/blob/master/rabin/rabin_dedup.c
>>
>> [2] - https://restic.github.io/blog/2015-09-12/restic-foundation1-cdc
>>
>> [3] - http://www.infoarena.ro/blog/rolling-hash
>>
>> Thanks & regards
>
> Interesting concept. I'm not aware of any D implementation but it shouldn't be difficult to implement this in D: https://en.wikipedia.org/wiki/Rolling_hash#Cyclic_polynomial
>
> There's a BSD licensed haskell implementation, so a BSD licensed port
> would be very easy to implement:
> https://hackage.haskell.org/package/optimal-blocks-0.1.0
> https://hackage.haskell.org/package/optimal-blocks-0.1.0/docs/src/Algorithm-OptimalBlocks-BuzzHash.html
>
> To make an implementation D'ish it could integrate with either std.digest or process input ranges. If you want to use it exclusively for chunking your code can be more efficient (process InputRange until a boundary condition is met). When using input ranges, prefer some kind of buffered approach, Range!ubyte[] instead of Range!ubyte for better performance.
>
> If you really want the rolling hash value for each byte in a sequence this will be less efficient as you'll have to enter data byte-by-byte. In this case it's extremely important for performance that your function can be inlined, so use templates:
>
> ubyte[] data;
> foreach(b; data)
> {
> // This needs to be inlined for performance reasons
> rollinghash.put(b);
> }
>
> -- Johannes
Thanks for the feedback, Johannes.
I hoped there may already be something in Mir or Weka.io or somewhere else... Will read the Golang, C and C++ source and see if my Dlang is good enough for ranges and the like magic...
|
May 10, 2017 Re: "Rolling Hash computation" or "Content Defined Chunking" | ||||
---|---|---|---|---|
| ||||
Posted in reply to notna | On Tuesday, 9 May 2017 at 18:17:45 UTC, notna wrote: > > I hoped there may already be something in Mir or Weka.io or somewhere else... Will read the Golang, C and C++ source and see if my Dlang is good enough for ranges and the like magic... Hello notha, You may want to open a PR to mir-algorithm. I will help to make code idiomatic. https://github.com/libmir/mir-algorithm Thanks, Ilya |
Copyright © 1999-2021 by the D Language Foundation