Thread overview
Stream-Based Processing of Range Chunks in D
Dec 10, 2013
Nordlöw
Dec 10, 2013
qznc
Dec 10, 2013
Philippe Sigaud
December 10, 2013
I'm looking for an elegant way to perform chunk-stream-based processing of arrays/ranges. I'm building a file indexing/search engine in D that calculates various kinds of statistics on files such as histograms and SHA1-digests. I want these calculations to be performed in a single pass with regards to data-access locality.

Here is an excerpt from the engine

    /** Process File in Cache Friendly Chunks. */
    void calculateCStatInChunks(immutable (ubyte[]) src,
                                size_t chunkSize, bool doSHA1, bool doBHist8) {
        if (!_cstat.contentsDigest[].allZeros) { doSHA1 = false; }
        if (!_cstat.bhist8.allZeros) { doBHist8 = false; }

        import std.digest.sha;
        SHA1 sha1;
        if (doSHA1) { sha1.start(); }

        import std.range: chunks;
        foreach (chunk; src.chunks(chunkSize)) {
            if (doSHA1) { sha1.put(chunk); }
            if (doBHist8) { /*...*/ }
        }

        if (doSHA1) {
            _cstat.contentsDigest = sha1.finish();
        }
    }

Seemingly this is not a very elegant (functional) approach as I have to spread logic for each statistics (reducer) across three different places in the code, namely `start`, `put` and `finish`.

Does anybody have suggestions/references on Haskell-monad-like stream based APIs that can make this code more D-style component-based?
December 10, 2013
On Tuesday, 10 December 2013 at 09:57:44 UTC, Nordlöw wrote:
> I'm looking for an elegant way to perform chunk-stream-based processing of arrays/ranges. I'm building a file indexing/search engine in D that calculates various kinds of statistics on files such as histograms and SHA1-digests. I want these calculations to be performed in a single pass with regards to data-access locality.
>
> Here is an excerpt from the engine
>
>     /** Process File in Cache Friendly Chunks. */
>     void calculateCStatInChunks(immutable (ubyte[]) src,
>                                 size_t chunkSize, bool doSHA1, bool doBHist8) {
>         if (!_cstat.contentsDigest[].allZeros) { doSHA1 = false; }
>         if (!_cstat.bhist8.allZeros) { doBHist8 = false; }
>
>         import std.digest.sha;
>         SHA1 sha1;
>         if (doSHA1) { sha1.start(); }
>
>         import std.range: chunks;
>         foreach (chunk; src.chunks(chunkSize)) {
>             if (doSHA1) { sha1.put(chunk); }
>             if (doBHist8) { /*...*/ }
>         }
>
>         if (doSHA1) {
>             _cstat.contentsDigest = sha1.finish();
>         }
>     }
>
> Seemingly this is not a very elegant (functional) approach as I have to spread logic for each statistics (reducer) across three different places in the code, namely `start`, `put` and `finish`.
>
> Does anybody have suggestions/references on Haskell-monad-like stream based APIs that can make this code more D-style component-based?

You could make a range step for each kind of statistic, which outputs the input range unchanged and does its job as a side effect.

  SHA1 sha1;
  src.chunks(chunkSize)
     .add_sha1(doSHA1, &sha1)
     .add_bhist(doBHist8)
     .strict_consuming();

You could try to use constructor/destructor mechanisms for sha1.start and sha1.finish. Or at least scope guards:

SHA1 sha1;
if (doSHA1) { sha1.start(); }
scope(exit) if (doSHA1) { _cstat.contentsDigest = sha1.finish(); }
December 10, 2013
On Tue, Dec 10, 2013 at 10:57 AM, "Nordlöw" <per.nordlow@gmail.com> wrote:
> I'm looking for an elegant way to perform chunk-stream-based processing of arrays/ranges. I'm building a file indexing/search engine in D that calculates various kinds of statistics on files such as histograms and SHA1-digests. I want these calculations to be performed in a single pass with regards to data-access locality.

> Seemingly this is not a very elegant (functional) approach as I have to
> spread logic for each statistics (reducer) across three different places in
> the code, namely `start`, `put` and `finish`.

Concerning the put, you could have an auxiliary function that's defined only once:

void delegate( /*typeofChunk?*/ chunk) worker, sha, bhist8;

if (doSHA1)
    sha = (chunk) { sha1.put(chunk);}
else
    sha = (chunk) {}

if (doBhist8)
    bhist8 = (chunk) { /*some BHist8 work*/}
else
    bhist8 = (chunk) {}

worker = (chunk) { sha(chunk); bist8(chunk};}

...

       foreach (chunk; src.chunks(chunkSize))
            worker(chunk);