task parallelize dirEntries

Aug 11, 2017

Arun Chandrasekaran

Aug 11, 2017

Arun Chandrasekaran

Aug 11, 2017

Johnson

Aug 12, 2017

Arun Chandrasekaran

I've modified the sample from tour.dlang.org to calculate the md5 digest of the files in a directory using std.parallelism. When I run this on a dir with huge number of files, I get: core.exception.OutOfMemoryError@src/core/exception.d(696): Memory allocation failed Since dirEntries returns a range, I thought std.parallelism.parallel can make use of that without loading the entire file list into the memory. What am I doing wrong here? Is there a way to achieve what I'm expecting? ``` import std.digest.md; import std.stdio: writeln; import std.file; import std.algorithm; import std.parallelism; void printUsage() { writeln("Loops through a given directory and calculates the md5 digest of each file encountered."); writeln("Usage: md <dirname>"); } void safePrint(T...)(T args) { synchronized { import std.stdio : writeln; writeln(args); } } void main(string[] args) { if (args.length != 2) return printUsage; foreach (d; parallel(dirEntries(args[1], SpanMode.depth).filter!(f => f.isFile), 1)) { auto md5 = new MD5Digest(); md5.reset(); auto data = cast(const(ubyte)[]) read(d.name); md5.put(data); auto hash = md5.finish(); import std.array; string[] t = split(d.name, '/'); safePrint(toHexString!(LetterCase.lower)(hash), " ", t[$-1]); } } ```

August 11, 2017

Re: task parallelize dirEntries

Posted by Johnson
in reply to Arun Chandrasekaran

Permalink

Johnson

Posted in reply to Arun Chandrasekaran

Permalink

On Friday, 11 August 2017 at 21:33:51 UTC, Arun Chandrasekaran wrote:
> I've modified the sample from tour.dlang.org to calculate the md5 digest of the files in a directory using std.parallelism.
>
> When I run this on a dir with huge number of files, I get:
>
> core.exception.OutOfMemoryError@src/core/exception.d(696): Memory allocation failed
>
> Since dirEntries returns a range, I thought std.parallelism.parallel can make use of that without loading the entire file list into the memory.
>
> What am I doing wrong here? Is there a way to achieve what I'm expecting?
>
> ```
> import std.digest.md;
> import std.stdio: writeln;
> import std.file;
> import std.algorithm;
> import std.parallelism;
>
> void printUsage()
> {
>     writeln("Loops through a given directory and calculates the md5 digest of each file encountered.");
>     writeln("Usage: md <dirname>");
> }
>
> void safePrint(T...)(T args)
> {
>     synchronized
>     {
>         import std.stdio : writeln;
>         writeln(args);
>     }
> }
>
> void main(string[] args)
> {
>     if (args.length != 2)
>         return printUsage;
>
>     foreach (d; parallel(dirEntries(args[1], SpanMode.depth).filter!(f => f.isFile), 1))
>     {
>         auto md5 = new MD5Digest();
>         md5.reset();
>         auto data = cast(const(ubyte)[]) read(d.name);
>         md5.put(data);
>         auto hash = md5.finish();
>         import std.array;
>         string[] t = split(d.name, '/');
>         safePrint(toHexString!(LetterCase.lower)(hash), "  ", t[$-1]);
>     }
> }
> ```

Just a thought, maybe the GC isn't cleaning up quick enough? You are allocating and md5 digest each iteration.

Possibly, an opitimization is use use a collection of md5 hashes and reuse them. e.g., pre-allocate 100(you probably only need as many as the number of parallel loops going) and then attempt to resuse them. If all are in use, wait for a free one. Might require some synchronization.

On Friday, 11 August 2017 at 21:58:20 UTC, Johnson wrote: > Just a thought, maybe the GC isn't cleaning up quick enough? You are allocating and md5 digest each iteration. > > Possibly, an opitimization is use use a collection of md5 hashes and reuse them. e.g., pre-allocate 100(you probably only need as many as the number of parallel loops going) and then attempt to resuse them. If all are in use, wait for a free one. Might require some synchronization. John, thanks. That was it. md.d has nifty function that is straightforward than the OOP version. ``` void main(string[] args) { foreach (d; parallel(dirEntries(args[1], SpanMode.depth).filter!(f => f.isFile), 1)) { auto data = cast(const(ubyte)[]) read(d.name); auto hash = md5Of(data); import std.array; string[] t = split(d.name, '/'); writeln(toHexString(hash), " ", t[$-1]); } } ``` Also I expected the performance to be faster than `md5sum`. However, that was not the case. Please see below. Is there anyway to optimize this further? ``` 11-08-2017 17:22:54 vaalaham ~/code/d/d-mpmc-sample $ time find /home/arun/downloads/boost_1_64_0/ -type f | xargs md5sum >/dev/null 2>&1 real 0m1.124s user 0m0.952s sys 0m0.208s 11-08-2017 17:23:16 vaalaham ~/code/d/d-mpmc-sample $ ldc2 pmd.d -O3 11-08-2017 17:23:31 vaalaham ~/code/d/d-mpmc-sample $ time ./pmd ~/downloads/boost_1_64_0 > /dev/null real 0m0.499s user 0m1.596s sys 0m0.580s 11-08-2017 17:23:37 vaalaham ~/code/d/d-mpmc-sample $ ``` strace showed lots of futex exchanges. Why would that be?

Forums