Thread overview | ||||||
---|---|---|---|---|---|---|
|
August 11, 2017 task parallelize dirEntries | ||||
---|---|---|---|---|
| ||||
I've modified the sample from tour.dlang.org to calculate the md5 digest of the files in a directory using std.parallelism. When I run this on a dir with huge number of files, I get: core.exception.OutOfMemoryError@src/core/exception.d(696): Memory allocation failed Since dirEntries returns a range, I thought std.parallelism.parallel can make use of that without loading the entire file list into the memory. What am I doing wrong here? Is there a way to achieve what I'm expecting? ``` import std.digest.md; import std.stdio: writeln; import std.file; import std.algorithm; import std.parallelism; void printUsage() { writeln("Loops through a given directory and calculates the md5 digest of each file encountered."); writeln("Usage: md <dirname>"); } void safePrint(T...)(T args) { synchronized { import std.stdio : writeln; writeln(args); } } void main(string[] args) { if (args.length != 2) return printUsage; foreach (d; parallel(dirEntries(args[1], SpanMode.depth).filter!(f => f.isFile), 1)) { auto md5 = new MD5Digest(); md5.reset(); auto data = cast(const(ubyte)[]) read(d.name); md5.put(data); auto hash = md5.finish(); import std.array; string[] t = split(d.name, '/'); safePrint(toHexString!(LetterCase.lower)(hash), " ", t[$-1]); } } ``` |
August 11, 2017 Re: task parallelize dirEntries | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arun Chandrasekaran | On Friday, 11 August 2017 at 21:33:51 UTC, Arun Chandrasekaran wrote:
> I've modified the sample from tour.dlang.org to calculate the
>
> [...]
RHEL 7.2 64 bit
dmd v2.075.0
ldc 1.1.0
|
August 11, 2017 Re: task parallelize dirEntries | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arun Chandrasekaran | On Friday, 11 August 2017 at 21:33:51 UTC, Arun Chandrasekaran wrote:
> I've modified the sample from tour.dlang.org to calculate the md5 digest of the files in a directory using std.parallelism.
>
> When I run this on a dir with huge number of files, I get:
>
> core.exception.OutOfMemoryError@src/core/exception.d(696): Memory allocation failed
>
> Since dirEntries returns a range, I thought std.parallelism.parallel can make use of that without loading the entire file list into the memory.
>
> What am I doing wrong here? Is there a way to achieve what I'm expecting?
>
> ```
> import std.digest.md;
> import std.stdio: writeln;
> import std.file;
> import std.algorithm;
> import std.parallelism;
>
> void printUsage()
> {
> writeln("Loops through a given directory and calculates the md5 digest of each file encountered.");
> writeln("Usage: md <dirname>");
> }
>
> void safePrint(T...)(T args)
> {
> synchronized
> {
> import std.stdio : writeln;
> writeln(args);
> }
> }
>
> void main(string[] args)
> {
> if (args.length != 2)
> return printUsage;
>
> foreach (d; parallel(dirEntries(args[1], SpanMode.depth).filter!(f => f.isFile), 1))
> {
> auto md5 = new MD5Digest();
> md5.reset();
> auto data = cast(const(ubyte)[]) read(d.name);
> md5.put(data);
> auto hash = md5.finish();
> import std.array;
> string[] t = split(d.name, '/');
> safePrint(toHexString!(LetterCase.lower)(hash), " ", t[$-1]);
> }
> }
> ```
Just a thought, maybe the GC isn't cleaning up quick enough? You are allocating and md5 digest each iteration.
Possibly, an opitimization is use use a collection of md5 hashes and reuse them. e.g., pre-allocate 100(you probably only need as many as the number of parallel loops going) and then attempt to resuse them. If all are in use, wait for a free one. Might require some synchronization.
|
August 12, 2017 Re: task parallelize dirEntries | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johnson | On Friday, 11 August 2017 at 21:58:20 UTC, Johnson wrote: > Just a thought, maybe the GC isn't cleaning up quick enough? You are allocating and md5 digest each iteration. > > Possibly, an opitimization is use use a collection of md5 hashes and reuse them. e.g., pre-allocate 100(you probably only need as many as the number of parallel loops going) and then attempt to resuse them. If all are in use, wait for a free one. Might require some synchronization. John, thanks. That was it. md.d has nifty function that is straightforward than the OOP version. ``` void main(string[] args) { foreach (d; parallel(dirEntries(args[1], SpanMode.depth).filter!(f => f.isFile), 1)) { auto data = cast(const(ubyte)[]) read(d.name); auto hash = md5Of(data); import std.array; string[] t = split(d.name, '/'); writeln(toHexString(hash), " ", t[$-1]); } } ``` Also I expected the performance to be faster than `md5sum`. However, that was not the case. Please see below. Is there anyway to optimize this further? ``` 11-08-2017 17:22:54 vaalaham ~/code/d/d-mpmc-sample $ time find /home/arun/downloads/boost_1_64_0/ -type f | xargs md5sum >/dev/null 2>&1 real 0m1.124s user 0m0.952s sys 0m0.208s 11-08-2017 17:23:16 vaalaham ~/code/d/d-mpmc-sample $ ldc2 pmd.d -O3 11-08-2017 17:23:31 vaalaham ~/code/d/d-mpmc-sample $ time ./pmd ~/downloads/boost_1_64_0 > /dev/null real 0m0.499s user 0m1.596s sys 0m0.580s 11-08-2017 17:23:37 vaalaham ~/code/d/d-mpmc-sample $ ``` strace showed lots of futex exchanges. Why would that be? |
Copyright © 1999-2021 by the D Language Foundation