May 12, 2015
On Tuesday, 12 May 2015 at 16:46:42 UTC, thedeemon wrote:
> On Tuesday, 12 May 2015 at 14:59:38 UTC, Gerald Jansen wrote:
>
>> The output of /usr/bin/time is as follows:
>>
>> Lang Jobs    User  System  Elapsed %CPU
>> Py      2   79.24    2.16  0:48.90  166
>> D       2   19.41   10.14  0:17.96  164
>>
>> Py     30 1255.17   58.38  2:39.54  823 * Pool(12)
>> D      30  421.61 4565.97  6:33.73 1241
>
> The fact that most of the time is spent in System department is quite important. I suspect there are too many system calls from line-wise reading and writing the files. How many lines are read and written there?

About 3.5 million lines read by main(), 0.5 to 2 million lines read and 3.5 million lines written by runTraits (aka runJob).

I have smaller datasets that I test on my laptop with a single quad-core I7 which sometimes show little increase in System time and other times have a marked increase, but not nearly as exagerated as in the large datasets on the server.

Gerald
May 12, 2015
On Tuesday, 12 May 2015 at 17:02:19 UTC, Gerald Jansen wrote:

> About 3.5 million lines read by main(), 0.5 to 2 million lines read and 3.5 million lines written by runTraits (aka runJob).

Each GC allocation in D is a locking operation (and disabling GC doesn't help here at all), probably each writeln too, so when multiple threads try to write millions of lines such issue is easy to meet. I would look for a way to write those lines without allocations and locking, and also reduce total number of system calls by buffering data, doing less f.writef's.
May 12, 2015
On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote:
> On 13/05/2015 4:20 a.m., Gerald Jansen wrote:
>> At the risk of great embarassment ... here's my program:
>> http://dekoppel.eu/tmp/pedupg.d
>
> Would it be possible to give us some example data?
> I might give it a go to try rewriting it tomorrow.

http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb)

Contains two largish datasets in a directory structure expected by the program.
May 12, 2015
On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote:
> On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote:
>> On 13/05/2015 4:20 a.m., Gerald Jansen wrote:
>>> At the risk of great embarassment ... here's my program:
>>> http://dekoppel.eu/tmp/pedupg.d
>>
>> Would it be possible to give us some example data?
>> I might give it a go to try rewriting it tomorrow.
>
> http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb)
>
> Contains two largish datasets in a directory structure expected by the program.

I haven't had time to read code closely.  But if you disable the logging does that change things?  If so, how about having the logging done asynchronously in another thread?

And are you using optimization on gdc ?
May 12, 2015
On Tuesday, 12 May 2015 at 19:10:13 UTC, Laeeth Isharc wrote:
> On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote:
>> On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote:
>>> On 13/05/2015 4:20 a.m., Gerald Jansen wrote:
>>>> At the risk of great embarassment ... here's my program:
>>>> http://dekoppel.eu/tmp/pedupg.d
>>>
>>> Would it be possible to give us some example data?
>>> I might give it a go to try rewriting it tomorrow.
>>
>> http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb)
>>
>> Contains two largish datasets in a directory structure expected by the program.
>
> I haven't had time to read code closely.  But if you disable the logging does that change things?  If so, how about having the logging done asynchronously in another thread?
>
> And are you using optimization on gdc ?

Also try byLineFast eg
http://forum.dlang.org/thread/umkcjntsxchskljygcbs@forum.dlang.org#post-20130516144627.000050da:40unknown

I don't know if std.csv CSVReader would be faster than parsing yourself, but worth trying.

Some tricks here, also:
http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

May 12, 2015
On Tuesday, 12 May 2015 at 19:14:23 UTC, Laeeth Isharc wrote:
>> But if you disable the logging does that change things?

There is only a tiny bit of logging happening.

>> And are you using optimization on gdc ?

gdc -Ofast -march=native -frelease

>
> Also try byLineFast eg
> http://forum.dlang.org/thread/umkcjntsxchskljygcbs@forum.dlang.org#post-20130516144627.000050da:40unknown

Thx, I'll have a look. Performance is good for a single dataset so I thought regular byLine would be okay.

> I don't know if std.csv CSVReader would be faster than parsing yourself, but worth trying.

No, my initial experience with CSVReader was that it was not very fast:
http://forum.dlang.org/post/wklmolsqcmnagluidphu@forum.dlang.org .

> Some tricks here, also:
> http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

Thanks again. I am having doubts about "d-is-for-data-science". The learning curve is very steep compared to my experience with R/Python/(Julia). But I'm trying...
May 12, 2015
On Tuesday, 12 May 2015 at 17:45:54 UTC, thedeemon wrote:
> On Tuesday, 12 May 2015 at 17:02:19 UTC, Gerald Jansen wrote:
>
>> About 3.5 million lines read by main(), 0.5 to 2 million lines read and 3.5 million lines written by runTraits (aka runJob).
>
> Each GC allocation in D is a locking operation (and disabling GC doesn't help here at all), probably each writeln too, so when multiple threads try to write millions of lines such issue is easy to meet. I would look for a way to write those lines without allocations and locking, and also reduce total number of system calls by buffering data, doing less f.writef's.

Your advice is appreciated but quite disheartening. I was hoping for something (nearly) as easy to use as Python's parallel.Pool() map(), given that this is essentially an "embarassingly parallel" problem. Avoidance of GC allocation and  self-written buffered IO functions seems a bit much to ask of a newcomer to a language.
May 12, 2015
On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote:
> On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote:
>> On 13/05/2015 4:20 a.m., Gerald Jansen wrote:
>>> At the risk of great embarassment ... here's my program:
>>> http://dekoppel.eu/tmp/pedupg.d
>>
>> Would it be possible to give us some example data?
>> I might give it a go to try rewriting it tomorrow.
>
> http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb)
>
> Contains two largish datasets in a directory structure expected by the program.

Profiling shows that your program spends most of the time reading the data.

I see a considerable speed boost with the following one-line patch (plus imports):

- foreach (line; File(pednum, "r").byLine()) {
+ foreach (line; (cast(string)read(pednum)).split('\n')) {
May 12, 2015
On Tuesday, 12 May 2015 at 20:58:16 UTC, Vladimir Panteleev wrote:
> On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote:
>> On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote:
>>> On 13/05/2015 4:20 a.m., Gerald Jansen wrote:
>>>> At the risk of great embarassment ... here's my program:
>>>> http://dekoppel.eu/tmp/pedupg.d
>>>
>>> Would it be possible to give us some example data?
>>> I might give it a go to try rewriting it tomorrow.
>>
>> http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb)
>>
>> Contains two largish datasets in a directory structure expected by the program.
>
> Profiling shows that your program spends most of the time reading the data.
>
> I see a considerable speed boost with the following one-line patch (plus imports):
>
> - foreach (line; File(pednum, "r").byLine()) {
> + foreach (line; (cast(string)read(pednum)).split('\n')) {

Nice, thanks. Making that replacement in three points in the program resulted in roughly a 30% speedup at the cost of about 30% more memory in this specific case. Unfortunately it didn't help with the performance deteroration problem with parallel foreach.
May 13, 2015
On Tuesday, 12 May 2015 at 20:50:45 UTC, Gerald Jansen wrote:

> Your advice is appreciated but quite disheartening. I was hoping for something (nearly) as easy to use as Python's parallel.Pool() map(), given that this is essentially an "embarassingly parallel" problem. Avoidance of GC allocation and  self-written buffered IO functions seems a bit much to ask of a newcomer to a language.

You're right, these are issues of D's standard library that are not easy for a newcomer to tackle. In case of Python's parallel.Pool() separate processes do the work without any synchronization issues. In case of D's std.parallelism it's just threads inside one process and they do fight for some locks, thus this result.