May 13, 2015
On Wednesday, 13 May 2015 at 14:28:52 UTC, Gerald Jansen wrote:
> On Wednesday, 13 May 2015 at 13:40:33 UTC, John Colvin wrote:
>> On Wednesday, 13 May 2015 at 11:33:55 UTC, John Colvin wrote:
>>> On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote:
>>>> On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote:
>>>>> On 13/05/2015 4:20 a.m., Gerald Jansen wrote:
>>>>>> At the risk of great embarassment ... here's my program:
>>>>>> http://dekoppel.eu/tmp/pedupg.d
>>>>>
>>>>> Would it be possible to give us some example data?
>>>>> I might give it a go to try rewriting it tomorrow.
>>>>
>>>> http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb)
>>>>
>>>> Contains two largish datasets in a directory structure expected by the program.
>>>
>>> I only see 2 traits in that example, so it's hard for anyone to explore your scaling problem, seeing as there are a maximum of 2 tasks.
>>
>> Either way, a few small changes were enough to cut the runtime by a factor of ~6 in the single-threaded case and improve the scaling a bit, although the printing to output files still looks like a bit of a bottleneck.
>>
>
>> http://dpaste.dzfl.pl/80cd36fd6796
>>
>> The key thing was reducing the number of allocations (more std.algorithm.splitter copying to static arrays, less std.array.split) and avoiding File.byLine. Other people in this thread have mentioned alternatives to it that may be faster/have lower memory usage, I just read the whole files in to memory and then lazily split them with std.algorithm.splitter. I ended up with some blank lines coming through, so i added if(line.empty) continue; in a few places, you might want to look more carefully at that, it could be my mistake.
>>
>> The use of std.array.appender for `info` is just good practice, but it doesn't make much difference here.
>
> Wow, I'm impressed with the effort you guys (John, Rikki, others) are making to teach me some efficiency tricks. I guess this is one of the strengths of D: its community. I'm studying your various contributions closely!
>
> The empty line comes from the very last line on the files, which also end with a newline (as per "normal" practice?).

Yup, that would be it.

I added a bit of buffered writing and it actually seems to scale quite well for me now.

http://dpaste.dzfl.pl/710afe8b6df5
May 13, 2015
On Wednesday, 13 May 2015 at 12:16:19 UTC, weaselcat wrote:
> On Wednesday, 13 May 2015 at 09:01:05 UTC, Gerald Jansen wrote:
>> On Wednesday, 13 May 2015 at 03:19:17 UTC, thedeemon wrote:
>>> In case of Python's parallel.Pool() separate processes do the work without any synchronization issues. In case of D's std.parallelism it's just threads inside one process and they do fight for some locks, thus this result.
>>
>> Okay, so to do something equivalent I would need to use std.process. My next question is how to pass the common data to the sub-processes. In the Python approach I guess this is automatically looked after by pickling serialization. Is there something similar in D? Alternatively, would the use of std.mmfile to temporarily store the common data be a reasonable approach?
>
> Assuming you're on a POSIX compliant platform, you would just take advantage of fork()'s shared memory model and pipes - i.e, read the data, then fork in a loop to process it, then use pipes to communicate. It ran about 3x faster for me by doing this, and obviously scales with the workloads you have(the provided data only seems to have 2.) If you could provide a larger dataset and the python implementation, that would be great.
>
> I'm actually surprised and disappointed that there isn't a fork()-backend to std.process OR std.parallel. You have to use stdc

Okay, more studying...

The python implementation is part of a larger package so it would be a fair bit of work to provide a working version. Anyway, the salient bits are like this:

from parallel import Pool
def run_job(args):
    (job, arr1, arr2) = args
    # ... do the work for each dataset
def main():
    # ... read common data and store in numpy arrays...
    pool = Pool()
    pool.map(run_job, [(job, arr1, arr2) for job in jobs])
May 13, 2015
On Wednesday, 13 May 2015 at 14:43:50 UTC, John Colvin wrote:
> On Wednesday, 13 May 2015 at 14:28:52 UTC, Gerald Jansen wrote:
>> On Wednesday, 13 May 2015 at 13:40:33 UTC, John Colvin wrote:
>>> On Wednesday, 13 May 2015 at 11:33:55 UTC, John Colvin wrote:
>>>> On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote:
>>>>> On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote:
>>>>>> On 13/05/2015 4:20 a.m., Gerald Jansen wrote:
>>>>>>> At the risk of great embarassment ... here's my program:
>>>>>>> http://dekoppel.eu/tmp/pedupg.d
>>>>>>
>>>>>> Would it be possible to give us some example data?
>>>>>> I might give it a go to try rewriting it tomorrow.
>>>>>
>>>>> http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb)
>>>>>
>>>>> Contains two largish datasets in a directory structure expected by the program.
>>>>
>>>> I only see 2 traits in that example, so it's hard for anyone to explore your scaling problem, seeing as there are a maximum of 2 tasks.
>>>
>>> Either way, a few small changes were enough to cut the runtime by a factor of ~6 in the single-threaded case and improve the scaling a bit, although the printing to output files still looks like a bit of a bottleneck.
>>>
>>
>>> http://dpaste.dzfl.pl/80cd36fd6796
>>>
>>> The key thing was reducing the number of allocations (more std.algorithm.splitter copying to static arrays, less std.array.split) and avoiding File.byLine. Other people in this thread have mentioned alternatives to it that may be faster/have lower memory usage, I just read the whole files in to memory and then lazily split them with std.algorithm.splitter. I ended up with some blank lines coming through, so i added if(line.empty) continue; in a few places, you might want to look more carefully at that, it could be my mistake.
>>>
>>> The use of std.array.appender for `info` is just good practice, but it doesn't make much difference here.
>>
>> Wow, I'm impressed with the effort you guys (John, Rikki, others) are making to teach me some efficiency tricks. I guess this is one of the strengths of D: its community. I'm studying your various contributions closely!
>>
>> The empty line comes from the very last line on the files, which also end with a newline (as per "normal" practice?).
>
> Yup, that would be it.
>
> I added a bit of buffered writing and it actually seems to scale quite well for me now.
>
> http://dpaste.dzfl.pl/710afe8b6df5

Fixed the file reading spare '\n' problem, added some comments.

http://dpaste.dzfl.pl/114d5a6086b7
May 14, 2015
John Colvin's improvements to my D program seem to have resolved the problem.
(http://forum.dlang.org/post/ydgmzhlspvvvrbeemrqf@forum.dlang.org
and http://dpaste.dzfl.pl/114d5a6086b7).

I have rerun my tests and now the picture is a bit different (see tables below).

In the middle table I have used gnu parallel in combination with a slightly modified version of the D program which runs a single trait (specified in argv[1]). This approach runs the jobs as completely isolated processes, but at the extra cost of re-reading the common data for each trait. The elapsed time is very similar with the parallel foreach in the D program or using gnu parallel (for this particular program and these data run on this server...). I'm guessing the program is now essentially limited by disk I/O, so this is about as good as it gets.

So, just to wrap up:
- there is a nice speed improvement over Python program :-)
- one needs to learn a fair bit to fully benefit from D's potential
- thanks for all the help!

Gerald Jansen


Jobs __ time for D parallel foreach w. JC mods____
1     4.71user  0.56system 0:05.28elapsed   99%CPU
2     6.59user  0.96system 0:05.48elapsed  137%CPU
4    11.45user  1.94system 0:07.24elapsed  184%CPU
8    20.30user  5.18system 0:13.16elapsed  193%CPU
16   68.48user 13.87system 0:27.21elapsed  302%CPU
27   99.66user 18.73system 0:42.34elapsed  279%CPU

Jobs __ gnu parallel + D program for single job __
1     4.71user  0.56system 0:05.28elapsed   99%CPU as above
2     9.66user  1.28system 0:05.76elapsed  189%CPU
4    18.86user  3.85system 0:08.15elapsed  278%CPU
8    40.76user  7.53system 0:14.69elapsed  328%CPU
16  135.76user 20.68system 0:31.06elapsed  503%CPU
27  189.43user 28.26system 0:47.75elapsed  455%CPU

Jobs _____ time for python version _____________
1    45.39user  1.52system 0:46.88elapsed  100%CPU
2    77.76user  2.42system 0:47.16elapsed  170%CPU
4   141.28user  4.37system 0:48.77elapsed  298%CPU
8   280.45user  8.80system 0:56.00elapsed  516%CPU
16  926.05user 20.48system 1:31.36elapsed 1036%CPU
27 1329.09user 27.18system 2:11.79elapsed 1029%CPU
May 14, 2015
On Thursday, 14 May 2015 at 10:46:53 UTC, Gerald Jansen wrote:
> John Colvin's improvements to my D program seem to have resolved the problem.
> (http://forum.dlang.org/post/ydgmzhlspvvvrbeemrqf@forum.dlang.org
> and http://dpaste.dzfl.pl/114d5a6086b7).
>
> I have rerun my tests and now the picture is a bit different (see tables below).
>
> In the middle table I have used gnu parallel in combination with a slightly modified version of the D program which runs a single trait (specified in argv[1]). This approach runs the jobs as completely isolated processes, but at the extra cost of re-reading the common data for each trait. The elapsed time is very similar with the parallel foreach in the D program or using gnu parallel (for this particular program and these data run on this server...). I'm guessing the program is now essentially limited by disk I/O, so this is about as good as it gets.
>
> So, just to wrap up:
> - there is a nice speed improvement over Python program :-)
> - one needs to learn a fair bit to fully benefit from D's potential
> - thanks for all the help!
>
> Gerald Jansen
>
>
> Jobs __ time for D parallel foreach w. JC mods____
> 1     4.71user  0.56system 0:05.28elapsed   99%CPU
> 2     6.59user  0.96system 0:05.48elapsed  137%CPU
> 4    11.45user  1.94system 0:07.24elapsed  184%CPU
> 8    20.30user  5.18system 0:13.16elapsed  193%CPU
> 16   68.48user 13.87system 0:27.21elapsed  302%CPU
> 27   99.66user 18.73system 0:42.34elapsed  279%CPU
>
> Jobs __ gnu parallel + D program for single job __
> 1     4.71user  0.56system 0:05.28elapsed   99%CPU as above
> 2     9.66user  1.28system 0:05.76elapsed  189%CPU
> 4    18.86user  3.85system 0:08.15elapsed  278%CPU
> 8    40.76user  7.53system 0:14.69elapsed  328%CPU
> 16  135.76user 20.68system 0:31.06elapsed  503%CPU
> 27  189.43user 28.26system 0:47.75elapsed  455%CPU
>
> Jobs _____ time for python version _____________
> 1    45.39user  1.52system 0:46.88elapsed  100%CPU
> 2    77.76user  2.42system 0:47.16elapsed  170%CPU
> 4   141.28user  4.37system 0:48.77elapsed  298%CPU
> 8   280.45user  8.80system 0:56.00elapsed  516%CPU
> 16  926.05user 20.48system 1:31.36elapsed 1036%CPU
> 27 1329.09user 27.18system 2:11.79elapsed 1029%CPU

Would it be OK if I showed some parts of this code as examples in my DConf talk in 2 weeks?
May 15, 2015
On Thursday, 14 May 2015 at 17:12:07 UTC, John Colvin wrote:

> Would it be OK if I showed some parts of this code as examples in my DConf talk in 2 weeks?

Sure!!!
1 2 3 4
Next ›   Last »