May 13, 2015 Re: problem with parallel foreach | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gerald Jansen | On Wednesday, 13 May 2015 at 14:28:52 UTC, Gerald Jansen wrote: > On Wednesday, 13 May 2015 at 13:40:33 UTC, John Colvin wrote: >> On Wednesday, 13 May 2015 at 11:33:55 UTC, John Colvin wrote: >>> On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote: >>>> On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: >>>>> On 13/05/2015 4:20 a.m., Gerald Jansen wrote: >>>>>> At the risk of great embarassment ... here's my program: >>>>>> http://dekoppel.eu/tmp/pedupg.d >>>>> >>>>> Would it be possible to give us some example data? >>>>> I might give it a go to try rewriting it tomorrow. >>>> >>>> http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) >>>> >>>> Contains two largish datasets in a directory structure expected by the program. >>> >>> I only see 2 traits in that example, so it's hard for anyone to explore your scaling problem, seeing as there are a maximum of 2 tasks. >> >> Either way, a few small changes were enough to cut the runtime by a factor of ~6 in the single-threaded case and improve the scaling a bit, although the printing to output files still looks like a bit of a bottleneck. >> > >> http://dpaste.dzfl.pl/80cd36fd6796 >> >> The key thing was reducing the number of allocations (more std.algorithm.splitter copying to static arrays, less std.array.split) and avoiding File.byLine. Other people in this thread have mentioned alternatives to it that may be faster/have lower memory usage, I just read the whole files in to memory and then lazily split them with std.algorithm.splitter. I ended up with some blank lines coming through, so i added if(line.empty) continue; in a few places, you might want to look more carefully at that, it could be my mistake. >> >> The use of std.array.appender for `info` is just good practice, but it doesn't make much difference here. > > Wow, I'm impressed with the effort you guys (John, Rikki, others) are making to teach me some efficiency tricks. I guess this is one of the strengths of D: its community. I'm studying your various contributions closely! > > The empty line comes from the very last line on the files, which also end with a newline (as per "normal" practice?). Yup, that would be it. I added a bit of buffered writing and it actually seems to scale quite well for me now. http://dpaste.dzfl.pl/710afe8b6df5 |
May 13, 2015 Re: problem with parallel foreach | ||||
---|---|---|---|---|
| ||||
Posted in reply to weaselcat | On Wednesday, 13 May 2015 at 12:16:19 UTC, weaselcat wrote:
> On Wednesday, 13 May 2015 at 09:01:05 UTC, Gerald Jansen wrote:
>> On Wednesday, 13 May 2015 at 03:19:17 UTC, thedeemon wrote:
>>> In case of Python's parallel.Pool() separate processes do the work without any synchronization issues. In case of D's std.parallelism it's just threads inside one process and they do fight for some locks, thus this result.
>>
>> Okay, so to do something equivalent I would need to use std.process. My next question is how to pass the common data to the sub-processes. In the Python approach I guess this is automatically looked after by pickling serialization. Is there something similar in D? Alternatively, would the use of std.mmfile to temporarily store the common data be a reasonable approach?
>
> Assuming you're on a POSIX compliant platform, you would just take advantage of fork()'s shared memory model and pipes - i.e, read the data, then fork in a loop to process it, then use pipes to communicate. It ran about 3x faster for me by doing this, and obviously scales with the workloads you have(the provided data only seems to have 2.) If you could provide a larger dataset and the python implementation, that would be great.
>
> I'm actually surprised and disappointed that there isn't a fork()-backend to std.process OR std.parallel. You have to use stdc
Okay, more studying...
The python implementation is part of a larger package so it would be a fair bit of work to provide a working version. Anyway, the salient bits are like this:
from parallel import Pool
def run_job(args):
(job, arr1, arr2) = args
# ... do the work for each dataset
def main():
# ... read common data and store in numpy arrays...
pool = Pool()
pool.map(run_job, [(job, arr1, arr2) for job in jobs])
|
May 13, 2015 Re: problem with parallel foreach | ||||
---|---|---|---|---|
| ||||
Posted in reply to John Colvin | On Wednesday, 13 May 2015 at 14:43:50 UTC, John Colvin wrote: > On Wednesday, 13 May 2015 at 14:28:52 UTC, Gerald Jansen wrote: >> On Wednesday, 13 May 2015 at 13:40:33 UTC, John Colvin wrote: >>> On Wednesday, 13 May 2015 at 11:33:55 UTC, John Colvin wrote: >>>> On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote: >>>>> On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole wrote: >>>>>> On 13/05/2015 4:20 a.m., Gerald Jansen wrote: >>>>>>> At the risk of great embarassment ... here's my program: >>>>>>> http://dekoppel.eu/tmp/pedupg.d >>>>>> >>>>>> Would it be possible to give us some example data? >>>>>> I might give it a go to try rewriting it tomorrow. >>>>> >>>>> http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb) >>>>> >>>>> Contains two largish datasets in a directory structure expected by the program. >>>> >>>> I only see 2 traits in that example, so it's hard for anyone to explore your scaling problem, seeing as there are a maximum of 2 tasks. >>> >>> Either way, a few small changes were enough to cut the runtime by a factor of ~6 in the single-threaded case and improve the scaling a bit, although the printing to output files still looks like a bit of a bottleneck. >>> >> >>> http://dpaste.dzfl.pl/80cd36fd6796 >>> >>> The key thing was reducing the number of allocations (more std.algorithm.splitter copying to static arrays, less std.array.split) and avoiding File.byLine. Other people in this thread have mentioned alternatives to it that may be faster/have lower memory usage, I just read the whole files in to memory and then lazily split them with std.algorithm.splitter. I ended up with some blank lines coming through, so i added if(line.empty) continue; in a few places, you might want to look more carefully at that, it could be my mistake. >>> >>> The use of std.array.appender for `info` is just good practice, but it doesn't make much difference here. >> >> Wow, I'm impressed with the effort you guys (John, Rikki, others) are making to teach me some efficiency tricks. I guess this is one of the strengths of D: its community. I'm studying your various contributions closely! >> >> The empty line comes from the very last line on the files, which also end with a newline (as per "normal" practice?). > > Yup, that would be it. > > I added a bit of buffered writing and it actually seems to scale quite well for me now. > > http://dpaste.dzfl.pl/710afe8b6df5 Fixed the file reading spare '\n' problem, added some comments. http://dpaste.dzfl.pl/114d5a6086b7 |
May 14, 2015 Re: problem with parallel foreach | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gerald Jansen | John Colvin's improvements to my D program seem to have resolved the problem. (http://forum.dlang.org/post/ydgmzhlspvvvrbeemrqf@forum.dlang.org and http://dpaste.dzfl.pl/114d5a6086b7). I have rerun my tests and now the picture is a bit different (see tables below). In the middle table I have used gnu parallel in combination with a slightly modified version of the D program which runs a single trait (specified in argv[1]). This approach runs the jobs as completely isolated processes, but at the extra cost of re-reading the common data for each trait. The elapsed time is very similar with the parallel foreach in the D program or using gnu parallel (for this particular program and these data run on this server...). I'm guessing the program is now essentially limited by disk I/O, so this is about as good as it gets. So, just to wrap up: - there is a nice speed improvement over Python program :-) - one needs to learn a fair bit to fully benefit from D's potential - thanks for all the help! Gerald Jansen Jobs __ time for D parallel foreach w. JC mods____ 1 4.71user 0.56system 0:05.28elapsed 99%CPU 2 6.59user 0.96system 0:05.48elapsed 137%CPU 4 11.45user 1.94system 0:07.24elapsed 184%CPU 8 20.30user 5.18system 0:13.16elapsed 193%CPU 16 68.48user 13.87system 0:27.21elapsed 302%CPU 27 99.66user 18.73system 0:42.34elapsed 279%CPU Jobs __ gnu parallel + D program for single job __ 1 4.71user 0.56system 0:05.28elapsed 99%CPU as above 2 9.66user 1.28system 0:05.76elapsed 189%CPU 4 18.86user 3.85system 0:08.15elapsed 278%CPU 8 40.76user 7.53system 0:14.69elapsed 328%CPU 16 135.76user 20.68system 0:31.06elapsed 503%CPU 27 189.43user 28.26system 0:47.75elapsed 455%CPU Jobs _____ time for python version _____________ 1 45.39user 1.52system 0:46.88elapsed 100%CPU 2 77.76user 2.42system 0:47.16elapsed 170%CPU 4 141.28user 4.37system 0:48.77elapsed 298%CPU 8 280.45user 8.80system 0:56.00elapsed 516%CPU 16 926.05user 20.48system 1:31.36elapsed 1036%CPU 27 1329.09user 27.18system 2:11.79elapsed 1029%CPU |
May 14, 2015 Re: problem with parallel foreach | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gerald Jansen | On Thursday, 14 May 2015 at 10:46:53 UTC, Gerald Jansen wrote:
> John Colvin's improvements to my D program seem to have resolved the problem.
> (http://forum.dlang.org/post/ydgmzhlspvvvrbeemrqf@forum.dlang.org
> and http://dpaste.dzfl.pl/114d5a6086b7).
>
> I have rerun my tests and now the picture is a bit different (see tables below).
>
> In the middle table I have used gnu parallel in combination with a slightly modified version of the D program which runs a single trait (specified in argv[1]). This approach runs the jobs as completely isolated processes, but at the extra cost of re-reading the common data for each trait. The elapsed time is very similar with the parallel foreach in the D program or using gnu parallel (for this particular program and these data run on this server...). I'm guessing the program is now essentially limited by disk I/O, so this is about as good as it gets.
>
> So, just to wrap up:
> - there is a nice speed improvement over Python program :-)
> - one needs to learn a fair bit to fully benefit from D's potential
> - thanks for all the help!
>
> Gerald Jansen
>
>
> Jobs __ time for D parallel foreach w. JC mods____
> 1 4.71user 0.56system 0:05.28elapsed 99%CPU
> 2 6.59user 0.96system 0:05.48elapsed 137%CPU
> 4 11.45user 1.94system 0:07.24elapsed 184%CPU
> 8 20.30user 5.18system 0:13.16elapsed 193%CPU
> 16 68.48user 13.87system 0:27.21elapsed 302%CPU
> 27 99.66user 18.73system 0:42.34elapsed 279%CPU
>
> Jobs __ gnu parallel + D program for single job __
> 1 4.71user 0.56system 0:05.28elapsed 99%CPU as above
> 2 9.66user 1.28system 0:05.76elapsed 189%CPU
> 4 18.86user 3.85system 0:08.15elapsed 278%CPU
> 8 40.76user 7.53system 0:14.69elapsed 328%CPU
> 16 135.76user 20.68system 0:31.06elapsed 503%CPU
> 27 189.43user 28.26system 0:47.75elapsed 455%CPU
>
> Jobs _____ time for python version _____________
> 1 45.39user 1.52system 0:46.88elapsed 100%CPU
> 2 77.76user 2.42system 0:47.16elapsed 170%CPU
> 4 141.28user 4.37system 0:48.77elapsed 298%CPU
> 8 280.45user 8.80system 0:56.00elapsed 516%CPU
> 16 926.05user 20.48system 1:31.36elapsed 1036%CPU
> 27 1329.09user 27.18system 2:11.79elapsed 1029%CPU
Would it be OK if I showed some parts of this code as examples in my DConf talk in 2 weeks?
|
May 15, 2015 Re: problem with parallel foreach | ||||
---|---|---|---|---|
| ||||
Posted in reply to John Colvin | On Thursday, 14 May 2015 at 17:12:07 UTC, John Colvin wrote:
> Would it be OK if I showed some parts of this code as examples in my DConf talk in 2 weeks?
Sure!!!
|
Copyright © 1999-2021 by the D Language Foundation