Jump to page: 1 24  
Page
Thread overview
problem with parallel foreach
May 12, 2015
Gerald Jansen
May 12, 2015
John Colvin
May 12, 2015
John Colvin
May 12, 2015
Ali Çehreli
May 12, 2015
Gerald Jansen
May 12, 2015
Ali Çehreli
May 12, 2015
Gerald Jansen
May 12, 2015
Rikki Cattermole
May 12, 2015
Gerald Jansen
May 12, 2015
Laeeth Isharc
May 12, 2015
Laeeth Isharc
May 12, 2015
Gerald Jansen
May 12, 2015
Vladimir Panteleev
May 12, 2015
Gerald Jansen
May 13, 2015
John Colvin
May 13, 2015
John Colvin
May 13, 2015
Gerald Jansen
May 13, 2015
John Colvin
May 13, 2015
John Colvin
May 13, 2015
Gerald Jansen
May 13, 2015
Gerald Jansen
May 12, 2015
Rikki Cattermole
May 12, 2015
thedeemon
May 12, 2015
Gerald Jansen
May 12, 2015
thedeemon
May 12, 2015
Gerald Jansen
May 13, 2015
thedeemon
May 13, 2015
Ali Çehreli
May 13, 2015
thedeemon
May 13, 2015
Gerald Jansen
May 13, 2015
weaselcat
May 13, 2015
Gerald Jansen
May 13, 2015
Rikki Cattermole
May 14, 2015
Gerald Jansen
May 14, 2015
John Colvin
May 15, 2015
Gerald Jansen
May 12, 2015
I am a data analyst trying to learn enough D to decide whether to use D for a  new project rather than Python + Fortran. I have recoded a non-trivial Python program to do some simple parallel data processing (using the map function in Python's multiprocessing module and parallel foreach in D). I was very happy that my D version ran considerably faster that Python version when running a single job but was soon dismayed to find that the performance of my D version deteriorates rapidly beyond a handful of jobs whereas the time for the Python version increases linearly with the number of jobs per cpu core.

The server has 4 quad-core Xeons and abundant memory compared to my needs for this task even though there are several million records in each dataset. The basic structure of the D program is:

import std.parallelism; // and other modules
function main()
{
    // ...
    // read common data and store in arrays
    // ...
    foreach (job; parallel(jobs, 1)) {
        runJob(job, arr1, arr2.dup);
    }
}
function runJob(string job, in int[] arr1, int[] arr2)
{
    // read file of job specific data file and modify arr2 copy
    // write job specific output data file
}

The output of /usr/bin/time is as follows:

Lang Jobs    User  System  Elapsed %CPU
Py      1   45.17    1.44  0:46.65   99
D       1    8.44    1.17  0:09.24  104

Py      2   79.24    2.16  0:48.90  166
D       2   19.41   10.14  0:17.96  164

Py     30 1255.17   58.38  2:39.54  823 * Pool(12)
D      30  421.61 4565.97  6:33.73 1241

(Note that the Python program was somewhat optimized with numpy vectorization and a bit of numba jit compilation.)

The system time varies widely between repititions for D with multiple jobs (eg. from 3.8 to 21.5 seconds for 2 jobs).

Clearly simple my approach with parallel foreach has some problem(s). Any suggestions?

Gerald Jansen
May 12, 2015
On Tuesday, 12 May 2015 at 14:59:38 UTC, Gerald Jansen wrote:
> I am a data analyst trying to learn enough D to decide whether to use D for a  new project rather than Python + Fortran. I have recoded a non-trivial Python program to do some simple parallel data processing (using the map function in Python's multiprocessing module and parallel foreach in D). I was very happy that my D version ran considerably faster that Python version when running a single job but was soon dismayed to find that the performance of my D version deteriorates rapidly beyond a handful of jobs whereas the time for the Python version increases linearly with the number of jobs per cpu core.
>
> The server has 4 quad-core Xeons and abundant memory compared to my needs for this task even though there are several million records in each dataset. The basic structure of the D program is:
>
> import std.parallelism; // and other modules
> function main()
> {
>     // ...
>     // read common data and store in arrays
>     // ...
>     foreach (job; parallel(jobs, 1)) {
>         runJob(job, arr1, arr2.dup);
>     }
> }
> function runJob(string job, in int[] arr1, int[] arr2)
> {
>     // read file of job specific data file and modify arr2 copy
>     // write job specific output data file
> }
>
> The output of /usr/bin/time is as follows:
>
> Lang Jobs    User  System  Elapsed %CPU
> Py      1   45.17    1.44  0:46.65   99
> D       1    8.44    1.17  0:09.24  104
>
> Py      2   79.24    2.16  0:48.90  166
> D       2   19.41   10.14  0:17.96  164
>
> Py     30 1255.17   58.38  2:39.54  823 * Pool(12)
> D      30  421.61 4565.97  6:33.73 1241
>
> (Note that the Python program was somewhat optimized with numpy vectorization and a bit of numba jit compilation.)
>
> The system time varies widely between repititions for D with multiple jobs (eg. from 3.8 to 21.5 seconds for 2 jobs).
>
> Clearly simple my approach with parallel foreach has some problem(s). Any suggestions?
>
> Gerald Jansen

Have you tried adjusting the workUnitSize argument to parallel? It should probably be 1 for such large individual tasks.
May 12, 2015
On Tuesday, 12 May 2015 at 15:11:01 UTC, John Colvin wrote:
> On Tuesday, 12 May 2015 at 14:59:38 UTC, Gerald Jansen wrote:
>> I am a data analyst trying to learn enough D to decide whether to use D for a  new project rather than Python + Fortran. I have recoded a non-trivial Python program to do some simple parallel data processing (using the map function in Python's multiprocessing module and parallel foreach in D). I was very happy that my D version ran considerably faster that Python version when running a single job but was soon dismayed to find that the performance of my D version deteriorates rapidly beyond a handful of jobs whereas the time for the Python version increases linearly with the number of jobs per cpu core.
>>
>> The server has 4 quad-core Xeons and abundant memory compared to my needs for this task even though there are several million records in each dataset. The basic structure of the D program is:
>>
>> import std.parallelism; // and other modules
>> function main()
>> {
>>    // ...
>>    // read common data and store in arrays
>>    // ...
>>    foreach (job; parallel(jobs, 1)) {
>>        runJob(job, arr1, arr2.dup);
>>    }
>> }
>> function runJob(string job, in int[] arr1, int[] arr2)
>> {
>>    // read file of job specific data file and modify arr2 copy
>>    // write job specific output data file
>> }
>>
>> The output of /usr/bin/time is as follows:
>>
>> Lang Jobs    User  System  Elapsed %CPU
>> Py      1   45.17    1.44  0:46.65   99
>> D       1    8.44    1.17  0:09.24  104
>>
>> Py      2   79.24    2.16  0:48.90  166
>> D       2   19.41   10.14  0:17.96  164
>>
>> Py     30 1255.17   58.38  2:39.54  823 * Pool(12)
>> D      30  421.61 4565.97  6:33.73 1241
>>
>> (Note that the Python program was somewhat optimized with numpy vectorization and a bit of numba jit compilation.)
>>
>> The system time varies widely between repititions for D with multiple jobs (eg. from 3.8 to 21.5 seconds for 2 jobs).
>>
>> Clearly simple my approach with parallel foreach has some problem(s). Any suggestions?
>>
>> Gerald Jansen
>
> Have you tried adjusting the workUnitSize argument to parallel? It should probably be 1 for such large individual tasks.

ignore me, i missed that you already had done that.
May 12, 2015
On 05/12/2015 07:59 AM, Gerald Jansen wrote:

> the performance of my D version deteriorates
> rapidly beyond a handful of jobs whereas the time for the Python version
> increases linearly with the number of jobs per cpu core.

It may be related to GC collections. If it hasn't been changed recently, every allocation from GC triggers a collection cycle. D's current GC being a stop-the-world kind, you lose all benefit of parallel processing when that happens.

Without seeing runJob, even arr2.dup may be having such an effect on the performance.

Ali

May 12, 2015
Thanks Ali. I have tried putting GC.disable() in both main and runJob, but the timing behaviour did not change. The python version works in a similar fashion and also has automatic GC. I tend to think that is not the (biggest) problem.

The program is long and newbie-ugly ... but I could put it somewhere if that would help.

Gerald

On Tuesday, 12 May 2015 at 15:24:45 UTC, Ali Çehreli wrote:
> On 05/12/2015 07:59 AM, Gerald Jansen wrote:
>
> > the performance of my D version deteriorates
> > rapidly beyond a handful of jobs whereas the time for the
> Python version
> > increases linearly with the number of jobs per cpu core.
>
> It may be related to GC collections. If it hasn't been changed recently, every allocation from GC triggers a collection cycle. D's current GC being a stop-the-world kind, you lose all benefit of parallel processing when that happens.
>
> Without seeing runJob, even arr2.dup may be having such an effect on the performance.
>
> Ali

May 12, 2015
On 13/05/2015 2:59 a.m., Gerald Jansen wrote:
> I am a data analyst trying to learn enough D to decide whether to use D
> for a  new project rather than Python + Fortran. I have recoded a
> non-trivial Python program to do some simple parallel data processing
> (using the map function in Python's multiprocessing module and parallel
> foreach in D). I was very happy that my D version ran considerably
> faster that Python version when running a single job but was soon
> dismayed to find that the performance of my D version deteriorates
> rapidly beyond a handful of jobs whereas the time for the Python version
> increases linearly with the number of jobs per cpu core.
>
> The server has 4 quad-core Xeons and abundant memory compared to my
> needs for this task even though there are several million records in
> each dataset. The basic structure of the D program is:
>
> import std.parallelism; // and other modules
> function main()
> {
>      // ...
>      // read common data and store in arrays
>      // ...
>      foreach (job; parallel(jobs, 1)) {
>          runJob(job, arr1, arr2.dup);
>      }
> }
> function runJob(string job, in int[] arr1, int[] arr2)
> {
>      // read file of job specific data file and modify arr2 copy
>      // write job specific output data file
> }
>
> The output of /usr/bin/time is as follows:
>
> Lang Jobs    User  System  Elapsed %CPU
> Py      1   45.17    1.44  0:46.65   99
> D       1    8.44    1.17  0:09.24  104
>
> Py      2   79.24    2.16  0:48.90  166
> D       2   19.41   10.14  0:17.96  164
>
> Py     30 1255.17   58.38  2:39.54  823 * Pool(12)
> D      30  421.61 4565.97  6:33.73 1241
>
> (Note that the Python program was somewhat optimized with numpy
> vectorization and a bit of numba jit compilation.)
>
> The system time varies widely between repititions for D with multiple
> jobs (eg. from 3.8 to 21.5 seconds for 2 jobs).
>
> Clearly simple my approach with parallel foreach has some problem(s).
> Any suggestions?
>
> Gerald Jansen

A couple of things comes to mind at the start of the main function.

---
import core.runtime : GC;
GC.disable;
GC.reserve(1024 * 1024 * 1024);
--

That will reserve 1gb of ram for the GC to work with. It will also stop the GC from trying to collect.

I would HIGHLY recommend that each worker thread have a preallocated amount of memory for it. Where that memory is then used to have the data put into.
Basically you are just being sloppy memory allocation wise.

Try your best to move your processing into @nogc annotated functions. Then make it as much as possible with no allocations.

Remember you can pass around slices of arrays as long as they are not mutated without memory allocation freely!
May 12, 2015
On 05/12/2015 08:35 AM, Gerald Jansen wrote:

> I could put it somewhere if that would help.

Please do so. We all want to learn to avoid such issues.

Thank you,
Ali

May 12, 2015
At the risk of great embarassment ... here's my program:

http://dekoppel.eu/tmp/pedupg.d

As per Rick's first suggestion (thanks) I added
import core.memory : GC;
main()
    GC.disable;
    GC.reserve(1024 * 1024 * 1024);

... to no avail.

thanks for all the help so far.
Gerald

ps. I am using GDC 4.9.2 and don't have DMD on that server
May 12, 2015
On 13/05/2015 4:20 a.m., Gerald Jansen wrote:
> At the risk of great embarassment ... here's my program:
>
> http://dekoppel.eu/tmp/pedupg.d
>
> As per Rick's first suggestion (thanks) I added
> import core.memory : GC;
> main()
>      GC.disable;
>      GC.reserve(1024 * 1024 * 1024);
>
> ... to no avail.
>
> thanks for all the help so far.
> Gerald
>
> ps. I am using GDC 4.9.2 and don't have DMD on that server

Well atleast we now know that it isn't caused by memory allocation/deallocation!

Would it be possible to give us some example data?
I might give it a go to try rewriting it tomorrow.
May 12, 2015
On Tuesday, 12 May 2015 at 14:59:38 UTC, Gerald Jansen wrote:

> The output of /usr/bin/time is as follows:
>
> Lang Jobs    User  System  Elapsed %CPU
> Py      2   79.24    2.16  0:48.90  166
> D       2   19.41   10.14  0:17.96  164
>
> Py     30 1255.17   58.38  2:39.54  823 * Pool(12)
> D      30  421.61 4565.97  6:33.73 1241

The fact that most of the time is spent in System department is quite important. I suspect there are too many system calls from line-wise reading and writing the files. How many lines are read and written there?
« First   ‹ Prev
1 2 3 4