View mode: basic / threaded / horizontal-split · Log in · Help
December 01, 2012
How works internally ParallelForEach
How works internally ParallelFor ?
I read that lauchn multiple tasks. Each task procces a chunk of 
the range, but each task it's syncronized , ahve some kind of 
comunication between or are using shared memory or what ??

In this example code :
import std.stdio;
import std.parallelism;
import std.math;

void main() {
  auto logs = new double[10_000_000];
  double total = 0;
  foreach(i, ref elem; taskPool.parallel(logs, 100)) {
    elem = log(i + 1.0);
    total += elem;
  }

  writeln(total);
}

I understand that are launched N task, doing a chunk of 100 
elements from logs array. But what happen with "total". There is 
only a "total" and D is using memory barriers / atomic operations 
to write in it ? Or each Task have his own "total" and later 
joint each private "total" in the outside "total".
December 01, 2012
Re: How works internally ParallelForEach
On Saturday, 1 December 2012 at 10:35:38 UTC, Zardoz wrote:
>   auto logs = new double[10_000_000];
>   double total = 0;
>   foreach(i, ref elem; taskPool.parallel(logs, 100)) {
>     elem = log(i + 1.0);
>     total += elem;
>   }
>
>   writeln(total);
> }
>
> I understand that are launched N task, doing a chunk of 100 
> elements from logs array. But what happen with "total". There 
> is only a "total" and D is using memory barriers / atomic 
> operations to write in it ? Or each Task have his own "total" 
> and later joint each private "total" in the outside "total".

taskPool.parallel is a library function, it doesn't make compiler 
smarter and doesn't get much help from the compiler. It means 
your "total" variable will not get any special treatment, it's 
still a local variable referenced from the loop body which is 
turned into a function by foreach. This function is run by 
.parallel in several threads, so you'll get a race condition and 
most probably an incorrect total value. You should avoid changing 
the same memory in paralel foreach. Processing different elements 
of one array (even local) is ok. Writing to one variable not ok.
December 01, 2012
Re: How works internally ParallelForEach
On Saturday, 1 December 2012 at 10:58:55 UTC, thedeemon wrote:
>
> taskPool.parallel is a library function, it doesn't make 
> compiler smarter and doesn't get much help from the compiler. 
> It means your "total" variable will not get any special 
> treatment, it's still a local variable referenced from the loop 
> body which is turned into a function by foreach. This function 
> is run by .parallel in several threads, so you'll get a race 
> condition and most probably an incorrect total value. You 
> should avoid changing the same memory in paralel foreach. 
> Processing different elements of one array (even local) is ok. 
> Writing to one variable not ok.

Humm... So ParallelForeach only launch N tasks doing a work over 
a slice from the range and nothing more.

The prevois code should work better if i set "total" to be sahred 
and hope that D shared vars have nnow the internal barries 
working ,or I need to manually use semaphores ?

import std.stdio;
import std.parallelism;
import std.math;

void main() {
  auto logs = new double[10_000_000];
  shared double total = 0;
  foreach(i, ref elem; taskPool.parallel(logs, 100)) {
    elem = log(i + 1.0);
    total += elem;
  }

  writeln(total);
}

PD: I know that I can use reduction to do the same thing much 
better...
December 01, 2012
Re: How works internally ParallelForEach
On Saturday, 1 December 2012 at 11:36:16 UTC, Zardoz wrote:

> The prevois code should work better if i set "total" to be 
> sahred and hope that D shared vars have nnow the internal 
> barries working ,or I need to manually use semaphores ?

Probably core.atomic is the way to go. Semaphore is an overkill.
December 01, 2012
Re: How works internally ParallelForEach
On Saturday, 1 December 2012 at 12:51:27 UTC, thedeemon wrote:
> On Saturday, 1 December 2012 at 11:36:16 UTC, Zardoz wrote:
>
>> The prevois code should work better if i set "total" to be 
>> sahred and hope that D shared vars have nnow the internal 
>> barries working ,or I need to manually use semaphores ?
>
> Probably core.atomic is the way to go. Semaphore is an overkill.

The easiest and fastest way is probably using taskPool.reduce, 
like this:

auto total = taskPool.reduce!"a+b"(
    iota(10_000_000).map!(a => log(a + 1.0)));

writeln(total);

Functions in core.atomic use instructions with lock prefix and 
according to http://www.agner.org/optimize/instruction_tables.pdf 
that "typically costs more than a hundred clock cycles,", so 
calling them for every element will probably slow things down 
significantly. It's best to just avoid accessing same memory from 
multiple threads wherever possible.
Top | Discussion index | About this forum | D home