How works internally ParallelForEach

Dec 01, 2012

Zardoz

Dec 01, 2012

thedeemon

Dec 01, 2012

Zardoz

Dec 01, 2012

thedeemon

Dec 01, 2012

jerro

How works internally ParallelFor ? I read that lauchn multiple tasks. Each task procces a chunk of the range, but each task it's syncronized , ahve some kind of comunication between or are using shared memory or what ?? In this example code : import std.stdio; import std.parallelism; import std.math; void main() { auto logs = new double[10_000_000]; double total = 0; foreach(i, ref elem; taskPool.parallel(logs, 100)) { elem = log(i + 1.0); total += elem; } writeln(total); } I understand that are launched N task, doing a chunk of 100 elements from logs array. But what happen with "total". There is only a "total" and D is using memory barriers / atomic operations to write in it ? Or each Task have his own "total" and later joint each private "total" in the outside "total".

On Saturday, 1 December 2012 at 10:35:38 UTC, Zardoz wrote: > auto logs = new double[10_000_000]; > double total = 0; > foreach(i, ref elem; taskPool.parallel(logs, 100)) { > elem = log(i + 1.0); > total += elem; > } > > writeln(total); > } > > I understand that are launched N task, doing a chunk of 100 elements from logs array. But what happen with "total". There is only a "total" and D is using memory barriers / atomic operations to write in it ? Or each Task have his own "total" and later joint each private "total" in the outside "total". taskPool.parallel is a library function, it doesn't make compiler smarter and doesn't get much help from the compiler. It means your "total" variable will not get any special treatment, it's still a local variable referenced from the loop body which is turned into a function by foreach. This function is run by .parallel in several threads, so you'll get a race condition and most probably an incorrect total value. You should avoid changing the same memory in paralel foreach. Processing different elements of one array (even local) is ok. Writing to one variable not ok.

December 01, 2012

Re: How works internally ParallelForEach

Posted by Zardoz
in reply to thedeemon

Permalink

Zardoz

Posted in reply to thedeemon

Permalink

On Saturday, 1 December 2012 at 10:58:55 UTC, thedeemon wrote:
>
> taskPool.parallel is a library function, it doesn't make compiler smarter and doesn't get much help from the compiler. It means your "total" variable will not get any special treatment, it's still a local variable referenced from the loop body which is turned into a function by foreach. This function is run by .parallel in several threads, so you'll get a race condition and most probably an incorrect total value. You should avoid changing the same memory in paralel foreach. Processing different elements of one array (even local) is ok. Writing to one variable not ok.

Humm... So ParallelForeach only launch N tasks doing a work over a slice from the range and nothing more.

The prevois code should work better if i set "total" to be sahred and hope that D shared vars have nnow the internal barries working ,or I need to manually use semaphores ?

import std.stdio;
import std.parallelism;
import std.math;

void main() {
  auto logs = new double[10_000_000];
  shared double total = 0;
  foreach(i, ref elem; taskPool.parallel(logs, 100)) {
    elem = log(i + 1.0);
    total += elem;
  }

  writeln(total);
}

PD: I know that I can use reduction to do the same thing much better...

On Saturday, 1 December 2012 at 11:36:16 UTC, Zardoz wrote: > The prevois code should work better if i set "total" to be sahred and hope that D shared vars have nnow the internal barries working ,or I need to manually use semaphores ? Probably core.atomic is the way to go. Semaphore is an overkill.

On Saturday, 1 December 2012 at 12:51:27 UTC, thedeemon wrote: > On Saturday, 1 December 2012 at 11:36:16 UTC, Zardoz wrote: > >> The prevois code should work better if i set "total" to be sahred and hope that D shared vars have nnow the internal barries working ,or I need to manually use semaphores ? > > Probably core.atomic is the way to go. Semaphore is an overkill. The easiest and fastest way is probably using taskPool.reduce, like this: auto total = taskPool.reduce!"a+b"( iota(10_000_000).map!(a => log(a + 1.0))); writeln(total); Functions in core.atomic use instructions with lock prefix and according to http://www.agner.org/optimize/instruction_tables.pdf that "typically costs more than a hundred clock cycles,", so calling them for every element will probably slow things down significantly. It's best to just avoid accessing same memory from multiple threads wherever possible.

Forums