December 15, 2012
I recently made some benchmarks with parallelism version of Reduce using the example code, and I got this times with this CPUs :

AMD FX(tm)-4100 Quad-Core Processor (Kubuntu 12.04 x64):
std.algorithm.reduce   = 70294 ms
std.parallelism.reduce = 18354 ms -> SpeedUp = ~3.79

2x AMD Opteron(tm) Processor 6128 aka 8 cores x 2 = 16 cores! (Rocks 6.0 x64) :
std.algorithm.reduce   = 98323 ms
std.parallelism.reduce = 6592 ms  -> SpeedUp = ~14.91

My congrats to std.parallelism and D language!

Source code compile with gdc 4.6.3 with -o2 flag :
import std.algorithm, std.parallelism, std.range;
import std.stdio;
import std.datetime;

void main() {
  // Parallel reduce can be combined with std.algorithm.map to interesting
  // effect. The following example (thanks to Russel Winder) calculates
  // pi by quadrature using std.algorithm.map and TaskPool.reduce.
  // getTerm is evaluated in parallel as needed by TaskPool.reduce.
  // // Timings on an Athlon 64 X2 dual core machine:
  // // TaskPool.reduce: 12.170 s
  // std.algorithm.reduce: 24.065 s

  immutable n = 1_000_000_000;
  immutable delta = 1.0 / n;
  real getTerm(int i) {
    immutable x = ( i - 0.5 ) * delta;
    return delta / ( 1.0 + x * x ) ;
  }

  StopWatch sw;
  sw.start(); //start/resume mesuring.
  immutable pi = 4.0 * taskPool.reduce!"a + b"( std.algorithm.map!getTerm(iota(n)) );
  //immutable pi = 4.0 * std.algorithm.reduce!"a + b"( std.algorithm.map!getTerm(iota(n)) );
  sw.stop();

  writeln("PI = ", pi);
  writeln("Tiempo = ", sw.peek().msecs, "[ms]");
}