Not able to get scaled performance on increasing number of threads

I am parallelizing a program which follows this structure: for iter = 1 to MAX_ITERATION { myLocalBarrier = new Barrier(numberOfThreads+1); for i= 1 to numberOfThreads { spawn(&myFunc, args) } }

It got posted before I completed it! Sorry. I am parallelizing a program which follows this structure: immutable int numberOfThreads= 2 for iter = 1 to MAX_ITERATION { myLocalBarrier = new Barrier(numberOfThreads+1); for i= 1 to numberOfThreads { spawn(&myFunc, args) } myLocalBarrier.wait(); } void myFunc(args) { //do the task myLocalBarrier.wait() } When I run it, and compare this parallel version with its serial version, I only get speedup of nearly <1.3 for 2 threads. When I write same program in Go, scaling is nearly 2. Also, in D, on doing "top", I see the usage as only 130% CPU and not nearly 200% or 180%. So I was wondering, if I am doing it properly. Please help me.

01-Feb-2013 19:42, Sparsh Mittal пишет: > It got posted before I completed it! Sorry. > > > I am parallelizing a program which follows this structure: > > immutable int numberOfThreads= 2 > > for iter = 1 to MAX_ITERATION > { > myLocalBarrier = new Barrier(numberOfThreads+1); > for i= 1 to numberOfThreads > { > spawn(&myFunc, args) > } > myLocalBarrier.wait(); > > } > > void myFunc(args) > { > //do the task > > myLocalBarrier.wait() > } > > When I run it, and compare this parallel version with its serial > version, I only get speedup of nearly <1.3 for 2 threads. When I write > same program in Go, scaling is nearly 2. > > Also, in D, on doing "top", I see the usage as only 130% CPU and not > nearly 200% or 180%. So I was wondering, if I am doing it properly. > Please help me. Can't tell much without the whole source or at least compilable standalone piece. The '//do task part' is critical to understanding as well as declaration of myLocalBarrier. Also why not use std.parallelism? -- Dmitry Olshansky

Here is the code: #!/usr/bin/env rdmd import std.stdio; import std.concurrency; import core.thread; import std.datetime; import std.conv; import core.sync.barrier; immutable int gridSize = 256; immutable int MAXSTEPS = 50000; /* Maximum number of iterations */ immutable double TOL_VAL =0.00001; /* Numerical Tolerance */ immutable double omega = 0.376; immutable double one_minus_omega = 1.0 - 0.376; immutable int numberOfThreads = 2; double MAX_FUNC(double a, double b) { return a> b? a: b; } double ABS_VAL(double a) { return a> 0? a: -a; } __gshared Barrier myLocalBarrier = null; shared double[gridSize+2][gridSize+2] gridInfo; shared double maxError = 0.0; void main(string args[]) { for(int i=0; i<gridSize+2; i++) { for(int j=0; j<gridSize+2; j++) { if(i==0) gridInfo[i][j] = 1.0; else gridInfo[i][j] = 0.0; } } bool shouldCheck = false; bool isConverged = false; for(int iter = 1; iter <= MAXSTEPS; iter++) { shouldCheck = false; if(iter % 400 ==0) { shouldCheck = true; maxError = 0.0; } //This is Phase 1 { myLocalBarrier = new Barrier(numberOfThreads+1); for (int cc=0; cc<numberOfThreads; cc++) { spawn(&SolverSlave, thisTid,cc, 0 ,shouldCheck); } myLocalBarrier.wait(); } //This is Phase 2 { myLocalBarrier = new Barrier(numberOfThreads+1); for (int cc=0; cc<numberOfThreads; cc++) { spawn(&SolverSlave, thisTid,cc, 1 ,shouldCheck ); } myLocalBarrier.wait(); } if( maxError < TOL_VAL) { isConverged = true; break; } } if(isConverged) writeln("It converged"); else writeln("It did not converge"); } void SolverSlave(Tid owner, int myNumber, int remainder, bool shouldCheckHere) { double sum =0; //Divide task among threads int iStart = ((myNumber*gridSize)/numberOfThreads) + 1; int iEnd = (((myNumber+1)*gridSize)/numberOfThreads) ; for(int i=iStart; i<= iEnd; i++) { for(int j=1; j< gridSize+1; j++) { if( ((i+j)%2 ==remainder)) //Phase 1 or 2 { sum = ( gridInfo[i ][j+1] + gridInfo[i+1][j ] + gridInfo[i-1][j ] + gridInfo[i ][j-1] )*0.25; //Should not check everytime to reduce synchronization overhead if(shouldCheckHere) { maxError = MAX_FUNC(ABS_VAL(omega *(sum-gridInfo[i][j])), maxError); } gridInfo[i][j] = one_minus_omega*gridInfo[i][j] + omega*sum; } } } myLocalBarrier.wait(); }

On 2013-02-01 16:42, Sparsh Mittal wrote: > When I run it, and compare this parallel version with its serial version, I only > get speedup of nearly <1.3 for 2 threads. When I write same program in Go, > scaling is nearly 2. > > Also, in D, on doing "top", I see the usage as only 130% CPU and not nearly 200% > or 180%. So I was wondering, if I am doing it properly. Please help me. Probably because the SolverSlave doesn't have enough work to do to call for dividing it into threads and barriers with their overhead. Like Dmitry wrote, std.parallelism may be a better tool for the job. I've tested your code on 2 cores (but have put whole main() in a loop). It's taking about 82% of both cores. After increasing gridSize to 1024 it was using 88% of the CPU.

February 01, 2013

Re: Not able to get scaled performance on increasing number of threads

Posted by Dmitry Olshansky
in reply to Sparsh Mittal

Permalink

Dmitry Olshansky

Posted in reply to Sparsh Mittal

Permalink

01-Feb-2013 20:08, Sparsh Mittal пишет:
> Here is the code:
>

Mine reiteration on it, with a bit of help from std.parallelism.
std.parallelism uses thread pool thus it's somewhat faster then creating threads anew.
Still it's instantaneous for me in a range of 30-40ms even with grid size of 1024 and 5M of iterations.

Have you enabled all of the optimizations? Correct switches are:

dmd -inline -O -release optimize_me.d

or

rdmd -inline -O -release optimize_me.d

to run after compile


import std.stdio;
import std.parallelism;
import std.datetime;
import std.conv;


immutable int gridSize = 1024;
immutable int MAXSTEPS = 5000_000;       /* Maximum number of iterations  */
immutable double TOL_VAL =0.00001;         /* Numerical Tolerance */
immutable double omega =  0.376;
immutable double one_minus_omega = 1.0 - 0.376;


immutable int numberOfThreads = 2;


double MAX_FUNC(double a, double b)
{
  return a> b? a: b;
}

double ABS_VAL(double a)
{
  return a> 0? a: -a;
}

shared double[gridSize+2][gridSize+2] gridInfo;
shared double maxError = 0.0;

void main(string args[])
{

  for(int i=0; i<gridSize+2; i++)
  {
    for(int j=0; j<gridSize+2; j++)
    {
      if(i==0)
        gridInfo[i][j] = 1.0;
      else
        gridInfo[i][j] = 0.0;
    }
  }

  bool shouldCheck = false;
  bool isConverged = false;
  for(int iter = 1; iter <= MAXSTEPS; iter++)
  {
    shouldCheck = false;
    if(iter % 400 ==0)
    {
      shouldCheck = true;
      maxError = 0.0;
    }


    alias MyTask = typeof(task!(SolverSlave)(0, 0, false));
      //This is Phase 1
    {
      MyTask[numberOfThreads] tasks;
      foreach(cc; 0..numberOfThreads)
      {
        tasks[cc] = task!(SolverSlave)(cc, 0, shouldCheck);
        taskPool.put(tasks[cc]);
      }
      foreach(cc; 0..numberOfThreads)
        tasks[cc].yieldForce();
    }

     //This is Phase 2
    {
      MyTask[numberOfThreads] tasks;
      foreach(cc; 0..numberOfThreads)
      {
        tasks[cc] = task!(SolverSlave)(cc, 1, shouldCheck);
        taskPool.put(tasks[cc]);
      }
      foreach(cc; 0..numberOfThreads)
        tasks[cc].yieldForce();

    }

    if( maxError <  TOL_VAL)
      {
        isConverged = true;
        break;
      }

  }
  /*if(isConverged)
    writeln("It converged");
  else
    writeln("It did not converge");*/
}



void SolverSlave(int myNumber, int remainder, bool shouldCheckHere)
{

  double sum =0;

  //Divide task among threads
  int iStart = ((myNumber*gridSize)/numberOfThreads) + 1;
  int iEnd =  (((myNumber+1)*gridSize)/numberOfThreads) ;


  for(int i=iStart; i<= iEnd; i++)
  {
    for(int j=1; j< gridSize+1; j++)
    {
      if( ((i+j)%2 ==remainder)) //Phase 1 or 2
      {
        sum = ( gridInfo[i  ][j+1] + gridInfo[i+1][j  ] +
            gridInfo[i-1][j  ] + gridInfo[i  ][j-1] )*0.25;

        //Should not check everytime to reduce synchronization overhead
        if(shouldCheckHere)
        {
          maxError = MAX_FUNC(ABS_VAL(omega *(sum-gridInfo[i][j])), maxError);
        }
        gridInfo[i][j] = one_minus_omega*gridInfo[i][j] + omega*sum;
      }

    }
  }

}









-- 
Dmitry Olshansky

On 2013-02-01 20:33, Dmitry Olshansky wrote: > Mine reiteration on it, with a bit of help from std.parallelism. > std.parallelism uses thread pool thus it's somewhat faster then creating threads > anew. Interestingly, threads+barrier here wasn't much slower than tasks: 14% slower for dmd32, only 5% for gdc64 (and taskpool in dmd 13% slower than in gdc).

Forums