Good demo for showing benefits of parallelism (page 2)

January 27, 2007

Re: Good demo for showing benefits of parallelism

Posted by Philipp Spöth
in reply to Kevin Bealer

Permalink

Philipp Spöth

Posted in reply to Kevin Bealer

Permalink

I did the good, old, elaborate "remove/insert everything until you
locate the problem" technique. And finaly speed improoved to about 9,6
seconds for the testscene with one thread (down from 39 seconds).
Also threading behaviour is reasonable now. Two threads do the job
in 5,4 seconds.

I had to change a couple of things. To many to list here, but when
the following lines are in - the strange threading behaviour is in as well:

Vec4d	Qo;
[snip]
Vec3d Qo3;
Qo3 = [Qo[0],Qo[1],Qo[2]];


When the last line is replaced with these - everything is fine:
Qo3[0] = Qo[0];
Qo3[1] = Qo[1];
Qo3[2] = Qo[2];


So the problem seems to be related to the vector assign operator which is defined like this:

alias Vector!(double,3u) Vec3d;
alias Vector!(double,4u) Vec4d;
struct Vector(T, uint dim)
{
	alias Vector!(T,dim) MyType;

	MyType opAssign(T[dim] d)
	{
		data[] = d;
		return *this;
	}
}

I tried some variations but didn't find any further hints to what causes the problem. Little curious side note - if the operator is declared directly instead of using 'MyType' the threading behaviour is still wrong but the program runs much faster.

This is on windows with dmd v1.004 Any ideas?



Kevin Bealer Wrote:

> 
> That's really wierd.  Unless you are actually straining your L1 or L2 cache (or there is as you say, really bad implicit synchonization), it's hard to see what could do this.
> 
> I think 'valgrind' has some skins that deal with this sort of analysis, specifically, cachegrind, callgrind and helgrind, but helgrind requires version 2.2 (or earlier) and it looks like valgrind is crashing for me with dmd-produced binaries -- maybe I need to set up gdc and try it that way.
> 
> Kevin
> 
> Philipp Spöth wrote:
> > Kevin is right with his optimization ideas of course. But there is something else going on there. The native approach already
> > should  gain way more speed than Jascha describes. And even worse on my Core
> Duo on windows the tracer drastically looses time
>  > with more threads. With one thread on a 640x480 image it takes 11
> seconds. With two threads it takes 86 seconds!
> > 
> > When looking at the process manager with 2 threads it shows that cpu usage is constantly only somehwere between 15% and 35%.
> > Increasing priority of the threads in D doen't seem to do anything. Raising Priority in the process manager lifts cpu usage
>  > to near 100%. However it doesn't make the progam finish any faster.
> > 
> > There must be some implizit synchronising involved though I have no idea where and why.
> > 
> >  Jascha Wetzel <[firstname]@mainia.de> Wrote:
> > 
> > 
> >> Kevin Bealer wrote:
> >>> [...]
> >>> The goal is for all the threads to finish at about the same time.  If
> >>> they don't you end up with some threads waiting idly at the end.
> >>>
> >>> This would require reintroducing a little synchronization. [...]
> >> that's all very reasonable, thx.
> >> i changed the job distribution by dividing the image into a grid,
> >> assigning (equally sized) cells to threads in an extending spiral
> >> starting in the middle of the image. this is due to the assumption that
> >> the hardest cells are around the middle (which is the case for the test
> >> scene).
> >> the distribution of those cells is of course synchronized.
> >> new sources: http://mainia.de/prt_grid.zip
> >>
> >> now everything is faster, even with one thread. but the speedup with more threads decreased (time stays about the same for 1-16 threads). still, i haven't quite "felt" the multicore performance.
> >>
> >> meanwhile, a friend tells me, that his speedups with image processing algorithms and naive block assignment are close to 200%...
> >>
> >> atm i suspect there's a more technical reason to that, something with the threading lib or the like...
> > 
>

Philipp Spöth wrote:
> I did the good, old, elaborate "remove/insert everything until you
> locate the problem" technique. And finaly speed improoved to about 9,6
> seconds for the testscene with one thread (down from 39 seconds).
> Also threading behaviour is reasonable now. Two threads do the job
> in 5,4 seconds.
> 
> I had to change a couple of things. To many to list here, but when
> the following lines are in - the strange threading behaviour is in as well:
> 
> Vec4d	Qo;
> [snip]
> Vec3d Qo3;
> Qo3 = [Qo[0],Qo[1],Qo[2]];
> 
> 
> When the last line is replaced with these - everything is fine:
> Qo3[0] = Qo[0];
> Qo3[1] = Qo[1];
> Qo3[2] = Qo[2];
> 
> 
> So the problem seems to be related to the vector assign operator

I don't think it is, at least not directly. It's probably related to the [Qo[0],Qo[1],Qo[2]] expression, which allocates a 3-element array _on the heap_. AFAIK to allocate on the heap a mutex needs to be locked, which might explain bad threading behavior.

Forums