Jump to page: 1 2
Thread overview
Good demo for showing benefits of parallelism
Jan 26, 2007
Bill Baxter
Jan 26, 2007
David B. Held
Jan 27, 2007
Jascha Wetzel
Jan 27, 2007
Kevin Bealer
Jan 27, 2007
Jascha Wetzel
Jan 27, 2007
Kevin Bealer
Jan 27, 2007
Jascha Wetzel
Jan 27, 2007
Philipp Spöth
Jan 27, 2007
Kevin Bealer
Jan 27, 2007
Philipp Spöth
Jan 28, 2007
Frits van Bommel
Jan 29, 2007
Jascha Wetzel
Jan 27, 2007
Kevin Bealer
January 26, 2007
I don't remember where it was, but somebody in some thread said something like gee it would be great if we had a demo that showed the benefits of having parallel threads.  (It was probably in either the Futures lib thread or in the discussion about varargs_reduce).

Anyway, a raytracer is a perfect example of "embarrasingly parallelizeable" code.  In the simplest case you can state it simply as "for each pixel do trace_ray(ray_dir)".

Bradley Smith posted code for a little raytracer over on D.learn under the heading "Why is this code slower than C++".  If anyone is interested in turning this into a parallel raytracer that would be a nice little demo.

Along the same lines, almost any image processing algorithm is in the same category where the top loop looks like "foreach pixel do ___". This is a big part of why GPUs just keep getting faster and faster, because the basic problem is just so inherently parallelizable.

--bb
January 26, 2007
Bill Baxter wrote:
> [...]
> Along the same lines, almost any image processing algorithm is in the same category where the top loop looks like "foreach pixel do ___". This is a big part of why GPUs just keep getting faster and faster, because the basic problem is just so inherently parallelizable.

Yeah, it's funny that people are so impressed with 2, 4, even 32 cores when the nVidia 8800 has 48 pixel pipelines and 128 shaders.  That's also why some people use GPUs as math coprocessors (and why some video cards require a dedicated power supply cord!).

Speaking of threading and parallelization, when is D going to get Transactional Memory?  Could it steal it from here: http://en.wikipedia.org/wiki/Transactional_memory#Implementations? There's http://libcmt.sourceforge.net/ and https://sourceforge.net/projects/libltx.  Surely something can be kiped?

Dave
January 27, 2007
i had a simple, unoptimized raytracer from an old CG assignment lying around that i ported to D and modified for parallel computation. download here: http://mainia.de/prt.zip

on a single core/cpu machine, the time stays about the same until 4
threads, then the overhead kicks in.
here are some values from an opteron dual core system running linux
kernel 2.6 for different thread counts:
thrds seconds
1     32.123
2     32.182
3     29.329
4     28.556
8     21.661
16    20.186
24    20.423
32    21.410

these aren't quite what i expected. CPU usage shows that both cores get about 55-80% load with 2 threads, stablilizing at 65-75% with 16 threads. with a single thread it's clearly 100% on one core.

am i missing something about the pthread lib or memory usage/sync that prevents reasonable speedup? 160% is nice, but why does it take 16 threads to get there? and where exactly do the remaining 40% go?

Bill Baxter wrote:
> I don't remember where it was, but somebody in some thread said something like gee it would be great if we had a demo that showed the benefits of having parallel threads.  (It was probably in either the Futures lib thread or in the discussion about varargs_reduce).
> 
> Anyway, a raytracer is a perfect example of "embarrasingly parallelizeable" code.  In the simplest case you can state it simply as "for each pixel do trace_ray(ray_dir)".
> 
> Bradley Smith posted code for a little raytracer over on D.learn under
> the heading "Why is this code slower than C++".  If anyone is interested
> in turning this into a parallel raytracer that would be a nice little demo.
> 
> Along the same lines, almost any image processing algorithm is in the same category where the top loop looks like "foreach pixel do ___". This is a big part of why GPUs just keep getting faster and faster, because the basic problem is just so inherently parallelizable.
> 
> --bb
January 27, 2007
Jascha Wetzel wrote:
> i had a simple, unoptimized raytracer from an old CG assignment lying
> around that i ported to D and modified for parallel computation.
> download here: http://mainia.de/prt.zip
> 
> on a single core/cpu machine, the time stays about the same until 4
> threads, then the overhead kicks in.
> here are some values from an opteron dual core system running linux
> kernel 2.6 for different thread counts:
> thrds seconds
> 1     32.123
> 2     32.182
> 3     29.329
> 4     28.556
> 8     21.661
> 16    20.186
> 24    20.423
> 32    21.410
> 
> these aren't quite what i expected. CPU usage shows that both cores get
> about 55-80% load with 2 threads, stablilizing at 65-75% with 16
> threads. with a single thread it's clearly 100% on one core.
> 
> am i missing something about the pthread lib or memory usage/sync that
> prevents reasonable speedup? 160% is nice, but why does it take 16
> threads to get there? and where exactly do the remaining 40% go?
> 
> Bill Baxter wrote:
>> I don't remember where it was, but somebody in some thread said
>> something like gee it would be great if we had a demo that showed the
>> benefits of having parallel threads.  (It was probably in either the
>> Futures lib thread or in the discussion about varargs_reduce).
>>
>> Anyway, a raytracer is a perfect example of "embarrasingly
>> parallelizeable" code.  In the simplest case you can state it simply as
>> "for each pixel do trace_ray(ray_dir)".
>>
>> Bradley Smith posted code for a little raytracer over on D.learn under
>> the heading "Why is this code slower than C++".  If anyone is interested
>> in turning this into a parallel raytracer that would be a nice little demo.
>>
>> Along the same lines, almost any image processing algorithm is in the
>> same category where the top loop looks like "foreach pixel do ___". This
>> is a big part of why GPUs just keep getting faster and faster, because
>> the basic problem is just so inherently parallelizable.
>>
>> --bb

Do you know if there was much lock contention?  If there is a lot, then you can lose efficiency because the threads bump into each other like a traffic jam.  Failing to get a lock (pausing and waking up again when it frees up) is a lot slower than getting it successfully on the first try.

There was some code I had at work where I do something like this:

Mutex.lock();

    int length = (end-begin)*4;

    char * p = (finds a byte location in an mmapped file);

    int remainder = *p;  // A
    length += remainder; // A

Mutex.unlock();

I was able to improve performance measurably by moving the unlock() to a point before the lines marked "A".

It turns out the *p was (often) causing a soft page fault, which caused the thread to stall, which caused other threads to pile up on the mutex.

If you can minimize your critical sections, you can reduce contention. The ideal thing is to use a critical section to swap pointers or something else that can't take much time.  If a memory read can trigger page fault, you can sometimes do it *before* the critical section, then get the lock and get the 'correct' value.  This way the memory page is "known to be warm" by the time you are in the lock, even if the actual value may change.

[ But this kind of fiddling around is pretty low level, so it won't have any effect most of the time -- make sure to have the right problem before fixing it. ]

Kevin
January 27, 2007
Bill Baxter wrote:
> I don't remember where it was, but somebody in some thread said something like gee it would be great if we had a demo that showed the benefits of having parallel threads.  (It was probably in either the Futures lib thread or in the discussion about varargs_reduce).
> 
> Anyway, a raytracer is a perfect example of "embarrasingly parallelizeable" code.  In the simplest case you can state it simply as "for each pixel do trace_ray(ray_dir)".
> 
> Bradley Smith posted code for a little raytracer over on D.learn under the heading "Why is this code slower than C++".  If anyone is interested in turning this into a parallel raytracer that would be a nice little demo.
> 
> Along the same lines, almost any image processing algorithm is in the same category where the top loop looks like "foreach pixel do ___". This is a big part of why GPUs just keep getting faster and faster, because the basic problem is just so inherently parallelizable.
> 
> --bb

I think it was me -- thanks for the pointer, I may have a look at this.

Kevin
January 27, 2007
Kevin Bealer wrote:
> Do you know if there was much lock contention?

there is no explicit synchronization at all. the nice thing about raytracing and similar algorithms is, that you can use disjoint memory and let the threads work isolated from each other. except for the main thread that waits for the workers to terminate, there are no muteces involved.
January 27, 2007
Jascha Wetzel wrote:
> Kevin Bealer wrote:
>> Do you know if there was much lock contention?
> 
> there is no explicit synchronization at all. the nice thing about
> raytracing and similar algorithms is, that you can use disjoint memory
> and let the threads work isolated from each other. except for the main
> thread that waits for the workers to terminate, there are no muteces
> involved.

Hmmm... if the chunks you use are too small or too large, it can be an issue too.  If you divide the workload into small chunks like individual pixels, it costs overhead due to switching and loss of locality of reference.

On the other hand, if you have ten threads, I think it is a mistake to divide the pixel load into exactly ten parts -- some pixels are heavier than others.  The pixels that raytrace complex refractions will take longer and so some threads finish first.

What I would tend to do is divide the pixel count into blocks, then let each thread pull a block of (consecutive or adjacent if possible) pixels to work on.  Then some threads can do hard blocks and take a while at it, and others can do simple blocks and process more of them.

The goal is for all the threads to finish at about the same time.  If they don't you end up with some threads waiting idly at the end.

This would require reintroducing a little synchronization.  I might divide an imagine into blocks that were individually each about 2% of the total image.  Let's say you have 4 hardware threads and each block is 2% of the work load.  Then the average inefficiency is maybe something like 1/2 * .02 * 4 = 4 %.  This is 1/2 of the threads being idle as they finish at randomly uneven rates, times 2 % of the total work, times 4 hardware threads (because it doesn't hurt much to have idle *software* threads, only idle hardware ones.)

Other micro-considerations:

1. Handing out 'harder' sections first is a good idea if you have this info, because these will "stick out" more if they are the only one running at the end (they represent a larger percentage than their block size.)

2. You can start by handing out large chunks and then switch to smaller chunks when the work is mostly done.  For example, for 4 hardware threads, you could always hand out 1/16th of the remaining work until you get down to handing out 100 pixels at a time.

(I don't know how useful this is for raytracing specifically, but some of these issues come up where I work.)

Kevin

January 27, 2007
Kevin Bealer wrote:
> [...]
> The goal is for all the threads to finish at about the same time.  If
> they don't you end up with some threads waiting idly at the end.
>
> This would require reintroducing a little synchronization. [...]

that's all very reasonable, thx.
i changed the job distribution by dividing the image into a grid,
assigning (equally sized) cells to threads in an extending spiral
starting in the middle of the image. this is due to the assumption that
the hardest cells are around the middle (which is the case for the test
scene).
the distribution of those cells is of course synchronized.
new sources: http://mainia.de/prt_grid.zip

now everything is faster, even with one thread. but the speedup with more threads decreased (time stays about the same for 1-16 threads). still, i haven't quite "felt" the multicore performance.

meanwhile, a friend tells me, that his speedups with image processing algorithms and naive block assignment are close to 200%...

atm i suspect there's a more technical reason to that, something with the threading lib or the like...
January 27, 2007
Kevin is right with his optimization ideas of course. But there is something else going on there. The native approach already should gain way more speed than Jascha describes. And even worse on my Core Duo on windows the tracer drastically looses time with more threads. With one thread on a 640x480 image it takes 11 seconds. With two threads it takes 86 seconds!

When looking at the process manager with 2 threads it shows that cpu usage is constantly only somehwere between 15% and 35%. Increasing priority of the threads in D doen't seem to do anything. Raising Priority in the process manager lifts cpu usage to near 100%. However it doesn't make the progam finish any faster.

There must be some implizit synchronising involved though I have no idea where and why.

 Jascha Wetzel <[firstname]@mainia.de> Wrote:


> Kevin Bealer wrote:
> > [...]
> > The goal is for all the threads to finish at about the same time.  If
> > they don't you end up with some threads waiting idly at the end.
> >
> > This would require reintroducing a little synchronization. [...]
> 
> that's all very reasonable, thx.
> i changed the job distribution by dividing the image into a grid,
> assigning (equally sized) cells to threads in an extending spiral
> starting in the middle of the image. this is due to the assumption that
> the hardest cells are around the middle (which is the case for the test
> scene).
> the distribution of those cells is of course synchronized.
> new sources: http://mainia.de/prt_grid.zip
> 
> now everything is faster, even with one thread. but the speedup with more threads decreased (time stays about the same for 1-16 threads). still, i haven't quite "felt" the multicore performance.
> 
> meanwhile, a friend tells me, that his speedups with image processing algorithms and naive block assignment are close to 200%...
> 
> atm i suspect there's a more technical reason to that, something with the threading lib or the like...

January 27, 2007
That's really wierd.  Unless you are actually straining your L1 or L2 cache (or there is as you say, really bad implicit synchonization), it's hard to see what could do this.

I think 'valgrind' has some skins that deal with this sort of analysis,
specifically, cachegrind, callgrind and helgrind, but helgrind requires version 2.2 (or earlier) and it looks like valgrind is crashing for me with dmd-produced binaries -- maybe I need to set up gdc and try it that way.

Kevin

Philipp Spöth wrote:
> Kevin is right with his optimization ideas of course. But there is something else going on there. The native approach already
> should  gain way more speed than Jascha describes. And even worse on my Core 
Duo on windows the tracer drastically looses time
> with more threads. With one thread on a 640x480 image it takes 11 seconds. With two threads it takes 86 seconds!
> 
> When looking at the process manager with 2 threads it shows that cpu usage is constantly only somehwere between 15% and 35%.
> Increasing priority of the threads in D doen't seem to do anything. Raising Priority in the process manager lifts cpu usage
> to near 100%. However it doesn't make the progam finish any faster.
> 
> There must be some implizit synchronising involved though I have no idea where and why.
> 
>  Jascha Wetzel <[firstname]@mainia.de> Wrote:
> 
> 
>> Kevin Bealer wrote:
>>> [...]
>>> The goal is for all the threads to finish at about the same time.  If
>>> they don't you end up with some threads waiting idly at the end.
>>>
>>> This would require reintroducing a little synchronization.
>>> [...]
>> that's all very reasonable, thx.
>> i changed the job distribution by dividing the image into a grid,
>> assigning (equally sized) cells to threads in an extending spiral
>> starting in the middle of the image. this is due to the assumption that
>> the hardest cells are around the middle (which is the case for the test
>> scene).
>> the distribution of those cells is of course synchronized.
>> new sources: http://mainia.de/prt_grid.zip
>>
>> now everything is faster, even with one thread. but the speedup with
>> more threads decreased (time stays about the same for 1-16 threads).
>> still, i haven't quite "felt" the multicore performance.
>>
>> meanwhile, a friend tells me, that his speedups with image processing
>> algorithms and naive block assignment are close to 200%...
>>
>> atm i suspect there's a more technical reason to that, something with
>> the threading lib or the like...
> 

« First   ‹ Prev
1 2