Advice on threading/fibers/?

Jun 15, 2011

Justin Whear

Jun 16, 2011

Lars T. Kyllingstad

Jun 16, 2011

Lars T. Kyllingstad

Jun 16, 2011

Jeremy Wright

Consider the following: You have 10 million data points and you need to apply a multipass algorithm to them. Each pass is like a cellular automata: it can read from the previous pass but it doesn't know the "current" values. This makes the actual processing of each value trivially parallelizable. The actual operation for each value is fairly simple and cheap (essentially a multidimensional ancestor-child averaging operation). After each value has been operated on once, the pass is complete and the "current" and "old" buffers are switched (conceptually, the "current" buffer can only be written to, the "old" buffer can only be read--using __gshared here). The number of passes is not fixed; in the course of each value operation, an error is computed. When the worst individual error falls below a certain threshold, the algorithm is finished. Generally this will take between one thousand and ten thousand passes. How would you go about parallelizing this? My thought is to take the map/reduce approach within each pass: each thread/fiber takes a slice of the dataset, makes its modifications, then returns an error summary. These summaries are quickly combined and the algorithm loop decides whether to run again. Each pass shouldn't take more than a second or two, so I'm not sure whether introducing the overhead of spawning, say, 10 threads each pass is worthwhile (times 5000 passes). On the other hand, I have plenty of CPUs to throw at it (at least 16 cores, each with hyperthreading) and am in a situation where "as fast as possible" is important (while individual datasets may not grow, the number of them is). Any thoughts appreciated. Justin

On Wed, 15 Jun 2011 23:57:25 +0000, Justin Whear wrote: > Consider the following: > > You have 10 million data points and you need to apply a multipass algorithm to them. Each pass is like a cellular automata: it can read from the previous pass but it doesn't know the "current" values. This makes the actual processing of each value trivially parallelizable. The actual operation for each value is fairly simple and cheap (essentially a multidimensional ancestor-child averaging operation). > > After each value has been operated on once, the pass is complete and the "current" and "old" buffers are switched (conceptually, the "current" buffer can only be written to, the "old" buffer can only be read--using __gshared here). > > The number of passes is not fixed; in the course of each value operation, an error is computed. When the worst individual error falls below a certain threshold, the algorithm is finished. Generally this will take between one thousand and ten thousand passes. > > How would you go about parallelizing this? My thought is to take the map/reduce approach within each pass: each thread/fiber takes a slice of the dataset, makes its modifications, then returns an error summary. These summaries are quickly combined and the algorithm loop decides whether to run again. Each pass shouldn't take more than a second or two, so I'm not sure whether introducing the overhead of spawning, say, 10 threads each pass is worthwhile (times 5000 passes). On the other hand, I have plenty of CPUs to throw at it (at least 16 cores, each with hyperthreading) and am in a situation where "as fast as possible" is important (while individual datasets may not grow, the number of them is). > > Any thoughts appreciated. I would recommend you take a look at the new std.parallelism module, which was introduced in the most recent DMD release (2.053): http://www.d-programming-language.org/phobos-prerelease/ std_parallelism.html -Lars

On Thu, 16 Jun 2011 13:22:33 +0000, Lars T. Kyllingstad wrote: > On Wed, 15 Jun 2011 23:57:25 +0000, Justin Whear wrote: > >> Any thoughts appreciated. > > I would recommend you take a look at the new std.parallelism module, which was introduced in the most recent DMD release (2.053): > > http://www.d-programming-language.org/phobos-prerelease/ std_parallelism.html Specifically, check out TaskPool.parallel(), TaskPool.amap(), TaskPool.map () and TaskPool.reduce(). My guess is one or more of these will be exactly what you need. -Lars

June 16, 2011

Re: Advice on threading/fibers/?

Posted by Jeremy Wright
in reply to Lars T. Kyllingstad

Permalink

Jeremy Wright

Posted in reply to Lars T. Kyllingstad

Permalink

On 06/16/2011 06:22 AM, Lars T. Kyllingstad wrote:
> On Wed, 15 Jun 2011 23:57:25 +0000, Justin Whear wrote:
>
>> Consider the following:
>>
>> You have 10 million data points and you need to apply a multipass
>> algorithm to them. Each pass is like a cellular automata: it can read
>> from the previous pass but it doesn't know the "current" values. This
>> makes the actual processing of each value trivially parallelizable. The
>> actual operation for each value is fairly simple and cheap (essentially
>> a multidimensional ancestor-child averaging operation).
>>
>> After each value has been operated on once, the pass is complete and the
>> "current" and "old" buffers are switched (conceptually, the "current"
>> buffer can only be written to, the "old" buffer can only be read--using
>> __gshared here).
>>
>> The number of passes is not fixed; in the course of each value
>> operation, an error is computed. When the worst individual error falls
>> below a certain threshold, the algorithm is finished. Generally this
>> will take between one thousand and ten thousand passes.
>>
>> How would you go about parallelizing this? My thought is to take the
>> map/reduce approach within each pass: each thread/fiber takes a slice of
>> the dataset, makes its modifications, then returns an error summary.
>> These summaries are quickly combined and the algorithm loop decides
>> whether to run again. Each pass shouldn't take more than a second or
>> two, so I'm not sure whether introducing the overhead of spawning, say,
>> 10 threads each pass is worthwhile (times 5000 passes). On the other
>> hand, I have plenty of CPUs to throw at it (at least 16 cores, each with
>> hyperthreading) and am in a situation where "as fast as possible" is
>> important (while individual datasets may not grow, the number of them
>> is).
>>
>> Any thoughts appreciated.
>
> I would recommend you take a look at the new std.parallelism module,
> which was introduced in the most recent DMD release (2.053):
>
> http://www.d-programming-language.org/phobos-prerelease/
> std_parallelism.html
>
> -Lars
I wrote an article on using std.parallelism for a bucket-sort algorithm.  http://www.codestrokes.com/archives/116

I hope it helps.

Forums