Parallel processing and further use of output (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Parallel processing and further use of output (page 2)

September 26, 2015

Re: Parallel processing and further use of output

Posted by John Colvin
in reply to Jay Norwood

John Colvin

Posted in reply to Jay Norwood

On Saturday, 26 September 2015 at 17:20:34 UTC, Jay Norwood wrote:
> This is a work-around to get a ulong result without having the ulong as the range variable.
>
> ulong getTerm(int i)
> {
>    return i;
> }
> auto sum4 = taskPool.reduce!"a + b"(std.algorithm.map!getTerm(iota(1000000001)));

or

auto sum4 = taskPool.reduce!"a + b"(0UL, iota(1_000_000_001));

works for me

September 28, 2015

Re: Parallel processing and further use of output

Posted by Russel Winder
in reply to Zoidberg

Russel Winder

Posted in reply to Zoidberg

Attachments:

signature.asc (This is a digitally signed message part)

On Sat, 2015-09-26 at 12:32 +0000, Zoidberg via Digitalmars-d-learn wrote:
> > Here's a correct version:
> > 
> > import std.parallelism, std.range, std.stdio, core.atomic;
> > void main()
> > {
> >     shared ulong i = 0;
> >     foreach (f; parallel(iota(1, 1000000+1)))
> >     {
> >         i.atomicOp!"+="(f);
> >     }
> >     i.writeln;
> > }
> 
> Thanks! Works fine. So "shared" and "atomic" is a must?

Yes and no. But mostly no. If you have to do this as an explicit iteration (very 1970s) then yes to avoid doing things wrong you have to ensure the update to the shared mutable state is atomic.

A more modern (1930s/1950s) way of doing things is to use implicit iteration – something Java, C++, etc. are all getting into more and more, you should use a reduce call. People have previously mentioned:

    taskPool.reduce!"a + b"(iota(1UL,1000001))

which I would suggest has to be seen as the best way of writing this algorithm.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

September 28, 2015

Re: Parallel processing and further use of output

Posted by Russel Winder
in reply to anonymous

Russel Winder

Posted in reply to anonymous

Attachments:

signature.asc (This is a digitally signed message part)

On Sat, 2015-09-26 at 14:33 +0200, anonymous via Digitalmars-d-learn wrote:
> […]
> I'm pretty sure atomicOp is faster, though.

Rough and ready anecdotal evidence would indicate that this is a reasonable statement, by quite a long way. However a proper benchmark is needed for statistical significance.

On the other hand std.parallelism.taskPool.reduce surely has to be the correct way of expressing the algorithm?

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

September 28, 2015

Re: Parallel processing and further use of output

Posted by Russel Winder
in reply to Jay Norwood

Russel Winder

Posted in reply to Jay Norwood

Attachments:

signature.asc (This is a digitally signed message part)

On Sat, 2015-09-26 at 15:56 +0000, Jay Norwood via Digitalmars-d-learn wrote:
> std.parallelism.reduce documentation provides an example of a parallel sum.
> 
> This works:
> auto sum3 = taskPool.reduce!"a + b"(iota(1.0,1000001.0));
> 
> This results in a compile error:
> auto sum3 = taskPool.reduce!"a + b"(iota(1UL,1000001UL));
> 
> I believe there was discussion of this problem recently ...

Which may or may not already have been fixed, or…

On the other hand:

	taskPool.reduce!"a + b"(1UL, iota(1000001));

seems to work fine.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

September 28, 2015

Re: Parallel processing and further use of output

Posted by Russel Winder
in reply to Jay Norwood

Russel Winder

Posted in reply to Jay Norwood

Attachments:

signature.asc (This is a digitally signed message part)

On Sat, 2015-09-26 at 17:20 +0000, Jay Norwood via Digitalmars-d-learn wrote:
> This is a work-around to get a ulong result without having the ulong as the range variable.
> 
> ulong getTerm(int i)
> {
>     return i;
> }
> auto sum4 = taskPool.reduce!"a +
> b"(std.algorithm.map!getTerm(iota(1000000001)));

Not needed as reduce can take an initial value that sets the type of the template. See previous email.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

September 28, 2015

Re: Parallel processing and further use of output

Posted by John Colvin
in reply to Russel Winder

John Colvin

Posted in reply to Russel Winder

On Monday, 28 September 2015 at 11:31:33 UTC, Russel Winder wrote:
> On Sat, 2015-09-26 at 14:33 +0200, anonymous via Digitalmars-d-learn wrote:
>> […]
>> I'm pretty sure atomicOp is faster, though.
>
> Rough and ready anecdotal evidence would indicate that this is a reasonable statement, by quite a long way. However a proper benchmark is needed for statistical significance.
>
> On the other hand std.parallelism.taskPool.reduce surely has to be the correct way of expressing the algorithm?

It would be really great if someone knowledgable did a full review of std.parallelism to find out the answer, hint, hint...  :)

September 28, 2015

Re: Parallel processing and further use of output

Posted by Russel Winder
in reply to John Colvin

Russel Winder

Posted in reply to John Colvin

Attachments:

signature.asc (This is a digitally signed message part)

On Mon, 2015-09-28 at 11:38 +0000, John Colvin via Digitalmars-d-learn wrote:
> […]
> 
> It would be really great if someone knowledgable did a full review of std.parallelism to find out the answer, hint, hint... :)

Indeed, I would love to be able to do this. However I don't have time in the next few months to do this on a volunteer basis, and no-one is paying money whereby this review could happen as a side effect. Sad, but…
-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

September 28, 2015

Re: Parallel processing and further use of output

Posted by Russel Winder
in reply to John Colvin

Russel Winder

Posted in reply to John Colvin

Attachments:

signature.asc (This is a digitally signed message part)

As a single data point:

======================  anonymous_fix.d ========== 500000500000

real	0m0.168s
user	0m0.200s
sys	0m0.380s
======================  colvin_fix.d ==========
500000500000

real	0m0.036s
user	0m0.124s
sys	0m0.000s
======================  norwood_reduce.d ==========
500000500000

real	0m0.009s
user	0m0.020s
sys	0m0.000s
======================  original.d ==========
218329750363

real	0m0.024s
user	0m0.076s
sys	0m0.000s


Original is the original, not entirely slow, but broken :-). anonymous is the anonymous' synchronized keyword version, slow. colvin_fix is John Colvin's use of atomicOp, correct but only ok-ish on speed. Jay Norword first proposed the reduce answer on the list, I amended it a tiddly bit, but clearly it is a resounding speed winner.

I guess we need a benchmark framework that can run these 100 times taking processor times and then do the statistics on them. Most people would assume normal distribution of results and do mean/std deviation and median.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

September 28, 2015

Re: Parallel processing and further use of output

Posted by John Colvin
in reply to Russel Winder

John Colvin

Posted in reply to Russel Winder

On Monday, 28 September 2015 at 12:18:28 UTC, Russel Winder wrote:
> As a single data point:
>
> ======================  anonymous_fix.d ========== 500000500000
>
> real	0m0.168s
> user	0m0.200s
> sys	0m0.380s
> ======================  colvin_fix.d ==========
> 500000500000
>
> real	0m0.036s
> user	0m0.124s
> sys	0m0.000s
> ======================  norwood_reduce.d ==========
> 500000500000
>
> real	0m0.009s
> user	0m0.020s
> sys	0m0.000s
> ======================  original.d ==========
> 218329750363
>
> real	0m0.024s
> user	0m0.076s
> sys	0m0.000s
>
>
> Original is the original, not entirely slow, but broken :-). anonymous is the anonymous' synchronized keyword version, slow. colvin_fix is John Colvin's use of atomicOp, correct but only ok-ish on speed. Jay Norword first proposed the reduce answer on the list, I amended it a tiddly bit, but clearly it is a resounding speed winner.

Pretty much as expected. Locks are slow, shared accumulators suck, much better to write to thread local and then merge.

September 28, 2015

Re: Parallel processing and further use of output

Posted by Russel Winder
in reply to John Colvin

Russel Winder

Posted in reply to John Colvin

Attachments:

signature.asc (This is a digitally signed message part)

On Mon, 2015-09-28 at 12:46 +0000, John Colvin via Digitalmars-d-learn wrote:
> […]
> 
> Pretty much as expected. Locks are slow, shared accumulators suck, much better to write to thread local and then merge.

Quite. Dataflow is where the parallel action is. (Except for those writing concurrency and parallelism libraries) Anyone doing concurrency and parallelism with shared memory multi-threading, locks, synchronized, mutexes, etc. is doing it wrong. This has been known since the 1970s, but the programming community got sidetracked by lack of abstraction (*) for a couple of decades.

(*) I blame C, C++ and Java. And programmers who programmed before (or
worse, without) thinking.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation