Thread overview
std.parallelism curious results
Oct 05, 2014
flamencofantasy
Oct 05, 2014
Artem Tarasov
Oct 05, 2014
Russel Winder
Oct 05, 2014
Sativa
Oct 05, 2014
Ali Çehreli
Oct 05, 2014
Sativa
Oct 05, 2014
Ali Çehreli
Oct 05, 2014
Sativa
Oct 06, 2014
flamencofantasy
October 05, 2014
Hello,

I am summing up the first 1 billion integers in parallel and in a single thread and I'm observing some curious results;

parallel sum : 499999999500000000, elapsed 102833 ms
single thread sum : 499999999500000000, elapsed 1667 ms

The parallel version is 60+ times slower on my i7-3770K CPU. I think that maybe due to the CPU constantly flushing and reloading the caches in the parallel version but I don't know for sure.

Here is the D code;

	shared ulong sum = 0;
	ulong iter = 1_000_000_000UL;

	StopWatch sw;

	sw.start();

	foreach(i; parallel(iota(0, iter)))
	{
		atomicOp!"+="(sum, i);
	}

	sw.stop();

	writefln("parallel sum : %s, elapsed %s ms", sum, sw.peek().msecs);

	sum = 0;

	sw.reset();

	sw.start();

	for (ulong i = 0; i < iter; ++i)
	{
		sum += i;
	}

	sw.stop();

	writefln("single thread sum : %s, elapsed %s ms", sum, sw.peek().msecs);

Out of curiosity I tried the equivalent code in C# and I got this;

parallel sum : 499999999500000000, elapsed 20320 ms
single thread sum : 499999999500000000, elapsed 1901 ms

The C# parallel is about 3 times faster than the D parallel which is strange on the exact same CPU.

And here is the C# code;

long sum = 0;
long iter = 1000000000L;

var sw = Stopwatch.StartNew();

Parallel.For(0, iter, i =>
{
	Interlocked.Add(ref sum, i);
});

Console.WriteLine("parallel sum : {0}, elapsed {1} ms", sum, sw.ElapsedMilliseconds);

sum = 0;

sw = Stopwatch.StartNew();

for (long i = 0; i < iter; ++i)
{
	sum += i;
}

Console.WriteLine("single thread sum : {0}, elapsed {1} ms", sum, sw.ElapsedMilliseconds);

Thoughts?
October 05, 2014
Welcome to the world of multithreading.
You have just discovered that atomic operations are performance killers, congratulations on this.
October 05, 2014
On 05/10/14 15:27, flamencofantasy via Digitalmars-d-learn wrote:
> Hello,
> 
> I am summing up the first 1 billion integers in parallel and in a single thread and I'm observing some curious results;

I am fairly certain that your use of "parallel for" introduces quite a lot of threads other than you "master" one.

> parallel sum : 499999999500000000, elapsed 102833 ms single thread sum : 499999999500000000, elapsed 1667 ms
> 
> The parallel version is 60+ times slower on my i7-3770K CPU. I think that maybe due to the CPU constantly flushing and reloading the caches in the parallel version but I don't know for sure.

I would bet there are cache problems, but far more likely that the core problem is all the thread activity and in particular all the synchronization.

> Here is the D code;
> 
> shared ulong sum = 0; ulong iter = 1_000_000_000UL;
> 
> StopWatch sw;
> 
> sw.start();
> 
> foreach(i; parallel(iota(0, iter))) { atomicOp!"+="(sum, i); }

Well that will be the problem then, lots and lots of synchronization with the billion tasks you have set up. I am highly surprised this is only 60 times slower than sequential!

> sw.stop();
> 
> writefln("parallel sum : %s, elapsed %s ms", sum,
> sw.peek().msecs);
> 
> sum = 0;
> 
> sw.reset();
> 
> sw.start();
> 
> for (ulong i = 0; i < iter; ++i) { sum += i; }
> 
> sw.stop();
> 
> writefln("single thread sum : %s, elapsed %s ms", sum, sw.peek().msecs);
> 
> Out of curiosity I tried the equivalent code in C# and I got this;
> 
> parallel sum : 499999999500000000, elapsed 20320 ms single thread sum : 499999999500000000, elapsed 1901 ms
> 
> The C# parallel is about 3 times faster than the D parallel which is strange on the exact same CPU.
> 
> And here is the C# code;
> 
> long sum = 0; long iter = 1000000000L;
> 
> var sw = Stopwatch.StartNew();
> 
> Parallel.For(0, iter, i => { Interlocked.Add(ref sum, i); });

Useful moral of this story is that C# synchronization in this (somewhat perverse) context is relatively much more efficient than that of D.

There is almost certainly a useful benchmark test that can come of
this for the std.parallelism implementation (if only I had a few
cycles to get really stuck in to a review and analysis of the module :-(

> Console.WriteLine("parallel sum : {0}, elapsed {1} ms", sum, sw.ElapsedMilliseconds);
> 
> sum = 0;
> 
> sw = Stopwatch.StartNew();
> 
> for (long i = 0; i < iter; ++i) { sum += i; }
> 
> Console.WriteLine("single thread sum : {0}, elapsed {1} ms", sum, sw.ElapsedMilliseconds);
> 
> Thoughts?


- --
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip:
sip:russel.winder@ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
October 05, 2014
Two problems, one, you should create your threads outside the stop watch, it is not generally a fair comparison in the real world. It throws of the results for short tasks.

Second, you are creating one thread per integer, this is bad. Do you really want to create 1B threads when you only have probably 4 cores?

Below there are 4 threads used. Each thread adds up 1/4 of the integers. So it is like 4 threads, each adding up 250M integers. The speed, compared to a single thread adding up 250M integers, shows how much the parallelism costs per thread.

import std.stdio, std.parallelism, std.datetime, std.range, core.atomic;

void main()
{	
	StopWatch sw;
	shared ulong sum1 = 0, sum2 = 0, sum3 = 0, time1, time2, time3;
	
	auto numThreads = 4;
	ulong iter = numThreads*100000UL;
	

	auto thds = parallel(iota(0, iter, iter/numThreads));
	
	sw.start();
	foreach(i; thds) { ulong s = 0; for(ulong k = 0; k < iter/numThreads; k++) { s += k; } s += i*iter/numThreads; atomicOp!"+="(sum1, s); }
	sw.stop(); time1 = sw.peek().usecs;
	
	
	
	sw.reset();	sw.start();	for (ulong i = 0; i < iter; ++i) { sum2 += i; } sw.stop(); time2 = sw.peek().usecs;
	
	writefln("parallel sum : %s, elapsed %s us", sum1, time1);
	writefln("single thread sum : %s, elapsed %s us", sum2, time2);
	writefln("Efficiency : %s%%", 100*time2/time1);
}

http://dpaste.dzfl.pl/bfda7bb2e2b7

Some results:

parallel sum : 79999800000, elapsed 3356 us
single thread sum : 79999800000, elapsed 1984 us Efficiency : 59%


(Not sure all the code is correct, the point is you were creating 1B threads with 1B atomic operations. The worse possible comparison one can do between single and multi-threaded tests.


October 05, 2014
On 10/05/2014 07:27 AM, flamencofantasy wrote:

> I am summing up the first 1 billion integers in parallel and in a single
> thread and I'm observing some curious results;
>
> parallel sum : 499999999500000000, elapsed 102833 ms
> single thread sum : 499999999500000000, elapsed 1667 ms
>
> The parallel version is 60+ times slower

Reducing the number of threads is key. However, unlike what others said, parallel() does not use that many threads. By default, TaskPool objects are constructed by 'totalCPUs - 1' worker threads. All of parallel()'s iteration are executed on that few threads.

The main problem here is the use of atomicOp, which necessarily synchronizes the whole process.

Something like the following takes advantage of parallelism and reduces the execution time by half on my machine (4 cores (hyperthreaded 2 actul ones)).

    ulong adder(ulong beg, ulong end)
    {
        ulong localSum = 0;

        foreach (i; beg .. end) {
            localSum += i;
        }

        return localSum;
    }

    enum totalTasks = 10;

    foreach(i; parallel(iota(0, totalTasks)))
    {
        ulong beg = i * iter / totalTasks;
        ulong end = beg + iter / totalTasks;

        atomicOp!"+="(sum, adder(beg, end));
    }

Ali

October 05, 2014
On Sunday, 5 October 2014 at 21:25:39 UTC, Ali Çehreli wrote:
import std.stdio, std.cstream, std.parallelism, std.datetime, std.range, core.atomic;

void main()
{	
	StopWatch sw;
	shared ulong sum1 = 0; ulong sum2 = 0, sum3 = 0, time1, time2, time3;

	enum numThreads = 4; // If numThreads is a variable then it significantly slows down the process
	ulong iter = 1000000L;
	iter = numThreads*cast(ulong)(iter/numThreads); // Force iter to be a multiple of the number of threads so we can partition uniformly

	auto thds = parallel(iota(0, cast(uint)iter, cast(uint)(iter/numThreads)));

	sw.reset(); sw.start();
	foreach(i; thds) { ulong s = 0; for(ulong k = 0; k < iter/numThreads; k++) { s += k; } s += i*iter/numThreads; atomicOp!"+="(sum1, s); }
	sw.stop(); time1 = sw.peek().usecs;



	sw.reset();	sw.start();	for (ulong i = 0; i < iter; ++i) { sum2 += i; } sw.stop(); time2 = sw.peek().usecs;

	writefln("parallel sum : %s, elapsed %s us", sum1, time1);
	writefln("single thread sum : %s, elapsed %s us", sum2, time2);
	if (time1 > 0) writefln("Efficiency : %s%%", 100*time2/time1);
	din.getc();
}

Playing around with the code above, it seems when numThreads is an enum, the execution time is significantly effected(that from being < 100% to being >100% efficiency).

results on a 4 core laptop with release builds:

parallel sum : 499999500000, elapsed 2469 us
single thread sum : 499999500000, elapsed 8054 us
Efficiency : 326%


when numThreads is an int:

parallel sum : 499999500000, elapsed 21762 us
single thread sum : 499999500000, elapsed 8033 us
Efficiency : 36%
October 05, 2014
On 10/05/2014 02:40 PM, Sativa wrote:

>      foreach(i; thds) { ulong s = 0; for(ulong k = 0; k <
> iter/numThreads; k++)

The for loop condition is executed at every iteration and division is an expensive operation. Apparently, the compiled does some optimization when the divisor is known at compile time.

Being 4, it is just a shift of 2 bits. Try something like 5, it is slow even for enum.

This solves the problem:

        const end = iter/numThreads;

        for(ulong k = 0; k < end; k++) {

Ali

October 05, 2014
On Sunday, 5 October 2014 at 21:53:23 UTC, Ali Çehreli wrote:
> On 10/05/2014 02:40 PM, Sativa wrote:
>
> >      foreach(i; thds) { ulong s = 0; for(ulong k = 0; k <
> > iter/numThreads; k++)
>
> The for loop condition is executed at every iteration and division is an expensive operation. Apparently, the compiled does some optimization when the divisor is known at compile time.
>
> Being 4, it is just a shift of 2 bits. Try something like 5, it is slow even for enum.
>
> This solves the problem:
>
>         const end = iter/numThreads;
>
>         for(ulong k = 0; k < end; k++) {
>
> Ali

Yes, it is a common problem when doing a computation in a for loop on the bounds. Most of the time they are constant for the loop but the compiler computes it every iteration. When doing a simple sum(when the loop does not do much), it becomes expensive since it is comparable to what is happening inside the loop.

It's surprising just how slow it makes it though. One can't really make numThreads const in the real world though as it wouldn't optimal(unless one had a version for each number of possible threads).

Obviously one can just move the computation outside the loop. I would expect better results if the loops actually did some real work.


October 06, 2014
Thanks everyone for the replies.

I wasn't sure how std.parallel operated but I thought it would launch at most a number of threads equal to the number of cores on the machine, just as Ali confirmed and similar to what Windows' thread pool does.