multithread/concurrency/parallel methods and performance

Feb 18, 2018

SrMordred

Feb 18, 2018

Jordan Wilson

Feb 19, 2018

Feb 19, 2018

Feb 19, 2018

Feb 19, 2018

Feb 20, 2018

I´m experimenting with threads and related recently. (i´m just started so may be some terrrible mistakes here) With this base work: foreach(i ; 0 .. SIZE) { results[i] = values1[i] * values2[i]; } and then with this 3 others methods: parallel, spawn and Threads. this was my results: _base : 456 ms and 479 us _parallel : 331 ms, 324 us, and 4 hnsecs _concurrency : 367 ms, 348 us, and 2 hnsecs _thread : 369 ms, 565 us, and 3 hnsecs (code here : https://run.dlang.io/is/2pdmmk ) All methods have minor speedup gains. I was expecting a lot more. Since I have 7 cores I expected like below 100ms. I´m not seeing false sharing in this case. or i'm wrong? If someone can expand on this, i'll be grateful. Thanks!

On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote: > I´m experimenting with threads and related recently. > (i´m just started so may be some terrrible mistakes here) > > With this base work: > > foreach(i ; 0 .. SIZE) > { > results[i] = values1[i] * values2[i]; > } > > and then with this 3 others methods: parallel, spawn and Threads. > > this was my results: > > _base : 456 ms and 479 us > _parallel : 331 ms, 324 us, and 4 hnsecs > _concurrency : 367 ms, 348 us, and 2 hnsecs > _thread : 369 ms, 565 us, and 3 hnsecs > > (code here : https://run.dlang.io/is/2pdmmk ) > > All methods have minor speedup gains. I was expecting a lot more. > Since I have 7 cores I expected like below 100ms. > > I´m not seeing false sharing in this case. or i'm wrong? > > If someone can expand on this, i'll be grateful. > > Thanks! It may be due to thread local storage: https://tour.dlang.org/tour/en/multithreading/thread-local-storage https://dlang.org/articles/migrate-to-shared.html I'm not sure though, as I don't know how your results array initialised. Jordan

February 19, 2018

Re: multithread/concurrency/parallel methods and performance

Posted by Nicholas Wilson
in reply to SrMordred

Permalink

Nicholas Wilson

Posted in reply to SrMordred

Permalink

On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote:
> I´m experimenting with threads and related recently.
> (i´m just started so may be some terrrible mistakes here)
>
> With this base work:
>
> foreach(i ; 0 .. SIZE)
> {
>     results[i] = values1[i] * values2[i];
> }
>
> and then with this 3 others methods: parallel, spawn and Threads.
>
> this was my results:
>
> _base : 456 ms and 479 us
> _parallel : 331 ms, 324 us, and 4 hnsecs
> _concurrency : 367 ms, 348 us, and 2 hnsecs
> _thread : 369 ms, 565 us, and 3 hnsecs
>
> (code here : https://run.dlang.io/is/2pdmmk )
>
> All methods have minor speedup gains.  I was expecting a lot more.
> Since I have 7 cores I expected like below 100ms.
>
> I´m not seeing false sharing in this case. or i'm wrong?
>
> If someone can expand on this, i'll be grateful.
>
> Thanks!

As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache for 32bit) it may be that dealing with the concurrency overhead adds a significant amount of overhead. Also the run.dlang.io link has no -O flag and thus no optimisations

without -O i get

_base : 323 ms, 92 μs, and 6 hnsecs
_parallel : 276 ms, 649 μs, and 3 hnsecs
_concurrency : 221 ms, 931 μs, and 7 hnsecs
_thread : 212 ms, 277 μs, and 3 hnsecs

with it I get
_base : 150 ms, 728 μs, and 5 hnsecs
_parallel : 120 ms, 78 μs, and 5 hnsecs
_concurrency : 134 ms, 787 μs, and 4 hnsecs
_thread : 129 ms, 476 μs, and 2 hnsecs

with SIZE= 16*1024*1024 without -O i get
_base : 5 secs, 835 ms, 240 μs, and 9 hnsecs
_parallel : 4 secs, 802 ms, 279 μs, and 8 hnsecs
_concurrency : 2 secs, 133 ms, 685 μs, and 3 hnsecs
_thread : 2 secs, 108 ms, 860 μs, and 9 hnsecs

with SIZE= 16*1024*1024 with -O i get
_base : 2 secs, 502 ms, 523 μs, and 4 hnsecs
_parallel : 1 sec, 769 ms, 945 μs, and 3 hnsecs
_concurrency : 1 sec, 362 ms, 747 μs, and 1 hnsec
_thread : 1 sec, 335 ms, 720 μs, and 1 hn

On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote: > I´m experimenting with threads and related recently. > (i´m just started so may be some terrrible mistakes here) > > With this base work: > > foreach(i ; 0 .. SIZE) > { > results[i] = values1[i] * values2[i]; > } > > and then with this 3 others methods: parallel, spawn and Threads. > > this was my results: > > _base : 456 ms and 479 us > _parallel : 331 ms, 324 us, and 4 hnsecs > _concurrency : 367 ms, 348 us, and 2 hnsecs > _thread : 369 ms, 565 us, and 3 hnsecs > > (code here : https://run.dlang.io/is/2pdmmk ) > > All methods have minor speedup gains. I was expecting a lot more. > Since I have 7 cores I expected like below 100ms. The operation is trivial and dataset is rather small. In such cases SIMD with eg array ops is the way to go: result[] = values[] * values2[]; Parallelism is gets more interesting with more expensive operations. You may also try bigger sizes or both. > > I´m not seeing false sharing in this case. or i'm wrong? > > If someone can expand on this, i'll be grateful. > > Thanks!

On Monday, 19 February 2018 at 05:49:54 UTC, Nicholas Wilson wrote: > As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache for 32bit) it may be that dealing with the concurrency overhead adds a significant amount of overhead. That 'concurrency overhead' is what i´m not getting. Since the array is big, dividing it into 6, 7 cores will not trash L1 since they are very far from each other, right? Or L2 cache trashing is also a problem in this case? > _base : 150 ms, 728 μs, and 5 hnsecs > _parallel : 120 ms, 78 μs, and 5 hnsecs > _concurrency : 134 ms, 787 μs, and 4 hnsecs > _thread : 129 ms, 476 μs, and 2 hnsecs > Yes, on my PC I was using -release. Yet, 150ms for 1 core. 120-134ms of X cores. Shouldn´t be way faster? I´m trying to understand where the overhead is, and if is possible to get rid of it (perfect thread scaling).

On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky wrote: > The operation is trivial and dataset is rather small. In such cases SIMD with eg array ops is the way to go: > result[] = values[] * values2[]; Yes, absolutely right :) I make a simple example to understand why the threads are not scaling in the way i thought they would. I imagine that, if one core work is done in 200ms a 4 core work will be done in 50ms, plus some overhead, since they are working on separate block of memory, without need of sync, and without false sharing, etc (at least I think i don´t have this problem here).

On Monday, 19 February 2018 at 14:57:22 UTC, SrMordred wrote: > On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky wrote: >> The operation is trivial and dataset is rather small. In such cases SIMD with eg array ops is the way to go: >> result[] = values[] * values2[]; > > Yes, absolutely right :) > > I make a simple example to understand why the threads are not scaling in the way i thought they would. Yeah, the world is ugly place where trivial math sometimes doesn’t work. I suggest to: - run with different number of threads from 1 to n - vary sizes from 100k to 10m For your numbers - 400ms / 64 is ~ 6ms, if we divide by # cores it’s 6/7 ~ 0.86ms which is a deal smaller then a CPU timeslice. In essence a single core runs fast b/c it doesn’t wait for all others to complete via join easily burning its quota in one go. In MT I bet some of overhead comes from not all threads finishing (and starting) at once, so the join block in the kernel. You could run your MT code with strace to see if it hits the futex call or some such, if it does that’s where you are getting delays. (that’s assuming you are on Linux) std.parallel version is a bit faster b/c I think it caches created threadpool so you don’t start threads anew on each run. > I imagine that, if one core work is done in 200ms a 4 core work will be done in 50ms, plus some overhead, since they are working on separate block of memory, without need of sync, and without false sharing, etc (at least I think i don´t have this problem here). If you had a long queue of small tasks like that and you don’t wait to join all threads untill absolutely required you get near perfect scalability. (Unless hitting other bottlenecks like RAM).

Forums