Threadpools, difference between DMD and LDC

I'm trying to grok message passing. That's my very first foray into this, so I'm probably making every mistake in the book :-) I wrote a small threadpool test, it's there: http://dpaste.dzfl.pl/3d3a65a00425 I'm playing with the number of threads and the number of tasks, and getting a feel about how message passing works. I must say I quite like it: it's a bit like suddenly being able to safely return different types from a function. What I don't get is the difference between DMD (I'm using 2.065) and LDC (0.14-alpha1). For DMD, I compile with -O -inline -noboundscheck For LDC, I use -03 -inline LDC gives me smaller executables than DMD (also, 3 to 5 times smaller than 0.13, good job!) but above all else incredibly, astoundingly faster. I'm used to LDC producing 20-30% faster programs, but here it's 1000 times faster! 8 threads, 1000 tasks: DMD: 4000 ms, LDC: 3 ms (!) So my current hypothesis is a) I'm doing something wrong or b) the tasks are optimized away or something. Can someone confirm the results and tell me what I'm doing wrong?

On Sunday, 3 August 2014 at 19:52:42 UTC, Philippe Sigaud wrote: > > Can someone confirm the results and tell me what I'm doing wrong? LDC is likely optimizing the summation: int sum = 0; foreach(i; 0..task.goal) sum += i; To something like: int sum = cast(int)(cast(ulong)(task.goal-1)*task.goal/2);

On Sunday, 3 August 2014 at 22:24:22 UTC, safety0ff wrote: > On Sunday, 3 August 2014 at 19:52:42 UTC, Philippe Sigaud wrote: >> >> Can someone confirm the results and tell me what I'm doing wrong? > > LDC is likely optimizing the summation: > > int sum = 0; > foreach(i; 0..task.goal) > sum += i; > > To something like: > > int sum = cast(int)(cast(ulong)(task.goal-1)*task.goal/2); This is correct – the LLVM optimizer indeed gets rid of the loop completely. Although I'd be more than happy to be able to claim a thousandfold speedup over DMD on real-world applications. ;) Cheers, David

> This is correct – the LLVM optimizer indeed gets rid of the loop completely. OK,that's clever. But I get this even when put a writeln("some msg") inside the task. I thought a write couldn't be optimized away that way and that it's a slow operation? Anyway, I discovered Thread.wait() in core in the meantime, I'll use that. I just wanted to have tasks taking a different amount of time each time. I have another question: it seems I can spawn hundreds of threads (Heck, even 10_000 is accepted), even when I have 4-8 cores. Is there: is there a limit to the number of threads? I tried a threadpool because in my application I feared having to spawn ~100-200 threads but if that's not the case, I can drastically simplify my code. Is spawning a thread a slow operation in general?

August 04, 2014

Re: Threadpools, difference between DMD and LDC

Posted by Kapps
in reply to Philippe Sigaud

Permalink

Kapps

Posted in reply to Philippe Sigaud

Permalink

On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via Digitalmars-d-learn wrote:
>
> I have another question: it seems I can spawn hundreds of threads
> (Heck, even 10_000 is accepted), even when I have 4-8 cores. Is there:
> is there a limit to the number of threads? I tried a threadpool
> because in my application I feared having to spawn ~100-200 threads
> but if that's not the case, I can drastically simplify my code.
> Is spawning a thread a slow operation in general?

Without going into much detail: Threads are heavy, and creating a thread is an expensive operation (which is partially why virtually every standard library includes a ThreadPool). Along with the overhead of creating the thread, you also get the overhead of additional context switches for each thread you have actively running. Context switches are expensive and a significant waste of time where your CPU gets to sit there doing effectively nothing while the OS manages scheduling which thread will go and restoring its context to run again. If you have 10,000 threads even if you won't run into limits of how many threads you can have, this will provide very significant overhead.

I haven't looked into detail your code, but consider using the TaskPool if you just want to schedule some tasks to run amongst a few threads, or potentially using Fibers (which are fairly light-weight) instead of Threads.

On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via Digitalmars-d-learn wrote: >> This is correct – the LLVM optimizer indeed gets rid of the loop completely. > > OK,that's clever. But I get this even when put a writeln("some msg") > inside the task. I thought a write couldn't be optimized away that way > and that it's a slow operation? You need the _result_ of the computation for the writeln. LLVM's optimizer recognizes what the loop tries to compute, though, and replaces it with an equivalent expression for the sum of the series, as Trass3r alluded to. Cheers, David

> Without going into much detail: Threads are heavy, and creating a thread is an expensive operation (which is partially why virtually every standard library includes a ThreadPool). > I haven't looked into detail your code, but consider using the TaskPool if you just want to schedule some tasks to run amongst a few threads, or potentially using Fibers (which are fairly light-weight) instead of Threads. OK, I get it. Just to be sure, there is no ThreadPool in Phobos or in core, right? IIRC, there are fibers somewhere in core, I'll have a look. I also heard the vibe.d has them.

On Monday, 4 August 2014 at 12:05:31 UTC, Philippe Sigaud via Digitalmars-d-learn wrote: > OK, I get it. Just to be sure, there is no ThreadPool in Phobos or in > core, right? > IIRC, there are fibers somewhere in core, I'll have a look. I also > heard the vibe.d has them. There is. It's called taskPool, though: http://dlang.org/phobos/std_parallelism.html#.taskPool

On Mon, Aug 4, 2014 at 2:13 PM, Chris Cain via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> wrote: >> OK, I get it. Just to be sure, there is no ThreadPool in Phobos or in core, right? > There is. It's called taskPool, though: > > http://dlang.org/phobos/std_parallelism.html#.taskPool Ah, std.parallelism. I stoopidly searched in std.concurrency and core.* Thanks!

On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via Digitalmars-d-learn wrote: > I have another question: it seems I can spawn hundreds of threads > (Heck, even 10_000 is accepted), even when I have 4-8 cores. Is there: > is there a limit to the number of threads? I tried a threadpool > because in my application I feared having to spawn ~100-200 threads > but if that's not the case, I can drastically simplify my code. > Is spawning a thread a slow operation in general? Most likely those threads either do nothing or are short living so you don't get actually 10 000 threads running simultaneously. In general you should expect your operating system to start stalling at few thousands of concurrent threads competing for context switches and system resources. Creating new thread is rather costly operation though you may not spot it in synthetic snippets, only under actual load. Modern default approach is to have amount of "worker" threads identical or close to amount of CPU cores and handle internal scheduling manually via fibers or some similar solution. If you are totally new to the topic of concurrent services, getting familiar with http://en.wikipedia.org/wiki/C10k_problem may be useful :)

Forums