Thread overview
D and multicore
Nov 13, 2010
parallel noob
Nov 13, 2010
Gary Whatmore
Nov 14, 2010
Fawzi Mohamed
Nov 14, 2010
Fawzi Mohamed
Nov 13, 2010
Peter Alexander
Nov 13, 2010
Gary Whatmore
Nov 14, 2010
sybrandy
Nov 14, 2010
retard
Nov 14, 2010
Sean Kelly
November 13, 2010
Hello

Intro: people with pseudonyms are often considered trolls here, but this is a really honest question by a sw engineer now writing mostly sequential web applications.  (I write "parallel" web apps, but the logic goes that you write sequential applications for each http query, the frontend distributes queries among backend http processes, and the database "magically" ensures proper locking. There's hardly ever any locks in my code.)

D is touted as the next gen of multicore languages. I pondered between D and D.learn, where to ask this. It just strikes me odd that there isn't any kind of documentation explaining how I should write (parallel) code for multicore in D. If D is much different, the general guidelines for PHP web applications, Java, or Erlang might don't work. From what I've gathered from these discussions, there are:

 - array operations and auto-parallelization of loops
 - mmx/sse intrinsics via library
 - transactional memory (requires hardware support? doesn't work?)
 - "erlang style" concurrency? == process functions in Phobos 2?
 - threads, locks, and synchronization primitives

Sean, sybrandy, don, fawzi, tobias, gary, dsimcha, bearophile, russel, trass3r, dennis, and simen clearly have ideas how to work with parallel problems.

A quick look at wikipedia gave http://en.wikipedia.org/wiki/Parallel_computing and http://en.wikipedia.org/wiki/Parallel_programming_model

I fail to map these concepts discussed here with the things listed on those pages. I found MPI, POSIX Threads, TBB, Erlang, OpenMP, and OpenCL there.

Sean mentioned:

"In the long term there may turn out to be better models, but I don't know of one today."

So he's basically saying that those others listed in the wikipedia pages are totally unsuitable for real world tasks? Only Erlang style message passing works?

The next machine I buy comes with 12 or 16 cores or even more -- this one has 4 cores. The typical applications I use take advantage of 1-2 threads. For example a cd ripper starts a new process for each mp3 encoder. The program runs at most 3 threads (the gui, the mp3 encoder, the cd ripper). More and more applications run in the browser. The browser actively uses one thread + one thread per applet. I can't even utilize more than 50% of the power of the current gen!

The situation is different with GPUs. My Radeon 5970 has 3200 cores. When the core count doubles, the FPS rating in games almost doubles. They definitely are not running Erlang style processes (one for GUI, one for sounds, one for physics, one for network). That would leave 3150 cores unused.
November 13, 2010
parallel noob Wrote:

> Hello
> 
> Intro: people with pseudonyms are often considered trolls here, but this is a really honest question by a sw engineer now writing mostly sequential web applications.  (I write "parallel" web apps, but the logic goes that you write sequential applications for each http query, the frontend distributes queries among backend http processes, and the database "magically" ensures proper locking. There's hardly ever any locks in my code.)
> 
> D is touted as the next gen of multicore languages. I pondered between D and D.learn, where to ask this. It just strikes me odd that there isn't any kind of documentation explaining how I should write (parallel) code for multicore in D. If D is much different, the general guidelines for PHP web applications, Java, or Erlang might don't work. From what I've gathered from these discussions, there are:
> 
>  - array operations and auto-parallelization of loops
>  - mmx/sse intrinsics via library
>  - transactional memory (requires hardware support? doesn't work?)
>  - "erlang style" concurrency? == process functions in Phobos 2?
>  - threads, locks, and synchronization primitives
> 
> Sean, sybrandy, don, fawzi, tobias, gary, dsimcha, bearophile, russel, trass3r, dennis, and simen clearly have ideas how to work with parallel problems.
> 
> A quick look at wikipedia gave http://en.wikipedia.org/wiki/Parallel_computing and http://en.wikipedia.org/wiki/Parallel_programming_model
> 
> I fail to map these concepts discussed here with the things listed on those pages. I found MPI, POSIX Threads, TBB, Erlang, OpenMP, and OpenCL there.
> 
> Sean mentioned:
> 
> "In the long term there may turn out to be better models, but I don't know of one today."
> 
> So he's basically saying that those others listed in the wikipedia pages are totally unsuitable for real world tasks? Only Erlang style message passing works?
> 
> The next machine I buy comes with 12 or 16 cores or even more -- this one has 4 cores. The typical applications I use take advantage of 1-2 threads. For example a cd ripper starts a new process for each mp3 encoder. The program runs at most 3 threads (the gui, the mp3 encoder, the cd ripper). More and more applications run in the browser. The browser actively uses one thread + one thread per applet. I can't even utilize more than 50% of the power of the current gen!
> 
> The situation is different with GPUs. My Radeon 5970 has 3200 cores. When the core count doubles, the FPS rating in games almost doubles. They definitely are not running Erlang style processes (one for GUI, one for sounds, one for physics, one for network). That would leave 3150 cores unused.

This question would have been more appropriate in D.learn. I started my journey with http://www.amazon.com/Java-Concurrency-Practice-Brian-Goetz/dp/0321349601

The GPU is a different kind of beast. GPU uses CUDA/OpenCL.
November 13, 2010
On 13/11/10 7:04 PM, parallel noob wrote:
> The situation is different with GPUs. My Radeon 5970 has 3200 cores. When the core count doubles, the FPS rating in games almost doubles. They definitely are not running Erlang style processes (one for GUI, one for sounds, one for physics, one for network). That would leave 3150 cores unused.

This is all to do with "what" you are actually parallelizing. The task of the GPU (typically transforming vertices and doing lighting calculations per pixel) is trivially parallelized.

If you did that task on CPU it would still be trivially parallelized (just spawn as many threads as you have cores and split up the screen's pixels between the threads).

Besides being optimized for this particular task, GPU's have no silver bullet for handling concurrency. Trying to get a GPU to parallelize a task that is not easily parallelizable (e.g. depth-first search on a graph) is just as hard as getting the CPU to do it. In this case, doubling the cores will not help.

GPU batching of tasks concurrency handles rendering well, but Erlang style message passing is a better model in the general case.
November 13, 2010
Peter Alexander Wrote:

> On 13/11/10 7:04 PM, parallel noob wrote:
> > The situation is different with GPUs. My Radeon 5970 has 3200 cores. When the core count doubles, the FPS rating in games almost doubles. They definitely are not running Erlang style processes (one for GUI, one for sounds, one for physics, one for network). That would leave 3150 cores unused.
> 
> This is all to do with "what" you are actually parallelizing. The task of the GPU (typically transforming vertices and doing lighting calculations per pixel) is trivially parallelized.
> 
> If you did that task on CPU it would still be trivially parallelized (just spawn as many threads as you have cores and split up the screen's pixels between the threads).
> 
> Besides being optimized for this particular task, GPU's have no silver bullet for handling concurrency. Trying to get a GPU to parallelize a task that is not easily parallelizable (e.g. depth-first search on a graph) is just as hard as getting the CPU to do it. In this case, doubling the cores will not help.
> 
> GPU batching of tasks concurrency handles rendering well, but Erlang style message passing is a better model in the general case.

He/she also doesn't really get the idea of GPU. Writing programs for the GPU is really simple. Just use a C enriched with parallel arrays. Few changes here and there. The old C program now runs 200 times as fast if it's graphics processing (z-buffer filtering, blur, brightness correction). When you need something more complex, the control becomes hard and locks don't always work. You're left with academic non-concepts and Erlang style message passing.
November 14, 2010
> Sean, sybrandy, don, fawzi, tobias, gary, dsimcha, bearophile, russel, trass3r, dennis, and simen clearly have ideas how to work with parallel problems.

Ideas: Yes.
Expert: No.  I haven't exercised it much as I haven't been able to do it at work.  Others here have shown that they have more knowledge than I.

> Sean mentioned:
>
> "In the long term there may turn out to be better models, but I don't know of one today."
>
> So he's basically saying that those others listed in the wikipedia pages are totally unsuitable for real world tasks? Only Erlang style message passing works?

It's the easiest to understand for the most part as it's very similar to how the real world works.  Think about a coffee shop: two processes/people, let's call them Alice (customer) and Bob (barrista), are in the coffee shop.  Alice is ordering coffee and Bob is making coffee.  Alice tells Bob that she wants a super-duper-extra-foamy-mocha-hazel-choco-latte.  E.g. she sends a message to Bob.  Bob then does his magic and sends a message to Alice by handing her overpriced coffee.

So, from an understandability standpoint, it's very simple to understand.  There are also other benefits that are best expressed in Joe Armstrong's thesis, which you can easily find online.  He is one of the creators of Erlang and his thesis describes many of the benefits of the architecture.

Is it the best for all problems?  No.  For example, there are better tools for parallelizing loops.  The Actor Model, which is what Erlang uses, doesn't work very well for this.  There are other methods as well, but outside of auto parellelization and the actor model, they are harder to implement correctly, IIRC.  There may be a couple I don't really know about.

Now, will any of these provide you with instant benefits in your apps? Probably not.  Most software today was designed to run on one core and therefore doesn't use multiple cores very well.  Firefox (3.6 and later), Chrome, IE 9, and I think the latest version of Opera all use multiple threads now mainly to keep certain pages/plugins from killing the entire browser, though you may notice some speedups just from that.

However, this does not mean that multiple cores are not benefiting you much.  The OS is (should be) smart enough to distribute multiple apps across the cores, so you can have your browser and your word processor working hard and not have either one prevent the other one from getting any CPU time.  The problem is that most of us don't have more than 2-4 apps working hard at any time, so past 4 cores you probably won't be getting much benefit in this respect.  As threading becomes more popular and threaded apps are better designed, we should see a trend of better CPU usage on the desktop.

Casey
November 14, 2010
Not really an answer to your questions, but according to my google reader this appeared very few minutes ago

http://lambda-the-ultimate.org/node/4134

I don't think we've heard the last word on this domain of programming.
November 14, 2010
On 13-nov-10, at 22:23, Gary Whatmore wrote:

> parallel noob Wrote:
>
>> Hello
>>
>> Intro: people with pseudonyms are often considered trolls here, but this is a really honest question by a sw engineer now writing mostly sequential web applications.  (I write "parallel" web apps, but the logic goes that you write sequential applications for each http query, the frontend distributes queries among backend http processes, and the database "magically" ensures proper locking. There's hardly ever any locks in my code.)
>>
>> D is touted as the next gen of multicore languages. I pondered between D and D.learn, where to ask this. It just strikes me odd that there isn't any kind of documentation explaining how I should write (parallel) code for multicore in D. If D is much different, the general guidelines for PHP web applications, Java, or Erlang might don't work. From what I've gathered from these discussions, there are:
>>
>> - array operations and auto-parallelization of loops
>> - mmx/sse intrinsics via library
>> - transactional memory (requires hardware support? doesn't work?)
>> - "erlang style" concurrency? == process functions in Phobos 2?
>> - threads, locks, and synchronization primitives
>>
>> Sean, sybrandy, don, fawzi, tobias, gary, dsimcha, bearophile, russel, trass3r, dennis, and simen clearly have ideas how to work with parallel problems.
>>
>> A quick look at wikipedia gave http://en.wikipedia.org/wiki/Parallel_computing and http://en.wikipedia.org/wiki/Parallel_programming_model
>>
>> I fail to map these concepts discussed here with the things listed on those pages. I found MPI, POSIX Threads, TBB, Erlang, OpenMP, and OpenCL there.
>>
>> Sean mentioned:
>>
>> "In the long term there may turn out to be better models, but I don't know of one today."
>>
>> So he's basically saying that those others listed in the wikipedia pages are totally unsuitable for real world tasks? Only Erlang style message passing works?
>>
>> The next machine I buy comes with 12 or 16 cores or even more -- this one has 4 cores. The typical applications I use take advantage of 1-2 threads. For example a cd ripper starts a new process for each mp3 encoder. The program runs at most 3 threads (the gui, the mp3 encoder, the cd ripper). More and more applications run in the browser. The browser actively uses one thread + one thread per applet. I can't even utilize more than 50% of the power of the current gen!
>>
>> The situation is different with GPUs. My Radeon 5970 has 3200 cores. When the core count doubles, the FPS rating in games almost doubles. They definitely are not running Erlang style processes (one for GUI, one for sounds, one for physics, one for network). That would leave 3150 cores unused.

there are different kinds of parallel problems, some are trivially, or almost trivially parallel, other are less parallel.
Some tasks are very quick (one talks of micro parallelism), other are much more coarse.

typical code has a limited parallelization potential, out of order execution of modern processors tries to take advantage of this, but normally having a lot of execution hardware is not useful because there is a limited amount of instruction level parallelism (ILP) is limited.
There is an important exception: vector operations. So processor often have vector hardware to do them efficiently. Compiler to take advantage of them vectorize loops.
Array operations are a class of operations (that include vector operations) that are often very parallel.
If one for example wants to apply a pure operations on an array this is trivially parallel.
Data parallel languages are especially good to express this kind of parallelism.

GPU are optimized graphical operations which are mainly kind of vector and array operations and thus have a large amount of this kind of parallelization. This is also present in some scientific programs, and indeed GPU (with CUDA or openCL) are increasingly used for that.

The more coarse levels of parallelization use other means.
In my opinion shared memory parallelization can be done efficiently if on is able to treat independent recursive tasks.
Recursive task (that come form example from divide & conquer approaches, and for example can be used to perform array operations) can be evaluated efficiently evaluating subtasks first, and stealing supertasks keeping into account the locality of processors (cilk has a such an approach).

Independent tasks can be represented by threads, should be executed fairly, and work well to represent a single interacting object or different requests for a web server.
All OS give support for this, as threads (unlike processes) share memory one has to take care that changes from one thread and another are done in a meaningful way. To achieve this there are locks, atomic ops,...
Transactional memory works for changes that should be done atomically (the big problem is that if something fails one has to undo everything and retry again, something that becomes more and more likely the longer the transaction).

What I tried to achieve with blip.parallel.smp is to accommodate reasonably well both kinds of parallelism.
Several higher level abstractions can then be built using this framework.
I feel that it is important to handle this centrally because a processor is used optimally when each core has a thread.
Actually to hide the latency introduced by some operations that would make a processor waste cycles, some processors keep two active threads and switch quickly from one to the other when one stalls. This is called hyperthreading, and while it doesn't improve the performance of a single thread (actually it is worse) it optimizes the throughput and and usage of the execution units of a single core.

I think that almost all shared memory parallelization approaches can be implemented reasonably efficiently on the top of tasks, possibly using some locality hints for the initial distribution.

PGAS (Partitioned global address space) tries to allow one to have a global view, while still having local storages, each process can access both his local and remote just indexing. the conceptual advantage of this is that one can easily migrate to it starting from a local view. The hope is that one can optimize the layout and patterns used to access the memory later and at least partially independently from the algorithm used. At least in some cases it is possible.

In distributed memory approaches each process has its own memory space, and cannot access directly the memory of other processes.
This is potentially more complex than PGAS/shared memory approaches, because one has to explicitly transfer data between different processes to communicate. Normally one uses *messages* to do it.
The big advantage of doing this is that the programmer has to think more explicitly about the potential costly communication operations (if one has a latency of few ms this is still 10^6 cycles of a typical processor), and possibly optimize it better.

The distributed memory space scales very well, as one can create easily large clusters of computers.
MPI (message passing interface) uses this approach.
But actually mpi is more about having collective communication patterns efficiently implemented and being able to create optimal subsets to do subwork. MPI is the correct choice if you have a complex problem (also for the parallelization) and you want to efficiently use *all* resources available.

Sometime the problem you have is not so costly that you need to commit all resources to it, you just want to solve it efficiently and if possible taking advantage of the parallelization.
In this case a good model is the actor model where objects communicate with messages to each other.
One can have thread objects with mailboxes and pattern matching to select the message, or objects with an interface and a remote procedure call to invoke them.
You can organize the network of messages in several ways, you can have a central server, and clients that connect, you can have a central database to communicate, you can have a peer to peer structure, you can have producer/consumer relationships.
Normally given a problem one can see how to partition it optimally.

I use blip.parallel.rpc to give that kind of messaging between objects.
Note that one has to think about failure of one part in  this model, not necessarily a failure of one process should stop all processes (in some cases it might even be undetected).
At this level one could theoretically migrate processes/objects automatically, but given that the latency increase can be very large (~10^6) this automatic distribution is doable only for tasks that were considered by the programmer, a fully automatic redistribution of any object is not realistic.

I hope this overview helps a bit
Fawzi
November 14, 2010
parallel noob Wrote:
> 
> I fail to map these concepts discussed here with the things listed on those pages. I found MPI, POSIX Threads, TBB, Erlang, OpenMP, and OpenCL there.
> 
> Sean mentioned:
> 
> "In the long term there may turn out to be better models, but I don't know of one today."
> 
> So he's basically saying that those others listed in the wikipedia pages are totally unsuitable for real world tasks? Only Erlang style message passing works?

I was saying that message-passing is the most easily understandable concurrency model.  But answering the rest of your question is somewhat tricky.  Data parallelism (ie. OpenMP) is the optimal approach for processing large datasets, but data parallelism can be achieved (at an API level) in a variety of ways, from POSIX Threads to message passing.  A specialized API is best though, and in fact there's a data parallelism package under consideration for Phobos right now.  I don't know if that clarifies or confuses things however.

> The next machine I buy comes with 12 or 16 cores or even more -- this one has 4 cores. The typical applications I use take advantage of 1-2 threads. For example a cd ripper starts a new process for each mp3 encoder. The program runs at most 3 threads (the gui, the mp3 encoder, the cd ripper). More and more applications run in the browser. The browser actively uses one thread + one thread per applet. I can't even utilize more than 50% of the power of the current gen!

Software takes a long time to change.  And the apps you picked aren't easily parallelizable anyway.  I think Russel is exactly right when he says that future systems will have a collection of heterogeneous cores... a few traditional ones for apps like the above (say 16) and clusters of others for different purposes.  Seems like the trick will be dispatching work to the appropriate one.

> The situation is different with GPUs. My Radeon 5970 has 3200 cores. When the core count doubles, the FPS rating in games almost doubles. They definitely are not running Erlang style processes (one for GUI, one for sounds, one for physics, one for network). That would leave 3150 cores unused.

Graphics are an ideal application for data parallelism.  But those cores are all organizing their operations via messages passed on the system bus :-)  I know, that's kind of a lame out, even if it's true.
November 14, 2010
On 14-nov-10, at 02:44, Fawzi Mohamed wrote:

> [...]
> Sometime the problem you have is not so costly that you need to commit all resources to it, you just want to solve it efficiently and if possible taking advantage of the parallelization.

> In this case a good model is the actor model where objects communicate with messages to each other.
> One can have thread objects with mailboxes and pattern matching to select the message, or objects with an interface and a remote procedure call to invoke them.
> You can organize the network of messages in several ways, you can have a central server, and clients that connect, you can have a central database to communicate, you can have a peer to peer structure, you can have producer/consumer relationships.
> Normally given a problem one can see how to partition it optimally.
>
> I use blip.parallel.rpc to give that kind of messaging between objects.
> Note that one has to think about failure of one part in  this model, not necessarily a failure of one process should stop all processes (in some cases it might even be undetected).
> At this level one could theoretically migrate processes/objects automatically, but given that the latency increase can be very large (~10^6) this automatic distribution is doable only for tasks that were considered by the programmer, a fully automatic redistribution of any object is not realistic.

I forgot to say that if that part of your problem fits a simple parallelization pattern, for example costly basically independent tasks, for which you might (for example) use a client/server approach, or a data parallel approach with huge amount of distributed data on which you want to apply something like map/reduce.
In these examples the actor model might work well allowing one to be more dynamic about the work distribution than MPI (that has no nice partial failure), and still use all system resources (in fact I use it exactly for those reasons).