March 24, 2011
On 3/24/2011 3:23 AM, bearophile wrote:
> dsimcha:
>
>> and apologize for getting defensive at times.
>
> It happens to mammals, don't worry.
>
>
>> The new docs are at
>> http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .
>
>>     real getTerm(int i) {
>>         immutable x = ( i - 0.5 ) * delta;
>>         return delta / ( 1.0 + x * x ) ;
>>     }
>>     immutable pi = 4.0 * taskPool.reduce!"a + b"(
>>         std.algorithm.map!getTerm(iota(n))
>>     );
>
> For the examples I suggest to use q{a + b} instead of "a + b".

I tried to keep it as consistent as possible with std.algorithm.
>
> When D will gain a good implementation of conditional purity, I think taskPool.reduce and taskPool.map may accept only pure functions to map on and pure iterables to work on.

Eventually, maybe.  Definitely not now, though, because in practice it would severely affect usability.

> In the module documentation I'd like to see a graph that shows how the parallel map/reduce/foreach scale as the number of cores goes to 1 to 2 to 4 to 8 (or more) :-)

Unfortunately I don't have access to this kind of hardware except at work.
March 24, 2011
On 3/24/2011 3:29 AM, Sönke Ludwig wrote:
> Hm depending on the way the pool is used, it might be a better default
> to have the number of threads equal the number of cpu cores. In my
> experience the control thread is mostly either waiting for tasks or
> processing messages and blocking in between so it rarely uses a full
> core, wasting the available computation time in this case.

It's funny, it seems like the task parallelism stuff is getting much more attention from the community than the data parallelism stuff.  I hardly ever use the task parallelism and use mostly data parallelism. I'm inclined to leave this as-is because:

1.  It's definitely the right answer for data parallelism and the task parallelism case is much less obvious.

2.  The main thread is utilized in the situation you describe.  As I mentioned in a previous post, when a task that has not been started by a worker thread yet is forced, it is executed immediately in the thread that tried to force it, regardless of its position in the queue.  There are two reasons for this:

    a.  It guarantees that there won't be any deadlocks where a task waits for another task that's behind it in the queue.

    b.  If you're trying to force a task, then you obviously need the results ASAP, so it's an ad-hoc form of prioritization.
>
> However, I'm not really sure if it is like this for the majority of all
> applications or if there are more cases where the control thread will
> continue to do computations in parallel. Maybe we could collect some
> opinions on this?
>
> On another note, I would like to see a rough description on what the
> default workUnitSize is depending on the size of the input. Otherwise it
> feels rather uncomfortable to use this version of parallel().

Hmm, this was there in the old documentation.  Andrei recommended against documenting it for one of the cases because it might change.  I can tell you that, right now, it's:

1.  Whatever workUnitSize would create TaskPool.size * 4 work units, if the range has a length.

2.  512 if the range doesn't have a length.
>
> Another small addition would be to state that the object returned by
> asyncBuf either is an InputRange or which useful methods it might have
> (some kind of progress counter could also be useful here).

I guess this could be a little clearer, but it's really just a plain vanilla input range that has a length iff range has a length.  There are no other public methods.

March 24, 2011
On 3/24/2011 8:35 AM, spir wrote:
> On 03/24/2011 05:32 AM, dsimcha wrote:
>> [...]
>>
>> The new docs are at
>> http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html .
>
> About the doc: very good. I could understand most of it, while knowing
> nearly nothing about parallelism prior to reading.
> 2 details:
> * highlight key words only on first occurrence (bold online)
> * wrong doc for Task.isPure (gets a copy of Tast.args' doc)
>
> Denis

Task.isPure isn't even supposed to be public.  I didn't notice that it had slipped into the docs.  WTF, DDoc?
March 24, 2011
On 3/24/2011 8:03 AM, Michel Fortin wrote:
> On 2011-03-24 03:29:52 -0400, Sönke Ludwig
> <ludwig@informatik.uni-luebeck.de> said:
>
>> Hm depending on the way the pool is used, it might be a better default
>> to have the number of threads equal the number of cpu cores. In my
>> experience the control thread is mostly either waiting for tasks or
>> processing messages and blocking in between so it rarely uses a full
>> core, wasting the available computation time in this case.
>>
>> However, I'm not really sure if it is like this for the majority of
>> all applications or if there are more cases where the control thread
>> will continue to do computations in parallel. Maybe we could collect
>> some opinions on this?
>
> The current default is good for command line applications where the main
> thread generally blocks while you're doing your work. The default you're
> proposing is good when you're using the task pool to pile up tasks to
> perform in background, which is generally what you do in an event-driven
> application. The current default is good to keeps simpler the simpler
> programs which are more linear in nature.
>
> My use case is like yours: a event-driven main thread which starts tasks
> to be performed in the background.
>

Please review the changes carefully, then, because this is a use case I know next to nothing about and didn't design for.  I added a few (previously discussed) things to help this use case, at your request. BTW, one hint that I'd like to mention (not sure how to work it into the docs) is that, if you **don't** want to execute a task in the current thread if it's not started, but want its return value if it's done (as may be the case when you need threading for responsiveness, not throughput), you can call Task.done to check if it's done, and then Task.spinForce() to get the return value after you know it's already done.
March 24, 2011
On 2011-03-24 09:46:01 -0400, dsimcha <dsimcha@yahoo.com> said:

> Please review the changes carefully, then, because this is a use case I know next to nothing about and didn't design for.

Well, it's practically the same thing except you never want to execute a task in the main thread, because the main thread acts more like a coordinator for various things and the coordinator must stay responsive.

And since your main thread might be creating various kind of tasks you need a way to priorize some tasks over others. I think creating a few task pools with various priority and relying on OS thread scheduling would be adequate in most cases, but I haven't tried.

One thing I'd want to be sure however is that you can use a parallel foreach from within a task. So if you have one or two tasks that could benefit from data parallelism it won't bring the whole system down. From the API I don't think it'll be a problem.


> I added a few (previously discussed) things to help this use case, at your request. BTW, one hint that I'd like to mention (not sure how to work it into the docs) is that, if you **don't** want to execute a task in the current thread if it's not started, but want its return value if it's done (as may be the case when you need threading for responsiveness, not throughput), you can call Task.done to check if it's done, and then Task.spinForce() to get the return value after you know it's already done.

The problem with using Task.done that way is that it requires polling. It might be appropriate in some cases, but in general you want to receive a message telling you when the task is done. That's not really complicated however: all the task has to do is send back its result through std.concurrency's "send" or through some other event dispatch mechanism instead of through a return statement.

So I hope your tasks can accept "void" as a return type.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

March 24, 2011
On 3/24/2011 10:34 AM, Michel Fortin wrote:
>
> One thing I'd want to be sure however is that you can use a parallel
> foreach from within a task. So if you have one or two tasks that could
> benefit from data parallelism it won't bring the whole system down. From
> the API I don't think it'll be a problem.

Right.  This is completely do-able and I've done it before in practice.

>
>
>> I added a few (previously discussed) things to help this use case, at
>> your request. BTW, one hint that I'd like to mention (not sure how to
>> work it into the docs) is that, if you **don't** want to execute a
>> task in the current thread if it's not started, but want its return
>> value if it's done (as may be the case when you need threading for
>> responsiveness, not throughput), you can call Task.done to check if
>> it's done, and then Task.spinForce() to get the return value after you
>> know it's already done.
>
> The problem with using Task.done that way is that it requires polling.
> It might be appropriate in some cases, but in general you want to
> receive a message telling you when the task is done. That's not really
> complicated however: all the task has to do is send back its result
> through std.concurrency's "send" or through some other event dispatch
> mechanism instead of through a return statement.

Sounds like a good plan.  In general, I've tried to keep the design of std.parallelism simple but composable.  I have no intention of re-implementing any kind of message system when std.concurrency already does this well.  If this is what you want to do, though, maybe you should just use std.concurrency.  I'm not sure what std.parallelism would add.

>
> So I hope your tasks can accept "void" as a return type.

Yes, this works.

March 24, 2011
On 2011-03-24 10:43:08 -0400, dsimcha <dsimcha@yahoo.com> said:

> Sounds like a good plan.  In general, I've tried to keep the design of std.parallelism simple but composable.  I have no intention of re-implementing any kind of message system when std.concurrency already does this well.  If this is what you want to do, though, maybe you should just use std.concurrency.  I'm not sure what std.parallelism would add.

What it adds is a task pool, where you have a fixed number of threads for an unlimited number of tasks. Spawning 10,000 threads because you have 10,000 parallelizable tasks generally isn't a good idea.

That said, perhaps std.concurrency's "spawn" should have the ability to create tasks instead of always creating new threads...

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

March 24, 2011
On 3/24/2011 11:00 AM, Michel Fortin wrote:
>>
>
> What it adds is a task pool, where you have a fixed number of threads
> for an unlimited number of tasks. Spawning 10,000 threads because you
> have 10,000 parallelizable tasks generally isn't a good idea.
>
> That said, perhaps std.concurrency's "spawn" should have the ability to
> create tasks instead of always creating new threads...

This is a great **long-term** todo.  Please file an enhancement request in Bugzilla w.r.t. std.concurrency pooling threads so it doesn't get lost.  TaskPool would probably be a good back end for this, but IMHO any use of TaskPool by std.concurrency should be regarded as an implementation detail.

On the other hand, this kind of inter-module cooperation requires lots of discussion about how it should be designed.  It is also well beyond the scope of what std.parallelism was designed to do.  This will take a long time to design and implement, have ripple effects into std.concurrency, etc.  I don't think it needs to be implemented **now** or should hold up the vote and inclusion of std.parallelism in Phobos.
March 24, 2011
dsimcha:

> I tried to keep it as consistent as possible with std.algorithm.

OK. Then the question is why std.algorithm uses normal strings instead of q{} ones.

And regarding consistency with std.algorithm, a more important factor is that std.algorithm.map is lazy, while you have a eager map, and the lazy version has lazy in the name, so the names are kind of opposite of std.algorithm.


> Unfortunately I don't have access to this kind of hardware except at work.

Maybe some other person in this group has access to a 8 core CPU and is willing to take some numbers, to create few little graphs.

Bye,
bearophile
March 24, 2011
== Quote from bearophile (bearophileHUGS@lycos.com)'s article
> dsimcha:
> > I tried to keep it as consistent as possible with std.algorithm.
> OK. Then the question is why std.algorithm uses normal strings instead of q{} ones.

I personally think "" strings look nicer for simple cases like "a + b".  At any rate, this is a bikeshed issue.

> And regarding consistency with std.algorithm, a more important factor is that
std.algorithm.map is lazy, while you have a eager map, and the lazy version has lazy in the name, so the names are kind of opposite of std.algorithm.

Hmm, you do have a point there.  Two reasons:

1.  map() was there first and at the time I didn't feel like renaming it.

2.  I think map() is much more frequently useful than lazyMap() and name verbosity
should be inversely proportional to usage frequency.  (Which is why I really need
help thinking of a better name than executeInNewThread().)

I'm not sure whether I want to change this.  Input from the rest of the community would be useful as long as it doesn't end up going off onto some wild tangent and starting Bikeshed War III.

> > Unfortunately I don't have access to this kind of hardware except at work.
> Maybe some other person in this group has access to a 8 core CPU and is willing
to take some numbers, to create few little graphs.

I can kinda see Andrei's point about adding a few benchmarks just to give a rough idea of the level of fine-grainedness this library is capable of and how to get it, but these requests are exactly what I was afraid of when I put them in.  I really don't think rigorous, detailed benchmarks have any place in API documentation.  The purpose of API documentation is for people to figure out what the library does and how to use it, not to sell a product or precisely quantify its performance.

Next people will be asking for standard deviations, error bars, P-values, confidence intervals, regression models of performance vs. cores and basically things that dstats is good for, or for me to port their favorite non-trivial benchmark from their favorite non-D language.