October 11, 2012
I haven't been able to get an idea of how many std.concurrency receivers is reasonable.  Is it a reasonable way to implement a cellular automaton (assume each cell has a float number of states)...it isn't exactly a cellular automaton, but it isn't exactly a neural network, either.  (I was considering Erlang, but each cell has variable state, which Erlang doesn't have a nice way to do.)

TDPL quotes the recommendation from an Erlang book "Have LOTS of threads!", but doesn't really say how to guess at an order of magnitude of what's reasonable for D std.concurrency.  People on Erlang say that 100's of thousands of threads is reasonable.  Is it the same for D?
October 11, 2012
On Thursday, 11 October 2012 at 02:21:01 UTC, Charles Hixson wrote:
> I haven't been able to get an idea of how many std.concurrency receivers is reasonable.

Currently in std.concurrency each "receiver" lives in its own OS thread, so they are very expensive, 4-10 is fine, 100 may be possible but expensive in terms of RAM and CPU cycles, 1000 is probably too much.
October 11, 2012
On Thu, 2012-10-11 at 09:09 +0200, thedeemon wrote:
> On Thursday, 11 October 2012 at 02:21:01 UTC, Charles Hixson wrote:
> > I haven't been able to get an idea of how many std.concurrency receivers is reasonable.
> 
> Currently in std.concurrency each "receiver" lives in its own OS thread, so they are very expensive, 4-10 is fine, 100 may be possible but expensive in terms of RAM and CPU cycles, 1000 is probably too much.

In the beginning processors weren't doing enough so processes were invented. Processes were too expensive so threads were invented. Now threads are too expensive. How the world goes round :-)

More constructively: there is an implication here that each receiver is bound explicitly and permanently to a thread. I would have thought the step would be de-couple receiver and thread, run a thread pool and then dynamically bind to receivers as needed.

I haven't read the earlier emails in the thread nor checked the code so I may be way off with the above. Apologies if so.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder


October 11, 2012
On 10/11/2012 12:09 AM, thedeemon wrote:
> On Thursday, 11 October 2012 at 02:21:01 UTC, Charles Hixson wrote:
>> I haven't been able to get an idea of how many std.concurrency
>> receivers is reasonable.
>
> Currently in std.concurrency each "receiver" lives in its own OS thread,
> so they are very expensive, 4-10 is fine, 100 may be possible but
> expensive in terms of RAM and CPU cycles, 1000 is probably too much.

Hmmm...what I'm trying to build is basically a cross between a weighted directed graph and a neural net, with some features of each, but not much in common.  Very light-weight processes would be ideal.  The only communication should be via message-passing.  Each cell would spend most of it's time sitting on a count-down timer waiting to be rolled out to a database of inactive processes, but it needs to maintain local state (weights of links, activation level, etc.  nothing fancy.)

If I were doing this sequentially, I'd want to use structs for the cells, because class instances would be too heavy.  And I'd store them in a hash table keyed by cell-id#.

Unfortunately, I don't see any reasonable way of chunking the pieces, so that I can chunk them into 100 relatively independent sets.  Or even 1000.  10,000 is probably about the right size for active-at-one-time cells.  And if it would handle that, std.concurrency seemed ideal.

Do you have any suggestions as to what would be a reasonable better choice?  (Outside of going back to sequential.)
October 11, 2012
On Thursday, 11 October 2012 at 16:09:20 UTC, Charles Hixson wrote:

> Hmmm...what I'm trying to build is basically a cross between a weighted directed graph and a neural net, with some features of each, but not much in common.  Very light-weight processes would be ideal.  The only communication should be via message-passing.  Each cell would spend most of it's time sitting on a count-down timer waiting to be rolled out to a database of inactive processes, but it needs to maintain local state (weights of links, activation level, etc.  nothing fancy.)
>
> If I were doing this sequentially, I'd want to use structs for the cells, because class instances would be too heavy.  And I'd store them in a hash table keyed by cell-id#.
>
> Unfortunately, I don't see any reasonable way of chunking the pieces, so that I can chunk them into 100 relatively independent sets.  Or even 1000.  10,000 is probably about the right size for active-at-one-time cells.  And if it would handle that, std.concurrency seemed ideal.
>
> Do you have any suggestions as to what would be a reasonable better choice?  (Outside of going back to sequential.)

Here's how I would try to approach a task of having thousands of independent agents with current std.concurrency. Each agent (cell) is represented by some data structure and its main function which gets one message as input, reacts (possibly changing its state and sending other messages) and returns without blocking. Then I'd create say 16 threads (or 8, anyway a power of 2 which is close to actual number of cores), each of them will have its own message queue, that's given by std.concurrency. Let's say each cell has its own id. I would place cell with id N to the thread number N mod 16. Each thread will have an array of cells mapped to it. Then if some cell sends a message to cell X, it makes sure the message contains cell id of recipient and then sends it to thread X mod 16. Each worker thread runs a loop where it receives next message from its queue, finds the target cell by its id in this thread's array of cells (we can use X / 16 as index) and calls its reaction function. This way all agents are evenly distributed between threads, we're using just 16 threads and 16 queues which work in parallel, and it all acts as if thousands of agents work independently. However this approach does not guarantee even work distribution between cores.
October 11, 2012
On Thu, 2012-10-11 at 20:04 +0200, thedeemon wrote:
[…]
> Here's how I would try to approach a task of having thousands of independent agents with current std.concurrency. Each agent (cell) is represented by some data structure and its main function which gets one message as input, reacts (possibly changing its state and sending other messages) and returns without blocking. Then I'd create say 16 threads (or 8, anyway a power of 2 which is close to actual number of cores), each of them will have its own message queue, that's given by std.concurrency. Let's say each cell has its own id. I would place cell with id N to the thread number N mod 16. Each thread will have an array of cells mapped to it. Then if some cell sends a message to cell X, it makes sure the message contains cell id of recipient and then sends it to thread X mod 16. Each worker thread runs a loop where it receives next message from its queue, finds the target cell by its id in this thread's array of cells (we can use X / 16 as index) and calls its reaction function. This way all agents are evenly distributed between threads, we're using just 16 threads and 16 queues which work in parallel, and it all acts as if thousands of agents work independently. However this approach does not guarantee even work distribution between cores.

Can't this be done now using tasks and a threadpool from std.parallel?

And I believe (in that I can't point you at explicit data just now), that it is generally best to have 1 or 2 more threads than there are cores to get optimal performance.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder


October 11, 2012
On Thursday, 11 October 2012 at 18:43:37 UTC, Russel Winder wrote:

> Can't this be done now using tasks and a threadpool from std.parallel?

As far as I understand that would essentially mean a single queue of tasks which is accessed concurrently by workers hungry of work (one point of locking), and if one cell receives two messages with little time interval inbetween then two different threads can pick up the tasks of reacting to those messages and run in parallel which means two threads may try to change cell's state simultaneously unless you add a lock to each cell or somehow organize pinning cells to particular threads. Doesn't look good to me, unless there is a very different design.

> And I believe (in that I can't point you at explicit data just now),
> that it is generally best to have 1 or 2 more threads than there are cores to get optimal performance.

I guess it depends very much on the tasks being executed. If they do some I/O or other blocking operations, additional threads may indeed help keep CPU cores busy.


October 11, 2012
My biggest concern here is with this number of agents communicating to each other via message passing it would mean huge number of memory allocations for the messages, but in current D runtime allocation is locking (and GC too), so it may kill all the parallelism if reactions to messages are short and simple. D is no Erlang in this regard.
October 11, 2012
On Thu, 2012-10-11 at 21:26 +0200, thedeemon wrote:
> On Thursday, 11 October 2012 at 18:43:37 UTC, Russel Winder wrote:
> 
> > Can't this be done now using tasks and a threadpool from std.parallel?
> 
> As far as I understand that would essentially mean a single queue of tasks which is accessed concurrently by workers hungry of work (one point of locking), and if one cell receives two messages with little time interval inbetween then two different threads can pick up the tasks of reacting to those messages and run in parallel which means two threads may try to change cell's state simultaneously unless you add a lock to each cell or somehow organize pinning cells to particular threads. Doesn't look good to me, unless there is a very different design.

Many actor systems that deal with very large numbers of messages per second are based on single threaded event-driven engines. JActor, PyActor, etc.

Alternatively use Concurrent Sequential Processes. The key here is concurrent, sequential, processes :-) Python-CSP has them. PyCSP has them. Go has them. C++CSP2 has them. JCSP has them. GroovyCSP has them. It's all about the sequential processes and the rendezvous semantics. And also the select operation.

> > And I believe (in that I can't point you at explicit data just
> > now),
> > that it is generally best to have 1 or 2 more threads than
> > there are cores to get optimal performance.
> 
> I guess it depends very much on the tasks being executed. If they do some I/O or other blocking operations, additional threads may indeed help keep CPU cores busy.

Cores being busy is not an important metric. Number of useful applications actions is far more important, even if this means most cores are idle most of the time.

The rational for more threads than cores is indeed blocking, be it I/O or otherwise. The serious problem is cache-line contention, which is where Threading Building Blocks makes a big win.

Sadly I seem to have used examples none of which relate to D :-(

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder


October 11, 2012
On 10/11/2012 11:04 AM, thedeemon wrote:
> On Thursday, 11 October 2012 at 16:09:20 UTC, Charles Hixson wrote:
>
>> Hmmm...what I'm trying to build is basically a cross between a
>> weighted directed graph and a neural net, with some features of each,
>> but not much in common. Very light-weight processes would be ideal.
>> The only communication should be via message-passing. Each cell would
>> spend most of it's time sitting on a count-down timer waiting to be
>> rolled out to a database of inactive processes, but it needs to
>> maintain local state (weights of links, activation level, etc. nothing
>> fancy.)
>>
>> If I were doing this sequentially, I'd want to use structs for the
>> cells, because class instances would be too heavy. And I'd store them
>> in a hash table keyed by cell-id#.
>>
>> Unfortunately, I don't see any reasonable way of chunking the pieces,
>> so that I can chunk them into 100 relatively independent sets. Or even
>> 1000. 10,000 is probably about the right size for active-at-one-time
>> cells. And if it would handle that, std.concurrency seemed ideal.
>>
>> Do you have any suggestions as to what would be a reasonable better
>> choice? (Outside of going back to sequential.)
>
> Here's how I would try to approach a task of having thousands of
> independent agents with current std.concurrency. Each agent (cell) is
> represented by some data structure and its main function which gets one
> message as input, reacts (possibly changing its state and sending other
> messages) and returns without blocking. Then I'd create say 16 threads
> (or 8, anyway a power of 2 which is close to actual number of cores),
> each of them will have its own message queue, that's given by
> std.concurrency. Let's say each cell has its own id. I would place cell
> with id N to the thread number N mod 16. Each thread will have an array
> of cells mapped to it. Then if some cell sends a message to cell X, it
> makes sure the message contains cell id of recipient and then sends it
> to thread X mod 16. Each worker thread runs a loop where it receives
> next message from its queue, finds the target cell by its id in this
> thread's array of cells (we can use X / 16 as index) and calls its
> reaction function. This way all agents are evenly distributed between
> threads, we're using just 16 threads and 16 queues which work in
> parallel, and it all acts as if thousands of agents work independently.
> However this approach does not guarantee even work distribution between
> cores.
=------------below is my second thoughts.

If I could do things that way, it would certainly be a faster design than what I'm considering now.  But I'm really concerned about everything fitting into RAM.  I'm going to need to think about this. I've got about 8GB of RAM, and I'm on a 64 bit system.  So maybe my concerns about things fitting into memory are out of date.  (I'm still used to thinking of a 64KB computer as being one with a lot of RAM.) And I notice my disk swap space is totally unused.  Hmmmm... Maybe I should even replace the database with a sequential file.

Unless D has some limits that I can't recall reading about, that looks like the right way to go, even if it feels wrong.  Probably because I learned programming way back when... but reasonably it looks like the right answer.

P.S.:  There's no way to guarantee that the cores will be used evenly, because the cells definitely AREN'T even in their use.  And while the distribution of use isn't random, it also isn't predictable...and varies over time.  So don't worry about this approach not guaranteeing equal distribution of work.

=------------below is my first impressions

That's a nice approach, though I can't use a vector of cells in each thread, because the cells roll in and out depending on their level of activity, and all active (i.e. ram-resident) cells will need to be accessed occasionally to age their activity, so that will need to be a hash table (i.e. associative array).  Also, I only have about 8-hyperthreads.  So I guess what I'll do is run all the cells in one thread (to simplify the logic) and in other threads do things like manage the database, etc.  Not what I was hoping for, but probably a much more reasonable match to the hardware.  (Also, I'll want to have a few extra threads available for things like background e-mail polling, etc.  Or even debuggers.)

I guess that a part of the problem (i.e., why I can't adopt your suggestion) is that there's no way all the cells would fit into RAM. (Or maybe I'm wrong.  There will probably be only a few million total. And each one will probably be less than a kilobyte in size.  [You'll note I don't have very precise estimates yet.  That will take months to years to develop.])

Still, if I adopt this serialized variation, it will be relatively easy to split it several ways in the future if I get fancier hardware, and if I decide that all the nodes WILL fit into RAM.  So I guess what I should do is build the serial version, but ensure that it remains feasible to convert it into the chunked-parallel version that you described. Certainly if I could replace the associative array by a simple vector that would speed up lots of parts of it, and so would eliminating the rolling in and out of cells.

« First   ‹ Prev
1 2
Top | Discussion index | About this forum | D home