August 27, 2010
On Aug 27, 2010, at 7:11 AM, David Simcha wrote:
> 
>> 
>> I see you have some CAS instructions. Sean, I think it's a good time to collaborate with David to put them into druntime or std.concurrency.
> 
> Yeah, D needs a real atomics library.  core.atomic is a good start, but I won't use it until it can efficiently do things like atomic increment.

How could it be made more efficient?
August 27, 2010
IIRC (maybe this has changed recently) atomic increments in core.atomic are based on CAS instructions in a while loop, which is how more generic lock free primitives are made.  Atomic increment should be special cased to directly use lock; inc [someRegister];.

On Fri, Aug 27, 2010 at 10:40 AM, Sean Kelly <sean at invisibleduck.org> wrote:

> On Aug 27, 2010, at 7:11 AM, David Simcha wrote:
> >
> >>
> >> I see you have some CAS instructions. Sean, I think it's a good time to
> collaborate with David to put them into druntime or std.concurrency.
> >
> > Yeah, D needs a real atomics library.  core.atomic is a good start, but I
> won't use it until it can efficiently do things like atomic increment.
>
> How could it be made more efficient?
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100827/6532485e/attachment.html>
August 27, 2010
I've heard that LOCK INC is actually slower than a CAS loop, but have no data to back this up.  What I really don't like about INC however is that it doesn't put the new value of the incremented variable in a register, so there's no way of returning the new value.

On Aug 27, 2010, at 7:45 AM, David Simcha wrote:

> IIRC (maybe this has changed recently) atomic increments in core.atomic are based on CAS instructions in a while loop, which is how more generic lock free primitives are made.  Atomic increment should be special cased to directly use lock; inc [someRegister];.
> 
> On Fri, Aug 27, 2010 at 10:40 AM, Sean Kelly <sean at invisibleduck.org> wrote: On Aug 27, 2010, at 7:11 AM, David Simcha wrote:
> >
> >>
> >> I see you have some CAS instructions. Sean, I think it's a good time to collaborate with David to put them into druntime or std.concurrency.
> >
> > Yeah, D needs a real atomics library.  core.atomic is a good start, but I won't use it until it can efficiently do things like atomic increment.
> 
> How could it be made more efficient?
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
> 
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos

August 27, 2010
This should be validated by benchmarking.

Andrei

On 8/27/10 7:45 PDT, David Simcha wrote:
> IIRC (maybe this has changed recently) atomic increments in core.atomic are based on CAS instructions in a while loop, which is how more generic lock free primitives are made.  Atomic increment should be special cased to directly use lock; inc [someRegister];.
>
> On Fri, Aug 27, 2010 at 10:40 AM, Sean Kelly <sean at invisibleduck.org <mailto:sean at invisibleduck.org>> wrote:
>
>     On Aug 27, 2010, at 7:11 AM, David Simcha wrote:
>      >
>      >>
>      >> I see you have some CAS instructions. Sean, I think it's a good
>     time to collaborate with David to put them into druntime or
>     std.concurrency.
>      >
>      > Yeah, D needs a real atomics library.  core.atomic is a good
>     start, but I won't use it until it can efficiently do things like
>     atomic increment.
>
>     How could it be made more efficient?
>     _______________________________________________
>     phobos mailing list
>     phobos at puremagic.com <mailto:phobos at puremagic.com>
>     http://lists.puremagic.com/mailman/listinfo/phobos
>
>
>
>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
August 27, 2010
This should do the job:

import std.stdio, std.perf, core.atomic;

// Cut and pasted from ParallelFuture.
void atomicIncUint(ref uint num) {
    asm {
        mov EAX, num;
        lock;
        inc int ptr [EAX];
        mov EAX, [EAX];
    }
}

void main() {
    auto pc = new PerformanceCounter;
    pc.start;

    shared uint num = 0;
    while(num < 100_000_000) {
        atomicIncUint(num);
    }

    pc.stop;
    writeln(pc.milliseconds);  // 1772 ms

    pc.start;

    num = 0;
    while(num < 100_000_000) {
        atomicOp!"+="(num, 1U);  // 2539 ms
    }

    pc.stop;
    writeln(pc.milliseconds);
}

So the difference is there, but it's not huge.  In addition, the lack of anything like atomicOp!"++"(num); seems kind of weird.

Also, atomicOp shouldn't be so picky about types.  For example, atomically adding an int to a uint should work.

On Fri, Aug 27, 2010 at 11:43 AM, Andrei Alexandrescu <andrei at erdani.com>wrote:

> This should be validated by benchmarking.
>
> Andrei
>
>
> On 8/27/10 7:45 PDT, David Simcha wrote:
>
>> IIRC (maybe this has changed recently) atomic increments in core.atomic are based on CAS instructions in a while loop, which is how more generic lock free primitives are made.  Atomic increment should be special cased to directly use lock; inc [someRegister];.
>>
>> On Fri, Aug 27, 2010 at 10:40 AM, Sean Kelly <sean at invisibleduck.org <mailto:sean at invisibleduck.org>> wrote:
>>
>>    On Aug 27, 2010, at 7:11 AM, David Simcha wrote:
>>     >
>>     >>
>>     >> I see you have some CAS instructions. Sean, I think it's a good
>>    time to collaborate with David to put them into druntime or
>>    std.concurrency.
>>     >
>>     > Yeah, D needs a real atomics library.  core.atomic is a good
>>    start, but I won't use it until it can efficiently do things like
>>    atomic increment.
>>
>>    How could it be made more efficient?
>>    _______________________________________________
>>    phobos mailing list
>>    phobos at puremagic.com <mailto:phobos at puremagic.com>
>>
>>    http://lists.puremagic.com/mailman/listinfo/phobos
>>
>>
>>
>>
>> _______________________________________________
>> phobos mailing list
>> phobos at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/phobos
>>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100827/57649fcf/attachment.html>
August 27, 2010
On Aug 27, 2010, at 8:55 AM, David Simcha wrote:
> 
> Also, atomicOp shouldn't be so picky about types.  For example, atomically adding an int to a uint should work.

Agreed.  I've already had to hack the template params so shared vs. unshared concrete types would be accepted.  I'll have to try and fix the other overloading issues as well.
August 27, 2010
I'm not an asm guru in any way, so there might be hidden bugs in my code, but still here is my version:

import std.perf;
import std.stdio;

void atomicIncUint(ref shared(uint) num) {
    asm {
//        mov EAX, num; //< no need to move num to EAX since last
argument is always passed in EAX and DMD doesn't inline functions that
use asm or alloca
        lock;
        inc int ptr [EAX];
//      mov EAX, [EAX]; // ???
    }
}

uint atomicIncUintViaCas(ref shared(uint) num) {
    asm {
        mov EDX, num;
repeat:
        mov EAX, [EDX];
        mov ECX, EAX;
        inc ECX;
        lock;
        cmpxchg [EDX], ECX;
        jnz repeat;

}

void main() {
    auto pc = new PerformanceCounter;
    pc.start;

    shared uint num = 111;
    for (uint i = num; i < 100_000_000; ++i) {
        //atomicIncUint(num);	// 2060
        atomicIncUintViaCas(num);	// 2199
    }

    pc.stop;
    writeln(pc.milliseconds);
    writeln(cast(uint)num); // wtf? why writeln doesn't allow printing
shared stuff?
}

I think the difference is pretty much negligible, but with atomicIncUintViaCas you have a benefit of returning old variable value (which is a nice thing to have).
September 05, 2010
Continuing to catch on with older email...

David, the intent of shared is to prevent sharing of everything that isn't shared. I didn't get to review your parallelism library yet, but I think it's likely your library uses things that shouldn't actually work :o). If it does, we should work towards making the type system better to accept your code without inadvertent sharing.


Andrei

On 07/31/2010 11:35 PM, David Simcha wrote:
> I've started thinking about how to make ParallelFuture jive with D's new threading model, since it was designed before shared and std.concurrency were implemented and is basically designed around default sharing. (core.thread takes a non-shared delegate, and allows you to completely bypass the shared system, and from what I remember of newsgroup discussions, this isn't going to change.)
>
> I've re-read the concurrency chapter in TDPL and I'm still trying to understand what the model actually is for shared data.  For example, the following compiles and, IIUC shouldn't:
>
> shared real foo;
>
> void main() {
>      foo++;
> }
>
> I guess my high-level question that I'm *still* not quite getting is "What is shared besides a piece of syntactic salt to make it harder to inadvertently share data across threads?"
>
> Secondly, my parallel foreach loop implementation relies on sharing the current stack frame and anything reachable from it across threads.  For example:
>
> void main() {
>      auto pool = new TaskPool;
>      uint[] nums = fillNums();
>      uint modBy = getSomeOtherNum();
>
>      foreach(num; pool.parallel(nums)) {
>          if(isPrime(num % modBy)) {
>               writeln("Found prime number:  ", num % modBy);
>         }
>      }
> }
>
> Allowing stuff like this is personally useful to me, but if the idea is that we have no implicit sharing across threads, then I don't see how something like this can be implemented.  When you call a parallel foreach loop like this, **everything** on the current stack frame is **transitively** shared.  Doing anything else would require a complete redesign of the library.  Is calling pool.parallel enough of an explicit asking for "here be dragons" that the delegate should simply be cast to shared?  If not, does anyone see any other reasonable way to do parallel foreach?
>
> On 7/31/2010 7:31 AM, Andrei Alexandrescu wrote:
>> Hello,
>>
>> Here's a belated answer to your question (hectic times prevented me from tending to non-urgent email).
>>
>> I think a parallel library would be great to have as indeed phobos is geared at general concurrency. Such a lib would also expose bugs and weaknesses in our model and its implementation.
>>
>> Andrei
>>
>> Sent by shouting through my showerhead.
>>
>> On May 30, 2010, at 12:54 PM, David Simcha <dsimcha at gmail.com <mailto:dsimcha at gmail.com>> wrote:
>>
>>> I have a few questions/comments about the possible inclusion of a library for parallelism in Phobos:
>>>
>>> 1.  What is the status of std.concurrency?  It's in the source tree, but it's not in the documentation or the changelogs.  It appears to have been checked in quietly ~3 months ago, and I just noticed now.
>>>
>>> 2.  From reading the description of std.concurrency in TDPL it seemed more geared toward concurrency (i.e. making stuff appear to be happening simultaneously, useful for things like GUIs and servers) rather than parallelism (i.e. the use of multiple CPU cores to increase throughput, useful for things like scientific computing and video encoding).  It seems fairly difficult (though I haven't tried yet) to write code that's designed for pull-out-all-stops maximal performance on a multicore machine, especially since immutability is somewhat of a straight jacket.  I find implicit sharing and the use of small synchronized blocks or atomic ops to be very useful in writing parallel programs.
>>>
>>> 3.  Most code where parallelism, as opposed to concurrency, is the goal (at least most that I write) is parallelized in one or two small, performance critical sections, and the rest is written serially.  Therefore, it's easy to reason about things and safety isn't as important as the case of concurrency-oriented multithreading over large sections of code.
>>>
>>> 4.  I've been eating my own dogfood for awhile on my ParallelFuture
>>> library.  (http://cis.jhu.edu/~dsimcha/parallelFuture.html
>>> <http://cis.jhu.edu/%7Edsimcha/parallelFuture.html>;
>>> http://dsource.org/projects/scrapple/browser/trunk/parallelFuture/parallelFuture.d)
>>> It's geared toward throughput-oriented parallelism on multicore
>>> machines, not concurrency for GUIs, servers, etc. and is higher level
>>> than std.concurrency.  Is there any interest in including something
>>> like this in Phobos?  If so, would we try to make it fit into the
>>> explicit-sharing-only model, or treat it as an alternative method of
>>> multithreading geared towards pull-out-all-stops parallelism on
>>> multicore computers?
>>>
>>> One last note:  Walter claimed a while back on the NG that
>>> Parallelfuture doesn't compile.  I use it regularly and it compiles
>>> for me.  Walter, can you please point out what the issue was?
>>> _______________________________________________
>>> phobos mailing list
>>> phobos at puremagic.com <mailto:phobos at puremagic.com>
>>> http://lists.puremagic.com/mailman/listinfo/phobos
>>
>>
>> _______________________________________________
>> phobos mailing list
>> phobos at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/phobos
>
>
>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
September 05, 2010
I think both are wrong for different reasons. Did you file them to Bugzilla already?

Andrei

On 07/31/2010 11:45 PM, David Simcha wrote:
> Also, the following doesn't compile:
>
> class Foo {
>
>      shared void bar() {}
> }
>
> void main() {
>      shared(void delegate()) d;
>      auto foo = new Foo;
>      d = &foo.bar;
> }
>
> Error: cannot implicitly convert expression (&foo.bar) of type void
> delegate() to shared(void delegate())
>
> But the following does:
>
> class Foo {
>
>      void bar() {}
> }
>
> void main() {
>      shared(void delegate()) d;
>      auto foo = new Foo;
>      d = &foo.bar;
> }
>
> If these are just plain bugs, let me know and I'll file them in Bugzilla, but right now I feel like they're more likely my lack of understanding of shared.
>
>
> On 7/31/2010 7:31 AM, Andrei Alexandrescu wrote:
>> Hello,
>>
>> Here's a belated answer to your question (hectic times prevented me from tending to non-urgent email).
>>
>> I think a parallel library would be great to have as indeed phobos is geared at general concurrency. Such a lib would also expose bugs and weaknesses in our model and its implementation.
>>
>> Andrei
>>
>> Sent by shouting through my showerhead.
>>
>> On May 30, 2010, at 12:54 PM, David Simcha <dsimcha at gmail.com <mailto:dsimcha at gmail.com>> wrote:
>>
>>> I have a few questions/comments about the possible inclusion of a library for parallelism in Phobos:
>>>
>>> 1.  What is the status of std.concurrency?  It's in the source tree, but it's not in the documentation or the changelogs.  It appears to have been checked in quietly ~3 months ago, and I just noticed now.
>>>
>>> 2.  From reading the description of std.concurrency in TDPL it seemed more geared toward concurrency (i.e. making stuff appear to be happening simultaneously, useful for things like GUIs and servers) rather than parallelism (i.e. the use of multiple CPU cores to increase throughput, useful for things like scientific computing and video encoding).  It seems fairly difficult (though I haven't tried yet) to write code that's designed for pull-out-all-stops maximal performance on a multicore machine, especially since immutability is somewhat of a straight jacket.  I find implicit sharing and the use of small synchronized blocks or atomic ops to be very useful in writing parallel programs.
>>>
>>> 3.  Most code where parallelism, as opposed to concurrency, is the goal (at least most that I write) is parallelized in one or two small, performance critical sections, and the rest is written serially.  Therefore, it's easy to reason about things and safety isn't as important as the case of concurrency-oriented multithreading over large sections of code.
>>>
>>> 4.  I've been eating my own dogfood for awhile on my ParallelFuture
>>> library.  (http://cis.jhu.edu/~dsimcha/parallelFuture.html
>>> <http://cis.jhu.edu/%7Edsimcha/parallelFuture.html>;
>>> http://dsource.org/projects/scrapple/browser/trunk/parallelFuture/parallelFuture.d)
>>> It's geared toward throughput-oriented parallelism on multicore
>>> machines, not concurrency for GUIs, servers, etc. and is higher level
>>> than std.concurrency.  Is there any interest in including something
>>> like this in Phobos?  If so, would we try to make it fit into the
>>> explicit-sharing-only model, or treat it as an alternative method of
>>> multithreading geared towards pull-out-all-stops parallelism on
>>> multicore computers?
>>>
>>> One last note:  Walter claimed a while back on the NG that
>>> Parallelfuture doesn't compile.  I use it regularly and it compiles
>>> for me.  Walter, can you please point out what the issue was?
>>> _______________________________________________
>>> phobos mailing list
>>> phobos at puremagic.com <mailto:phobos at puremagic.com>
>>> http://lists.puremagic.com/mailman/listinfo/phobos
>>
>>
>> _______________________________________________
>> phobos mailing list
>> phobos at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/phobos
>
>
>
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
September 05, 2010
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100905/1fe4e77d/attachment-0001.html>