April 24, 2008
> The problem with locks are:
>
> 1) they are expensive, so people try to optimize them away (grep for "double checked locking")
> 2) people forget to use the locks
> 3) deadlocks

Having had several run ins with pthreads_*(3) implementations earlier this year, I started digging around for alternatives and stashed two such nuggets away.  Both papers struck me as "not stupid" and rang high on my "I wish this would make its way into D" scale.

"Transactional Locking II"
http://research.sun.com/scalable/pubs/DISC2006.pdf

"Software Transactional Memory Should Not Be Obstruction-Free"
http://berkeley.intel-research.net/rennals/pubs/052RobEnnals.pdf

How you'd wrap these primitives into the low level language, is left up to an exercise for the implementor and language designer, but, getting low-level primitives in place that allow for efficient locks strikes me as highly keen.

-sc

--
Sean Chittenden
sean@chittenden.org
http://sean.chittenden.org/

April 24, 2008
Sean Kelly wrote:
> 1) The cost of acquiring or committing a lock is generally roughly equivalent to
>      a memory synchronization, and sometimes less than that (futexes, etc).  So
>      it's not insignificant, but also not as bad as people seem to think.  I suspect
>      that locked operations are often subject to premature optimization.

What exactly do you mean by "memory synchronization?"  Just a write barrier instruction, or something else?

If what you mean is a write barrier, then what you said isn't necessarily true, especially as we head toward more and more cores, and thus more and more caches.  Locks are almost always atomic read/modify/write operations, and those can cause terrible cache bouncing problems.  If you have N cores (each with its own cache) race for the same lock (even if they are trying to get shared locks), you can have up to N^2 bounces of the cache line around.
April 24, 2008
Sean Kelly Wrote:

> == Quote from Walter Bright (newshound1@digitalmars.com)'s article
> > Steven Schveighoffer wrote:
> > > You can sort of work around it by wrapping the previously volatile statement in a function, but it seems like having volatile doesn't really hurt anything.  I'm curious to know why it was so bad that it was worth removing...
> > Because after listening in while experts debated how to do write multithreaded code safely, it became pretty clear that the chances of using volatile statements correctly is very small, even for experts. It's the wrong approach.
> 
> Every tool can be mis-used with insufficient understanding.  Look at shared-
> memory multiprogramming for instance.  It's quite easy and understandable
> to share a few data structures between threads (which I'd assert is the original
> intent anyway), but common practice among non-experts is to use mutexes
> to protect code rather than data, and to call across threads willy-nilly.  It's
> no wonder the commonly held belief is that multiprogramming is hard.
> 
> Regarding lock-free programming in particular, I think it's worth pointing
> out that leaving out support for lock-free programming in general excludes
> an entire realm of code being written--not only library code to be ultimately
> used by everyday programmers, but kernel code and such as well.  Look at
> the Linux source code, for example.  As for the C++0x discussions, I feel
> that some of the participants of the memory model discussion are experts
> in the field and understand quite well the issues involved.
> 
> 
> Sean

I've use a tiny amount of lock-free-like programming here and there, which is to say, code that uses the "compare and swap" idiom (on an IBM OS/390) for a few very limited purposes, and just the "atomic swap" (via a portable library).

I was trying to do this with D a week or two back.  I wrote some inline ASM code using  "xchg" and "cmpxchg8b".  I was able to get xchg working (as far as I can tell) on DMD.  The same inline ASM code on GDC (64 bit machine) just threw a BUS error for some reason.  I couldn't get cmpxchg8b to do what I expected on either platform, but my assembly skills are weak, and my inline assembly skills are even weaker.  (it was my first stab at inline ASM in D).

1. I have no idea if my code was reasonable or did what I thought but...

2. there might be a gdc / dmd ASM compatability issue.

3. I think it would be cool if there were atomic swap and ideally, compare
    and swap type functions in D -- one more thing we could do portably that
    C++ has to do non-portably.

4. By the way these links contain public domain versions of the "swap pointers
    atomically" code from my current work location that might be useful:

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/ncbiatomic_8h-source.html http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/include/corelib/impl/ncbi_atomic_defs.h

Unfortunately, it looks like they don't define the _CONDITIONALLY versions for the x86 or x86_64 platforms.  One of my libraries at work uses the atomic pointer-swapping to implement a lightweight mutex for a libraries I wrote and it's a big win.

Any thoughts?  It would be neat to play with lock-free algorithms in D, especially since the papers I've read on the subject (Andrei's I think) say that it's much easier to get the simpler ones right in a garbage collected environment.

Kevin Bealer

April 24, 2008
== Quote from Russell Lewis (webmaster@villagersonline.com)'s article
> Sean Kelly wrote:
> > 1) The cost of acquiring or committing a lock is generally roughly equivalent to
> >      a memory synchronization, and sometimes less than that (futexes, etc).  So
> >      it's not insignificant, but also not as bad as people seem to think.  I suspect
> >      that locked operations are often subject to premature optimization.
> What exactly do you mean by "memory synchronization?"  Just a write
> barrier instruction, or something else?
> If what you mean is a write barrier, then what you said isn't
> necessarily true, especially as we head toward more and more cores, and
> thus more and more caches.  Locks are almost always atomic
> read/modify/write operations, and those can cause terrible cache
> bouncing problems.  If you have N cores (each with its own cache) race
> for the same lock (even if they are trying to get shared locks), you can
> have up to N^2 bounces of the cache line around.

Yeah I meant an atomic RMW, or at least a load barrier for the acquire.  Releasing
a mutex can often be done using a plain old store though, since write ops are
typically ordered anyway and moving loads up into the mutex doesn't break
anything.  My point, however, was simply that mutexes aren't terribly slower than
atomic operations, since a mutex acquire/release is really little more than an atomic
operation itself, at least in the simple case.


Sean
April 24, 2008
== Quote from Kevin Bealer (kevinbealer@gmail.com)'s article
> Sean Kelly Wrote:
> > == Quote from Walter Bright (newshound1@digitalmars.com)'s article
> > > Steven Schveighoffer wrote:
> > > > You can sort of work around it by wrapping the previously volatile statement in a function, but it seems like having volatile doesn't really hurt anything.  I'm curious to know why it was so bad that it was worth removing...
> > > Because after listening in while experts debated how to do write multithreaded code safely, it became pretty clear that the chances of using volatile statements correctly is very small, even for experts. It's the wrong approach.
> >
> > Every tool can be mis-used with insufficient understanding.  Look at shared-
> > memory multiprogramming for instance.  It's quite easy and understandable
> > to share a few data structures between threads (which I'd assert is the original
> > intent anyway), but common practice among non-experts is to use mutexes
> > to protect code rather than data, and to call across threads willy-nilly.  It's
> > no wonder the commonly held belief is that multiprogramming is hard.
> >
> > Regarding lock-free programming in particular, I think it's worth pointing
> > out that leaving out support for lock-free programming in general excludes
> > an entire realm of code being written--not only library code to be ultimately
> > used by everyday programmers, but kernel code and such as well.  Look at
> > the Linux source code, for example.  As for the C++0x discussions, I feel
> > that some of the participants of the memory model discussion are experts
> > in the field and understand quite well the issues involved.
> >
> I've use a tiny amount of lock-free-like programming here and there, which is to
> say, code that uses the "compare and swap" idiom (on an IBM OS/390) for a few
> very limited purposes, and just the "atomic swap" (via a portable library).
> I was trying to do this with D a week or two back.  I wrote some inline ASM code
> using  "xchg" and "cmpxchg8b".  I was able to get xchg working (as far as I can
> tell) on DMD.  The same inline ASM code on GDC (64 bit machine) just threw a
> BUS error for some reason.  I couldn't get cmpxchg8b to do what I expected on
> either platform, but my assembly skills are weak, and my inline assembly skills
> are even weaker.  (it was my first stab at inline ASM in D).
> 1. I have no idea if my code was reasonable or did what I thought but...
> 2. there might be a gdc / dmd ASM compatability issue.
> 3. I think it would be cool if there were atomic swap and ideally, compare
>     and swap type functions in D -- one more thing we could do portably that
>     C++ has to do non-portably.
> 4. By the way these links contain public domain versions of the "swap pointers
>     atomically" code from my current work location that might be useful:
> http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/ncbiatomic_8h-source.html
> http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/include/corelib/impl/ncbi_atomic_defs.h
> Unfortunately, it looks like they don't define the _CONDITIONALLY versions for
> the x86 or x86_64 platforms.  One of my libraries at work uses the atomic
> pointer-swapping to implement a lightweight mutex for a libraries I wrote and
> it's a big win.
> Any thoughts?  It would be neat to play with lock-free algorithms in D, especially
> since the papers I've read on the subject (Andrei's I think) say that it's much easier
> to get the simpler ones right in a garbage collected environment.

Tango (and Ares before it) has support for atomic load, store, storeIf (CAS), increment, and decrement.  Currently, x86 is the only architecture that's truly atomic however, other platforms fall back to using synchronized (largely because D doesn't support inline ASM for other platforms and because no one has asked for other platforms to be supported). The API docs are here:

http://www.dsource.org/projects/tango/docs/current/tango.core.Atomic.html

And this is the source file:

http://www.dsource.org/projects/tango/browser/trunk/tango/core/Atomic.d

The unit tests all pass with DMD and I /think/ they pass with GDC as well, but I haven't verified this personally.  Also, the docs above are a bit misleading in that increment and decrement operations are actually available for the Atomic struct if T is an integer or a pointer type.  the doc tool doesn't communicate that properly because it has issues with "static if".  Oh, and I'd just use the default msync option unless your needs are really specific.  The acquire/release options are a bit tricky to use properly in practice.


Sean
April 24, 2008
Sean Kelly wrote:
> == Quote from Russell Lewis (webmaster@villagersonline.com)'s article
>> Sean Kelly wrote:
>>> 1) The cost of acquiring or committing a lock is generally roughly equivalent to
>>>      a memory synchronization, and sometimes less than that (futexes, etc).  So
>>>      it's not insignificant, but also not as bad as people seem to think.  I suspect
>>>      that locked operations are often subject to premature optimization.
>> What exactly do you mean by "memory synchronization?"  Just a write
>> barrier instruction, or something else?
>> If what you mean is a write barrier, then what you said isn't
>> necessarily true, especially as we head toward more and more cores, and
>> thus more and more caches.  Locks are almost always atomic
>> read/modify/write operations, and those can cause terrible cache
>> bouncing problems.  If you have N cores (each with its own cache) race
>> for the same lock (even if they are trying to get shared locks), you can
>> have up to N^2 bounces of the cache line around.
> 
> Yeah I meant an atomic RMW, or at least a load barrier for the acquire.  Releasing
> a mutex can often be done using a plain old store though, since write ops are
> typically ordered anyway and moving loads up into the mutex doesn't break
> anything.  My point, however, was simply that mutexes aren't terribly slower than
> atomic operations, since a mutex acquire/release is really little more than an atomic
> operation itself, at least in the simple case.

Ah, now I get what you were saying.  Yes, I agree that atomic instructions are not likely to be much faster than mutexes.  (Ofc, pthread mutexes, when they sleep, are a whole 'nother beast.)

What I thought you were referring to were barriers, which are (in the many-cache case) *far* faster than atomic operations.  Which is why I disagreed, in my previous post.
April 24, 2008
Walter Bright wrote:

> http://ftp.digitalmars.com/dmd.1.029.zip
...
> http://ftp.digitalmars.com/dmd.2.013.zip

I wanted to install both of dmd and dmd2,
but they both wanted to use /etc/dmd.conf

So I modified my setup, so that dmd2 would
instead read dmd2.conf which had phobos2...


I moved /usr/bin/dmd2 over to another dir,
I called mine "dmd2": /usr/libexec/dmd2/dmd
In that directory I created a symlink file:
/usr/libexec/dmd2/dmd.conf -> /etc/dmd2.conf

And then I set up a shell wrapper for "dmd2",
that would call the relocated binary instead:
#!/bin/sh
exec /usr/libexec/dmd2/dmd "$*"


So now I can have my old D1 configuration in
dmd.conf and my D2 configuration in dmd2.conf.

And have both RPM packages installed at once,
without the file conflicts on /etc/dmd.conf.


Maybe something for the compiler to do too ?
(at least look for dmd2.conf before dmd.conf)

--anders

PS.
wxD is now tested OK with DMD 1.029 and 2.013
At least on Linux, as usual Windows is left...
"make DC=dmd" and "make DC=dmd2", respectively.
April 24, 2008
Walter Bright wrote:
> Because after listening in while experts debated how to do write multithreaded code safely, it became pretty clear that the chances of using volatile statements correctly is very small, even for experts. It's the wrong approach.

Just out of curiosity, which approach would you recommend to ensure
that a variable which is updated from an interrupt service routine
(and, implicitly, any other thread) will be read from common memory
and not cached in a register?
I know there are a few, but which would you recommend?
I think ensuring that memory access happens at every variable access
is a straightforward solution (and a good one, if access is atomar).

Regards, frank
April 25, 2008
"Walter Bright" wrote
> Sean Kelly wrote:
>> From my understanding, the problem with doing this via inline assembler
>> is
>> that some compilers can actually optimize inline assembler, leaving no
>> truly
>> portable way to do this in language.  This issue has come up on
>> comp.programming.threads in the past, but I don't remember whether there
>> was any resolution insofar as C++ is concerned.
>
> There's always a way to do it, even if you have to write an external function and link it in. I still don't believe memory mapped register access justifies adding complex language features.

Who's adding?  We already have it and it works.

If volatile was not already a solved problem, I'd say yeah, sure it might be more trouble than it's worth.  But to remove it from the language seems unnecessary to me.

I was just asking for justification for *removing* it, not justification for having it :)

-Steve


April 25, 2008
On Fri, 25 Apr 2008 00:45:58 +0200, Anders F Björklund wrote:

> Walter Bright wrote:
> 
>> http://ftp.digitalmars.com/dmd.1.029.zip
> ...
>> http://ftp.digitalmars.com/dmd.2.013.zip
> 
> I wanted to install both of dmd and dmd2, but they both wanted to use /etc/dmd.conf
> 
> So I modified my setup, so that dmd2 would instead read dmd2.conf which had phobos2...
> 
> 
> I moved /usr/bin/dmd2 over to another dir, I called mine "dmd2": /usr/libexec/dmd2/dmd In that directory I created a symlink file: /usr/libexec/dmd2/dmd.conf -> /etc/dmd2.conf
> 
> And then I set up a shell wrapper for "dmd2", that would call the
> relocated binary instead: #!/bin/sh
> exec /usr/libexec/dmd2/dmd "$*"
> 
> 
> So now I can have my old D1 configuration in dmd.conf and my D2 configuration in dmd2.conf.
> 
> And have both RPM packages installed at once, without the file conflicts on /etc/dmd.conf.
> 
> 
> Maybe something for the compiler to do too ? (at least look for
> dmd2.conf before dmd.conf)
> 
> --anders
> 
> PS.
> wxD is now tested OK with DMD 1.029 and 2.013 At least on Linux, as
> usual Windows is left... "make DC=dmd" and "make DC=dmd2", respectively.

I agree here, I feel the compiler should do more distinction between v1 and v2.