February 02, 2010
On Tue, 02 Feb 2010 01:04:37 -0500, Sean Kelly <sean at invisibleduck.org> wrote:
>> A problem that can come up is that the variable is promoted to a
>> register in a loop and is not updated from the memory.
>> To avoid this one can use volatile. This issue is a priory disconnected
>> with the presence of barriers (well one would hope that a correct
>> compiler disables "upgrading to registers" for variables in a locked
>> section crossing the boundary of it, but a priory it doesn't have to be
>> like that, and especially using just memory barriers, I would not trust
>> current compilers to always do the correct thing without a "volatile".
>
> Sadly, volatile is deprecated in D 2.0.  You pretty much have to use inline asm and rely on the fact that Walter has said D compilers should never optimize in or across asm code.

Well, to the best of my knowledge volatile has never been implemented in D 1.0, which is one of the reason's it was dropped from D 2.0.
February 01, 2010
Fawzi Mohamed wrote:
>> I suggest we resume and start with describing what was unclear, then specifying the steps we need to take to clarify things better in the draft. Thanks!
> 
> p 26-27
> "All operations on _balance are now protected by acquiring _guard. It
> may seem
> there is no need to protect balance with _guard because a double can be
> read atomi-
> cally, but protection must be there for subtle reasons that shall come
> forth later. In brief,
> due to today?s aggressive optimizing compilers and relaxed memory
> models, all access
> to shared data must entail some handshake between threads; no handshake
> means a
> thread could call obj.balance repeatedly and always get a cached copy
> form some pro-
> cessor?s cache that never, ever gets updated, in spite of other threads?
> frenetic use of
> deposit and withdraw. ( This is one of the ways in which modern
> multithreading de?es
> intuition and confuses programmers versed in classic multithreading.)"
> 
>  From the previous discussion it should be clear that as written it is
> wrong.

I'm afraid it's your assessment of the situation that is wrong or at least in conflict with the texts that I'm perusing.

There are two entities that must be kept in the know with regard to proper synchronization: the compiler and the underlying machine. The text chose to conflate the two (at least for the time being) in order to simplify the discussion.

> Indeed it is better to have a barrier there (if the update would not be atomic, in general case) but there is no need for it in this specific case, as access is atomic.

I'm not sure. Clearly the compiler must be kept informed that no code motion and no enregistering should happen. So no contest there.

The problem is whether a barrier should also be inserted. According to the four Posix guarantees in Butenhof's book "Programming with Posix Threads" page 89, memory visibility by a thread of whatever another thread just did is only ensured if you insert a barrier. (Acquiring and releasing a mutex both issue barriers.) Butenhof actually goes to lengths to insist on it: "This is not only to protect against multiple writes---even when a thread only reads data it must use a mutex to ensure that it sees the most recent value of the data written while the mutex was locked." Later he does mention only a barrier is really needed, but he never says _nothing_ is needed.

At the end of the visibility section he says "If you want to write portable Pthreads code, you will always guarantee correct memory visibility by using the Pthreads memory visibility rules instead of relying on any assumptions regarding the hardware or compiler behavior."

If you or I don't know of any machine that would keep on reading a datum from its own cache instead of consulting the main memory, we're making an assumption. The right thing to do is to keep the compiler _AND_ the machine in the loop by using appropriate fencing.

> It is wrong that a cache of a processor might be never updated, it will be updated at some point, and a barrier (in general) will not force the update to be performed immediately.

According to my reading of Butenhof's text, there is nothing stopping a machine from delaying reads indefinitely if you don't insert a barrier.

BTW I am not under the misconception that a barrier is a flush operation. In fact Butenhof's book, which I use for reference, warns exactly about that in large font on page 93.

If you have references to back up your viewpoint please let me know.

> A problem that can come up is that the variable is promoted to a register in a loop and is not updated from the memory.

Yes.

> To avoid this one can use volatile.

Volatile is not in D anymore, and also at that point in the text the entire notion is not introduced and not worth introducing.

> This issue is a priory disconnected with the presence of barriers (well one would hope that a correct compiler disables "upgrading to registers" for variables in a locked section crossing the boundary of it, but a priory it doesn't have to be like that, and especially using just memory barriers, I would not trust current compilers to always do the correct thing without a "volatile".

Agreed. There must be a special construct understood by the compiler to (a) prevent reordering and enregistering and (b) insert the appropriate barrier instruction. One without the other is useless.


Andrei
February 01, 2010
On Feb 1, 2010, at 10:24 PM, Andrei Alexandrescu wrote:
> 
> If you or I don't know of any machine that would keep on reading a datum from its own cache instead of consulting the main memory, we're making an assumption. The right thing to do is to keep the compiler _AND_ the machine in the loop by using appropriate fencing.

I did some googling before my last reply to try and find a NUMA architecture that works this way, but I didn't have any luck.  Cache coherency is just too popular these days.  But it seems completely reasonable to include such an assertion in a standard as broadly applicable as POSIX.  I'm not sure I agree that fencing is always necessary though--we're really operating at the level of "assumptions regarding hardware behavior" here.

>> It is wrong that a cache of a processor might be never updated, it will be updated at some point, and a barrier (in general) will not force the update to be performed immediately.
> 
> According to my reading of Butenhof's text, there is nothing stopping a machine from delaying reads indefinitely if you don't insert a barrier.

This is certainly a useful way to think about concurrent code--it makes problems a lot easier to find.  But I think it's a bit difficult to apply at times.  x86, for example, still really only provides the LOCK prefix as an ad-hoc memory barrier--it isn't even explicitly for this purpose, and it's often more heavyweight than necessary.  The only way to perform a load-acquire or a store-release on x86 is to just do a plain old MOV.  It's really too bad that RC (release consistency) never caught on, the model is very programmer-oriented.

>> This issue is a priory disconnected with the presence of barriers (well one would hope that a correct compiler disables "upgrading to registers" for variables in a locked section crossing the boundary of it, but a priory it doesn't have to be like that, and especially using just memory barriers, I would not trust current compilers to always do the correct thing without a "volatile".
> 
> Agreed. There must be a special construct understood by the compiler to (a) prevent reordering and enregistering and (b) insert the appropriate barrier instruction. One without the other is useless.

I think there's definite value in just being able to constrain compiler optimization, since it's necessary for anyone who wants to implement a concurrency library.  As long as the compiler is required to treat inline asm as effectively volatile then we're in decent shape on platforms the compiler supports inline asm for though.  For others I guess we'll have to trust that no optimization would occur across opaque function calls?  I recall this sort of thing coming up in the C++ 0x memory model talks, and I wish I had a clearer recollection of what came out of it... or that Hans Boehm were a part of this conversation.
February 01, 2010
Sean Kelly wrote:
> I think there's definite value in just being able to constrain compiler optimization, since it's necessary for anyone who wants to implement a concurrency library.  As long as the compiler is required to treat inline asm as effectively volatile then we're in decent shape on platforms the compiler supports inline asm for though.  For others I guess we'll have to trust that no optimization would occur across opaque function calls?  I recall this sort of thing coming up in the C++ 0x memory model talks, and I wish I had a clearer recollection of what came out of it... or that Hans Boehm were a part of this conversation.

Isn't it enough that shared data is always handled as volatile?

Andrei
February 02, 2010
On 2-feb-10, at 07:19, Robert Jacques wrote:

> On Tue, 02 Feb 2010 01:04:37 -0500, Sean Kelly <sean at invisibleduck.org> wrote:
>>> A problem that can come up is that the variable is promoted to a
>>> register in a loop and is not updated from the memory.
>>> To avoid this one can use volatile. This issue is a priory
>>> disconnected with the presence of barriers (well one would hope
>>> that a correct compiler disables "upgrading to registers" for
>>> variables in a locked section crossing the boundary of it, but a
>>> priory it doesn't have to be like that, and especially using just
>>> memory barriers, I would not trust current compilers to always do
>>> the correct thing without a "volatile".
>>
>> Sadly, volatile is deprecated in D 2.0.  You pretty much have to use inline asm and rely on the fact that Walter has said D compilers should never optimize in or across asm code.
>
> Well, to the best of my knowledge volatile has never been implemented in D 1.0, which is one of the reason's it was dropped from D 2.0.

that is very sad, I did find volatile very useful, I was interpreting
it as something like a way to switch off the optimizations in a code
segment, so that one can have some kind of portable asm.
I find it very useful together with barriers...

February 02, 2010
Le 2010-02-01 ? 17:52, Fawzi Mohamed a ?crit :

> On 1-feb-10, at 23:43, Michel Fortin wrote:
> 
>> Le 2010-02-01 ? 17:26, Fawzi Mohamed a ?crit :
>> 
>>> you don't need a barrier, you need a volatile statement to avoid compiler optimizations
>> 
>> Is that something the compiler could do in an optimization pass, I mean determining when a barrier isn't necessary and volatile is enough? For instance when periodically checking a shared variable in a loop?
> 
> I think that there is some confusion about what barriers do, they just introduces a partial ordering In general a compiler cannot decide automatically to remove them.

But the compiler could choose not to add them when they would have no effect. That's what I meant by optimization.

Of course if you're putting barriers yourself the compiler shouldn't optimize them away.

> The handshake Andrei was talking about really isn't the point here.
> The thing is that the compiler should avoid some optimizations, in the sense that it should always load from memory (and not store it in a register and read it from there).
> A barrier alone has nothing to do with that, but a read barrier can ensure that what is visible after the read is for sure the status after a corresponding write barrier.

I was assuming that all accesses to a shared variable would be volatile in addition to having a barrier.

Also, if I understand well, with shared variables it's not the programmer who put barriers but the compiler. The compiler could be conservative and put barriers everywhere, or it could be smarter and put them only where required for sequential consistency among access to shared variables (which I called optimization).

So for instance, this:

	shared int flag;

	int wait() {
		int loopCount;
		while (flag == 0) {
			++loopCount;
		}
		return loopCount;
	}

doesn't warrant a barrier at each loop iteration, one barrier before the loop should be sufficient (even though all accesses should be volatile).

That's if I understand barriers correctly. It's not a subject I master very well.

What's important is that the compiler preserves sequential consistency and atomicity of reads/writes for shared variables. Whether it's by inserting barriers everywhere or doing so just where it's really needed is only a quality of implementation issue.

Perhaps the compiler should be allowed to reorder access to non-shared variables across the barriers it inserts for shared variables. Since those variables are not shared, it shouldn't have any effect, right?

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/



February 02, 2010
On 2-feb-10, at 13:20, Fawzi Mohamed wrote:

>
> On 2-feb-10, at 07:19, Robert Jacques wrote:
>
>> On Tue, 02 Feb 2010 01:04:37 -0500, Sean Kelly <sean at invisibleduck.org
>> > wrote:
>>> Sadly, volatile is deprecated in D 2.0.  You pretty much have to use inline asm and rely on the fact that Walter has said D compilers should never optimize in or across asm code.
>>
>> Well, to the best of my knowledge volatile has never been implemented in D 1.0, which is one of the reason's it was dropped from D 2.0.
>
> that is very sad, I did find volatile very useful, I was
> interpreting it as something like a way to switch off the
> optimizations in a code segment, so that one can have some kind of
> portable asm.
> I find it very useful together with barriers...
... does it mean that my code is incorrect in general (in D 1.0)?
February 02, 2010
On 2-feb-10, at 07:47, Sean Kelly wrote:

> On Feb 1, 2010, at 10:24 PM, Andrei Alexandrescu wrote:
>>
>> If you or I don't know of any machine that would keep on reading a datum from its own cache instead of consulting the main memory, we're making an assumption. The right thing to do is to keep the compiler _AND_ the machine in the loop by using appropriate fencing.
>
> I did some googling before my last reply to try and find a NUMA architecture that works this way, but I didn't have any luck.  Cache coherency is just too popular these days.  But it seems completely reasonable to include such an assertion in a standard as broadly applicable as POSIX.  I'm not sure I agree that fencing is always necessary though--we're really operating at the level of "assumptions regarding hardware behavior" here.

My imaginary hardware model is the following:

several processors, each have a separate cache.
Operations on the cache are kept in a kind of journal and communicated
to other processors.
A processor continuously updates its cache and sends its updates to
the other processors (for me it makes no sense to skip this, if you do
it than you don't have a shared memory system).
It might be that different processors see work of other processors
delayed or out of order, actually most likely it is so (for
performance reasons).

* barriers
A write barrier ensures that all writes done on the cache of processor
X (where the barrier was issued) are communicated to the caches of
other processors *before* any subsequent write.
A read barrier insures that all reads done on processor Y (where the
barrier was issued) before the barrier are completed before any read
after the barrier.
Using them one can ensure the if one before the barrier Y sees a
change t that was done after the write barrier X, all changes done
before the write barrier X are visible after the read barrier Y.
Thus barriers can introduce a weak ordering.
That is all they do, they don't synchronize, or ensure that a change
becomes visible, with the time all changes become visible anyhow,
otherwise you are not working on a shared memory machine (NUMA or not).

* atomic load/stores
atomic load or stores don't really change much, but ensure that a
change is done at once, their cost (if the hardware supports them) is
typically very small, and often if supported for a given size it is
always used (64 bit on 32 processors is an exception).

* atomic update
this is something much stronger, basically it ensures that an update
is immediately globally seen, to make its cost reasonable it regards
just one memory location (some work has to be done so that the update
is perceived as immediate.
Atomic update a priory does not imply any kind of barrier for other
memory, only the updated memory location is synchronized.
Atomic operations can be used to implement locks, unique counters,
when used in combination with barriers.

That is about it, you will not find this hardware definition somewhere, but that is my conceptual model, and I think it can be applied to all hardware that I know, and, I think, to the future hardware too.

You can see that barriers don't imply immediate synchronization (unlike atomic updates) but synchronize *all* changes done, itanium was toying with the idea of of more local barriers, that synchronize only part of the memory, but that is difficult to program for (the compiler could probably use them in some cases), and I have ignored it.

Anyway here is my understanding of the situation, and I think it is pretty good.

By the way I found the atomic module in tango difficult to use
correctly (maybe I had not understood it), and I rewrote it.
Due to the recent reorganization in tango an older (buggy) version was
resurrected, and I don't know what will happen now, probably the new
one will be merged in, but if you are interested the latest version is
in blip:
	http://github.com/fawzi/blip/blob/master/blip/sync/Atomic.d
it is apache 2.0, but I am willing to relicense it to whatever if
deemed useful

As note I would like to say that atomic updates while useful to implement some operations in my opinion should not be the method of choice for parallel programming (difficult to compose), still in D they should definitely be usable by who wants them (i.e. I want a working volatile,... at least in D1.0 ;).

Fawzi
February 02, 2010
On 2-feb-10, at 15:04, Fawzi Mohamed wrote:

>
> On 2-feb-10, at 07:47, Sean Kelly wrote:
>
>> On Feb 1, 2010, at 10:24 PM, Andrei Alexandrescu wrote:
>>>
>>> If you or I don't know of any machine that would keep on reading a datum from its own cache instead of consulting the main memory, we're making an assumption. The right thing to do is to keep the compiler _AND_ the machine in the loop by using appropriate fencing.
>>
>> I did some googling before my last reply to try and find a NUMA architecture that works this way, but I didn't have any luck. Cache coherency is just too popular these days.  But it seems completely reasonable to include such an assertion in a standard as broadly applicable as POSIX.  I'm not sure I agree that fencing is always necessary though--we're really operating at the level of "assumptions regarding hardware behavior" here.
>
> My imaginary hardware model is the following:
>
> several processors, each have a separate cache.
> Operations on the cache are kept in a kind of journal and
> communicated to other processors.
> A processor continuously updates its cache and sends its updates to
> the other processors (for me it makes no sense to skip this, if you
> do it than you don't have a shared memory system).
> It might be that different processors see work of other processors
> delayed or out of order, actually most likely it is so (for
> performance reasons).
>
> * barriers
> A write barrier ensures that all writes done on the cache of
> processor X (where the barrier was issued) are communicated to the
> caches of other processors *before* any subsequent write.
> A read barrier insures that all reads done on processor Y (where the
> barrier was issued) before the barrier are completed before any read
> after the barrier.
> Using them one can ensure the if one before the barrier Y sees a
> change t that was done after the write barrier X, all changes done
> before the write barrier X are visible after the read barrier Y.
> Thus barriers can introduce a weak ordering.
> That is all they do, they don't synchronize, or ensure that a change
> becomes visible, with the time all changes become visible anyhow,
> otherwise you are not working on a shared memory machine (NUMA or
> not).
>
> * atomic load/stores
> atomic load or stores don't really change much, but ensure that a
> change is done at once, their cost (if the hardware supports them)
> is typically very small, and often if supported for a given size it
> is always used (64 bit on 32 processors is an exception).
>
> * atomic update
> this is something much stronger, basically it ensures that an update
> is immediately globally seen, to make its cost reasonable it regards
> just one memory location (some work has to be done so that the
> update is perceived as immediate.
> Atomic update a priory does not imply any kind of barrier for other
> memory, only the updated memory location is synchronized.
> Atomic operations can be used to implement locks, unique counters,
> when used in combination with barriers.
>
> That is about it, you will not find this hardware definition somewhere, but that is my conceptual model, and I think it can be applied to all hardware that I know, and, I think, to the future hardware too.
>
> You can see that barriers don't imply immediate synchronization (unlike atomic updates) but synchronize *all* changes done, itanium was toying with the idea of of more local barriers, that synchronize only part of the memory, but that is difficult to program for (the compiler could probably use them in some cases), and I have ignored it.
>
> Anyway here is my understanding of the situation, and I think it is pretty good.
>
> By the way I found the atomic module in tango difficult to use
> correctly (maybe I had not understood it), and I rewrote it.
> Due to the recent reorganization in tango an older (buggy) version
> was resurrected, and I don't know what will happen now, probably the
> new one will be merged in, but if you are interested the latest
> version is in blip:
> 	http://github.com/fawzi/blip/blob/master/blip/sync/Atomic.d
> it is apache 2.0, but I am willing to relicense it to whatever if
> deemed useful
>
> As note I would like to say that atomic updates while useful to implement some operations in my opinion should not be the method of choice for parallel programming (difficult to compose), still in D they should definitely be usable by who wants them (i.e. I want a working volatile,... at least in D1.0 ;).
>
> Fawzi

Please note that the wording of some atomic operations is quite strange, but if you can use them for spinlocks or similar then the update has to be "effectively" instantaneous, an atomic operation local to a single processor is not really useful...
February 02, 2010
On 2-feb-10, at 14:19, Michel Fortin wrote:

> That's if I understand barriers correctly. It's not a subject I master very well.

I hope my previous post makes it a little more clear.

Fawzi