February 02, 2010
On Feb 1, 2010, at 10:50 PM, Andrei Alexandrescu wrote:

> Sean Kelly wrote:
>> I think there's definite value in just being able to constrain compiler optimization, since it's necessary for anyone who wants to implement a concurrency library.  As long as the compiler is required to treat inline asm as effectively volatile then we're in decent shape on platforms the compiler supports inline asm for though.  For others I guess we'll have to trust that no optimization would occur across opaque function calls?  I recall this sort of thing coming up in the C++ 0x memory model talks, and I wish I had a clearer recollection of what came out of it... or that Hans Boehm were a part of this conversation.
> 
> Isn't it enough that shared data is always handled as volatile?

Hm... yeah probably.  If a shared variable is passed into a function call as a mutable ref then the compiler would have to assume it could be modified, so it couldn't optimize across that call.  I guess that works then :-)
February 02, 2010
On Feb 1, 2010, at 10:19 PM, Robert Jacques wrote:

> On Tue, 02 Feb 2010 01:04:37 -0500, Sean Kelly <sean at invisibleduck.org> wrote:
>>> A problem that can come up is that the variable is promoted to a register in a loop and is not updated from the memory.
>>> To avoid this one can use volatile. This issue is a priory disconnected with the presence of barriers (well one would hope that a correct compiler disables "upgrading to registers" for variables in a locked section crossing the boundary of it, but a priory it doesn't have to be like that, and especially using just memory barriers, I would not trust current compilers to always do the correct thing without a "volatile".
>> 
>> Sadly, volatile is deprecated in D 2.0.  You pretty much have to use inline asm and rely on the fact that Walter has said D compilers should never optimize in or across asm code.
> 
> Well, to the best of my knowledge volatile has never been implemented in D 1.0, which is one of the reason's it was dropped from D 2.0.

I think it was semi-implemented.  The compiler didn't optimize functions containing a volatile statement, if I recall.
February 02, 2010
On 2-feb-10, at 16:08, Sean Kelly wrote:

> On Feb 1, 2010, at 10:19 PM, Robert Jacques wrote:
>
>> On Tue, 02 Feb 2010 01:04:37 -0500, Sean Kelly <sean at invisibleduck.org
>> > wrote:
>>> Sadly, volatile is deprecated in D 2.0.  You pretty much have to use inline asm and rely on the fact that Walter has said D compilers should never optimize in or across asm code.
>>
>> Well, to the best of my knowledge volatile has never been implemented in D 1.0, which is one of the reason's it was dropped from D 2.0.
>
> I think it was semi-implemented.  The compiler didn't optimize functions containing a volatile statement, if I recall.

ok, that is enough for me... thanks, an confirmation would be greatly appreciated though :)

February 02, 2010
On Feb 2, 2010, at 6:04 AM, Fawzi Mohamed wrote:
> 
> My imaginary hardware model is the following:
> 
> several processors, each have a separate cache.
> Operations on the cache are kept in a kind of journal and communicated to other processors.
> A processor continuously updates its cache and sends its updates to the other processors (for me it makes no sense to skip this, if you do it than you don't have a shared memory system).
> It might be that different processors see work of other processors delayed or out of order, actually most likely it is so (for performance reasons).

The weird part is that a processor may see the work of another processor out of order because of its own load reordering rather than because the stores were issued out of order.  I think this is why Bartosz has said that you need a barrier at both the read and write locations, and I guess this is the handshake Andrei mentioned.

> * barriers
> A write barrier ensures that all writes done on the cache of processor X (where the barrier was issued) are communicated to the caches of other processors *before* any subsequent write.
> A read barrier insures that all reads done on processor Y (where the barrier was issued) before the barrier are completed before any read after the barrier.

I'd call these a hoist-store barrier and a hoist-load barrier (Alex Terekhov's terminology).  SPARC would call them a LoadStore and a LoadLoad barrier, I believe.  I can never keep the SPARC terminology straight because they use "load" to represent both an actual load from memory and something about what type of code movement the barrier prevents, so the first word is one and the second word is the other.  Saying that a load or store has acquire semantics is a stronger guarantee because it constrains the movement of both loads and stores.

> * atomic load/stores
> atomic load or stores don't really change much, but ensure that a change is done at once, their cost (if the hardware supports them) is typically very small, and often if supported for a given size it is always used (64 bit on 32 processors is an exception).

I've always considered atomic operations to only guarantee that the operation happens as an indivisible unit.  It may still be delayed, and perhaps longer than an op with a barrier since the CPU could reorder operations to occur before it in some cases.  For example, a MOV instruction on x86 is atomic but there's no barrier.

> By the way I found the atomic module in tango difficult to use correctly (maybe I had not understood it), and I rewrote it.

What problems did you have?
February 02, 2010
On Feb 2, 2010, at 7:07 AM, Sean Kelly wrote:

> On Feb 1, 2010, at 10:50 PM, Andrei Alexandrescu wrote:
> 
>> Sean Kelly wrote:
>>> I think there's definite value in just being able to constrain compiler optimization, since it's necessary for anyone who wants to implement a concurrency library.  As long as the compiler is required to treat inline asm as effectively volatile then we're in decent shape on platforms the compiler supports inline asm for though.  For others I guess we'll have to trust that no optimization would occur across opaque function calls?  I recall this sort of thing coming up in the C++ 0x memory model talks, and I wish I had a clearer recollection of what came out of it... or that Hans Boehm were a part of this conversation.
>> 
>> Isn't it enough that shared data is always handled as volatile?
> 
> Hm... yeah probably.  If a shared variable is passed into a function call as a mutable ref then the compiler would have to assume it could be modified, so it couldn't optimize across that call.  I guess that works then :-)

Actually, even a const ref, since there may be a barrier on the load as well.
February 02, 2010
On 2-feb-10, at 18:50, Sean Kelly wrote:

> On Feb 2, 2010, at 6:04 AM, Fawzi Mohamed wrote:
>>
>> My imaginary hardware model is the following:
>>
>> several processors, each have a separate cache.
>> Operations on the cache are kept in a kind of journal and
>> communicated to other processors.
>> A processor continuously updates its cache and sends its updates to
>> the other processors (for me it makes no sense to skip this, if you
>> do it than you don't have a shared memory system).
>> It might be that different processors see work of other processors
>> delayed or out of order, actually most likely it is so (for
>> performance reasons).
>
> The weird part is that a processor may see the work of another processor out of order because of its own load reordering rather than because the stores were issued out of order.  I think this is why Bartosz has said that you need a barrier at both the read and write locations, and I guess this is the handshake Andrei mentioned.

yes indeed the barriers you typically need are the ones I describe later, and I suppose it was what Andrei meant with handshake

>> * barriers
>> A write barrier ensures that all writes done on the cache of
>> processor X (where the barrier was issued) are communicated to the
>> caches of other processors *before* any subsequent write.
>> A read barrier insures that all reads done on processor Y (where
>> the barrier was issued) before the barrier are completed before any
>> read after the barrier.
>
> I'd call these a hoist-store barrier and a hoist-load barrier (Alex Terekhov's terminology).  SPARC would call them a LoadStore and a LoadLoad barrier, I believe.  I can never keep the SPARC terminology straight because they use "load" to represent both an actual load from memory and something about what type of code movement the barrier prevents, so the first word is one and the second word is the other.  Saying that a load or store has acquire semantics is a stronger guarantee because it constrains the movement of both loads and stores.


>> * atomic load/stores
>> atomic load or stores don't really change much, but ensure that a
>> change is done at once, their cost (if the hardware supports them)
>> is typically very small, and often if supported for a given size it
>> is always used (64 bit on 32 processors is an exception).
>
> I've always considered atomic operations to only guarantee that the operation happens as an indivisible unit.  It may still be delayed, and perhaps longer than an op with a barrier since the CPU could reorder operations to occur before it in some cases.  For example, a MOV instruction on x86 is atomic but there's no barrier.

indeed

>> By the way I found the atomic module in tango difficult to use correctly (maybe I had not understood it), and I rewrote it.
>
> What problems did you have?

yes, I don't remember exactly, but I found the terminology unfamiliar
to me (hoist-store barrier and hoist-load), and I did not see the two
barriers I wanted (basically the one you tipically need are (store-
store, and load-load, order stores wrt. stores and order loads wrt
loads, what I call write and read barriers).
Typically I want either a flag or something like that, and I like the
barriers to be "hidden" in the operation I call.
I don't want to have to think if the barriers has to be before or
after the operation each time I call, some seemed at the wrong place
to me.
When writing a flag for example the write barrier is before writing
the flag, whereas when reading the read barrier goes after reading the
flag.
Whereas the old module had this (opaque to me) terminology and you
sort of had to pass it to each call, furthermore the barrier
operations themselves looked wrong to me (lock is not enough).
It is fully possible that I simply did not understand how it was
supposed to work, in any case I find (clearly ;) my version more
clear, and I can use it correctly.

Fawzi
February 02, 2010
On Feb 2, 2010, at 12:52 PM, Fawzi Mohamed wrote:

> 
> On 2-feb-10, at 18:50, Sean Kelly wrote:
> 
>> On Feb 2, 2010, at 6:04 AM, Fawzi Mohamed wrote:
> 
>>> By the way I found the atomic module in tango difficult to use correctly (maybe I had not understood it), and I rewrote it.
>> 
>> What problems did you have?
> 
> yes, I don't remember exactly, but I found the terminology unfamiliar to me (hoist-store barrier and hoist-load), and I did not see the two barriers I wanted (basically the one you tipically need are (store-store, and load-load, order stores wrt. stores and order loads wrt loads, what I call write and read barriers).
> Typically I want either a flag or something like that, and I like the barriers to be "hidden" in the operation I call.
> I don't want to have to think if the barriers has to be before or after the operation each time I call, some seemed at the wrong place to me.
> When writing a flag for example the write barrier is before writing the flag, whereas when reading the read barrier goes after reading the flag.
> Whereas the old module had this (opaque to me) terminology and you sort of had to pass it to each call, furthermore the barrier operations themselves looked wrong to me (lock is not enough).

I think LOCK is enough.  The FENCE instructions didn't apply to normal memory operations until very recently--they had always been for streaming operations only.  A LOCK operation is equivalent to an MFENCE on the operation itself.  On platforms where finer-grained barriers are available, assume that they are in the places you expect.

> It is fully possible that I simply did not understand how it was supposed to work, in any case I find (clearly ;) my version more clear, and I can use it correctly.

I'll have to give it another look.  Last time I checked the API didn't seem to have changed very much.
February 08, 2010



----- Original Message ----
> From: Andrei Alexandrescu <andrei at erdani.com>
> 
> (I assume you meant "BankAccount[string]".)
> 
> It all depends on the behavior of the types shared(BankAccount[string]), shared(BankAccount), and shared(string[]).
> 
> I didn't quite mention what happens to member arrays inside synchronized methods. I think I wouldn't have to, but ~= warrants it. Here's the deal: inside a synchronized method, a member of type T[] is shared(T)[], except if you want to take the address of the member, in which case it's shared(T[]). Yep, that's tail const. I haven't thought a great deal about this, but arrays of shared(T) may be appended to efficiently.

I thought the decision was made (in discussing the array append ideas I implemented) that shared array appending efficiency was not important.

I don't see how it can be...  You must always take a lock while you are expanding a shared array, which has proven to be the major slowdown in the current released runtime.  Not using a lock reintroduces memory stomping, which I think we all agree is worse than inefficient shared array appending.

In short, even a tail-shared array points to a shared capacity field.  Updating that capacity field must be surrounded by a lock.

> I have given zero thought to shared associative arrays, and this is a great time to start talking about it. For my money I think we need to have a whole separate type SharedMap(K, V).

I agree there.

-Steve




1 2 3 4 5 6
Next ›   Last »