November 14, 2012
On 11/14/12 6:39 AM, Alex Rønne Petersen wrote:
> On 14-11-2012 15:14, Andrei Alexandrescu wrote:
>> On 11/14/12 1:19 AM, Walter Bright wrote:
>>> On 11/13/2012 11:56 PM, Jonathan M Davis wrote:
>>>> Being able to have double-checked locking work would be valuable, and
>>>> having
>>>> memory barriers would reduce race condition weirdness when locks
>>>> aren't used
>>>> properly, so I think that it would be desirable to have memory
>>>> barriers.
>>>
>>> I'm not saying "memory barriers are bad". I'm saying that having the
>>> compiler blindly insert them for shared reads/writes is far from the
>>> right way to do it.
>>
>> Let's not hasten. That works for Java and C#, and is allowed in C++.
>>
>> Andrei
>>
>>
>
> I need some clarification here: By memory barrier, do you mean x86's
> mfence, sfence, and lfence?

Sorry, I was imprecise. We need to (a) define intrinsics for loading and storing data with high-level semantics (a short list: acquire, release, acquire+release, and sequentially-consistent) and THEN (b) implement the needed code generation appropriately for each architecture. Indeed on x86 there is little need to insert fence instructions, BUT there is a definite need for the compiler to prevent certain reorderings. That's why implementing shared data operations (whether implicit or explicit) as sheer library code is NOT possible.

> Because as Walter said, inserting those blindly when unnecessary can
> lead to terrible performance because it practically murders
> pipelining.

I think at this point we need to develop a better understanding of what's going on before issuing assessments.


Andrei
November 14, 2012
On 14-11-2012 15:32, Andrei Alexandrescu wrote:
> On 11/14/12 4:23 AM, David Nadlinger wrote:
>> On Wednesday, 14 November 2012 at 00:04:56 UTC, deadalnix wrote:
>>> That is what java's volatile do. It have several uses cases, including
>>> valid double check locking (It has to be noted that this idiom is used
>>> incorrectly in druntime ATM, which proves both its usefullness and
>>> that it require language support) and disruptor which I wanted to
>>> implement for message passing in D but couldn't because of lack of
>>> support at the time.
>>
>> What stops you from using core.atomic.{atomicLoad, atomicStore}? I don't
>> know whether there might be a weird spec loophole which could
>> theoretically lead to them being undefined behavior, but I'm sure that
>> they are guaranteed to produce the right code on all relevant compilers.
>> You can even specify the memory order semantics if you know what you are
>> doing (although this used to trigger a template resolution bug in the
>> frontend, no idea if it works now).
>>
>> David
>
> This is a simplification of what should be going on. The
> core.atomic.{atomicLoad, atomicStore} functions must be intrinsics so
> the compiler generate sequentially consistent code with them (i.e. not
> perform certain reorderings). Then there are loads and stores with
> weaker consistency semantics (acquire, release, acquire/release, and
> consume).
>
> Andrei

They already work as they should:

* DMD: They use inline asm, so they're guaranteed to not be reordered. Calls aren't reordered with DMD either, so even if the former wasn't the case, it'd still work.
* GDC: They map directly to the GCC __sync_* builtins, which have the semantics you describe (with full sequential consistency).
* LDC: They map to LLVM's load/store instructions with the atomic flag set and with the given atomic consistency, which have the semantics you describe.

I don't think there's anything that actually needs to be fixed there.

-- 
Alex Rønne Petersen
alex@lycus.org
http://lycus.org
November 14, 2012
On 2012-11-14 15:33, Andrei Alexandrescu wrote:

> Actually this hypothesis is false.

That we should remove it or that it's not working/nobody understands what it should do? If it's the latter then this thread is the evidence that my hypothesis is true.

-- 
/Jacob Carlborg
November 14, 2012
On 14-11-2012 15:50, deadalnix wrote:
> Le 14/11/2012 15:39, Alex Rønne Petersen a écrit :
>> On 14-11-2012 15:14, Andrei Alexandrescu wrote:
>>> On 11/14/12 1:19 AM, Walter Bright wrote:
>>>> On 11/13/2012 11:56 PM, Jonathan M Davis wrote:
>>>>> Being able to have double-checked locking work would be valuable, and
>>>>> having
>>>>> memory barriers would reduce race condition weirdness when locks
>>>>> aren't used
>>>>> properly, so I think that it would be desirable to have memory
>>>>> barriers.
>>>>
>>>> I'm not saying "memory barriers are bad". I'm saying that having the
>>>> compiler blindly insert them for shared reads/writes is far from the
>>>> right way to do it.
>>>
>>> Let's not hasten. That works for Java and C#, and is allowed in C++.
>>>
>>> Andrei
>>>
>>>
>>
>> I need some clarification here: By memory barrier, do you mean x86's
>> mfence, sfence, and lfence? Because as Walter said, inserting those
>> blindly when unnecessary can lead to terrible performance because it
>> practically murders pipelining.
>>
>
> In fact, x86 is mostly sequentially consistent due to its memory model.
> It only require an mfence when an shared store is followed by a shared
> load.

I just used x86's fencing instructions as an example because most people here are familiar with it. The problem is much, much bigger on architectures like ARM, MIPS, and PowerPC which are not in-order.

>
> See : http://g.oswego.edu/dl/jmm/cookbook.html for more information on
> the barrier required on different architectures.
>
>> (And note that you can't optimize this either; since the dependencies
>> memory barriers are supposed to express are subtle and not detectable by
>> a compiler, the compiler would always have to insert them because it
>> can't know when it would be safe not to.)
>>
>
> Compiler is aware of what is thread local and what isn't. It means the
> compiler can fully optimize TL store and load (like doing register
> promotion or reorder them across shared store/load).

Thread-local loads and stores are not atomic and thus do not take part in the reordering constraints that atomic operations impose. See e.g. the LLVM docs for atomicrmw and atomic load/store.

>
> This have a cost, indeed, but is useful, and Walter's solution to cast
> away shared when a mutex is acquired is always available.

-- 
Alex Rønne Petersen
alex@lycus.org
http://lycus.org
November 14, 2012
On 2012-11-14 15:22, Andrei Alexandrescu wrote:

> It's not an advantage, it's a necessity.

Walter seems to indicate that there is no technical reason for "shared" to be part of the language. I don't know how these memory barriers work, that's why I'm asking. Does it need to be in the language or not?

-- 
/Jacob Carlborg
November 14, 2012
On 14-11-2012 16:08, Andrei Alexandrescu wrote:
> On 11/14/12 6:39 AM, Alex Rønne Petersen wrote:
>> On 14-11-2012 15:14, Andrei Alexandrescu wrote:
>>> On 11/14/12 1:19 AM, Walter Bright wrote:
>>>> On 11/13/2012 11:56 PM, Jonathan M Davis wrote:
>>>>> Being able to have double-checked locking work would be valuable, and
>>>>> having
>>>>> memory barriers would reduce race condition weirdness when locks
>>>>> aren't used
>>>>> properly, so I think that it would be desirable to have memory
>>>>> barriers.
>>>>
>>>> I'm not saying "memory barriers are bad". I'm saying that having the
>>>> compiler blindly insert them for shared reads/writes is far from the
>>>> right way to do it.
>>>
>>> Let's not hasten. That works for Java and C#, and is allowed in C++.
>>>
>>> Andrei
>>>
>>>
>>
>> I need some clarification here: By memory barrier, do you mean x86's
>> mfence, sfence, and lfence?
>
> Sorry, I was imprecise. We need to (a) define intrinsics for loading and
> storing data with high-level semantics (a short list: acquire, release,
> acquire+release, and sequentially-consistent) and THEN (b) implement the
> needed code generation appropriately for each architecture. Indeed on
> x86 there is little need to insert fence instructions, BUT there is a
> definite need for the compiler to prevent certain reorderings. That's
> why implementing shared data operations (whether implicit or explicit)
> as sheer library code is NOT possible.

Let's continue this part of the discussion in my other reply (the one explaining how core.atomic is implemented in the various compilers).

>
>> Because as Walter said, inserting those blindly when unnecessary can
>> lead to terrible performance because it practically murders
>> pipelining.
>
> I think at this point we need to develop a better understanding of
> what's going on before issuing assessments.

I dunno. On low-end architectures like ARM the out-of-order processing is pretty much what makes them usable at all because they don't have the raw power that x86 does (I even recall an ARM Holdings executive saying that they couldn't possibly switch to a strong memory model with an in-order pipeline without severely reducing the efficiency of ARM). So I'm just putting that out there - it's definitely worth taking into consideration because very few architectures are actually fully in-order like x86.

>
>
> Andrei

-- 
Alex Rønne Petersen
alex@lycus.org
http://lycus.org
November 14, 2012
On Wednesday, 14 November 2012 at 14:32:34 UTC, Andrei Alexandrescu wrote:
> On 11/14/12 4:23 AM, David Nadlinger wrote:
>> On Wednesday, 14 November 2012 at 00:04:56 UTC, deadalnix wrote:
>>> That is what java's volatile do. It have several uses cases, including
>>> valid double check locking (It has to be noted that this idiom is used
>>> incorrectly in druntime ATM, which proves both its usefullness and
>>> that it require language support) and disruptor which I wanted to
>>> implement for message passing in D but couldn't because of lack of
>>> support at the time.
>>
>> What stops you from using core.atomic.{atomicLoad, atomicStore}? I don't
>> know whether there might be a weird spec loophole which could
>> theoretically lead to them being undefined behavior, but I'm sure that
>> they are guaranteed to produce the right code on all relevant compilers.
>> You can even specify the memory order semantics if you know what you are
>> doing (although this used to trigger a template resolution bug in the
>> frontend, no idea if it works now).
>>
>> David
>
> This is a simplification of what should be going on. The core.atomic.{atomicLoad, atomicStore} functions must be intrinsics so the compiler generate sequentially consistent code with them (i.e. not perform certain reorderings). Then there are loads and stores with weaker consistency semantics (acquire, release, acquire/release, and consume).

Sorry, I don't quite see where I simplified things. Yes, in the implementation of atomicLoad/atomicStore, one would probably use compiler intrinsics, as done in LDC's druntime, or inline assembly, as done for DMD.

But an optimizer will never move instructions across opaque function calls, because they could have arbitrary side effects. So, either we are fine by definition, or if the compiler inlines the atomicLoad/atomicStore calls (which is actually possible in LDC), then its optimizer will detect the presence of inline assembly resp. the load/store intrinsics, and take care of not reordering the instructions in an invalid way.

I don't see how this makes my answer to deadalnix (that »volatile« is not necessary to implement sequentially consistent loads/stores) any less valid.

David
November 14, 2012
On Wednesday, 14 November 2012 at 14:16:57 UTC, Andrei Alexandrescu wrote:
> On 11/14/12 1:20 AM, Walter Bright wrote:
>> On 11/13/2012 11:37 PM, Jacob Carlborg wrote:
>>> If the compiler should/does not add memory barriers, then is there a
>>> reason for
>>> having it built into the language? Can a library solution be enough?
>>
>> Memory barriers can certainly be added using library functions.
>
> The compiler must understand the semantics of barriers such as e.g. it doesn't hoist code above an acquire barrier or below a release barrier.

Again, this is true, but it would be a fallacy to conclude that compiler-inserted memory barriers for »shared« are required due to this (and it is »shared« we are discussing here!).

Simply having compiler intrinsics for atomic loads/stores is enough, which is hardly »built into the language«.

David
November 14, 2012
On Wednesday, 14 November 2012 at 15:08:35 UTC, Andrei Alexandrescu wrote:
> Sorry, I was imprecise. We need to (a) define intrinsics for loading and storing data with high-level semantics (a short list: acquire, release, acquire+release, and sequentially-consistent) and THEN (b) implement the needed code generation appropriately for each architecture. Indeed on x86 there is little need to insert fence instructions, BUT there is a definite need for the compiler to prevent certain reorderings. That's why implementing shared data operations (whether implicit or explicit) as sheer library code is NOT possible.

Sorry, I didn't see this message of yours before replying (the perils of threaded news readers…).

You are right about the fact that we need some degree of compiler support for atomic instructions. My point was that is it already available, otherwise it would have been impossible to implement core.atomic.{atomicLoad, atomicStore} (for DMD inline asm is used, which prohibits compiler code motion).

Thus, »we«, meaning on a language level, don't need to change anything about the current situations, with the possible exception of adding finer-grained control to core.atomic.MemoryOrder/mysnc [1]. It is the duty of the compiler writers to provide the appropriate means to implement druntime on their code generation infrastructure – and indeed, the situation in DMD could be improved, using inline asm is hitting a fly with a sledgehammer.

David


[1] I am not sure where the point of diminishing returns is here, although it might make sense to provide the same options as C++11. If I remember correctly, D1/Tango supported a lot more levels of synchronization.
November 14, 2012
On Wednesday, 14 November 2012 at 13:19:12 UTC, deadalnix wrote:
> The main drawback with that solution is that the compiler can't optimize thread local read/write regardless of shared read/write. This is wasted opportunity.

You mean moving non-atomic loads/stores across atomic instructions? This is simply a matter of the compiler providing the right intrinsics for implementing the core.atomic functions. LDC already does it.

David