November 14, 2023
On Tuesday, 14 November 2023 at 03:53:09 UTC, Walter Bright wrote:
> There have been multiple requests to add this support. It can all be done with a library implemented with some inline assembler.
>
> Anyone want the glory of implementing this?

Probably, previously it was implied that core.* is for own needs of druntime and Phobos? And existence of core.stdc.* is a forced solution, since using libc is the easiest way to create an interface between druntime and OSes
November 15, 2023
On 15/11/2023 8:00 AM, Walter Bright wrote:
> On 11/14/2023 12:51 AM, Richard (Rikki) Andrew Cattermole wrote:
>> Question: Why do people want another wrapper around some inline assembly that already exists in core.atomic?
> 
> Because they have existing carefully crafted code in C and want to translate it to D.
> 
> 
>> Writing a wrapper around stdatomic.h would take probably 2 hours.
> 
> Great! That saves each stdatomic C user 2 hours who wants to get their code in D.

So what I'm getting at here is that we can already do a best effort approach to this by swapping stdatomic to core.atomic, but it does not bring the value that people are wanting to do so.

The codegen must be similar, if the C compiler for a target uses a function call, so can we. If they use intrinsics with inlining so must us.

This way the behavior will be similar, and the port can be a success. Otherwise you are introducing new and potentially wrong behavior; which means a failure in trying to fulfill your ideas.
November 15, 2023
On 15/11/2023 7:13 AM, Walter Bright wrote:
> Correct multithreaded code is about doing things in the correct sequence, it is not about timing. If synchronization code is dependent on instruction timing, it is inevitably going to fail because too many things affect timing.

You have mostly caught on to the problems here that I have experienced. About the only people who can write lock-free concurrent data structures reliably work with kernels and yes kernels do have them. As a subject matter they are only about 20 years old (30 for some key theory). Which is very young for data structures.



To quote Andrei about its significance:

If you believe that's a fundamental enough question to award a prize to the answerer, so did others. In 2003, Maurice Herlihy was awarded the Edsger W. Dijkstra Prize in Distributed Computing for his seminal 1991 paper "Wait-Free Synchronization" (see http://www.podc.org/dijkstra/2003.html, which includes a link to the paper, too). In his tour-de-force paper, Herlihy proves which primitives are good and which are bad for building lock-free data structures. That brought some seemingly hot hardware architectures to instant obsolescence, while clarifying what synchronization primitives should be implemented in future hardware.

https://drdobbs.com/lock-free-data-structures/184401865



So it is timing based, you have set points which act as synchronization events and then everything after it must work exactly the same on each core. This is VERY HARD!

I got grey hair because of dmd using inline assembly, because function calls do not result in "exact" timings after those synchronization points! But it was possible with ldc with a lot of work, just not dmd.



I'm not the only one who has gone down this path:

https://github.com/MartinNowak/lock-free

https://github.com/mw66/liblfdsd

https://github.com/nin-jin/go.d



> As a path forward for DMD:
>
>  1. implement core.stdc.stdatomic in terms of core.atomic and/or
>     core.internal.atomic
>
>  2. eventually add intrinsics to dmd to replace them

Almost. Everything that is in core.stdc.stdatomic should have similar codegen to the C compiler, if it doesn't that is a bug, this is what gives it the value desired, the guarantee that it will line up.

So my proposal is almost the same but applies to all compilers:

1. Implement functions that wrap core.internal.atomic iff those function implementations are intrinsics, equivalent to an intrinsic or uses locking.
2. Implement intrinsics in core.interal.atomic and then implement the corresponding wrapper function for stdatomic.
November 14, 2023
On 11/14/2023 7:49 PM, Richard (Rikki) Andrew Cattermole wrote:
> So it is timing based, you have set points which act as synchronization events and then everything after it must work exactly the same on each core. This is VERY HARD!

Everything I've read about writing correct synchronization says it's not about timing, it's about sequencing.

For example,
https://www.amazon.com/C-Concurrency-Action-Anthony-Williams/dp/1617294691

or maybe you and I are just misunderstanding terms.

For example, fences. Fences enforce memory-ordering constraints, not timing. Happens-before and synchronizes-with are sequencing, not timing.
November 14, 2023
On 11/14/2023 7:07 PM, Richard (Rikki) Andrew Cattermole wrote:
> This way the behavior will be similar, and the port can be a success. Otherwise you are introducing new and potentially wrong behavior; which means a failure in trying to fulfill your ideas.

I do not understand why a function that consists of a FENCE instruction will be a failure compared to a FENCE instruction inlined.
November 15, 2023
On 15/11/2023 7:32 PM, Walter Bright wrote:
> On 11/14/2023 7:49 PM, Richard (Rikki) Andrew Cattermole wrote:
>> So it is timing based, you have set points which act as synchronization events and then everything after it must work exactly the same on each core. This is VERY HARD!
> 
> Everything I've read about writing correct synchronization says it's not about timing, it's about sequencing.
> 
> For example,
> https://www.amazon.com/C-Concurrency-Action-Anthony-Williams/dp/1617294691
> 
> or maybe you and I are just misunderstanding terms.
> 
> For example, fences. Fences enforce memory-ordering constraints, not timing. Happens-before and synchronizes-with are sequencing, not timing.

You have understood the simplified parts of the problem.

The problem is when concurrency is in action, multiple cores operating on same memory in same time units. You reach a shifting sands feeling where multiple facts can be true at the same time on different cores and it can be completely contradictory. Memory can be mapped on one, and not on another.

Did I mention I have grey hair because of this?

This might be a reason why I have strong opinions about D foundations such as symbols since that experience ;)

Also:

https://github.com/dlang/dmd/pull/15816
November 15, 2023
On 15/11/2023 7:33 PM, Walter Bright wrote:
> I do not understand why a function that consists of a FENCE instruction will be a failure compared to a FENCE instruction inlined.

Yes, ideally a memory barrier wouldn't matter for how it is executed.

But it does matter for load/cas. Because what was true when the operation executed may not be true any longer by the time the next operation is performed with deep calls in use.
November 15, 2023
On 11/14/2023 10:59 PM, Richard (Rikki) Andrew Cattermole wrote:
> On 15/11/2023 7:33 PM, Walter Bright wrote:
>> I do not understand why a function that consists of a FENCE instruction will be a failure compared to a FENCE instruction inlined.
> 
> Yes, ideally a memory barrier wouldn't matter for how it is executed.
> 
> But it does matter for load/cas. Because what was true when the operation executed may not be true any longer by the time the next operation is performed with deep calls in use.

Is this only a problem with load/cas?
November 15, 2023
On 15/11/2023 10:33 PM, Walter Bright wrote:
> Is this only a problem with load/cas?

A store by itself should be ok.

If you do any loading like you do with cas or atomicOp, then the timings might not be in any way predictable.

But the general rule of thumb for this is: if an atomic operation is by itself for a given block of memory then you can ignore timings. If it used in conjunction with another atomic instruction (or more), you have to consider if timings can mess it up.

But wrt. load/cas they are the two big primitives that are used in the literature very heavily so they need to be prioritized over the others for implementing intrinsics. They will also be used very heavily in any problem data structure function from my own experience.


A random fun fact, the book I recommend on the subject "The Art of Multiprocessor Programming" one of the authors teachers at the same university as Roy!

https://shop.elsevier.com/books/the-art-of-multiprocessor-programming/herlihy/978-0-12-415950-1
November 15, 2023
On Wednesday, 15 November 2023 at 06:59:45 UTC, Richard (Rikki) Andrew Cattermole wrote:
> On 15/11/2023 7:33 PM, Walter Bright wrote:
>> I do not understand why a function that consists of a FENCE instruction will be a failure compared to a FENCE instruction inlined.
>
> Yes, ideally a memory barrier wouldn't matter for how it is executed.
>
> But it does matter for load/cas. Because what was true when the operation executed may not be true any longer by the time the next operation is performed with deep calls in use.

That doesnt make any sense, the whole point of CAS is that it is atomic, immediately after it has completed you have no guarantees anyway, what difference does it make if it's wrapped in a function call?