January 03
On 1/2/2025 4:17 PM, Guillaume Piolat via Digitalmars-d wrote:
> On Thursday, 2 January 2025 at 19:59:26 UTC, Jin wrote:
>> If someone can tell me what could be wrong here, I would be very grateful.
> 
> https://github.com/nin-jin/go.d/blob/master/source/jin/go/queue.d#L67
> 
> A concurrent `put` and `popFront` will do nothing to avoid races in `Queue`.
> Individual access to _offset is atomic but its point of use in both put and popFront is not.
> 
> Both functions look like this:
> 
>     1. Atomic-read-offset
>     2. Anything can happen
>     3. Atomic-write-offset
> 
> If one function has completed 1) then other has completed 1+2+3 then you get a race.
> Chaining two atomic things doesn't make the whole atomic, rather classic mistake with mutual exclusion.

Probably the most common atomicity error, is atomic check followed by non atomic acting based on that check.  The problem that is overlooked in that scenario is that while the check itself is safe, by the time the action proceeds the condition can change.  The only safe way is to combine the check and action into a single atomic transaction.

The next most common is probably the ABA issue.  Assuming that because you see A in the second case, that what you're seeing is still the first A and that implies that B couldn't and didn't happen.

The lesson here is that you're far better off NOT being clever and trying to avoid longer lived locks unless you can demonstrate that it's particularly important and detrimental to the app performance.  It's super easy to get separate/tiny atomic operations wrong and be very hard to detect/debug that than to get it right at a slightly higher cost with multi-instruction simple locking.

Not to mention that cpu and os designers have invested energy improving the performance of mutex and spinlock style code.

There's a time an place for cleverness, but not at the expense of correctness.

My 2 cents,
Brad

January 04

On Friday, 3 January 2025 at 00:17:27 UTC, Guillaume Piolat wrote:

>

On Thursday, 2 January 2025 at 19:59:26 UTC, Jin wrote:

>

If someone can tell me what could be wrong here, I would be very grateful.

https://github.com/nin-jin/go.d/blob/master/source/jin/go/queue.d#L67

A concurrent put and popFront will do nothing to avoid races in Queue.
Individual access to _offset is atomic but its point of use in both put and popFront is not.

Both functions look like this:

  1. Atomic-read-offset
  2. Anything can happen
  3. Atomic-write-offset

If one function has completed 1) then other has completed 1+2+3 then you get a race.
Chaining two atomic things doesn't make the whole atomic, rather classic mistake with mutual exclusion.

Its a single producer, single consumer queue, so only one thread pushes, and one thread pulls. So the "2 Anything can happen" doesnt apply, since there's only one thread actually pushing to the queue. The consumer thread only reads the producer.offset, so it cant interfere with pushes, and it only sees a pushed message once producer.offset is updated.

January 04

On Thursday, 2 January 2025 at 19:59:26 UTC, Jin wrote:

> > >

received 100000000 messages in 2906 msec sum=4999999950000000

I use the atomic acquire and release operations, although they are not required on x86, but I hope the compiler takes them into account and does not reorder instructions. But even with stricter memory barriers, I don't get a very stable result. If someone can tell me what could be wrong here, I would be very grateful.

Have you considered not wrapping the offsets, and only modulus with the length when you index into the messages, it'll simplify the math, IE..

messagesInQueue = producer.offet-consumer.offset

available = (Length - messagesInQueue -1)

January 04

On Thursday, 2 January 2025 at 19:59:26 UTC, Jin wrote:

> > >

received 100000000 messages in 2906 msec sum=4999999950000000 speed=34411 msg/msec
so, it's ~2.7x faster than Java:
https://github.com/mingwugmail/liblfdsd/tree/master/comparison
And your https://github.com/nin-jin/go.d on my machine
go.d is 2~4 times slower than Go.
9.7638ms (Go) v.s [19 ms ~ 40 ms] (go.d)

Hello everyone, I've done a little refactoring and optimization of jin.go:

  • I got rid of the vibe.d dependency because it's slow, big, and I haven't been able to make friends with version 2. When running only 1000 vibe-fibers, not only did the application crash, but even the graphics system driver crashed once, which required restarting the laptop.
  • So far, I've settled on native streams with a small stack size (4 kb).
  • I'm really looking forward to photon's stabilization to get fiber support back. It would be really awesome to see it in the standard library.
  • I had to abandon the move semantics because I couldn't make friends with the delegates. Currently, the number of references to the queue is controlled by the copy constructor.

Good news! After all the optimizations, the channels show impressive speed in the above benchmark for pumping messages between two streams.

Since you are essentially testing the speed of your spsc ring queue, I would just benchmark the queue directly. Don't think there is much room for improvement unless you adopt disruptor design and switch to batch consumption. That said, I do see some redundant offset loads and stores in the code the optimiser might miss.

Testing just the queue might also simplify and reduce the surface area to find the error.

January 04
While having a CSP style mechanism, with cheap stackful tasks/procs/coroutines and channels has merit, pursing a performance test with a 1000 (or more) trivial instances of such tasks seems misguided.

In the last couple of years, I used Go to design and implement a subsystem as a process using CSP style, and many goroutines.  Even there I only had on the order of 20 - 40 goroutines present at once.  Then possibly around 4-8 additional goroutines for each simultaneous client service I was performing.

Now it may have been theoretically possible to cause that to spawn on the order of 64k goroutines if stressed the "correct" way in an absurd deployment. That would be improbable, and a max of around 500 would be more likely, even there performance was gated upon Internet RTTs.

I'll have to re-read the code next week to get the actual number, but I expect it would be of that order of 30 or so.
January 06

On Thursday, 2 January 2025 at 19:59:26 UTC, Jin wrote:

>

Bad news. Sometimes I get incorrect results and I can't figure out why.
I use the atomic acquire and release operations, although they are not required on x86, but I hope the compiler takes them into account and does not reorder instructions. But even with stricter memory barriers, I don't get a very stable result. If someone can tell me what could be wrong here, I would be very grateful.

I found a bug. I first checked if there was anything in the queue, and if not, I checked if the queue was finalized. Sometimes the queue would fill up between these two steps and I would lose data. I moved the finalization check to the beginning and now everything is stable.

January 08

On Monday, 6 January 2025 at 10:15:39 UTC, Jin wrote:

>

I found a bug. I first checked if there was anything in the queue, and if not, I checked if the queue was finalized. Sometimes the queue would fill up between these two steps and I would lose data. I moved the finalization check to the beginning and now everything is stable.

I looked into the core.atomic code and was very disappointed. I expected to see memory barriers (necessary for wait-free), but I saw a CAS-spinlock there (and this is lock-free):

asm pure nothrow @nogc @trusted
{
      naked;
      push RBX;
      mov R8, RDX;
      mov RAX, [RDX];
      mov RDX, 8[RDX];
      mov RBX, [RCX];
      mov RCX, 8[RCX];
  L1: lock; cmpxchg16b [R8];
      jne L1;
      pop RBX;
      ret;
}
January 08

On Wednesday, 8 January 2025 at 10:10:55 UTC, Jin wrote:

>

On Monday, 6 January 2025 at 10:15:39 UTC, Jin wrote:

>

I found a bug. I first checked if there was anything in the queue, and if not, I checked if the queue was finalized. Sometimes the queue would fill up between these two steps and I would lose data. I moved the finalization check to the beginning and now everything is stable.

I looked into the core.atomic code and was very disappointed. I expected to see memory barriers (necessary for wait-free), but I saw a CAS-spinlock there (and this is lock-free):

Technically it is lock free but not wait free. No thread can actually hold it like a regular lock, the CAS either succeeds or of fails in one go, so you are always guaranteed at least one thread will progress.

January 08

On Wednesday, 8 January 2025 at 10:16:58 UTC, claptrap wrote:

>

Technically it is lock free but not wait free. No thread can actually hold it like a regular lock, the CAS either succeeds or of fails in one go, so you are always guaranteed at least one thread will progress.

For wait-free in general and cyclic buffers in particular, it is important that operations are ordered (first buffer operations, then offset updates). CAS does not do anything useful in this case, as it always succeeds (only one thread writes to offset), but other operations can be rearranged.

January 09
On 08/01/2025 11:10 PM, Jin wrote:
> On Monday, 6 January 2025 at 10:15:39 UTC, Jin wrote:
>> I found a bug. I first checked if there was anything in the queue, and if not, I checked if the queue was finalized. Sometimes the queue would fill up between these two steps and I would lose data. I moved the finalization check to the beginning and now everything is stable.
> 
> I looked into the core.atomic code and was very disappointed. I expected to see memory barriers (necessary for wait-free), but I saw a CAS- spinlock there (and this is lock-free):
> 
> ```d
> asm pure nothrow @nogc @trusted
> {
>        naked;
>        push RBX;
>        mov R8, RDX;
>        mov RAX, [RDX];
>        mov RDX, 8[RDX];
>        mov RBX, [RCX];
>        mov RCX, 8[RCX];
>    L1: lock; cmpxchg16b [R8];
>        jne L1;
>        pop RBX;
>        ret;
> }
> ```

Dmd will not inline functions with inline assembly, any function calls should prevent reordering cpu side.

So any concern for ordering you have shouldn't matter for dmd, its ldc and gdc that you need to be worried about.