Oh, my GoD! Goroutines on D (page 5)

Settings

Help

Index » General » Oh, my GoD! Goroutines on D (page 5)

June 14, 2020

Re: Oh, my GoD! Goroutines on D

Posted by mw
in reply to Jin

Permalink

Posted in reply to Jin

Permalink

On Monday, 25 May 2020 at 16:41:06 UTC, Jin wrote:
>> FYI: channels are also part of vibe-core since a while:
>>
>> https://github.com/vibe-d/vibe-core/blob/master/source/vibe/core/channel.d
>
> Yes. But it uses mutex. My implementation is wait-free (https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom). All threads can easy and fast write and read without any locking. So every queue is 1-provider-1-consumer. But Input and Output channels is roundrobin list of queues. You can found some diagrams there: https://github.com/nin-jin/go.d/blob/master/readme.drawio.svg


Just saw this, for 1-provider-1-consumer queue, I did some experiments with

https://code.dlang.org/packages/lock-free

and the result is here:

https://github.com/mingwugmail/liblfdsd/tree/master/comparison

LDC is ~5x times faster than DMD, I'm not so sure why; and this package may need more stress testing:

------------------------------------------------------------------------
$ make ldcrun
...
received 1000000000 messages in 9845 msec sum=499999999500000000 speed=101574 msg/msec


$ make dmdrun
...
received 1000000000 messages in 53607 msec sum=499999999500000000 speed=18654 msg/msec


$ make javamp
10000000 messages received in 1151.0 ms, sum=49999995000000 speed: 0.1151 microsec/message, 8688.097306689835 messages/msec

$ make dmp
received 100000000 messages in 27574 msec sum=4999999950000000 speed=3626 msg/msec
------------------------------------------------------------------------

June 14, 2020

Re: Oh, my GoD! Goroutines on D

Posted by Jin
in reply to mw

Permalink

Jin

Posted in reply to mw

Permalink

On Sunday, 14 June 2020 at 19:49:46 UTC, mw wrote:
> On Sunday, 14 June 2020 at 17:10:14 UTC, mw wrote:
>> Have you tried lock-free queue?
>>
>> https://www.liblfds.org/mediawiki/index.php?title=r7.1.1:Queue_(unbounded,_many_producer,_many_consumer)
>>
>> Java uses the same algorithm for ConcurrentLinkedQueue (in C implementation).
>>
>> I tried some small examples with liblfds, got slightly better performance than Java. Maybe we don’t want to reinvent the wheels, esp the well tested ones.
>
> You can try it here:
>
> https://github.com/mingwugmail/liblfdsd
>
> only https://www.liblfds.org/mediawiki/index.php?title=r7.1.1:Queue_(bounded,_many_producer,_many_consumer) for now.
>
> ```
> received 100000000 messages in 4632 msec sum=4999999950000000 speed=21588 msg/msec
> ```

My wheels are quite simple: https://github.com/nin-jin/go.d/blob/master/source/jin/go/queue.d

I have written your bench (https://github.com/mingwugmail/liblfdsd/blob/master/liblfds.dpp#L67) with go.d:

```
const int n = 100_000_000;

void threadProducer(Output!int queue)
{
  foreach (int i; 0..n) {
	queue.put(i);
  }
}

void main()
{
	Input!int queue;
	go!threadProducer(queue.pair);

	StopWatch sw;
	sw.start();
	long sum = 0;

	foreach (p; queue)
	{
		sum += p;
	}

	sw.stop();

	writefln("received %d messages in %d msec sum=%d speed=%d msg/msec", n,
			sw.peek.total!"msecs", sum, n / sw.peek.total!"msecs");
	
	assert(sum == (n * (n - 1) / 2));
}
```

The code is simpler. On my laptop it gives:

```
> dub --quiet --build=release
received 100000000 messages in 10011 msec sum=4999999950000000 speed=9989 msg/msec
```

June 14, 2020

Re: Oh, my GoD! Goroutines on D

Posted by mw
in reply to Jin

Permalink

Posted in reply to Jin

Permalink

On Sunday, 14 June 2020 at 22:57:25 UTC, Jin wrote:
>> https://github.com/mingwugmail/liblfdsd
>> speed=21588 msg/msec
...
> My wheels are quite simple: https://github.com/nin-jin/go.d/blob/master/source/jin/go/queue.d
>
...
> speed=9989 msg/msec

Can you make a new go.d release to https://github.com/nin-jin/go.d/releases, or create a PR  using your jin.go.queue to https://github.com/mingwugmail/liblfdsd/tree/master/comparison

So either you or me can run the comparison on the same machine?

June 15, 2020

Re: Oh, my GoD! Goroutines on D

Posted by mw
in reply to Jin

Permalink

Posted in reply to Jin

Permalink

On Sunday, 14 June 2020 at 22:57:25 UTC, Jin wrote:
> My wheels are quite simple: https://github.com/nin-jin/go.d/blob/master/source/jin/go/queue.d
>
...
> received 100000000 messages in 10011 msec sum=4999999950000000 speed=9989 msg/msec

Since you are using 1-provider-1-consumer queue, can you try this package as a drop-in replacement in go.d:

https://github.com/MartinNowak/lock-free/blob/master/src/lock_free/rwqueue.d

(you may need to add align() to that implementation).

according to my test:

received 1000000000 messages in 9845 msec sum=499999999500000000 speed=101574 msg/msec

you may get ~10x boost.

June 15, 2020

Re: Oh, my GoD! Goroutines on D

Posted by Jin
in reply to mw

Permalink

Jin

Posted in reply to mw

Permalink

On Monday, 15 June 2020 at 01:55:27 UTC, mw wrote:
> On Sunday, 14 June 2020 at 22:57:25 UTC, Jin wrote:
>> My wheels are quite simple: https://github.com/nin-jin/go.d/blob/master/source/jin/go/queue.d
>>
> ...
>> received 100000000 messages in 10011 msec sum=4999999950000000 speed=9989 msg/msec
>
> Since you are using 1-provider-1-consumer queue, can you try this package as a drop-in replacement in go.d:
>
> https://github.com/MartinNowak/lock-free/blob/master/src/lock_free/rwqueue.d
>
> (you may need to add align() to that implementation).
>
> according to my test:
>
> received 1000000000 messages in 9845 msec sum=499999999500000000 speed=101574 msg/msec
>
> you may get ~10x boost.

I have added a new release and a PR to your repo.

But I don't think it's a good idea to replace jin.go.queue by lock_free.rwqueue because:

1. My API is std.range compatible.
2. I use the same API for queue and channels.
3. My API supports finalization (by provider or consumer or both).
4. Your queue is fixed size. But my channels operate with set of queues that can have different sizes.

But I would steal some optimization:

1. Power of 2 capacity and bitop instead of division. It can add 10% performance.
2. The cache line size should be a minimum size of the message to prevent false sharing.

You are using atomicLoad!(MemoryOrder.acq) at there: https://github.com/MartinNowak/lock-free/blob/master/src/lock_free/rwqueue.d#L41
Is this really required? CPU can't reorder dependent statements.

June 15, 2020

Re: Oh, my GoD! Goroutines on D

Posted by mw
in reply to Jin

Permalink

Posted in reply to Jin

Permalink

On Monday, 15 June 2020 at 09:09:23 UTC, Jin wrote:
> I have added a new release and a PR to your repo.

on my same machine, go.d:

received 100000000 messages in 2906 msec sum=4999999950000000 speed=34411 msg/msec

so, it's ~2.7x faster than Java:

https://github.com/mingwugmail/liblfdsd/tree/master/comparison

And your https://github.com/nin-jin/go.d on my machine

go.d is 2~4 times slower than Go.

9.7638ms (Go) v.s [19 ms ~ 40 ms] (go.d)

Go's speed is consistent in multiple runs, for go.d it can be 2x difference, maybe because of the scheduler is unstable?


> But I don't think it's a good idea to replace jin.go.queue by lock_free.rwqueue because:

I just want to do a comparison.

> You are using atomicLoad!(MemoryOrder.acq) at there: https://github.com/MartinNowak/lock-free/blob/master/src/lock_free/rwqueue.d#L41
> Is this really required? CPU can't reorder dependent statements.

It's not mine, but @MartinNowak's.  The implementation is based on

https://www.codeproject.com/Articles/43510/Lock-Free-Single-Producer-Single-Consumer-Circular

June 16, 2020

Re: Oh, my GoD! Goroutines on D

Posted by Petar Kirov [ZombineDev]
in reply to mw

Permalink

Petar Kirov [ZombineDev]

Posted in reply to mw

Permalink

On Monday, 15 June 2020 at 22:04:49 UTC, mw wrote:
> On Monday, 15 June 2020 at 09:09:23 UTC, Jin wrote:
>> I have added a new release and a PR to your repo.
>
> on my same machine, go.d:
>
> received 100000000 messages in 2906 msec sum=4999999950000000 speed=34411 msg/msec
>
> so, it's ~2.7x faster than Java:
>
> https://github.com/mingwugmail/liblfdsd/tree/master/comparison
>
> And your https://github.com/nin-jin/go.d on my machine
>
> go.d is 2~4 times slower than Go.
>
> 9.7638ms (Go) v.s [19 ms ~ 40 ms] (go.d)
>
> Go's speed is consistent in multiple runs, for go.d it can be 2x difference, maybe because of the scheduler is unstable?
>
>
>> But I don't think it's a good idea to replace jin.go.queue by lock_free.rwqueue because:
>
> I just want to do a comparison.
>
>> You are using atomicLoad!(MemoryOrder.acq) at there: https://github.com/MartinNowak/lock-free/blob/master/src/lock_free/rwqueue.d#L41
>> Is this really required? CPU can't reorder dependent statements.
>
> It's not mine, but @MartinNowak's.  The implementation is based on
>
> https://www.codeproject.com/Articles/43510/Lock-Free-Single-Producer-Single-Consumer-Circular

There's a SPMC/MPSC queue implementation here: https://github.com/weka-io/mecca/blob/master/src/mecca/containers/otm_queue.d that may be also interesting for you guys to check.

January 02

Re: Oh, my GoD! Goroutines on D

Posted by Jin
in reply to Petar Kirov [ZombineDev]

Permalink

Jin

Posted in reply to Petar Kirov [ZombineDev]

Permalink

> >

received 100000000 messages in 2906 msec sum=4999999950000000 speed=34411 msg/msec
so, it's ~2.7x faster than Java:
https://github.com/mingwugmail/liblfdsd/tree/master/comparison
And your https://github.com/nin-jin/go.d on my machine
go.d is 2~4 times slower than Go.
9.7638ms (Go) v.s [19 ms ~ 40 ms] (go.d)

Hello everyone, I've done a little refactoring and optimization of jin.go:

I got rid of the vibe.d dependency because it's slow, big, and I haven't been able to make friends with version 2. When running only 1000 vibe-fibers, not only did the application crash, but even the graphics system driver crashed once, which required restarting the laptop.
So far, I've settled on native streams with a small stack size (4 kb).
I'm really looking forward to photon's stabilization to get fiber support back. It would be really awesome to see it in the standard library.
I had to abandon the move semantics because I couldn't make friends with the delegates. Currently, the number of references to the queue is controlled by the copy constructor.

Good news! After all the optimizations, the channels show impressive speed in the above benchmark for pumping messages between two streams.

import std.datetime.stopwatch;
import std.range;
import std.stdio;

import jin.go;

const long n = 100_000_000;

    auto threadProducer() {
    	return n.iota;
    }

    void main() {
    	auto queue = go!threadProducer;

    	StopWatch sw;
    	sw.start();
    	long sum = 0;

    	foreach (p; queue) {
    		sum += p;
    	}

    	sw.stop();

    	writefln("received %d messages in %d msec sum=%d speed=%d msg/msec", n,
    		sw.peek.total!"msecs", sum, n / sw.peek.total!"msecs");

    	assert(sum == (n * (n - 1) / 2));
    }
    ```

```sh
received 100000000 messages in 718 msec sum=4999999950000000 speed=139275 msg/msec

I've almost caught up with Go in my goroutines benchmark:

> go run app.go --release
Workers Result          Time
8       49995000000     109.7644ms

> dub --quiet --build=release
Workers Result          Time
0       49995000000     124 ms

Bad news. Sometimes I get incorrect results and I can't figure out why.

> dub --quiet --build=release
Workers Result          Time
0       49945005000     176 ms

I use the atomic acquire and release operations, although they are not required on x86, but I hope the compiler takes them into account and does not reorder instructions. But even with stricter memory barriers, I don't get a very stable result. If someone can tell me what could be wrong here, I would be very grateful.

January 03

Re: Oh, my GoD! Goroutines on D

Posted by Guillaume Piolat
in reply to Jin

Permalink

Guillaume Piolat

Posted in reply to Jin

Permalink

On Thursday, 2 January 2025 at 19:59:26 UTC, Jin wrote:

If someone can tell me what could be wrong here, I would be very grateful.

https://github.com/nin-jin/go.d/blob/master/source/jin/go/queue.d#L67

A concurrent put and popFront will do nothing to avoid races in Queue.
Individual access to _offset is atomic but its point of use in both put and popFront is not.

Both functions look like this:

Atomic-read-offset
Anything can happen
Atomic-write-offset

If one function has completed 1) then other has completed 1+2+3 then you get a race.
Chaining two atomic things doesn't make the whole atomic, rather classic mistake with mutual exclusion.

January 03

Re: Oh, my GoD! Goroutines on D

Posted by Jin
in reply to Guillaume Piolat

Permalink

Jin

Posted in reply to Guillaume Piolat

Permalink

On Friday, 3 January 2025 at 00:17:27 UTC, Guillaume Piolat wrote:

On Thursday, 2 January 2025 at 19:59:26 UTC, Jin wrote:

If someone can tell me what could be wrong here, I would be very grateful.

https://github.com/nin-jin/go.d/blob/master/source/jin/go/queue.d#L67

A concurrent put and popFront will do nothing to avoid races in Queue.
Individual access to _offset is atomic but its point of use in both put and popFront is not.

Both functions look like this:

Atomic-read-offset
Anything can happen
Atomic-write-offset

If one function has completed 1) then other has completed 1+2+3 then you get a race.
Chaining two atomic things doesn't make the whole atomic, rather classic mistake with mutual exclusion.

I didn't really understand what you were talking about. After checking pending/available, we are guaranteed to have the opportunity to take the next step on consumer/provider side. Therefore, we are doing our job and then increasing our offset.

Top | Forum index | About this forum

Forums