Jump to page: 1 2
Thread overview
NGINX Unit and vibe.d Integration Performance
Oct 28
ryuukk_
Nov 05
monkyyy
October 28

Hi there,
I'm looking for help with the performance of an integration I'm trying to write between NGINX Unit and D. Here are two minimal demos I've put together:

The first integration achieves ~43k requests per second on my computer. That matches what I've been able to achieve with a minimal vibe.d project and is I believe the max my benchmark configuration on macOS can hit.

The second though only achieves ~20k requests per second. In that demo I try to make vibe.d's concurrency system available during request handling. NGINX Unit's event loop is run in its own thread. When requests arrive, Unit sends them to the main thread for handling on vibe.d's event loop. I've tried a few methods to increase performance but none have been successful:

  • Batching messages when sending new request messages to minimize overhead. This increased latency and didn't improve on throughput.
  • Using vibe.d channels to pass requests. This achieved the same performance as message passing. I wasn't able to use the channel config that prioritized minimizing overhead as the API didn't jive with my use case.
  • Using a lock-free queue (https://github.com/MartinNowak/lock-free) between threads with a loop in the vibe.d thread that constantly polled for requests. This method achieves ~43k requests per second but results in atrocious CPU usage.

~20k requests per second seems to be the best I can hit with all that I've tried. I know vibe.d can do better so I'm thinking there's something I'm missing. In profiling I can see that the vibe.d thread spends a third of its time in what seems to be event loop management code. Am I seeing the effects of Unit's and vibe.d's loops being 'out-of-sync' i.e. there being some slack time between a message being sent and then being acted upon? Is there a better way to integrate NGINX Unit with vibe.d?

October 28

Let's take a moment to appreciate how easy it was for you to use nginx unit from D

https://github.com/kyleingraham/unit-d-hello-world/blob/main/source/unit_integration.c

ImportC is great

October 28

On Monday, 28 October 2024 at 05:56:32 UTC, ryuukk_ wrote:

>

ImportC is great

It really is. Most of my time setting it up was on getting include and linking flags working. Which is exactly what you’d run into using C from C.

October 28

On Monday, 28 October 2024 at 01:06:58 UTC, Kyle Ingraham wrote:

>

...

The second though only achieves ~20k requests per second. In that demo I try to make vibe.d's concurrency system available during request handling. NGINX Unit's event loop is run in its own thread. When requests arrive, Unit sends them to the main thread for handling on vibe.d's event loop. I've tried a few methods to increase performance...

Apparently, vibe.d's event loop is not fully compatible with NGINX Unit's loop, causing performance loss. I wonder if it would be wise to use something like an IntrusiveQueue or task pool to make it compatible? For example, something like this:

alias IQ = IntrusiveQueue;
struct IntrusiveQueue(T)
{
  import core.atomic;

  private {
    T[] buffer;
    size_t head, tail;
    alias acq = MemoryOrder.acq;
    alias rel = MemoryOrder.rel;
  }

  size_t capacity;
  this(size_t capacity) {
    this.capacity = capacity;
    buffer.length = capacity;
  }

  alias push = enqueue;
  bool enqueue(T item) {
    auto currTail = tail.atomicLoad!acq;
    auto nextTail = (currTail + 1) % capacity;

    if (nextTail == head.atomicLoad!acq)
      return false;

    buffer[currTail] = item;
    atomicStore!rel(tail, nextTail);

    return true;
  }

  alias fetch = dequeue;
  bool dequeue(ref T item) {
    auto currHead = head.atomicLoad!acq;

    if (currHead == tail.atomicLoad!acq)
      return false;

    auto nextTail = (currHead + 1) % capacity;
    item = buffer[currHead];
    atomicStore!rel(head, nextTail);

    return true;
  }
}

unittest
{
  enum start = 41;
  auto queue = IQ!int(10);
       queue.push(start);
       queue.push(start + 1);

  int item;
  if (queue.fetch(item)) assert(item == start);
  if (queue.fetch(item)) assert(item == start + 1);
}

SDB@79

October 28

On Monday, 28 October 2024 at 18:37:18 UTC, Salih Dincer wrote:

>

Apparently, vibe.d's event loop is not fully compatible with NGINX Unit's loop, causing performance loss. I wonder if it would be wise to use something like an IntrusiveQueue or task pool to make it compatible? For example, something like this:
...

You are right that they aren't compatible. Running them in the same thread was a no-go (which makes sense given they both want to control when code is run).

How would you suggest reading from the queue you provided in the vibe.d thread? I tried something similar with lock-free. It was easy to push into the queue efficiently from Unit's thread but popping from it in vibe.d's was difficult:

  • Polling too little killed performance and too often wrecked CPU usage.
  • Using message passing reduced performance quite a bit.
  • Batching reads was hard because it was tricky balancing performance for single requests with performance for streams of them.
October 28

On Monday, 28 October 2024 at 19:57:41 UTC, Kyle Ingraham wrote:

>
  • Polling too little killed performance and too often wrecked CPU usage.
  • Using message passing reduced performance quite a bit.
  • Batching reads was hard because it was tricky balancing performance for single requests with performance for streams of them.

Semaphore?

https://demirten-gitbooks-io.translate.goog/linux-sistem-programlama/content/semaphore/operations.html?_x_tr_sl=tr&_x_tr_tl=en&_x_tr_hl=tr&_x_tr_pto=wapp

SDB@79

October 28

On Monday, 28 October 2024 at 20:53:32 UTC, Salih Dincer wrote:

>

Semaphore?

Please see: https://dlang.org/phobos/core_sync_semaphore.html

SDB@79

October 31

On Monday, 28 October 2024 at 20:53:32 UTC, Salih Dincer wrote:

>

On Monday, 28 October 2024 at 19:57:41 UTC, Kyle Ingraham wrote:

>
  • Polling too little killed performance and too often wrecked CPU usage.
  • Using message passing reduced performance quite a bit.
  • Batching reads was hard because it was tricky balancing performance for single requests with performance for streams of them.

Semaphore?

https://demirten-gitbooks-io.translate.goog/linux-sistem-programlama/content/semaphore/operations.html?_x_tr_sl=tr&_x_tr_tl=en&_x_tr_hl=tr&_x_tr_pto=wapp

SDB@79

I went back to try using a semaphore and ended up using a mutex, an event, and a lock-free queue. My aim was to limit the amount of vibe.d events emitted to hopefully limit event loop overhead. It works as follows:

  • Requests come in on the Unit thread and are added to the lock-free queue.
  • The Unit thread tries to obtain the mutex. If it cannot, it assumes request processing is in progress on the vibe.d thread and does not emit an event.
  • In the vibe.d thread it waits on an event. Once it arrives, it obtains the mutex and pulls from the lock-free queue until it is empty.
  • Once the queue is empty the vibe.d thread releases the mutex and waits for another event.

This approach increased requests processed per events emitted/waited from 1:1 to 10:1. This had no impact on event loop overhead however. The entire program still spends ~50% of its runtime in this function: https://github.com/vibe-d/eventcore/blob/0cdddc475965824f32d32c9e4a1dfa58bd616cc9/source/eventcore/drivers/posix/cfrunloop.d#L38. I'll see if I can get images here of my profiling. I'm sure I'm missing something obvious here.

October 31

On Thursday, 31 October 2024 at 16:43:09 UTC, Kyle Ingraham wrote:

>

This approach increased requests processed per events emitted/waited from 1:1 to 10:1. This had no impact on event loop overhead however. The entire program still spends ~50% of its runtime in this function: https://github.com/vibe-d/eventcore/blob/0cdddc475965824f32d32c9e4a1dfa58bd616cc9/source/eventcore/drivers/posix/cfrunloop.d#L38. I'll see if I can get images here of my profiling. I'm sure I'm missing something obvious here.

I forgot to add that once you add delays to my demonstrator and a program using vibe.d's web framework the two have similar performance numbers. Adding a 10ms sleep resulted in 600 req/s for my demonstrator and 630 req/s for vibe.d.

It's encouraging to see the benefit of vibe.d's concurrency system with delays added. I'd like to be able to use it without drastically affecting throughput for no-delay cases however.

October 31

On Thursday, 31 October 2024 at 16:43:09 UTC, Kyle Ingraham wrote:

>

..I'll see if I can get images here of my profiling...

Here are images as promised:

In the flame graph there are two threads: Main Thread and thread_entryPoint. NGINX Unit runs in thread_entryPoint. vibe.d and my request handing code run in Main Thread. My request handing code is grouped under fiber_entryPoint within Main Thread. vibe.d's code is grouped under 'start' in Main Thread.

« First   ‹ Prev
1 2