Thread overview
GC hangs on spinlock in ConservativeGC.runLocked
Feb 20, 2023
klasbo
Feb 20, 2023
klasbo
Feb 20, 2023
klasbo
February 20, 2023

This is in a vibe.d application, on Ubuntu server 22.04. Presumably something isn't unlocking the gcLock, but I have no idea what. See GDB backtrace below.

Ideas? And is there some way to get stack traces for any other vibe-tasks in a running process? Could this be caused by a non-clean exit of a vibe task?

It's hard to know if this is me, vibe.d, or druntime...

#0  0x00007f7d13573868 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7f7d0e6763d0, rem=0x7f7d0e6763e0)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x00007f7d135786e7 in __GI___nanosleep (req=<optimized out>, rem=<optimized out>) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x000055c709ddca40 in core.thread.osthread.Thread.sleep(core.time.Duration) ()
#3  0x000055c709e7db08 in core.internal.spinlock.SpinLock.yield(ulong) shared ()
#4  0x000055c709e7dab2 in core.internal.spinlock.SpinLock.lock() shared ()
#5  0x000055c709e86dff in core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC.mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), core.internal.gc.impl.conservative.gc.mallocTime, core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong, const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo)) ()
#6  0x000055c709e7e776 in core.internal.gc.impl.conservative.gc.ConservativeGC.malloc(ulong, uint, const(TypeInfo)) ()
#7  0x000055c709dd94bf in gc_malloc ()
#8  0x000055c709de0139 in _d_newclass ()
#9  0x000055c709dd1fe1 in core.runtime.defaultTraceHandler(void*) ()
#10 0x000055c709dded7e in _d_traceContext ()
#11 0x000055c709e4269e in _d_createTrace ()
#12 0x000055c709ddf45e in _d_throwdwarf ()
#13 0x000055c709e83130 in core.internal.gc.impl.conservative.gc.Gcx.fullcollect(bool, bool, bool) ()
#14 0x000055c709e810fa in core.internal.gc.impl.conservative.gc.Gcx.bigAlloc(ulong, ref ulong, uint, const(TypeInfo)) ()
#15 0x000055c709e86e52 in core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC.mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), core.internal.gc.impl.conservative.gc.mallocTime, core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong, const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo)) ()
#16 0x000055c709e7e8d2 in core.internal.gc.impl.conservative.gc.ConservativeGC.qalloc(ulong, uint, scope const(TypeInfo)) ()
#17 0x000055c709dd9503 in gc_qalloc ()
#18 0x000055c709ad8694 in _D3std5array__T8AppenderTAyaZQo13ensureAddableMFmZ9__lambda9MFNaNbNeZS4core6memory8BlkInfo_ (__capture=0x7f7d0e676a90, __HID4=0x7f7d0e676a60)
    at /usr/local/dlang/dmd-2.100.2/linux/bin64/../../src/phobos/std/array.d:3576
[rest of trace omitted]
February 20, 2023

On 2/20/23 8:01 AM, klasbo wrote:

>

This is in a vibe.d application, on Ubuntu server 22.04. Presumably something isn't unlocking the gcLock, but I have no idea what. See GDB backtrace below.

Ideas? And is there some way to get stack traces for any other vibe-tasks in a running process? Could this be caused by a non-clean exit of a vibe task?

This bug was fixed in 2.102.0: https://dlang.org/changelog/2.102.0.html#druntime.nogc-traceinfo

The actual bug: https://issues.dlang.org/show_bug.cgi?id=22616

-Steve

February 20, 2023

On Monday, 20 February 2023 at 16:31:18 UTC, Steven Schveighoffer wrote:

>

This bug was fixed in 2.102.0: https://dlang.org/changelog/2.102.0.html#druntime.nogc-traceinfo

The actual bug: https://issues.dlang.org/show_bug.cgi?id=22616

-Steve

I was using 2.100, so that would explain it then.

I don't quite understand the internals of the GC well enough to "get" what happens here. Is it that an assert in sweep() triggers, and this[1] scope(failure) in fullcollect() would re-trigger a GC collection when it tries to allocate the trace info (which is now fixed when trace info is nogc)?
Or the more important (for me) question: Is this part of "normal" GC control flow (why would assert(freedPages < usedPages) trigger? This is the beyond the limit of my GC understanding!), or is there still/also something broken on my end that I have to look for?

[1] https://github.com/dlang/dmd/blob/3457018e34ab2333de22a6285622d3190268d5e0/druntime/src/core/internal/gc/impl/conservative/gc.d#L3147

Thank you for the answer, and (to all involved) the work in fixing it!

February 20, 2023

On 2/20/23 4:28 PM, klasbo wrote:

>

I don't quite understand the internals of the GC well enough to "get" what happens here. Is it that an assert in sweep() triggers, and this[1] scope(failure) in fullcollect() would re-trigger a GC collection when it tries to allocate the trace info (which is now fixed when trace info is nogc)?

So it was quite a subtle bug.

The throw handler will look at the info member of the Throwable (the trace info), and if it is non-null, will attempt to allocate a trace info.

If you threw inside the GC (i.e. when the GC is collecting), this means you cannot allocate. The trace info allocator sneakily was using the GC, and so it had a case where it checked the inFinalizer flag, and if that was true, it would simply not allocate traceinfo (leave it at null).

In the GC collection routine, there is a mechanism that catches Exception, and if so, handles it properly by throwing a FinalizerError (this has its info set to SuppressTraceInfo, which prevents further traceinfo allocations).

But Error is not caught. It leaks all the way out to a scope(failure) statement.

Now, inside this scope(failure) statement, the inFinalizer flag is set to false. What does this do?

a. catches the Error (which remember, still has the null trace info)
b. Sets the flag to false
c. rethrows the Error. This sees a null traceinfo, and tries to allocate
d. Allocating tries to take the spinlock, which is still locked, and deadlocks.

The whole reason we don't allow allocating inside the finalizer is because of this deadlock!

The solution is to allocate the trace info with C malloc, and free it with free. The trace info was always simply a list of stack frame addresses, and so trivial to allocate/free.

>

Or the more important (for me) question: Is this part of "normal" GC control flow (why would assert(freedPages < usedPages) trigger? This is the beyond the limit of my GC understanding!), or is there still/also something broken on my end that I have to look for?

The error itself is something that isn't addressed with this fix. So I should clarify that the dealdock was fixed, not your original root cause. Please try with the latest compiler, and elaborate further if you still can't figure it out and/or file a bug!

-Steve

February 20, 2023

On Monday, 20 February 2023 at 22:11:47 UTC, Steven Schveighoffer wrote:

>

On 2/20/23 4:28 PM, klasbo wrote:

>

Or the more important (for me) question: Is this part of "normal" GC control flow (why would assert(freedPages < usedPages) trigger? This is the beyond the limit of my GC understanding!), or is there still/also something broken on my end that I have to look for?

The error itself is something that isn't addressed with this fix. So I should clarify that the dealdock was fixed, not your original root cause. Please try with the latest compiler, and elaborate further if you still can't figure it out and/or file a bug!

-Steve

So at least I will now get a proper error (presumably InvalidMemoryOperationError?) and its stack trace when/if-ever this happens again. Or if I am (un)lucky, this will be "handled" by vibe and only show up in any logs I enable.

Thanks again.