GC dead-locking ?

Jun 13, 2013

Marco Leise

Jun 13, 2013

Marco Leise

Jun 17, 2013

Jun 18, 2013

Jun 19, 2013

Jul 01, 2013

Here is an excerpt from a stack trace I got while profiling with OProfile: #0 sem_wait () from /lib64/libpthread.so.0 #1 thread_suspendAll () at core/thread.d:2471 #2 gc.gcx.Gcx.fullcollect() (this=...) at gc/gcx.d:2427 #3 gc.gcx.Gcx.bigAlloc() (this=..., size=16401, poolPtr=0x7fc3d4bfe3c8, alloc_size=0x7fc3d4bfe418) at gc/gcx.d:2099 #4 gc.gcx.GC.mallocNoSync (alloc_size=0x7fc3d4bfe418, bits=10, size=16401, this=...) gc/gcx.d:503 #5 gc.gcx.GC.malloc() (this=..., size=16401, bits=10, alloc_size=0x7fc3d4bfe418) gc/gcx.d:421 #6 gc.gc.gc_qalloc (ba=10, sz=<optimized out>) gc/gc.d:203 #7 gc_qalloc (sz=<optimized out>, ba=10) gc/gc.d:198 #8 _d_newarrayT (ti=..., length=4096) rt/lifetime.d:807 #9 sequencer.algorithm.gzip.HuffmanTree.__T6__ctorTG32hZ.__ctor() (this=..., bitLengths=...) sequencer/algorithm/gzip.d:444 Two more threads are alive, but waiting on a condition variable (i.e.: in pthread_cond_wait(), but from my own and not from druntime code. Is there some obvious way I could have dead-locked the GC ? Or is there a bug ? This was compiled with GDC using DMD FE 2.062. -- Marco

One more note: I get this consistently during profiling, but not without. I don't count kernel involvement out either, since OProfile is a kernel based profiler and there could be a quirk in its interaction with semaphores. -- Marco

On Jun 13, 2013, at 2:22 AM, Marco Leise <Marco.Leise@gmx.de> wrote: > Here is an excerpt from a stack trace I got while profiling with OProfile: > > #0 sem_wait () from /lib64/libpthread.so.0 > #1 thread_suspendAll () at core/thread.d:2471 > #2 gc.gcx.Gcx.fullcollect() (this=...) at gc/gcx.d:2427 > #3 gc.gcx.Gcx.bigAlloc() (this=..., size=16401, poolPtr=0x7fc3d4bfe3c8, alloc_size=0x7fc3d4bfe418) at gc/gcx.d:2099 > #4 gc.gcx.GC.mallocNoSync (alloc_size=0x7fc3d4bfe418, bits=10, size=16401, this=...) gc/gcx.d:503 > #5 gc.gcx.GC.malloc() (this=..., size=16401, bits=10, alloc_size=0x7fc3d4bfe418) gc/gcx.d:421 > #6 gc.gc.gc_qalloc (ba=10, sz=<optimized out>) gc/gc.d:203 > #7 gc_qalloc (sz=<optimized out>, ba=10) gc/gc.d:198 > #8 _d_newarrayT (ti=..., length=4096) rt/lifetime.d:807 > #9 sequencer.algorithm.gzip.HuffmanTree.__T6__ctorTG32hZ.__ctor() (this=..., bitLengths=...) sequencer/algorithm/gzip.d:444 > > Two more threads are alive, but waiting on a condition variable (i.e.: in pthread_cond_wait(), but from my own and not from druntime code. Is there some obvious way I could have dead-locked the GC ? Or is there a bug ? I assume you're running on Linux, which uses signals (SIGUSR1, specifically) to suspend threads for a collection. So I imagine what's happening is that your thread is trying to suspend all the other threads so it can collect, and those threads are ignoring the signal for some reason. I would expect pthread_cond_wait to be interrupted if a signal arrives though. Have you overridden the signal handler for SIGUSR1?

Am Mon, 17 Jun 2013 10:46:19 -0700 schrieb Sean Kelly <sean@invisibleduck.org>: > On Jun 13, 2013, at 2:22 AM, Marco Leise <Marco.Leise@gmx.de> wrote: > > > Here is an excerpt from a stack trace I got while profiling with OProfile: > > > > #0 sem_wait () from /lib64/libpthread.so.0 > > #1 thread_suspendAll () at core/thread.d:2471 > > #2 gc.gcx.Gcx.fullcollect() (this=...) at gc/gcx.d:2427 > > #3 gc.gcx.Gcx.bigAlloc() (this=..., size=16401, poolPtr=0x7fc3d4bfe3c8, alloc_size=0x7fc3d4bfe418) at gc/gcx.d:2099 > > #4 gc.gcx.GC.mallocNoSync (alloc_size=0x7fc3d4bfe418, bits=10, size=16401, this=...) gc/gcx.d:503 > > #5 gc.gcx.GC.malloc() (this=..., size=16401, bits=10, alloc_size=0x7fc3d4bfe418) gc/gcx.d:421 > > #6 gc.gc.gc_qalloc (ba=10, sz=<optimized out>) gc/gc.d:203 > > #7 gc_qalloc (sz=<optimized out>, ba=10) gc/gc.d:198 > > #8 _d_newarrayT (ti=..., length=4096) rt/lifetime.d:807 > > #9 sequencer.algorithm.gzip.HuffmanTree.__T6__ctorTG32hZ.__ctor() (this=..., bitLengths=...) sequencer/algorithm/gzip.d:444 > > > > Two more threads are alive, but waiting on a condition variable (i.e.: in pthread_cond_wait(), but from my own and not from druntime code. Is there some obvious way I could have dead-locked the GC ? Or is there a bug ? > > I assume you're running on Linux, which uses signals (SIGUSR1, specifically) to suspend threads for a collection. So I imagine what's happening is that your thread is trying to suspend all the other threads so it can collect, and those threads are ignoring the signal for some reason. I would expect pthread_cond_wait to be interrupted if a signal arrives though. Have you overridden the signal handler for SIGUSR1? No, I have not overridden the signal handler. I'm aware of the fact that signals make pthread_cond_wait() return early and put them in a while loop as one would expect, that is all. -- Marco

On Jun 18, 2013, at 7:01 AM, Marco Leise <Marco.Leise@gmx.de> wrote: > Am Mon, 17 Jun 2013 10:46:19 -0700 > schrieb Sean Kelly <sean@invisibleduck.org>: > >> On Jun 13, 2013, at 2:22 AM, Marco Leise <Marco.Leise@gmx.de> wrote: >> >>> Here is an excerpt from a stack trace I got while profiling with OProfile: >>> >>> #0 sem_wait () from /lib64/libpthread.so.0 >>> #1 thread_suspendAll () at core/thread.d:2471 >>> #2 gc.gcx.Gcx.fullcollect() (this=...) at gc/gcx.d:2427 >>> #3 gc.gcx.Gcx.bigAlloc() (this=..., size=16401, poolPtr=0x7fc3d4bfe3c8, alloc_size=0x7fc3d4bfe418) at gc/gcx.d:2099 >>> #4 gc.gcx.GC.mallocNoSync (alloc_size=0x7fc3d4bfe418, bits=10, size=16401, this=...) gc/gcx.d:503 >>> #5 gc.gcx.GC.malloc() (this=..., size=16401, bits=10, alloc_size=0x7fc3d4bfe418) gc/gcx.d:421 >>> #6 gc.gc.gc_qalloc (ba=10, sz=<optimized out>) gc/gc.d:203 >>> #7 gc_qalloc (sz=<optimized out>, ba=10) gc/gc.d:198 >>> #8 _d_newarrayT (ti=..., length=4096) rt/lifetime.d:807 >>> #9 sequencer.algorithm.gzip.HuffmanTree.__T6__ctorTG32hZ.__ctor() (this=..., bitLengths=...) sequencer/algorithm/gzip.d:444 >>> >>> Two more threads are alive, but waiting on a condition variable (i.e.: in pthread_cond_wait(), but from my own and not from druntime code. Is there some obvious way I could have dead-locked the GC ? Or is there a bug ? >> >> I assume you're running on Linux, which uses signals (SIGUSR1, specifically) to suspend threads for a collection. So I imagine what's happening is that your thread is trying to suspend all the other threads so it can collect, and those threads are ignoring the signal for some reason. I would expect pthread_cond_wait to be interrupted if a signal arrives though. Have you overridden the signal handler for SIGUSR1? > > No, I have not overridden the signal handler. I'm aware of the fact that signals make pthread_cond_wait() return early and put them in a while loop as one would expect, that is all. Hrm... Can you trap this in a debugger and post the stack traces of all threads? That stack above is a thread waiting for others to say they're suspended so it can collect.