Thread overview
GC dead-locking ?
Jun 13, 2013
Marco Leise
Jun 13, 2013
Marco Leise
Jun 17, 2013
Sean Kelly
Jun 18, 2013
Marco Leise
Jun 19, 2013
Sean Kelly
Jul 01, 2013
Marco Leise
June 13, 2013
Here is an excerpt from a stack trace I got while profiling with OProfile:

#0  sem_wait () from /lib64/libpthread.so.0
#1  thread_suspendAll () at core/thread.d:2471
#2  gc.gcx.Gcx.fullcollect() (this=...) at gc/gcx.d:2427
#3  gc.gcx.Gcx.bigAlloc() (this=..., size=16401, poolPtr=0x7fc3d4bfe3c8, alloc_size=0x7fc3d4bfe418) at gc/gcx.d:2099
#4  gc.gcx.GC.mallocNoSync (alloc_size=0x7fc3d4bfe418, bits=10, size=16401, this=...) gc/gcx.d:503
#5  gc.gcx.GC.malloc() (this=..., size=16401, bits=10, alloc_size=0x7fc3d4bfe418) gc/gcx.d:421
#6  gc.gc.gc_qalloc (ba=10, sz=<optimized out>) gc/gc.d:203
#7  gc_qalloc (sz=<optimized out>, ba=10) gc/gc.d:198
#8  _d_newarrayT (ti=..., length=4096) rt/lifetime.d:807
#9  sequencer.algorithm.gzip.HuffmanTree.__T6__ctorTG32hZ.__ctor() (this=..., bitLengths=...) sequencer/algorithm/gzip.d:444

Two more threads are alive, but waiting on a condition variable (i.e.: in pthread_cond_wait(), but from my own and not from druntime code. Is there some obvious way I could have dead-locked the GC ? Or is there a bug ?

This was compiled with GDC using DMD FE 2.062.

-- 
Marco

June 13, 2013
One more note: I get this consistently during profiling, but
not without.
I don't count kernel involvement out either, since OProfile is
a kernel based profiler and there could be a quirk in its
interaction with semaphores.

-- 
Marco

June 17, 2013
On Jun 13, 2013, at 2:22 AM, Marco Leise <Marco.Leise@gmx.de> wrote:

> Here is an excerpt from a stack trace I got while profiling with OProfile:
> 
> #0  sem_wait () from /lib64/libpthread.so.0
> #1  thread_suspendAll () at core/thread.d:2471
> #2  gc.gcx.Gcx.fullcollect() (this=...) at gc/gcx.d:2427
> #3  gc.gcx.Gcx.bigAlloc() (this=..., size=16401, poolPtr=0x7fc3d4bfe3c8, alloc_size=0x7fc3d4bfe418) at gc/gcx.d:2099
> #4  gc.gcx.GC.mallocNoSync (alloc_size=0x7fc3d4bfe418, bits=10, size=16401, this=...) gc/gcx.d:503
> #5  gc.gcx.GC.malloc() (this=..., size=16401, bits=10, alloc_size=0x7fc3d4bfe418) gc/gcx.d:421
> #6  gc.gc.gc_qalloc (ba=10, sz=<optimized out>) gc/gc.d:203
> #7  gc_qalloc (sz=<optimized out>, ba=10) gc/gc.d:198
> #8  _d_newarrayT (ti=..., length=4096) rt/lifetime.d:807
> #9  sequencer.algorithm.gzip.HuffmanTree.__T6__ctorTG32hZ.__ctor() (this=..., bitLengths=...) sequencer/algorithm/gzip.d:444
> 
> Two more threads are alive, but waiting on a condition variable (i.e.: in pthread_cond_wait(), but from my own and not from druntime code. Is there some obvious way I could have dead-locked the GC ? Or is there a bug ?

I assume you're running on Linux, which uses signals (SIGUSR1, specifically) to suspend threads for a collection.  So I imagine what's happening is that your thread is trying to suspend all the other threads so it can collect, and those threads are ignoring the signal for some reason.  I would expect pthread_cond_wait to be interrupted if a signal arrives though.  Have you overridden the signal handler for SIGUSR1?
June 18, 2013
Am Mon, 17 Jun 2013 10:46:19 -0700
schrieb Sean Kelly <sean@invisibleduck.org>:

> On Jun 13, 2013, at 2:22 AM, Marco Leise <Marco.Leise@gmx.de> wrote:
> 
> > Here is an excerpt from a stack trace I got while profiling with OProfile:
> > 
> > #0  sem_wait () from /lib64/libpthread.so.0
> > #1  thread_suspendAll () at core/thread.d:2471
> > #2  gc.gcx.Gcx.fullcollect() (this=...) at gc/gcx.d:2427
> > #3  gc.gcx.Gcx.bigAlloc() (this=..., size=16401, poolPtr=0x7fc3d4bfe3c8, alloc_size=0x7fc3d4bfe418) at gc/gcx.d:2099
> > #4  gc.gcx.GC.mallocNoSync (alloc_size=0x7fc3d4bfe418, bits=10, size=16401, this=...) gc/gcx.d:503
> > #5  gc.gcx.GC.malloc() (this=..., size=16401, bits=10, alloc_size=0x7fc3d4bfe418) gc/gcx.d:421
> > #6  gc.gc.gc_qalloc (ba=10, sz=<optimized out>) gc/gc.d:203
> > #7  gc_qalloc (sz=<optimized out>, ba=10) gc/gc.d:198
> > #8  _d_newarrayT (ti=..., length=4096) rt/lifetime.d:807
> > #9  sequencer.algorithm.gzip.HuffmanTree.__T6__ctorTG32hZ.__ctor() (this=..., bitLengths=...) sequencer/algorithm/gzip.d:444
> > 
> > Two more threads are alive, but waiting on a condition variable (i.e.: in pthread_cond_wait(), but from my own and not from druntime code. Is there some obvious way I could have dead-locked the GC ? Or is there a bug ?
> 
> I assume you're running on Linux, which uses signals (SIGUSR1, specifically) to suspend threads for a collection.  So I imagine what's happening is that your thread is trying to suspend all the other threads so it can collect, and those threads are ignoring the signal for some reason.  I would expect pthread_cond_wait to be interrupted if a signal arrives though.  Have you overridden the signal handler for SIGUSR1?

No, I have not overridden the signal handler. I'm aware of the fact that signals make pthread_cond_wait() return early and put them in a while loop as one would expect, that is all.

-- 
Marco

June 19, 2013
On Jun 18, 2013, at 7:01 AM, Marco Leise <Marco.Leise@gmx.de> wrote:

> Am Mon, 17 Jun 2013 10:46:19 -0700
> schrieb Sean Kelly <sean@invisibleduck.org>:
> 
>> On Jun 13, 2013, at 2:22 AM, Marco Leise <Marco.Leise@gmx.de> wrote:
>> 
>>> Here is an excerpt from a stack trace I got while profiling with OProfile:
>>> 
>>> #0  sem_wait () from /lib64/libpthread.so.0
>>> #1  thread_suspendAll () at core/thread.d:2471
>>> #2  gc.gcx.Gcx.fullcollect() (this=...) at gc/gcx.d:2427
>>> #3  gc.gcx.Gcx.bigAlloc() (this=..., size=16401, poolPtr=0x7fc3d4bfe3c8, alloc_size=0x7fc3d4bfe418) at gc/gcx.d:2099
>>> #4  gc.gcx.GC.mallocNoSync (alloc_size=0x7fc3d4bfe418, bits=10, size=16401, this=...) gc/gcx.d:503
>>> #5  gc.gcx.GC.malloc() (this=..., size=16401, bits=10, alloc_size=0x7fc3d4bfe418) gc/gcx.d:421
>>> #6  gc.gc.gc_qalloc (ba=10, sz=<optimized out>) gc/gc.d:203
>>> #7  gc_qalloc (sz=<optimized out>, ba=10) gc/gc.d:198
>>> #8  _d_newarrayT (ti=..., length=4096) rt/lifetime.d:807
>>> #9  sequencer.algorithm.gzip.HuffmanTree.__T6__ctorTG32hZ.__ctor() (this=..., bitLengths=...) sequencer/algorithm/gzip.d:444
>>> 
>>> Two more threads are alive, but waiting on a condition variable (i.e.: in pthread_cond_wait(), but from my own and not from druntime code. Is there some obvious way I could have dead-locked the GC ? Or is there a bug ?
>> 
>> I assume you're running on Linux, which uses signals (SIGUSR1, specifically) to suspend threads for a collection.  So I imagine what's happening is that your thread is trying to suspend all the other threads so it can collect, and those threads are ignoring the signal for some reason.  I would expect pthread_cond_wait to be interrupted if a signal arrives though.  Have you overridden the signal handler for SIGUSR1?
> 
> No, I have not overridden the signal handler. I'm aware of the fact that signals make pthread_cond_wait() return early and put them in a while loop as one would expect, that is all.

Hrm... Can you trap this in a debugger and post the stack traces of all threads?  That stack above is a thread waiting for others to say they're suspended so it can collect.
July 01, 2013
Am Tue, 18 Jun 2013 19:12:06 -0700
schrieb Sean Kelly <sean@invisibleduck.org>:

> On Jun 18, 2013, at 7:01 AM, Marco Leise <Marco.Leise@gmx.de> wrote:
> 
> > Am Mon, 17 Jun 2013 10:46:19 -0700
> > schrieb Sean Kelly <sean@invisibleduck.org>:
> > 
> >> On Jun 13, 2013, at 2:22 AM, Marco Leise <Marco.Leise@gmx.de> wrote:
> >> 
> >>> Here is an excerpt from a stack trace I got while profiling with OProfile:
> >>> 
> >>> #0  sem_wait () from /lib64/libpthread.so.0
> >>> #1  thread_suspendAll () at core/thread.d:2471
> >>> #2  gc.gcx.Gcx.fullcollect() (this=...) at gc/gcx.d:2427
> >>> #3  gc.gcx.Gcx.bigAlloc() (this=..., size=16401, poolPtr=0x7fc3d4bfe3c8, alloc_size=0x7fc3d4bfe418) at gc/gcx.d:2099
> >>> #4  gc.gcx.GC.mallocNoSync (alloc_size=0x7fc3d4bfe418, bits=10, size=16401, this=...) gc/gcx.d:503
> >>> #5  gc.gcx.GC.malloc() (this=..., size=16401, bits=10, alloc_size=0x7fc3d4bfe418) gc/gcx.d:421
> >>> #6  gc.gc.gc_qalloc (ba=10, sz=<optimized out>) gc/gc.d:203
> >>> #7  gc_qalloc (sz=<optimized out>, ba=10) gc/gc.d:198
> >>> #8  _d_newarrayT (ti=..., length=4096) rt/lifetime.d:807
> >>> #9  sequencer.algorithm.gzip.HuffmanTree.__T6__ctorTG32hZ.__ctor() (this=..., bitLengths=...) sequencer/algorithm/gzip.d:444
> >>> 
> >>> Two more threads are alive, but waiting on a condition variable (i.e.: in pthread_cond_wait(), but from my own and not from druntime code. Is there some obvious way I could have dead-locked the GC ? Or is there a bug ?
> >> 
> >> I assume you're running on Linux, which uses signals (SIGUSR1, specifically) to suspend threads for a collection.  So I imagine what's happening is that your thread is trying to suspend all the other threads so it can collect, and those threads are ignoring the signal for some reason.  I would expect pthread_cond_wait to be interrupted if a signal arrives though.  Have you overridden the signal handler for SIGUSR1?
> > 
> > No, I have not overridden the signal handler. I'm aware of the fact that signals make pthread_cond_wait() return early and put them in a while loop as one would expect, that is all.
> 
> Hrm... Can you trap this in a debugger and post the stack traces of all threads?  That stack above is a thread waiting for others to say they're suspended so it can collect.

I could do that (with a little work setting the scenario up again), but it wont help. As I said, the other two threads were paused in pthread_cond_wait() in my own code. There was nothing special about their stack trace.

-- 
Marco