April 27, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

Walter Bright <bugzilla@digitalmars.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugzilla@digitalmars.com

--- Comment #8 from Walter Bright <bugzilla@digitalmars.com> ---
(In reply to Aleksei Preobrazhenskii from comment #7)
> I was running tests for past five days, I didn't see any deadlocks since I switched GC to using real-time POSIX signals (thread_setGCSignals(SIGRTMIN, SIGRTMIN + 1)). I would recommend to change default signals accordingly.

Since you've written the code to fix it, please write a Pull Request for it. That way you get the credit!

--
April 27, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

safety0ff.bugz <safety0ff.bugz@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |safety0ff.bugz@gmail.com

--- Comment #9 from safety0ff.bugz <safety0ff.bugz@gmail.com> ---
Could you run strace to get a log of the signal usage?

For example:

strace -f -e signal -o signals.log command_to_run_program

Then add the output signals.log to the bug report?
I don't know if it'll be useful but it will be something more to look for
hints.

I'm wondering if there are any other signal handler invocations in the
"...application stack" part of your stack traces.
I've seem a deadlock caused by an assert firing within the
thread_suspendHandler, which deadlocks on the GC lock.

(In reply to Aleksei Preobrazhenskii from comment #6)
> Like, if thread_suspendAll happens while some threads are still in the thread_suspendHandler (already handled resume signal, but still didn't leave the suspend handler).

What should happen in this case is since the signal is masked upon signal handler invocation, the new suspend signal is marked as "pending" and run once thread_suspendHandler returns and the signal is unblocked.

The suspended thread cannot receive another resume or suspend signal until after the sem_post in thread_suspendHandler.

I've mocked up the suspend / resume code and it does not deadlock from the situation you've described.

> Real-time POSIX signals (SIGRTMIN .. SIGRTMAX) have stronger delivery
> guarantees

Their queuing and ordering guarantees should be irrelevant due to synchronization and signal masks.

I don't see any other benefits of RT signals.

(In reply to Walter Bright from comment #8)
> 
> Since you've written the code to fix it, please write a Pull Request for it. That way you get the credit!

He modified his code to use the thread_setGCSignals function: https://dlang.org/phobos/core_thread.html#.thread_setGCSignals


P.S.: I don't mean to sound doubtful, I just want a sound explanation of the deadlock so it can be properly address at the cause.

--
April 28, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #10 from Aleksei Preobrazhenskii <apreobrazhensky@gmail.com> ---
(In reply to safety0ff.bugz from comment #9)
> Could you run strace to get a log of the signal usage?

I did it before to catch the deadlock, but I wasn't able to do that while strace was running. And, unfortunately, I don't have original code running in production anymore.

> I'm wondering if there are any other signal handler invocations in the "...application stack" part of your stack traces.

No, there was no signal related code in hidden parts of stack trace.

> I've seem a deadlock caused by an assert firing within the thread_suspendHandler, which deadlocks on the GC lock.

In my case that was a release build, so I assume no asserts.

> What should happen in this case is since the signal is masked upon signal handler invocation, the new suspend signal is marked as "pending" and run once thread_suspendHandler returns and the signal is unblocked.

Yeah, my reasoning was wrong. I did a quick test and saw that signals weren't delivered, apparently, I forgot that pthread_kill is asynchronous, so signals should've coalesced in my test.

> Their queuing and ordering guarantees should be irrelevant due to synchronization and signal masks.

Ideally, yeah, but as I said, I just changed SIGUSR1/SIGUSR2 to SIGRTMIN/SIGRTMIN+1 and didn't see any deadlocks for a long time, and I saw them pretty consistently before. So, either "irrelevant" part is wrong, or there is something else which is different and relevant (and probably not documented) for real-time signals. The other explanation is that bug is still there and real-time signals just somehow reduced probability of it happening.

Also, I have no other explanation why stack traces look like that, the simplest one is that signal wasn't delivered.

--
May 07, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

Илья Ярошенко <ilyayaroshenko@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|nobody@puremagic.com        |ilyayaroshenko@gmail.com

--
May 08, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #11 from Martin Nowak <code@dawg.eu> ---
Having the main thread hang while waiting for semaphore posts in the
thread_suspendAll is a good indication that the signal was lost.
Did you have gdb attached while the signal was send? That sometime causes
issues w/ signal delivery.
The setup looks fairly simple (a few threads allocating classes and extending
arrays) to be run for a few days, maybe we can reproduce the problem.

Are there any other reasons for switching to real-time signals? Which real-time signals are usually not used for other purposes?

--
May 09, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #12 from Aleksei Preobrazhenskii <apreobrazhensky@gmail.com> ---
(In reply to Martin Nowak from comment #11)
> Did you have gdb attached while the signal was send? That sometime causes issues w/ signal delivery.

No, I didn't. I attached gdb to investigate deadlock which already happened at that point.

> Are there any other reasons for switching to real-time signals?

I read that traditional signals are internally mapped to real-time signals. If that's true I see no reason to stick with inferior emulated entity with weaker guarantees.

> Which real-time signals are usually not used for other purposes?

Basically all real-time signals from range SIGRTMIN .. SIGRTMAX are intended for custom use (SIGRTMIN might vary from platform to platform though, because of things like NPTL and LinuxThreads).

--
May 11, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #13 from Aleksei Preobrazhenskii <apreobrazhensky@gmail.com> ---
I saw new deadlock with different symptoms today.

Stack trace of collecting thread:

Thread XX (Thread 0x7fda6ffff700 (LWP 32383)):
#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
#1  0x00000000007b4046 in thread_suspendAll ()
#2  0x00000000007998dd in gc.gc.Gcx.fullcollect() ()
#3  0x0000000000797e24 in gc.gc.Gcx.bigAlloc() ()
#4  0x000000000079bb5f in
gc.gc.GC.__T9runLockedS47_D2gc2gc2GC12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS21_D2gc2gc10mallocTimelS21_D2gc2gc10numMallocslTmTkTmTxC8TypeInfoZ.runLocked()
()
#5  0x000000000079548e in gc.gc.GC.malloc() ()
#6  0x0000000000760ac7 in gc_qalloc ()
#7  0x000000000076437b in _d_arraysetlengthT ()
...application stack

Stack traces of other threads:

Thread XX (Thread 0x7fda5cff9700 (LWP 32402)):
#0  0x00007fda78927454 in do_sigsuspend (set=0x7fda5cff76c0) at
../sysdeps/unix/sysv/linux/sigsuspend.c:63
#1  __GI___sigsuspend (set=<optimized out>) at
../sysdeps/unix/sysv/linux/sigsuspend.c:78
#2  0x000000000075d979 in core.thread.thread_suspendHandler() ()
#3  0x000000000075e220 in core.thread.callWithStackShell() ()
#4  0x000000000075d907 in thread_suspendHandler ()
#5  <signal handler called>
#6  pthread_cond_wait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:160
#7  0x0000000000760069 in core.sync.condition.Condition.wait() ()
...application stack


All suspending signals were delivered, but it seems that number of calls to sem_wait was different than number of calls to sem_post (or something similar). I have no reasonable explanation for that.

It doesn't invalidate the hypothesis that RT signals helped with original deadlock though.

--
May 12, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #14 from safety0ff.bugz <safety0ff.bugz@gmail.com> ---
(In reply to Aleksei Preobrazhenskii from comment #13)
> 
> All suspending signals were delivered, but it seems that number of calls to sem_wait was different than number of calls to sem_post (or something similar). I have no reasonable explanation for that.
>
> It doesn't invalidate the hypothesis that RT signals helped with original deadlock though.

I haven't looked too closely at whether there's any races for thread
termination.
My suspicions are still on a low-level synchronization bug.
Have you tried a more recent (3.19+ kernel) or trying to newer glibc?

I'm aware of this bug [1] which was supposed to affect kernels 3.14 - 3.18 but perhaps there's a preexisting bug which affects your machine?

[1] https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64

--
May 20, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

Artem Tarasov <lomereiter@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lomereiter@gmail.com

--- Comment #15 from Artem Tarasov <lomereiter@gmail.com> ---
I'm apparently bumping into the same problem. Here's the last stack trace that I've received from a user, very similar to the one posted here: https://gist.github.com/rtnh/e2eab6afa7c0a37dbc96578d0f73c540

The prominent kernel bug mentioned here has been ruled out already. Another hint I've got is that reportedly 'error doesn't happen on XenServer hypervisors, only on KVM' (full discussion is taking place at https://github.com/lomereiter/sambamba/issues/189)

--
August 09, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #16 from Martin Nowak <code@dawg.eu> ---
(In reply to Aleksei Preobrazhenskii from comment #13)
> All suspending signals were delivered, but it seems that number of calls to sem_wait was different than number of calls to sem_post (or something similar). I have no reasonable explanation for that.
> 
> It doesn't invalidate the hypothesis that RT signals helped with original deadlock though.

To be hypothesis it must verifyable, but as we can't explain why RT signals
would help, it's not a real hypothesis. Can anyone somewhat repeatedly
reproduce the issue?
I would suspect that this issue came with the recent parallel suspend feature.
https://github.com/dlang/druntime/pull/1110, that would affect dmd >= 2.070.0.
Could someone test their code with 2.069.2?

--