August 09, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #17 from Илья Ярошенко <ilyayaroshenko@gmail.com> ---
(In reply to Martin Nowak from comment #16)
> (In reply to Aleksei Preobrazhenskii from comment #13)
> > All suspending signals were delivered, but it seems that number of calls to sem_wait was different than number of calls to sem_post (or something similar). I have no reasonable explanation for that.
> > 
> > It doesn't invalidate the hypothesis that RT signals helped with original deadlock though.
> 
> To be hypothesis it must verifyable, but as we can't explain why RT signals would help, it's not a real hypothesis. Can anyone somewhat repeatedly reproduce the issue?

It is not easy to catch it on PC. The bug was found when program was running on multiple CPUs on multiple servers during a day.

> I would suspect that this issue came with the recent parallel suspend
> feature.
> https://github.com/dlang/druntime/pull/1110, that would affect dmd >=
> 2.070.0.
> Could someone test their code with 2.069.2?

Yes, the bug was found first for 2.069.

--
August 09, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #18 from Martin Nowak <code@dawg.eu> ---
Think I just spotted the problem.

There seems to be a race condition between sending the signal and checking
whether the thread exited.
https://github.com/dlang/druntime/blob/c1f285715cf14e80307eafc76d0b25b417b8de19/src/core/thread.d#L2586-L2588
This might lead to a wrong counting of active threads, and therefor to a
deadlock.
https://github.com/dlang/druntime/blob/c1f285715cf14e80307eafc76d0b25b417b8de19/src/core/thread.d#L2648-L2649

Those changes were indeed introduced with https://github.com/dlang/druntime/pull/1110.

A fix would be to simply synchronize the reception of the signal in thread_suspendHandler with a variable in Thread, and only sem_wait for threads that did receive the signal, somewhat similar to the FreeBSD workaround.

--
August 09, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #19 from Martin Nowak <code@dawg.eu> ---
(In reply to Илья Ярошенко from comment #17)
> > https://github.com/dlang/druntime/pull/1110, that would affect dmd >=
> > 2.070.0.
> > Could someone test their code with 2.069.2?
> 
> Yes, the bug was found first for 2.069.

But that change is not in 2.069.x, only in 2.070.0 and following. Can you somewhat reproduce it? Would simplify my life a lot.

Following my hypothesis, it should be fairly simple to trigger with one thread continuously looping on GC.collect(), while concurrently spawning many short lived threads, to increase the change of triggering the race between signal delivery and the thread exiting.

If realtime signals are delivered faster (before pthread_kill returns), then they might indeed avoid the race condition by pure chance.

--
August 10, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #20 from Илья Ярошенко <ilyayaroshenko@gmail.com> ---
I have not access to the source code anymore :/

--
August 11, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #21 from Martin Nowak <code@dawg.eu> ---
Nope, that doesn't seem to be the problem.
All the thread exit code synchronizes on Thread.slock_nothrow.
It shouldn't even be possible to send a signal to an exiting thread, b/c they
get removed from the thread list before that, and that is synchronized around
the suspend loop.

Might still be a problem with the synchronization of m_isRunning and/or thread_cleanupHandler. Did your apps by any chance use thread cancellation or pthread_exit?

--
August 11, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #22 from Илья Ярошенко <ilyayaroshenko@gmail.com> ---
(In reply to Martin Nowak from comment #21)
> Nope, that doesn't seem to be the problem.
> All the thread exit code synchronizes on Thread.slock_nothrow.
> It shouldn't even be possible to send a signal to an exiting thread, b/c
> they get removed from the thread list before that, and that is synchronized
> around the suspend loop.
> 
> Might still be a problem with the synchronization of m_isRunning and/or thread_cleanupHandler. Did your apps by any chance use thread cancellation or pthread_exit?

No, but an Exception may be thrown in a thread.

--
September 22, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #23 from Martin Nowak <code@dawg.eu> ---
Anyone still experiencing this issue? Can't seem to fix it w/o reproducing it.

--
September 22, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #24 from Aleksei Preobrazhenskii <apreobrazhensky@gmail.com> ---
(In reply to Martin Nowak from comment #23)
> Anyone still experiencing this issue? Can't seem to fix it w/o reproducing it.

Since I changed signals to real-time and migrated to recent kernel I haven't seen that issue in the release builds, however, I tried running profile build recently (unfortunately I only did it for the old kernel) and it was consistently stuck every time. It might be something related to the issue, I will try to reproduce it with simpler code when I have time.

--
September 23, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #25 from Martin Nowak <code@dawg.eu> ---
(In reply to Aleksei Preobrazhenskii from comment #24)
> Since I changed signals to real-time and migrated to recent kernel I haven't seen that issue in the release builds, however, I tried running profile build recently (unfortunately I only did it for the old kernel) and it was consistently stuck every time.

Thanks, good to hear from you.

There is a chance that these are kernel bugs fixed in 3.10 https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db and 3.18 https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0.

--
October 04, 2016
https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #26 from Илья Ярошенко <ilyayaroshenko@gmail.com> ---
Probably related issue http://forum.dlang.org/post/igqwbqawrtxnigplgnka@forum.dlang.org

--