[Issue 15939] GC.collect causes deadlock in multi-threaded environment (page 3)

August 09, 2016

[Issue 15939] GC.collect causes deadlock in multi-threaded environment

Posted by Илья Ярошенко

Permalink

Илья Ярошенко

Permalink

https://issues.dlang.org/show_bug.cgi?id=15939

--- Comment #17 from Илья Ярошенко <ilyayaroshenko@gmail.com> ---
(In reply to Martin Nowak from comment #16)
> (In reply to Aleksei Preobrazhenskii from comment #13)
> > All suspending signals were delivered, but it seems that number of calls to sem_wait was different than number of calls to sem_post (or something similar). I have no reasonable explanation for that.
> > 
> > It doesn't invalidate the hypothesis that RT signals helped with original deadlock though.
> 
> To be hypothesis it must verifyable, but as we can't explain why RT signals would help, it's not a real hypothesis. Can anyone somewhat repeatedly reproduce the issue?

It is not easy to catch it on PC. The bug was found when program was running on multiple CPUs on multiple servers during a day.

> I would suspect that this issue came with the recent parallel suspend
> feature.
> https://github.com/dlang/druntime/pull/1110, that would affect dmd >=
> 2.070.0.
> Could someone test their code with 2.069.2?

Yes, the bug was found first for 2.069.

--

https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #18 from Martin Nowak <code@dawg.eu> --- Think I just spotted the problem. There seems to be a race condition between sending the signal and checking whether the thread exited. https://github.com/dlang/druntime/blob/c1f285715cf14e80307eafc76d0b25b417b8de19/src/core/thread.d#L2586-L2588 This might lead to a wrong counting of active threads, and therefor to a deadlock. https://github.com/dlang/druntime/blob/c1f285715cf14e80307eafc76d0b25b417b8de19/src/core/thread.d#L2648-L2649 Those changes were indeed introduced with https://github.com/dlang/druntime/pull/1110. A fix would be to simply synchronize the reception of the signal in thread_suspendHandler with a variable in Thread, and only sem_wait for threads that did receive the signal, somewhat similar to the FreeBSD workaround. --

https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #19 from Martin Nowak <code@dawg.eu> --- (In reply to Илья Ярошенко from comment #17) > > https://github.com/dlang/druntime/pull/1110, that would affect dmd >= > > 2.070.0. > > Could someone test their code with 2.069.2? > > Yes, the bug was found first for 2.069. But that change is not in 2.069.x, only in 2.070.0 and following. Can you somewhat reproduce it? Would simplify my life a lot. Following my hypothesis, it should be fairly simple to trigger with one thread continuously looping on GC.collect(), while concurrently spawning many short lived threads, to increase the change of triggering the race between signal delivery and the thread exiting. If realtime signals are delivered faster (before pthread_kill returns), then they might indeed avoid the race condition by pure chance. --

https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #21 from Martin Nowak <code@dawg.eu> --- Nope, that doesn't seem to be the problem. All the thread exit code synchronizes on Thread.slock_nothrow. It shouldn't even be possible to send a signal to an exiting thread, b/c they get removed from the thread list before that, and that is synchronized around the suspend loop. Might still be a problem with the synchronization of m_isRunning and/or thread_cleanupHandler. Did your apps by any chance use thread cancellation or pthread_exit? --

https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #22 from Илья Ярошенко <ilyayaroshenko@gmail.com> --- (In reply to Martin Nowak from comment #21) > Nope, that doesn't seem to be the problem. > All the thread exit code synchronizes on Thread.slock_nothrow. > It shouldn't even be possible to send a signal to an exiting thread, b/c > they get removed from the thread list before that, and that is synchronized > around the suspend loop. > > Might still be a problem with the synchronization of m_isRunning and/or thread_cleanupHandler. Did your apps by any chance use thread cancellation or pthread_exit? No, but an Exception may be thrown in a thread. --

https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #24 from Aleksei Preobrazhenskii <apreobrazhensky@gmail.com> --- (In reply to Martin Nowak from comment #23) > Anyone still experiencing this issue? Can't seem to fix it w/o reproducing it. Since I changed signals to real-time and migrated to recent kernel I haven't seen that issue in the release builds, however, I tried running profile build recently (unfortunately I only did it for the old kernel) and it was consistently stuck every time. It might be something related to the issue, I will try to reproduce it with simpler code when I have time. --

https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #25 from Martin Nowak <code@dawg.eu> --- (In reply to Aleksei Preobrazhenskii from comment #24) > Since I changed signals to real-time and migrated to recent kernel I haven't seen that issue in the release builds, however, I tried running profile build recently (unfortunately I only did it for the old kernel) and it was consistently stuck every time. Thanks, good to hear from you. There is a chance that these are kernel bugs fixed in 3.10 https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db and 3.18 https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0. --

https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #26 from Илья Ярошенко <ilyayaroshenko@gmail.com> --- Probably related issue http://forum.dlang.org/post/igqwbqawrtxnigplgnka@forum.dlang.org --

Forums