Jump to page: 1 2
Thread overview
GC buggy in windows?
Nov 08, 2019
tchaloupka
Nov 08, 2019
rikki cattermole
Nov 08, 2019
tchaloupka
Nov 08, 2019
tchaloupka
Nov 08, 2019
Dennis
Nov 08, 2019
rikki cattermole
Nov 08, 2019
tchaloupka
Nov 08, 2019
rikki cattermole
Nov 08, 2019
bachmeier
Nov 09, 2019
Gregor Mückl
Nov 09, 2019
Fathou
Nov 09, 2019
bachmeier
Nov 10, 2019
Gregor Mückl
Nov 09, 2019
JN
November 08, 2019
We've experiencing some really strange nasty GC behavior in our IOCP I/O heavy windows app.

Sometimes it hangs with just: "Unable to load thread context"

I've spend last three days with experimenting and trying to narrow it somehow to find exact cause :(

The problem is in GC and it's stop the world behavior.
In core.thread.osthread.sleep method there is basically:

```
SuspendThread( t.m_hndl );
GetThreadContext( t.m_hndl, &context );
```

In some cases GetThreadContext returns `ERROR_GEN_FAILURE(31)` which leads to the error being thrown.

First problem is, that application doesn't terminate after this error, but just hangs.
That's because thread is still suspended and somewhere down the line `join` is called on this thread which won't return - ever.

This is a nice blog explaining that the `SuspendThread` is actually asynchronnous: https://devblogs.microsoft.com/oldnewthing/?p=44743

But it also states that when `GetThreadContext` is called on it, we can be sure that it is actually already suspended.

So what could lead to the error? Searching in windows API documentation - nah, nothing as usual..

Searching on the internet - sure a lot of problems with some game engines using GC (unity) combined with some anticheat or antivirus programs - not our case.

Ok, so I've tried to compile custom druntime (what a pleasure itself) and found that:

* when you try to Thread.yield and get context again, it doesn't help, still error
* only way I could workaround this problem was resuming back the thread again, Thread.yield, suspend thread and try the context again, usually first or second try succeeds - HOORAY.

Then I've spent a lot of time figuring what is actually causing the error and I have a theory that the problem is with some IO operation being run in kernel context that can't finish when the thread is suspended and so the error is returned.

I ended up with this minimized test app that causes this error really fast.

```
import core.memory : GC;
import core.stdc.stdio;
import core.thread;
import std.random;
import std.range;

void main() {
	Thread t;
	while (true) {
		GC.collect();
		if (t is null || !t.isRunning) {
			t = new Thread(&threadProc);
			t.start();
		}
	}
}

void threadProc() {
	foreach (_; iota(uniform(0, 100))) {
		FILE* f = fopen("dummy", "a");
		scope (exit) fclose(f);
	}
}
```

compiled with: `dmd -m64 -debug test.d`
Tested on 64bit Windows 10.

I definitely think that this is a bug in a windows GC implementation.

Should I fill it?

What seems to be a fix to both of them is:
* retry the resume/suspend/get context on the failing thread some more - how many times?
* before returning the error resume the thread so it can be joined (I haven't looked from where it's being called on termination)

For me it is also questionable if terminating the application in this case is even the correct behavior. It might be better to scratch the GC attempt, resume the threads and retry on next collection? That might lead to other problems but as this occurs pretty rarely it might have a better outcome. Ideas?

PS: I'm beginning to understand the C/C++ devs to don't like GC languages ;-)
PPS: Now I hate windows even more.. (normally a linux dev)
PPPS: This kind of experience would definitely led away devs that just need to have "shit done" and don't bother with the tool used..
November 09, 2019
Just to confirm, this code snippet is meant to lock the entire process up and CPU usage go down to 0%?

If so, so far I have not confirmed it using dmd 2.087.0.
November 08, 2019
On Friday, 8 November 2019 at 14:39:34 UTC, rikki cattermole wrote:
> Just to confirm, this code snippet is meant to lock the entire process up and CPU usage go down to 0%?
>
> If so, so far I have not confirmed it using dmd 2.087.0.

Yep, it just outputs:

C:\Users\tcha\Workspace>test.exe

core.thread.osthread.ThreadError@src\core\thread\osthread.d(3176): Unable to load thread context
----------------

and hangs on Thread.join (0% CPU).

Tested both on physical and virtual windows 10 x86_64.
November 08, 2019
On Friday, 8 November 2019 at 14:47:18 UTC, tchaloupka wrote:
> On Friday, 8 November 2019 at 14:39:34 UTC, rikki cattermole wrote:
>> Just to confirm, this code snippet is meant to lock the entire process up and CPU usage go down to 0%?
>>
>> If so, so far I have not confirmed it using dmd 2.087.0.
>
> Yep, it just outputs:
>
> C:\Users\tcha\Workspace>test.exe
>
> core.thread.osthread.ThreadError@src\core\thread\osthread.d(3176): Unable to load thread context
> ----------------
>
> and hangs on Thread.join (0% CPU).
>
> Tested both on physical and virtual windows 10 x86_64.

We've just tried it on 5 more physical PCs (all win 10 x86_64 with ssd/m2, core i5/i7 of various models).
With dmd-master, dmd-2.086.1, dmd-2.089.0.

All ended up same within a few secs.
November 08, 2019
On Friday, 8 November 2019 at 15:01:26 UTC, tchaloupka wrote:
> All ended up same within a few secs.

I tried it a few times on my Windows 10 laptop with dmd 2.088, it just sat there for minutes using ~14% CPU (note: I have 8 logical processors) taking 10 Mb, and nothing appeared in the console. So unfortunately I couldn't reproduce it either.
November 09, 2019
On 09/11/2019 4:21 AM, Dennis wrote:
> On Friday, 8 November 2019 at 15:01:26 UTC, tchaloupka wrote:
>> All ended up same within a few secs.
> 
> I tried it a few times on my Windows 10 laptop with dmd 2.088, it just sat there for minutes using ~14% CPU (note: I have 8 logical processors) taking 10 Mb, and nothing appeared in the console. So unfortunately I couldn't reproduce it either.

Yes.

This is looking more and more like an environment issue, not a bug on druntime's end.

Potentially AV related (I use Avast) and I'm on Windows 10 Home 64bit.
November 08, 2019
On Friday, 8 November 2019 at 15:30:18 UTC, rikki cattermole wrote:
> On 09/11/2019 4:21 AM, Dennis wrote:
>> On Friday, 8 November 2019 at 15:01:26 UTC, tchaloupka wrote:
>>> All ended up same within a few secs.
>> 
>> I tried it a few times on my Windows 10 laptop with dmd 2.088, it just sat there for minutes using ~14% CPU (note: I have 8 logical processors) taking 10 Mb, and nothing appeared in the console. So unfortunately I couldn't reproduce it either.
>
> Yes.
>
> This is looking more and more like an environment issue, not a bug on druntime's end.
>
> Potentially AV related (I use Avast) and I'm on Windows 10 Home 64bit.

Thanks for feedback, I've tried it on more servers where it actually worked as you both described.
At the end the difference was Eset antivirus installed.
I had it whole disabled to eliminate exactly this but only after it's uninstall it started to work.. So some crap was still active in it.

Well still it's pretty unfortunate if some 3rd side app can brick the GC runtime. We can't just say to customers "You've got Eset installed? Screw you it won't work together."

So bug or not a bug?
November 09, 2019
On 09/11/2019 4:54 AM, tchaloupka wrote:
> On Friday, 8 November 2019 at 15:30:18 UTC, rikki cattermole wrote:
>> On 09/11/2019 4:21 AM, Dennis wrote:
>>> On Friday, 8 November 2019 at 15:01:26 UTC, tchaloupka wrote:
>>>> All ended up same within a few secs.
>>>
>>> I tried it a few times on my Windows 10 laptop with dmd 2.088, it just sat there for minutes using ~14% CPU (note: I have 8 logical processors) taking 10 Mb, and nothing appeared in the console. So unfortunately I couldn't reproduce it either.
>>
>> Yes.
>>
>> This is looking more and more like an environment issue, not a bug on druntime's end.
>>
>> Potentially AV related (I use Avast) and I'm on Windows 10 Home 64bit.
> 
> Thanks for feedback, I've tried it on more servers where it actually worked as you both described.
> At the end the difference was Eset antivirus installed.
> I had it whole disabled to eliminate exactly this but only after it's uninstall it started to work.. So some crap was still active in it.
> 
> Well still it's pretty unfortunate if some 3rd side app can brick the GC runtime. We can't just say to customers "You've got Eset installed? Screw you it won't work together."
> 
> So bug or not a bug?

Bug on Eset's side.

They are misbehaving in some way.

You can confirm that this is the case by installing an AV like Avast with full firewall capability turned on (may need to pay, but worth while to confirm).

The reason I am confident that it is a bug on the AV side and not D's is because I don't remember hearing about this happening before.

It may be possible to add a workaround on our end, but we'll need Eset on our side for that I think.

Based upon a quick search on Google, its looking like Eset consider this a feature not a bug. https://forum.unity.com/threads/getthreadcontext-failed.140925/
November 08, 2019
On Friday, 8 November 2019 at 15:54:56 UTC, tchaloupka wrote:

> Well still it's pretty unfortunate if some 3rd side app can brick the GC runtime. We can't just say to customers "You've got Eset installed? Screw you it won't work together."

But isn't that the purpose of antivirus software? Isn't the whole point to allow it to be able to interfere with the execution of other programs?

November 09, 2019
On Friday, 8 November 2019 at 20:30:28 UTC, bachmeier wrote:
> On Friday, 8 November 2019 at 15:54:56 UTC, tchaloupka wrote:
>
>> Well still it's pretty unfortunate if some 3rd side app can brick the GC runtime. We can't just say to customers "You've got Eset installed? Screw you it won't work together."
>
> But isn't that the purpose of antivirus software? Isn't the whole point to allow it to be able to interfere with the execution of other programs?

It's not OK if the interference consists of injecting random bugs into legitimate programs. Antivirus programs have a pretty awful track record in this regard. I can't think of an antivirus product that I used that didn't turn out to be defective in one way or another.
« First   ‹ Prev
1 2