Issues with debugging GC-related crashes #2 (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Issues with debugging GC-related crashes #2 (page 3)

April 19, 2018

Re: Issues with debugging GC-related crashes #2

Posted by Kagamin
in reply to Matthias Klumpp

Kagamin

Posted in reply to Matthias Klumpp

On Wednesday, 18 April 2018 at 22:24:13 UTC, Matthias Klumpp wrote:
> size_t memSize = pooltable.maxAddr - minAddr;
> (https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L1982 )
> That wouldn't make sense for a pool size...
>
> The machine this is running on has 16G memory, at the time of the crash the software was using ~2.1G memory, with 130G virtual memory due to LMDB memory mapping (I wonder what happens if I reduce that...)

If big LMDB mapping causes a problem, try a test like this:
---
import core.memory;
void testLMDB()
{
    //how do you use it?
}
void test1()
{
    void*[][] a;
    foreach(i;0..100000)a~=new void*[10000];
    void*[][] b;
    foreach(i;0..100000)b~=new void*[10000];
    b=null;
    GC.collect();

    testLMDB();

    GC.collect();
    foreach(i;0..100000)a~=new void*[10000];
    foreach(i;0..100000)b~=new void*[10000];
    b=null;
    GC.collect();
}
---

April 19, 2018

Re: Issues with debugging GC-related crashes #2

Posted by Kagamin
in reply to Kagamin

Kagamin

Posted in reply to Kagamin

foreach(i;0..10000)
100000 is too much

April 19, 2018

Re: Issues with debugging GC-related crashes #2

Posted by Matthias Klumpp
in reply to Kagamin

Matthias Klumpp

Posted in reply to Kagamin

On Thursday, 19 April 2018 at 08:30:45 UTC, Kagamin wrote:
> On Wednesday, 18 April 2018 at 22:24:13 UTC, Matthias Klumpp wrote:
>> size_t memSize = pooltable.maxAddr - minAddr;
>> (https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L1982 )
>> That wouldn't make sense for a pool size...
>>
>> The machine this is running on has 16G memory, at the time of the crash the software was using ~2.1G memory, with 130G virtual memory due to LMDB memory mapping (I wonder what happens if I reduce that...)
>
> If big LMDB mapping causes a problem, try a test like this:
> ---
> import core.memory;
> void testLMDB()
> {
>     //how do you use it?
> }
> void test1()
> {
>     void*[][] a;
>     foreach(i;0..100000)a~=new void*[10000];
>     void*[][] b;
>     foreach(i;0..100000)b~=new void*[10000];
>     b=null;
>     GC.collect();
>
>     testLMDB();
>
>     GC.collect();
>     foreach(i;0..100000)a~=new void*[10000];
>     foreach(i;0..100000)b~=new void*[10000];
>     b=null;
>     GC.collect();
> }
> ---

I tried something similar, with no effect.
Something that maybe is relevant though: I occasionally get the following SIGABRT crash in the tool on machines which have the SIGSEGV crash:
```
Thread 53 "appstream-gener" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fdfe98d4700 (LWP 7326)]
0x00007ffff5040428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff5040428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff504202a in __GI_abort () at abort.c:89
#2  0x0000000000780ae0 in core.thread.Fiber.allocStack(ulong, ulong) (this=0x7fde0758a680, guardPageSize=4096, sz=20480) at src/core/thread.d:4606
#3  0x00000000007807fc in _D4core6thread5Fiber6__ctorMFNbDFZvmmZCQBlQBjQBf (this=0x7fde0758a680, guardPageSize=4096, sz=16384, dg=...)
    at src/core/thread.d:4134
#4  0x00000000006f9b31 in _D3std11concurrency__T9GeneratorTAyaZQp6__ctorMFDFZvZCQCaQBz__TQBpTQBiZQBx (this=0x7fde0758a680, dg=...)
    at /home/ubuntu/dtc/dmd/generated/linux/debug/64/../../../../../druntime/import/core/thread.d:4126
#5  0x00000000006e9467 in _D5asgen8handlers11iconhandler5Theme21matchingIconFilenamesMFAyaSQCl5utils9ImageSizebZC3std11concurrency__T9GeneratorTQCfZQp (this=0x7fdea2747800, relaxedScalingRules=true, size=..., iname=...) at ../src/asgen/handlers/iconhandler.d:196
#6  0x00000000006ea75a in _D5asgen8handlers11iconhandler11IconHandler21possibleIconFilenamesMFAyaSQCs5utils9ImageSizebZ9__lambda4MFZv (this=0x7fde0752bd00)
    at ../src/asgen/handlers/iconhandler.d:392
#7  0x000000000082fdfa in core.thread.Fiber.run() (this=0x7fde07528580) at src/core/thread.d:4436
#8  0x000000000082fd5d in fiber_entryPoint () at src/core/thread.d:3665
#9  0x0000000000000000 in  ()
```

This is in the constructor of a std.concurrency.Generator:
auto gen = new Generator!string (...)

I am not sure what to make of this yet though... This goes into DRuntime territory that I actually hoped to never have to deal with as much as I apparently need to now.

April 19, 2018

Re: Issues with debugging GC-related crashes #2

Posted by kinke
in reply to Matthias Klumpp

kinke

Posted in reply to Matthias Klumpp

On Thursday, 19 April 2018 at 17:01:48 UTC, Matthias Klumpp wrote:
> Something that maybe is relevant though: I occasionally get the following SIGABRT crash in the tool on machines which have the SIGSEGV crash:
> ```
> Thread 53 "appstream-gener" received signal SIGABRT, Aborted.
> [Switching to Thread 0x7fdfe98d4700 (LWP 7326)]
> 0x00007ffff5040428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
> 54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  0x00007ffff5040428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x00007ffff504202a in __GI_abort () at abort.c:89
> #2  0x0000000000780ae0 in core.thread.Fiber.allocStack(ulong, ulong) (this=0x7fde0758a680, guardPageSize=4096, sz=20480) at src/core/thread.d:4606
> #3  0x00000000007807fc in _D4core6thread5Fiber6__ctorMFNbDFZvmmZCQBlQBjQBf (this=0x7fde0758a680, guardPageSize=4096, sz=16384, dg=...)
>     at src/core/thread.d:4134
> #4  0x00000000006f9b31 in _D3std11concurrency__T9GeneratorTAyaZQp6__ctorMFDFZvZCQCaQBz__TQBpTQBiZQBx (this=0x7fde0758a680, dg=...)
>     at /home/ubuntu/dtc/dmd/generated/linux/debug/64/../../../../../druntime/import/core/thread.d:4126
> #5  0x00000000006e9467 in _D5asgen8handlers11iconhandler5Theme21matchingIconFilenamesMFAyaSQCl5utils9ImageSizebZC3std11concurrency__T9GeneratorTQCfZQp (this=0x7fdea2747800, relaxedScalingRules=true, size=..., iname=...) at ../src/asgen/handlers/iconhandler.d:196
> #6  0x00000000006ea75a in _D5asgen8handlers11iconhandler11IconHandler21possibleIconFilenamesMFAyaSQCs5utils9ImageSizebZ9__lambda4MFZv (this=0x7fde0752bd00)
>     at ../src/asgen/handlers/iconhandler.d:392
> #7  0x000000000082fdfa in core.thread.Fiber.run() (this=0x7fde07528580) at src/core/thread.d:4436
> #8  0x000000000082fd5d in fiber_entryPoint () at src/core/thread.d:3665
> #9  0x0000000000000000 in  ()
> ```

You probably already figured that the new Fiber seems to be allocating its 16KB-stack, with an additional 4 KB guard page at its bottom, via a 20 KB mmap() call. The abort seems to be triggered by mprotect() returning -1, i.e., a failure to disallow all access to the the guard page; so checking `errno` should help.

April 20, 2018

Re: Issues with debugging GC-related crashes #2

Posted by Matthias Klumpp
in reply to kinke

Matthias Klumpp

Posted in reply to kinke

On Thursday, 19 April 2018 at 18:45:41 UTC, kinke wrote:
> On Thursday, 19 April 2018 at 17:01:48 UTC, Matthias Klumpp wrote:
>> Something that maybe is relevant though: I occasionally get the following SIGABRT crash in the tool on machines which have the SIGSEGV crash:
>> ```
>> Thread 53 "appstream-gener" received signal SIGABRT, Aborted.
>> [Switching to Thread 0x7fdfe98d4700 (LWP 7326)]
>> 0x00007ffff5040428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
>> 54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>> (gdb) bt
>> #0  0x00007ffff5040428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
>> #1  0x00007ffff504202a in __GI_abort () at abort.c:89
>> #2  0x0000000000780ae0 in core.thread.Fiber.allocStack(ulong, ulong) (this=0x7fde0758a680, guardPageSize=4096, sz=20480) at src/core/thread.d:4606
>> #3  0x00000000007807fc in _D4core6thread5Fiber6__ctorMFNbDFZvmmZCQBlQBjQBf (this=0x7fde0758a680, guardPageSize=4096, sz=16384, dg=...)
>>     at src/core/thread.d:4134
>> #4  0x00000000006f9b31 in _D3std11concurrency__T9GeneratorTAyaZQp6__ctorMFDFZvZCQCaQBz__TQBpTQBiZQBx (this=0x7fde0758a680, dg=...)
>>     at /home/ubuntu/dtc/dmd/generated/linux/debug/64/../../../../../druntime/import/core/thread.d:4126
>> #5  0x00000000006e9467 in _D5asgen8handlers11iconhandler5Theme21matchingIconFilenamesMFAyaSQCl5utils9ImageSizebZC3std11concurrency__T9GeneratorTQCfZQp (this=0x7fdea2747800, relaxedScalingRules=true, size=..., iname=...) at ../src/asgen/handlers/iconhandler.d:196
>> #6  0x00000000006ea75a in _D5asgen8handlers11iconhandler11IconHandler21possibleIconFilenamesMFAyaSQCs5utils9ImageSizebZ9__lambda4MFZv (this=0x7fde0752bd00)
>>     at ../src/asgen/handlers/iconhandler.d:392
>> #7  0x000000000082fdfa in core.thread.Fiber.run() (this=0x7fde07528580) at src/core/thread.d:4436
>> #8  0x000000000082fd5d in fiber_entryPoint () at src/core/thread.d:3665
>> #9  0x0000000000000000 in  ()
>> ```
>
> You probably already figured that the new Fiber seems to be allocating its 16KB-stack, with an additional 4 KB guard page at its bottom, via a 20 KB mmap() call. The abort seems to be triggered by mprotect() returning -1, i.e., a failure to disallow all access to the the guard page; so checking `errno` should help.

Jup, I did that already, it just took a really long time to run because when I made the change to print errno I also enabled detailed GC profiling (via the PRINTF* debug options). Enabling the INVARIANT option for the GC is completely broken by the way, I enforced the compile to work by casting to shared, with the result of the GC locking up forever at the start of the program.

Anyway, I think for a chance I actually produced some useful information via the GC debug options:
Given the following crash:
```
#0  0x00000000007f1d94 in _D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv (this=..., ptop=0x7fdfce7fc010, pbot=0x7fdfcdbfc010)
    at src/gc/impl/conservative/gc.d:1990
        p1 = 0x7fdfcdbfc010
        p2 = 0x7fdfce7fc010
        stackPos = 0
[...]
```
The scanned range seemed fairly odd to me, so I searched for it in the (very verbose!) GC debug output, which yielded:
```
235.244445: 0xc4f090.Gcx::addRange(0x8264230, 0x8264270)
235.244460: 0xc4f090.Gcx::addRange(0x7fdfcdbfc010, 0x7fdfce7fc010)
235.253861: 0xc4f090.Gcx::addRange(0x8264300, 0x8264340)
235.253873: 0xc4f090.Gcx::addRange(0x8264390, 0x82643d0)
```
So, something is calling addRange explicitly there, causing the GC to scan a range that it shouldn't scan. Since my code doesn't add ranges to the GC, and I looked at the generated code from girtod/GtkD and it very much looks fine to me, I am currently looking into EMSI containers[1] as the possible culprit.
That library being the issue would also make perfect sense, because this issue started to appear with such a frequency only after containers were added (there was a GC-related crash before, but that might have been a different one).

So, I will look into that addRange call next.

[1]: https://github.com/dlang-community/containers

April 20, 2018

Re: Issues with debugging GC-related crashes #2

Posted by Matthias Klumpp
in reply to Matthias Klumpp

Matthias Klumpp

Posted in reply to Matthias Klumpp

On Friday, 20 April 2018 at 00:11:25 UTC, Matthias Klumpp wrote:
> [...]
> Jup, I did that already, it just took a really long time to run because when I made the change to print errno [...]

I forgot to mention that, the error code was 12, ENOMEM, so this is actually likely not a relevant issue afterall.

April 20, 2018

Re: Issues with debugging GC-related crashes #2

Posted by Dmitry Olshansky
in reply to Matthias Klumpp

Dmitry Olshansky

Posted in reply to Matthias Klumpp

On Friday, 20 April 2018 at 00:11:25 UTC, Matthias Klumpp wrote:
> On Thursday, 19 April 2018 at 18:45:41 UTC, kinke wrote:
>> [...]
>
> Jup, I did that already, it just took a really long time to run because when I made the change to print errno I also enabled detailed GC profiling (via the PRINTF* debug options). Enabling the INVARIANT option for the GC is completely broken by the way, I enforced the compile to work by casting to shared, with the result of the GC locking up forever at the start of the program.
>
> [...]

I think the order of operations is wrong, here is an example from containers:

allocator.dispose(buckets);
static if (useGC)
	    GC.removeRange(buckets.ptr);

If GC triggers between dispose and removeRange, it will likely segfault.

> [1]: https://github.com/dlang-community/containers

April 20, 2018

Re: Issues with debugging GC-related crashes #2

Posted by Kagamin
in reply to Matthias Klumpp

Kagamin

Posted in reply to Matthias Klumpp

On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:
> #2  0x00007ffff51341c8 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf7opApplyMFNbMDFNbKQBtZiZ9__lambda2MFNbKxSQCpQCpQCfZi (e=...) at treap.d:47
>         dg = {context = 0x7fffffffc140 "\320\065\206", funcptr = 0x7ffff5121d10 <_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi>}
> #3  0x00007ffff5134238 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (node=0x7568700, dg=...) at treap.d:221

Indeed, this is iteration over Treap!Range used to store ranges added with addRange method.
https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L2182

April 20, 2018

Re: Issues with debugging GC-related crashes #2

Posted by Matthias Klumpp
in reply to Dmitry Olshansky

Matthias Klumpp

Posted in reply to Dmitry Olshansky

On Friday, 20 April 2018 at 05:32:32 UTC, Dmitry Olshansky wrote:
> On Friday, 20 April 2018 at 00:11:25 UTC, Matthias Klumpp wrote:
>> On Thursday, 19 April 2018 at 18:45:41 UTC, kinke wrote:
>>> [...]
>>
>> Jup, I did that already, it just took a really long time to run because when I made the change to print errno I also enabled detailed GC profiling (via the PRINTF* debug options). Enabling the INVARIANT option for the GC is completely broken by the way, I enforced the compile to work by casting to shared, with the result of the GC locking up forever at the start of the program.
>>
>> [...]
>
> I think the order of operations is wrong, here is an example from containers:
>
> allocator.dispose(buckets);
> static if (useGC)
> 	    GC.removeRange(buckets.ptr);
>
> If GC triggers between dispose and removeRange, it will likely segfault.

Indeed! It's also the only place where this is shuffled around, all other parts of the containers library do this properly.
The thing I wonder about is though, that the crash usually appeared in an explicit GC.collect() call when the application was not running multiple threads. At that point, the GC - as far as I know - couldn't have triggered after the buckets were disposed of and the ranges were removed. But maybe I am wrong with that assumption.
This crash would be explained perfectly by that bug.

April 20, 2018

Re: Issues with debugging GC-related crashes #2

Posted by Matthias Klumpp
in reply to Matthias Klumpp

Matthias Klumpp

Posted in reply to Matthias Klumpp

On Friday, 20 April 2018 at 18:30:30 UTC, Matthias Klumpp wrote:
> On Friday, 20 April 2018 at 05:32:32 UTC, Dmitry Olshansky wrote:
>> On Friday, 20 April 2018 at 00:11:25 UTC, Matthias Klumpp wrote:
>>> On Thursday, 19 April 2018 at 18:45:41 UTC, kinke wrote:
>>>> [...]
>>> [...]
>>
>> I think the order of operations is wrong, here is an example from containers:
>>
>> allocator.dispose(buckets);
>> static if (useGC)
>> 	    GC.removeRange(buckets.ptr);
>>
>> If GC triggers between dispose and removeRange, it will likely segfault.
>
> Indeed! It's also the only place where this is shuffled around, all other parts of the containers library do this properly.
> The thing I wonder about is though, that the crash usually appeared in an explicit GC.collect() call when the application was not running multiple threads. At that point, the GC - as far as I know - couldn't have triggered after the buckets were disposed of and the ranges were removed. But maybe I am wrong with that assumption.
> This crash would be explained perfectly by that bug.

Turns out that was indeed the case! I created a small testcase which managed to very reliably reproduce the issue on all machines that I tested it on. After reordering the dispose/removeRange, the crashes went away completely.
I submitted a pull request to the containers library to fix this issue: https://github.com/dlang-community/containers/pull/107

I will also try to get the patch into the components in Debian and Ubuntu, so we can maybe have a chance of updating the software center metadata for Ubuntu before 18.04 LTS releases next week.
Since asgen uses HashMaps for pretty much everything, an most of the time with GC-managed elements, this should improve the stability of the application greatly.

Thanks a lot for the help in debugging this, I learned a lot about DRuntime internals in the process. Also, it is no exaggeration to say that the appstream-generator project would not be written in D (there was a Rust prototype once...) and I would probably not be using D as much (or at all) without the helpful community around it.
Thank you :-)

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation