Jump to page: 1 24  
Page
Thread overview
Issues with debugging GC-related crashes #2
Apr 16, 2018
Matthias Klumpp
Apr 16, 2018
Matthias Klumpp
Apr 17, 2018
Kagamin
Apr 17, 2018
Kagamin
Apr 17, 2018
Matthias Klumpp
Apr 18, 2018
Kagamin
Apr 18, 2018
Matthias Klumpp
Apr 18, 2018
Johannes Pfau
Apr 18, 2018
kinke
Apr 18, 2018
Matthias Klumpp
Apr 19, 2018
Johannes Pfau
Apr 19, 2018
Johannes Pfau
Apr 19, 2018
Johannes Pfau
Apr 19, 2018
Kagamin
Apr 19, 2018
Kagamin
Apr 19, 2018
Matthias Klumpp
Apr 19, 2018
kinke
Apr 20, 2018
Matthias Klumpp
Apr 20, 2018
Matthias Klumpp
Apr 20, 2018
Dmitry Olshansky
Apr 20, 2018
Matthias Klumpp
Apr 20, 2018
Matthias Klumpp
Apr 23, 2018
Dmitry Olshansky
Apr 18, 2018
Matthias Klumpp
Apr 19, 2018
Kagamin
Apr 19, 2018
Kagamin
Apr 18, 2018
kinke
Apr 18, 2018
Matthias Klumpp
Apr 18, 2018
Matthias Klumpp
Apr 18, 2018
negi
Apr 20, 2018
Kagamin
April 16, 2018
Hi!

I am developing a software called AppStream Generator in D, which is the default way of Debian and Ubuntu (and Arch Linux) to produce metadata for their software center applications.
D is working well for that purpose now, and - except for high memory usage - there are no issues on Debian. On Ubuntu, however, the software regularly crashes when the GC tries to mark a memory range that is not accessible to it (likely already freed).

The software is compiled using LDC 1.8.0, and uses D language bindings for C libraries generated by gir-to-d[1] as well as the EMSI containers library[2]. All of these are loaded as shared libraries.
You can find the source-code of appstream-generator on Github[3].

The code uses std.typecons.scoped occasionally, does no GC allocations in destructors and does nothing to mess with the GC in general. There are a few calls to GC.add/removeRoot in the gir-to-d generated code (ObjectG.d), but those are very unlikely to cause issues (removing them did yield the same crash, and the same code is used by more projects).

Running the tool under gdb yields backtraces like:
```
Thread 1 "appstream-gener" received signal SIGSEGV, Segmentation fault.
0x00007ffff5121168 in _D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv (this=..., pbot=0x7fcf4d721010 <error: Cannot access memory at address 0x7fcf4d721010>,
    ptop=0x7fcf4e321010 <error: Cannot access memory at address 0x7fcf4e321010>) at gc.d:1990
1990    gc.d: No such file or directory.
(gdb) bt full
#0  0x00007ffff5121168 in _D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv (this=..., pbot=0x7fcf4d721010 <error: Cannot access memory at address 0x7fcf4d721010>, ptop=0x7fcf4e321010 <error: Cannot access memory at address 0x7fcf4e321010>) at gc.d:1990
        p = 0xe256e <error: Cannot access memory at address 0xe256e>
        p1 = 0x7fcf4d721010
        p2 = 0x7fcf4e321010
        stackPos = 0
        stack =
            {{pbot = 0x17 <error: Cannot access memory at address 0x17>, ptop = 0x30b28ac000 <error: Cannot access memory at address 0x30b28ac000>}, {pbot = 0x7fcf45721000 "`&<\365\377\177", ptop = 0x3b <error: Cannot access memory at address 0x3b>}, {pbot = 0x0, ptop = 0x7fcf4f6f3000 "are/icons/Moka/16x16/apps/AdobeReader12.png\n/usr/share/icons/Moka/16x16/apps/AdobeReader8.png\n/usr/share/icons/Moka/16x16/apps/AdobeReader9.png\n/usr/share/icons/Moka/16x16/apps/Blender.png\n/usr/share/"...}, {pbot = 0x17 <error: Cannot access memory at address 0x17>, ptop = 0x30b28ac000 <error: Cannot access memory at address 0x30b28ac000>}, {pbot = 0x7fcf45721000 "`&<\365\377\177", ptop = 0x3b <error: Cannot access memory at address 0x3b>}, {pbot = 0x1083c00 "0V\001\340\337\177", ptop = 0x0}, {pbot = 0x17 <error: Cannot access memory at address 0x17>, ptop = 0x18 <error: Cannot access memory at address 0x18>}, {pbot = 0x16 <error: Cannot access memory at address 0x16>, ptop = 0x146a650 ""}, {pbot = 0x0, ptop = 0x7fcf4f68c000 "256x256/apps/homebank.png\n/usr/share/icons/Moka/256x256/apps/hp-logo.png\n/usr/share/icons/Moka/256x256/apps/hugin.png\n/usr/share/icons/Moka/256x256/apps/hydrogen.png\n/usr/share/icons/Moka/256x256/apps"...}, {pbot = 0x17 <error: Cannot access memory at address 0x17>, ptop = 0x30b28ac000 <error: Cannot access memory at address 0x30b28ac000>}, {pbot = 0x7fcf45721000 "`&<\365\377\177", ptop = 0x3b <error: Cannot access memory at address 0x3b>}, {pbot = 0x1083c00 "0V\001\340\337\177", ptop = 0x7fcf4f6bc000 "ons/Moka/48x48/places/distributor-logo-mageia.png\n/usr/share/icons/Moka/48x48/places/distributor-logo-mandriva.png\n/usr/share/icons/Moka/48x48/places/distributor-logo-manjaro.png\n/usr/share/icons/Moka"...}, {pbot = 0x17 <error: Cannot access memory at address 0x17>, ptop = 0x18 <error: Cannot access memory at address 0x18>}, {pbot = 0x16 <error: Cannot access memory at address 0x16>, ptop = 0x146a650 ""}, {pbot = 0x0, ptop = 0x7fcf4f466000 "/opera-extension.svg\n/usr/share/icons/Numix/64/mimetypes/package-gdebi.svg\n/usr/share/icons/Numix/64/mimetypes/package-x-generic.svg\n/usr/share/icons/Numix/64/mimetypes/package.svg\n/usr/share/icons/Nu"...}, {pbot = 0x17 <error: Cannot access memory at address 0x17>, ptop = 0x30b28ac000 <error: Cannot access memory at address 0x30b28ac000>}, {pbot = 0x7fcf45721000 "`&<\365\377\177", ptop = 0x3b <error: Cannot access memory at address 0x3b>}, {pbot = 0x1083c00 "0V\001\340\337\177", ptop = 0x7fcf4f01e000 "pirus-Adapta-Nokto/16x16/actions/upcomingevents-amarok.svg\n/usr/share/icons/Papirus-Adapta-Nokto/16x16/actions/upindicator.svg\n/usr/share/icons/Papirus-Adapta-Nokto/16x16/actions/upload-media.svg\n/usr"...}, {pbot = 0x1 <error: Cannot access memory at address 0x1>, ptop = 0x30b28ac000 <error: Cannot access memory at address 0x30b28ac000>}, {pbot = 0x7fcf45721000 "`&<\365\377\177", ptop = 0x3b <error: Cannot access memory at address 0x3b>}, {pbot = 0x1083c00 "0V\001\340\337\177", ptop = 0x7fdfd8faa000 "icons/ContrastHigh/32x32/status/user-offline.png\n/usr/share/icons/ContrastHigh/32x32/status/user-status-pending.png\n/usr/share/icons/ContrastHigh/32x32/status/user-trash-full.png\n/usr/share/icons/Cont"...}, {pbot = 0x75671e0 "P", ptop = 0x75671e0 "P"}, {pbot = 0x75671a0 "\020\203\244\004", ptop = 0x7fffffffbc00 "s\f"}, {pbot = 0x0, ptop = 0x7567420 "P"}, {pbot = 0x7567420 "P", ptop = 0xc735e0 ""}, {pbot = 0x1 <error: Cannot access memory at address 0x1>, ptop = 0xc73 <error: Cannot access memory at address 0xc73>}, {pbot = 0xc735e <error: Cannot access memory at address 0xc735e>, ptop = 0xc735e0 ""}, {pbot = 0x17 <error: Cannot access memory at address 0x17>, ptop = 0x18 <error: Cannot access memory at address 0x18>}, {pbot = 0x16 <error: Cannot access memory at address 0x16>, ptop = 0x146a650 ""}, {pbot = 0x0, ptop = 0x7568230 "P"}, {pbot = 0x7568230 "P", ptop = 0x7568230 "P"}, {pbot = 0x75681f0 "\220\202\337\006", ptop = 0x7fffffffbc90 "\300\274\377\377\377\177"}}
        pcache = 0
        pools = 0x1083c00
        highpool = 59
        minAddr = 0x7fcf45721000 "`&<\365\377\177"
        memSize = 209153867776
        base = 0x17 <error: Cannot access memory at address 0x17>
        top = 0xe256e0 ""
#1  0x00007ffff5121d46 in _D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi (__applyArg0=...) at gc.d:2188
        range = {pbot = 0x7fcf4d721010 <error: Cannot access memory at address 0x7fcf4d721010>, ptop = 0x7fcf4e321010 <error: Cannot access memory at address 0x7fcf4e321010>, ti = 0x0}
        this =
            @0x8635d0: {rootsLock = {impl = {val = 1, contention = 0 '\000'}}, rangesLock = {impl = {val = 1, contention = 0 '\000'}}, roots = {root = 0x0, rand48 = {rng_state = 8187282149633}}, ranges = {root = 0x703d2d0, rand48 = {rng_state = 637908263724}}, log = false, disabled = 0, pooltable = {pools = 0x1083c00, npools = 60, _minAddr = 0x7fcf45721000 "`&<\365\377\177", _maxAddr = 0x7ffff7fcd000 "\327\207\017+"}, bucket = {0x7fdeebfaf6f0, 0x7fdeebfff480, 0x7fdeebffa200, 0x7fdeebffb880, 0x7fdeebffcc00, 0x0, 0x7fdeebffec00, 0x7fdeebfed800}, smallCollectThreshold = 494324, largeCollectThreshold = 320094, usedSmallPages = 507904, usedLargePages = 290132, mappedPages = 813954, toscan = {_length = 0, _p = 0x7ffff7ebd000, _cap = 4096}}
#2  0x00007ffff51341c8 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf7opApplyMFNbMDFNbKQBtZiZ9__lambda2MFNbKxSQCpQCpQCfZi (e=...) at treap.d:47
        dg = {context = 0x7fffffffc140 "\320\065\206", funcptr = 0x7ffff5121d10 <_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi>}
#3  0x00007ffff5134238 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (node=0x7568700, dg=...) at treap.d:221
        result = 0
```
See https://paste.debian.net/1020595/ and https://paste.debian.net/1020596/ for long backtraces (and https://paste.debian.net/1020597/ for a short version).

For reasons unknown, this issue only happens at Ubuntu, and only occasionally, in a way that it is frequent enough to make the software impossible to use, but not persistent enough that running Dustmite on the code would make sense.

Given that the code does nothing (that I am aware of) that would mess with the GC, I am pretty much out of ideas by now and started to assume a bug in LDC or the D GC in general now.

Does anyone of you have an idea what is going on here? Is there anything more to try out to find out the root cause of the issue and figure out if there is a bug (and where to report it)?

The only major difference between Ubuntu and Debian in terms of how things are compiled is that Ubuntu enabled --as-needed linking options, which doesn't seem to be relevant here.

I would be happy for any help with figuring out what this issue actually is!

Regards,
    Matthias

[1]: https://github.com/gtkd-developers/gir-to-d
[2]: https://github.com/dlang-community/containers
[3]: https://github.com/ximion/appstream-generator


April 16, 2018
On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:
> [...]
> The code uses std.typecons.scoped occasionally, does no GC allocations in destructors and does nothing to mess with the GC in general. There are a few calls to GC.add/removeRoot in the gir-to-d generated code (ObjectG.d), but those are very unlikely to cause issues (removing them did yield the same crash, and the same code is used by more projects).
> [...]

Another thing to mention is that the software uses LMDB[1] and mmaps huge amounts of data into memory (gigabyte range).
Not sure if that information is relevant at all though.

[1]: https://symas.com/lmdb/technical/

April 17, 2018
On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:
> The code uses std.typecons.scoped occasionally, does no GC allocations in destructors and does nothing to mess with the GC in general.

What do you use destructors for?
April 17, 2018
Other stuff to try:
1. run application compiled on debian against ubuntu libs
2. can you mix dependencies from debian and ubuntu?
April 17, 2018
On Tuesday, 17 April 2018 at 08:23:07 UTC, Kagamin wrote:
> Other stuff to try:
> 1. run application compiled on debian against ubuntu libs
> 2. can you mix dependencies from debian and ubuntu?

I haven't tried that yet (next on my todo list), if I do run the program compiled with address sanitizer on Debian, I do get errors like:
```
AddressSanitizer:DEADLYSIGNAL
=================================================================
==25964==ERROR: AddressSanitizer: SEGV on unknown address 0x7fac8db3f800 (pc 0x7fac9c433430 bp 0x000000000008 sp 0x7ffc92be3dd0 T0)
==25964==The signal is caused by a READ memory access.
    #0 0x7fac9c43342f in _D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xa142f)
    #1 0x7fac9c433a2f in _D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xa1a2f)
    #2 0x7fac9c459ad4 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7ad4)
    #3 0x7fac9c459ac6 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7ac6)
    #4 0x7fac9c459ac6 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7ac6)
    #5 0x7fac9c459ac6 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7ac6)
    #6 0x7fac9c459a51 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf7opApplyMFNbMDFNbKQBtZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7a51)
    #7 0x7fac9c430f26 in _D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0x9ef26)
    #8 0x7fac9c431226 in _D2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCeQCeQCcQCnQBs18fullCollectNoStackMFNbZ2goFNbPSQEaQEaQDyQEj3GcxZmTQvZQDfMFNbKQBgZm (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0x9f226)
    #9 0x7fac9c4355d0 in gc_term (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xa35d0)
    #10 0x7fac9c443ab2 in rt_term (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xb1ab2)
    #11 0x7fac9c443e65 in _D2rt6dmain211_d_run_mainUiPPaPUAAaZiZ6runAllMFZv (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xb1e65)
    #12 0x7fac9c443d0b in _d_run_main (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xb1d0b)
    #13 0x7fac9b9cfa86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21a86)
    #14 0x55acd1dbe1d9 in _start (/home/matthias/Development/AppStream/generator/build/src/asgen/appstream-generator+0xba1d9)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xa142f) in _D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv
==25964==ABORTING
```
So, I don't think this bug is actually limited to Ubuntu, it just shows up there more often for some reason.
April 18, 2018
You can call GC.collect at some points in the program to see if they can trigger the crash https://dlang.org/library/core/memory/gc.collect.html
If you link against debug druntime, GC can check invariants for correctness of its structures. There's a number of debugging options for GC, though not sure which ones are enabled in default debug build of druntime: https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L1388
April 18, 2018
On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote:
> You can call GC.collect at some points in the program to see if they can trigger the crash

I already do that, and indeed I get crashes. I could throw those calls into every function though, or make a minimal pool size, maybe that yields something...

> https://dlang.org/library/core/memory/gc.collect.html
> If you link against debug druntime, GC can check invariants for correctness of its structures. There's a number of debugging options for GC, though not sure which ones are enabled in default debug build of druntime: https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L1388

I get compile errors for the INVARIANT option, and I don't actually know how to deal with those properly:
```
src/gc/impl/conservative/gc.d(1396): Error: shared mutable method core.internal.spinlock.SpinLock.lock is not callable using a shared const object
src/gc/impl/conservative/gc.d(1396):        Consider adding const or inout to core.internal.spinlock.SpinLock.lock
src/gc/impl/conservative/gc.d(1403): Error: shared mutable method core.internal.spinlock.SpinLock.unlock is not callable using a shared const object
src/gc/impl/conservative/gc.d(1403):        Consider adding const or inout to core.internal.spinlock.SpinLock.unlock
```

Commenting out the locks (eww!!) yields no change in behavior though.

The crashes always appear in https://github.com/dlang/druntime/blob/master/src/gc/impl/conservative/gc.d#L1990

Meanwhile, I also tried to reproduce the crash locally in a chroot, with no result. All libraries used between the machine where the crashes occur and my local machine were 100% identical, the only differences I am aware of are obviously the hardware (AWS cloud vs. home workstation) and the Linux kernel (4.4.0 vs 4.15.0)

The crash happens when built with LDC or DMD, that doesn't influence the result. Copying over a binary from the working machine to the crashing one also results in the same errors.

I am completely out of ideas here. Since I think I can rule out a hardware fault at Amazon, I don't even know what else would make sense to try.
April 18, 2018
On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote:
> There's a number of debugging options for GC, though not sure which
> ones are enabled in default debug build of druntime

Speaking for LDC, none are, they all need to be enabled explicitly. There's a whole bunch of them (https://github.com/dlang/druntime/blob/master/src/gc/impl/conservative/gc.d#L20-L31), so enabling most of them would surely help in tracking this down, but it's most likely still going to be very tedious.
I'm not really surprised that there are compilation errors when enabling the debug options, that's a likely fate of untested code unfortunately.

If possible, I'd give static linking a try.
April 18, 2018
Am Wed, 18 Apr 2018 17:40:56 +0000 schrieb Matthias Klumpp:
> 
> The crashes always appear in https://github.com/dlang/druntime/blob/master/src/gc/impl/conservative/
gc.d#L1990
> 

The important point to note here is that this is not one of these 'GC collected something because it was not reachable' bugs. A crash in the GC mark routine means it somehow scans an invalid address range. Actually, I've seen this before...


> Meanwhile, I also tried to reproduce the crash locally in a chroot, with
> no result. All libraries used between the machine where the crashes
> occur and my local machine were 100% identical,
> the only differences I am aware of are obviously the hardware (AWS cloud
> vs. home workstation) and the Linux kernel (4.4.0 vs 4.15.0)
> 
> The crash happens when built with LDC or DMD, that doesn't influence the result. Copying over a binary from the working machine to the crashing one also results in the same errors.


Actually this sounds very familiar: https://github.com/D-Programming-GDC/GDC/pull/236

it took us quite some time to reduce and debug this:

https://github.com/D-Programming-GDC/GDC/pull/236/commits/ 5021b8d031fcacac52ee43d83508a5d2856606cd

So I wondered why I couldn't find this in the upstream druntime code. Turns out our pull request has never been merged....

https://github.com/dlang/druntime/pull/1678


-- 
Johannes
April 18, 2018
On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:
> ...

This reminds me of (otherwise unrelated) problems I had involving Linux 4.15.

If you feel out of ideas, I suggest you take a look at the kernels.  It might
be that Ubuntu is turning some security-related knob in a different direction
than Debian.  Or it might be some bug in 4.15 (I found it to be quite buggy,
specially during the first few point releases; 4.15 was the first upstream
release including large amounts of meltdown/spectre-related work).
« First   ‹ Prev
1 2 3 4