April 18, 2018
On Wednesday, 18 April 2018 at 18:55:48 UTC, kinke wrote:
> On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote:
>> There's a number of debugging options for GC, though not sure which
>> ones are enabled in default debug build of druntime
>
> Speaking for LDC, none are, they all need to be enabled explicitly. There's a whole bunch of them (https://github.com/dlang/druntime/blob/master/src/gc/impl/conservative/gc.d#L20-L31), so enabling most of them would surely help in tracking this down, but it's most likely still going to be very tedious.
> I'm not really surprised that there are compilation errors when enabling the debug options, that's a likely fate of untested code unfortunately.

Yeah... Maybe making a CI build with "enable all the things" makes sense to combat that...

> If possible, I'd give static linking a try.

I tried that, with at least linking druntime and phobos statically. I did not, however, link all the things statically.
That is something to try (at least statically linking all the D libraries).

April 18, 2018
On Wednesday, 18 April 2018 at 20:40:52 UTC, Matthias Klumpp wrote:
> [...]
>> If possible, I'd give static linking a try.
>
> I tried that, with at least linking druntime and phobos statically. I did not, however, link all the things statically.
> That is something to try (at least statically linking all the D libraries).

No luck...
```
#0  0x00000000007f10e8 in _D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv (this=..., ptop=0x7fcf6a11b010, pbot=0x7fcf6951b010)
    at src/gc/impl/conservative/gc.d:1990
        p1 = 0x7fcf6951b010
        p2 = 0x7fcf6a11b010
        stackPos = 0
        stack =
            {{pbot = 0x7fffffffcc60, ptop = 0x7f15af <_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 0x87b4118, ptop = 0x87b4118}, {pbot = 0x0, ptop = 0xc4cda0}, {pbot = 0x7fffffffcca0, ptop = 0x7f15af <_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 0x87af258, ptop = 0x87af258}, {pbot = 0x0, ptop = 0xc4cda0}, {pbot = 0x7fffffffcce0, ptop = 0x7f15af <_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 0x87af158, ptop = 0x87af158}, {pbot = 0x0, ptop = 0xc4cda0}, {pbot = 0x7fffffffcd20, ptop = 0x7f15af <_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 0x87af0d8, ptop = 0x87af0d8}, {pbot = 0x0, ptop = 0xc4cda0}, {pbot = 0x7fdf6b265000, ptop = 0x69b96a0}, {pbot = 0x28, ptop = 0x7fcf5951b000}, {pbot = 0x309eab7000, ptop = 0x7fdf6b265000}, {pbot = 0x0, ptop = 0x0}, {pbot = 0x1381d00, ptop = 0x1c}, {pbot = 0x1d, ptop = 0x1c}, {pbot = 0x1a44100, ptop = 0x1a4410}, {pbot = 0x1a44, ptop = 0x4}, {pbot = 0x7fdf6b355000, ptop = 0x69b96a0}, {pbot = 0x28, ptop = 0x7fcf5951b000}, {pbot = 0x309eab7000, ptop = 0x4ac0}, {pbot = 0x4a, ptop = 0x0}, {pbot = 0x1381d00, ptop = 0x1c}, {pbot = 0x1d, ptop = 0x1c}, {pbot = 0x4ac00, ptop = 0x4ac0}, {pbot = 0x4a, ptop = 0x4}}
        pcache = 0
        pools = 0x69b96a0
        highpool = 40
        minAddr = 0x7fcf5951b000
        memSize = 208820465664
        base = 0xaef0
        top = 0xae
        p = 0x4618770
        pool = 0x0
        low = 110859936
        high = 40
        mid = 140528533483520
        offset = 208820465664
        biti = 8329709
        pn = 142275872
        bin = 1
        offsetBase = 0
        next = 0xc4cc80
        next = {pbot = 0x7fffffffcbe0, ptop = 0x7f19ed <_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi+57>}
        __r292 = 0x7fffffffd320
        __key293 = 8376632
        rng = @0x0: <error reading variable>
#1  0x00000000007f19ed in _D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi (this=0x7fffffffd360, __applyArg0=...)
    at src/gc/impl/conservative/gc.d:2188
        range = {pbot = 0x7fcf6951b010, ptop = 0x7fcf6a11b010, ti = 0x0}
#2  0x00000000007fd161 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf7opApplyMFNbMDFNbKQBtZiZ9__lambda2MFNbKxSQCpQCpQCfZi (this=0x7fffffffd320, e=...) at src/rt/util/container/treap.d:47
#3  0x00000000007fd539 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x80396c0) at src/rt/util/container/treap.d:221
        result = 0
#4  0x00000000007fd565 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x87c8140) at src/rt/util/container/treap.d:224
        result = 0
#5  0x00000000007fd516 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x7fdfc8000950) at src/rt/util/container/treap.d:218
        result = 16844032
#6  0x00000000007fd516 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x7fdfc8000a50) at src/rt/util/container/treap.d:218
        result = 0
#7  0x00000000007fd516 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x7fdfc8000c50) at src/rt/util/container/treap.d:218
        result = 0

[etc...]
#37 0x000000000077e889 in core.memory.GC.collect() () at src/core/memory.d:207
#38 0x00000000006b4791 in asgen.engine.Engine.gcCollect() (this=0x7ffff7ee13c0) at ../src/asgen/engine.d:122
```


April 18, 2018
On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote:
> Actually this sounds very familiar: https://github.com/D-Programming-GDC/GDC/pull/236

Interesting, but I don't think it applies here. Both start and end addresses are 16-bytes aligned, and both cannot be accessed according to the stack trace (`pbot=0x7fcf4d721010 <error: Cannot access memory at address 0x7fcf4d721010>, ptop=0x7fcf4e321010 <error: Cannot access memory at address 0x7fcf4e321010>`). That's quite interesting too: `memSize = 209153867776`. Don't know what exactly it is, but it's a pretty large number (~194 GB).
April 18, 2018
On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote:
> [...]
>
> Actually this sounds very familiar: https://github.com/D-Programming-GDC/GDC/pull/236
>
> it took us quite some time to reduce and debug this:
>
> https://github.com/D-Programming-GDC/GDC/pull/236/commits/ 5021b8d031fcacac52ee43d83508a5d2856606cd
>
> So I wondered why I couldn't find this in the upstream druntime code. Turns out our pull request has never been merged....
>
> https://github.com/dlang/druntime/pull/1678

Just to be sure, I applied your patch, but unfortunately I still get the same result...

On Wednesday, 18 April 2018 at 20:38:20 UTC, negi wrote:
> On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:
>> ...
>
> This reminds me of (otherwise unrelated) problems I had involving Linux 4.15.
>
> If you feel out of ideas, I suggest you take a look at the kernels.  It might
> be that Ubuntu is turning some security-related knob in a different direction
> than Debian.  Or it might be some bug in 4.15 (I found it to be quite buggy,
> specially during the first few point releases; 4.15 was the first upstream
> release including large amounts of meltdown/spectre-related work).

All the crashes are happening on a 4.4 kernel though... I am currently pondering digging out a 4.4 kernel here to see if that makes me reproduce the crash locally.
April 18, 2018
On Wednesday, 18 April 2018 at 22:12:12 UTC, kinke wrote:
> On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote:
>> Actually this sounds very familiar: https://github.com/D-Programming-GDC/GDC/pull/236
>
> Interesting, but I don't think it applies here. Both start and end addresses are 16-bytes aligned, and both cannot be accessed according to the stack trace (`pbot=0x7fcf4d721010 <error: Cannot access memory at address 0x7fcf4d721010>, ptop=0x7fcf4e321010 <error: Cannot access memory at address 0x7fcf4e321010>`). That's quite interesting too: `memSize = 209153867776`. Don't know what exactly it is, but it's a pretty large number (~194 GB).

size_t memSize = pooltable.maxAddr - minAddr;
(https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L1982 )
That wouldn't make sense for a pool size...

The machine this is running on has 16G memory, at the time of the crash the software was using ~2.1G memory, with 130G virtual memory due to LMDB memory mapping (I wonder what happens if I reduce that...)

April 19, 2018
Am Wed, 18 Apr 2018 22:24:13 +0000 schrieb Matthias Klumpp:

> On Wednesday, 18 April 2018 at 22:12:12 UTC, kinke wrote:
>> On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote:
>>> Actually this sounds very familiar: https://github.com/D-Programming-GDC/GDC/pull/236
>>
>> Interesting, but I don't think it applies here. Both start and end addresses are 16-bytes aligned, and both cannot be accessed according to the stack trace (`pbot=0x7fcf4d721010 <error: Cannot access memory at address 0x7fcf4d721010>, ptop=0x7fcf4e321010 <error: Cannot access memory at address 0x7fcf4e321010>`). That's quite interesting too: `memSize = 209153867776`. Don't know what exactly it is, but it's a pretty large number (~194 GB).
> 
> size_t memSize = pooltable.maxAddr - minAddr; (https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/
conservative/gc.d#L1982
> )
> That wouldn't make sense for a pool size...
> 
> The machine this is running on has 16G memory, at the time of the crash the software was using ~2.1G memory, with 130G virtual memory due to LMDB memory mapping (I wonder what happens if I reduce that...)

I see. Then I'd try to debug where the range originally comes from, try adding breakpoints in _d_dso_registry, registerGCRanges and similar functions here: https://github.com/dlang/druntime/blob/master/src/rt/ sections_elf_shared.d#L421

Generally if you produced a crash in gdb it should be reproducible if you restart the program in gdb. So once you have a crash, you should be able to restart the program and look at the _dso_registry and see the same addresses somewhere. If you then think you see memory corruption somewhere you could also use read or write watchpoints.

But just to be sure: you're not adding any GC ranges manually, right? You could also try to compare the GC range to the address range layout in /proc/$PID/maps .



-- 
Johannes
April 19, 2018
Am Thu, 19 Apr 2018 06:33:27 +0000 schrieb Johannes Pfau:

> 
> Generally if you produced a crash in gdb it should be reproducible if you restart the program in gdb. So once you have a crash, you should be able to restart the program and look at the _dso_registry and see the same addresses somewhere. If you then think you see memory corruption somewhere you could also use read or write watchpoints.
> 
> But just to be sure: you're not adding any GC ranges manually, right? You could also try to compare the GC range to the address range layout in /proc/$PID/maps .

Of course, if this is a GC pool / heap range adding breakpoints in the sections code won't be useful. Then I'd try to add a write watchpoint on pooltable.minAddr / maxAddr, restart the programm in gdb and see where / why the values are set.

-- 
Johannes
April 19, 2018
Am Thu, 19 Apr 2018 07:04:14 +0000 schrieb Johannes Pfau:

> Am Thu, 19 Apr 2018 06:33:27 +0000 schrieb Johannes Pfau:
> 
> 
>> Generally if you produced a crash in gdb it should be reproducible if you restart the program in gdb. So once you have a crash, you should be able to restart the program and look at the _dso_registry and see the same addresses somewhere. If you then think you see memory corruption somewhere you could also use read or write watchpoints.
>> 
>> But just to be sure: you're not adding any GC ranges manually, right? You could also try to compare the GC range to the address range layout in /proc/$PID/maps .
> 
> Of course, if this is a GC pool / heap range adding breakpoints in the sections code won't be useful. Then I'd try to add a write watchpoint on pooltable.minAddr / maxAddr, restart the programm in gdb and see where / why the values are set.

Having a quick look at https://github.com/ldc-developers/druntime/blob/ ldc/src/gc/pooltable.d: The GC seems to allocate multiple pools using malloc, but only keeps track of one minimum/maximum address for all pools. Now if there's some other memory area malloced in between these pools, you will end up with a huge memory block. When this will get scanned and if any of the memory in-between the GC pools is protected, you might see the GC crash. However, I don't really know anything about the GC code, so some GC expert would have to confirm this.



-- 
Johannes
April 19, 2018
On Wednesday, 18 April 2018 at 17:40:56 UTC, Matthias Klumpp wrote:
> On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote:
>> You can call GC.collect at some points in the program to see if they can trigger the crash
>
> I already do that, and indeed I get crashes. I could throw those calls into every function though, or make a minimal pool size, maybe that yields something...

Can you narrow down the earliest point at which it starts to crash? That might identify if something in particular causes the crash.
April 19, 2018
On Wednesday, 18 April 2018 at 17:40:56 UTC, Matthias Klumpp wrote:
> I get compile errors for the INVARIANT option, and I don't actually know how to deal with those properly:
> ```
> src/gc/impl/conservative/gc.d(1396): Error: shared mutable method core.internal.spinlock.SpinLock.lock is not callable using a shared const object
> src/gc/impl/conservative/gc.d(1396):        Consider adding const or inout to core.internal.spinlock.SpinLock.lock
> src/gc/impl/conservative/gc.d(1403): Error: shared mutable method core.internal.spinlock.SpinLock.unlock is not callable using a shared const object
> src/gc/impl/conservative/gc.d(1403):        Consider adding const or inout to core.internal.spinlock.SpinLock.unlock
> ```
>
> Commenting out the locks (eww!!) yields no change in behavior though.

As a workaround:
(cast(shared)rangesLock).lock();