April 18, 2018 Re: Issues with debugging GC-related crashes #2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to kinke | On Wednesday, 18 April 2018 at 18:55:48 UTC, kinke wrote: > On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote: >> There's a number of debugging options for GC, though not sure which >> ones are enabled in default debug build of druntime > > Speaking for LDC, none are, they all need to be enabled explicitly. There's a whole bunch of them (https://github.com/dlang/druntime/blob/master/src/gc/impl/conservative/gc.d#L20-L31), so enabling most of them would surely help in tracking this down, but it's most likely still going to be very tedious. > I'm not really surprised that there are compilation errors when enabling the debug options, that's a likely fate of untested code unfortunately. Yeah... Maybe making a CI build with "enable all the things" makes sense to combat that... > If possible, I'd give static linking a try. I tried that, with at least linking druntime and phobos statically. I did not, however, link all the things statically. That is something to try (at least statically linking all the D libraries). |
April 18, 2018 Re: Issues with debugging GC-related crashes #2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Matthias Klumpp | On Wednesday, 18 April 2018 at 20:40:52 UTC, Matthias Klumpp wrote: > [...] >> If possible, I'd give static linking a try. > > I tried that, with at least linking druntime and phobos statically. I did not, however, link all the things statically. > That is something to try (at least statically linking all the D libraries). No luck... ``` #0 0x00000000007f10e8 in _D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv (this=..., ptop=0x7fcf6a11b010, pbot=0x7fcf6951b010) at src/gc/impl/conservative/gc.d:1990 p1 = 0x7fcf6951b010 p2 = 0x7fcf6a11b010 stackPos = 0 stack = {{pbot = 0x7fffffffcc60, ptop = 0x7f15af <_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 0x87b4118, ptop = 0x87b4118}, {pbot = 0x0, ptop = 0xc4cda0}, {pbot = 0x7fffffffcca0, ptop = 0x7f15af <_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 0x87af258, ptop = 0x87af258}, {pbot = 0x0, ptop = 0xc4cda0}, {pbot = 0x7fffffffcce0, ptop = 0x7f15af <_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 0x87af158, ptop = 0x87af158}, {pbot = 0x0, ptop = 0xc4cda0}, {pbot = 0x7fffffffcd20, ptop = 0x7f15af <_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 0x87af0d8, ptop = 0x87af0d8}, {pbot = 0x0, ptop = 0xc4cda0}, {pbot = 0x7fdf6b265000, ptop = 0x69b96a0}, {pbot = 0x28, ptop = 0x7fcf5951b000}, {pbot = 0x309eab7000, ptop = 0x7fdf6b265000}, {pbot = 0x0, ptop = 0x0}, {pbot = 0x1381d00, ptop = 0x1c}, {pbot = 0x1d, ptop = 0x1c}, {pbot = 0x1a44100, ptop = 0x1a4410}, {pbot = 0x1a44, ptop = 0x4}, {pbot = 0x7fdf6b355000, ptop = 0x69b96a0}, {pbot = 0x28, ptop = 0x7fcf5951b000}, {pbot = 0x309eab7000, ptop = 0x4ac0}, {pbot = 0x4a, ptop = 0x0}, {pbot = 0x1381d00, ptop = 0x1c}, {pbot = 0x1d, ptop = 0x1c}, {pbot = 0x4ac00, ptop = 0x4ac0}, {pbot = 0x4a, ptop = 0x4}} pcache = 0 pools = 0x69b96a0 highpool = 40 minAddr = 0x7fcf5951b000 memSize = 208820465664 base = 0xaef0 top = 0xae p = 0x4618770 pool = 0x0 low = 110859936 high = 40 mid = 140528533483520 offset = 208820465664 biti = 8329709 pn = 142275872 bin = 1 offsetBase = 0 next = 0xc4cc80 next = {pbot = 0x7fffffffcbe0, ptop = 0x7f19ed <_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi+57>} __r292 = 0x7fffffffd320 __key293 = 8376632 rng = @0x0: <error reading variable> #1 0x00000000007f19ed in _D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi (this=0x7fffffffd360, __applyArg0=...) at src/gc/impl/conservative/gc.d:2188 range = {pbot = 0x7fcf6951b010, ptop = 0x7fcf6a11b010, ti = 0x0} #2 0x00000000007fd161 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf7opApplyMFNbMDFNbKQBtZiZ9__lambda2MFNbKxSQCpQCpQCfZi (this=0x7fffffffd320, e=...) at src/rt/util/container/treap.d:47 #3 0x00000000007fd539 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x80396c0) at src/rt/util/container/treap.d:221 result = 0 #4 0x00000000007fd565 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x87c8140) at src/rt/util/container/treap.d:224 result = 0 #5 0x00000000007fd516 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x7fdfc8000950) at src/rt/util/container/treap.d:218 result = 16844032 #6 0x00000000007fd516 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x7fdfc8000a50) at src/rt/util/container/treap.d:218 result = 0 #7 0x00000000007fd516 in _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeMDFNbKxSQDiQDiQCyZiZi (dg=..., node=0x7fdfc8000c50) at src/rt/util/container/treap.d:218 result = 0 [etc...] #37 0x000000000077e889 in core.memory.GC.collect() () at src/core/memory.d:207 #38 0x00000000006b4791 in asgen.engine.Engine.gcCollect() (this=0x7ffff7ee13c0) at ../src/asgen/engine.d:122 ``` |
April 18, 2018 Re: Issues with debugging GC-related crashes #2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johannes Pfau | On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote:
> Actually this sounds very familiar: https://github.com/D-Programming-GDC/GDC/pull/236
Interesting, but I don't think it applies here. Both start and end addresses are 16-bytes aligned, and both cannot be accessed according to the stack trace (`pbot=0x7fcf4d721010 <error: Cannot access memory at address 0x7fcf4d721010>, ptop=0x7fcf4e321010 <error: Cannot access memory at address 0x7fcf4e321010>`). That's quite interesting too: `memSize = 209153867776`. Don't know what exactly it is, but it's a pretty large number (~194 GB).
|
April 18, 2018 Re: Issues with debugging GC-related crashes #2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johannes Pfau | On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote: > [...] > > Actually this sounds very familiar: https://github.com/D-Programming-GDC/GDC/pull/236 > > it took us quite some time to reduce and debug this: > > https://github.com/D-Programming-GDC/GDC/pull/236/commits/ 5021b8d031fcacac52ee43d83508a5d2856606cd > > So I wondered why I couldn't find this in the upstream druntime code. Turns out our pull request has never been merged.... > > https://github.com/dlang/druntime/pull/1678 Just to be sure, I applied your patch, but unfortunately I still get the same result... On Wednesday, 18 April 2018 at 20:38:20 UTC, negi wrote: > On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote: >> ... > > This reminds me of (otherwise unrelated) problems I had involving Linux 4.15. > > If you feel out of ideas, I suggest you take a look at the kernels. It might > be that Ubuntu is turning some security-related knob in a different direction > than Debian. Or it might be some bug in 4.15 (I found it to be quite buggy, > specially during the first few point releases; 4.15 was the first upstream > release including large amounts of meltdown/spectre-related work). All the crashes are happening on a 4.4 kernel though... I am currently pondering digging out a 4.4 kernel here to see if that makes me reproduce the crash locally. |
April 18, 2018 Re: Issues with debugging GC-related crashes #2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to kinke | On Wednesday, 18 April 2018 at 22:12:12 UTC, kinke wrote: > On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote: >> Actually this sounds very familiar: https://github.com/D-Programming-GDC/GDC/pull/236 > > Interesting, but I don't think it applies here. Both start and end addresses are 16-bytes aligned, and both cannot be accessed according to the stack trace (`pbot=0x7fcf4d721010 <error: Cannot access memory at address 0x7fcf4d721010>, ptop=0x7fcf4e321010 <error: Cannot access memory at address 0x7fcf4e321010>`). That's quite interesting too: `memSize = 209153867776`. Don't know what exactly it is, but it's a pretty large number (~194 GB). size_t memSize = pooltable.maxAddr - minAddr; (https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L1982 ) That wouldn't make sense for a pool size... The machine this is running on has 16G memory, at the time of the crash the software was using ~2.1G memory, with 130G virtual memory due to LMDB memory mapping (I wonder what happens if I reduce that...) |
April 19, 2018 Re: Issues with debugging GC-related crashes #2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Matthias Klumpp | Am Wed, 18 Apr 2018 22:24:13 +0000 schrieb Matthias Klumpp: > On Wednesday, 18 April 2018 at 22:12:12 UTC, kinke wrote: >> On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote: >>> Actually this sounds very familiar: https://github.com/D-Programming-GDC/GDC/pull/236 >> >> Interesting, but I don't think it applies here. Both start and end addresses are 16-bytes aligned, and both cannot be accessed according to the stack trace (`pbot=0x7fcf4d721010 <error: Cannot access memory at address 0x7fcf4d721010>, ptop=0x7fcf4e321010 <error: Cannot access memory at address 0x7fcf4e321010>`). That's quite interesting too: `memSize = 209153867776`. Don't know what exactly it is, but it's a pretty large number (~194 GB). > > size_t memSize = pooltable.maxAddr - minAddr; (https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/ conservative/gc.d#L1982 > ) > That wouldn't make sense for a pool size... > > The machine this is running on has 16G memory, at the time of the crash the software was using ~2.1G memory, with 130G virtual memory due to LMDB memory mapping (I wonder what happens if I reduce that...) I see. Then I'd try to debug where the range originally comes from, try adding breakpoints in _d_dso_registry, registerGCRanges and similar functions here: https://github.com/dlang/druntime/blob/master/src/rt/ sections_elf_shared.d#L421 Generally if you produced a crash in gdb it should be reproducible if you restart the program in gdb. So once you have a crash, you should be able to restart the program and look at the _dso_registry and see the same addresses somewhere. If you then think you see memory corruption somewhere you could also use read or write watchpoints. But just to be sure: you're not adding any GC ranges manually, right? You could also try to compare the GC range to the address range layout in /proc/$PID/maps . -- Johannes |
April 19, 2018 Re: Issues with debugging GC-related crashes #2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johannes Pfau | Am Thu, 19 Apr 2018 06:33:27 +0000 schrieb Johannes Pfau: > > Generally if you produced a crash in gdb it should be reproducible if you restart the program in gdb. So once you have a crash, you should be able to restart the program and look at the _dso_registry and see the same addresses somewhere. If you then think you see memory corruption somewhere you could also use read or write watchpoints. > > But just to be sure: you're not adding any GC ranges manually, right? You could also try to compare the GC range to the address range layout in /proc/$PID/maps . Of course, if this is a GC pool / heap range adding breakpoints in the sections code won't be useful. Then I'd try to add a write watchpoint on pooltable.minAddr / maxAddr, restart the programm in gdb and see where / why the values are set. -- Johannes |
April 19, 2018 Re: Issues with debugging GC-related crashes #2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johannes Pfau | Am Thu, 19 Apr 2018 07:04:14 +0000 schrieb Johannes Pfau: > Am Thu, 19 Apr 2018 06:33:27 +0000 schrieb Johannes Pfau: > > >> Generally if you produced a crash in gdb it should be reproducible if you restart the program in gdb. So once you have a crash, you should be able to restart the program and look at the _dso_registry and see the same addresses somewhere. If you then think you see memory corruption somewhere you could also use read or write watchpoints. >> >> But just to be sure: you're not adding any GC ranges manually, right? You could also try to compare the GC range to the address range layout in /proc/$PID/maps . > > Of course, if this is a GC pool / heap range adding breakpoints in the sections code won't be useful. Then I'd try to add a write watchpoint on pooltable.minAddr / maxAddr, restart the programm in gdb and see where / why the values are set. Having a quick look at https://github.com/ldc-developers/druntime/blob/ ldc/src/gc/pooltable.d: The GC seems to allocate multiple pools using malloc, but only keeps track of one minimum/maximum address for all pools. Now if there's some other memory area malloced in between these pools, you will end up with a huge memory block. When this will get scanned and if any of the memory in-between the GC pools is protected, you might see the GC crash. However, I don't really know anything about the GC code, so some GC expert would have to confirm this. -- Johannes |
April 19, 2018 Re: Issues with debugging GC-related crashes #2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Matthias Klumpp | On Wednesday, 18 April 2018 at 17:40:56 UTC, Matthias Klumpp wrote:
> On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote:
>> You can call GC.collect at some points in the program to see if they can trigger the crash
>
> I already do that, and indeed I get crashes. I could throw those calls into every function though, or make a minimal pool size, maybe that yields something...
Can you narrow down the earliest point at which it starts to crash? That might identify if something in particular causes the crash.
|
April 19, 2018 Re: Issues with debugging GC-related crashes #2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Matthias Klumpp | On Wednesday, 18 April 2018 at 17:40:56 UTC, Matthias Klumpp wrote: > I get compile errors for the INVARIANT option, and I don't actually know how to deal with those properly: > ``` > src/gc/impl/conservative/gc.d(1396): Error: shared mutable method core.internal.spinlock.SpinLock.lock is not callable using a shared const object > src/gc/impl/conservative/gc.d(1396): Consider adding const or inout to core.internal.spinlock.SpinLock.lock > src/gc/impl/conservative/gc.d(1403): Error: shared mutable method core.internal.spinlock.SpinLock.unlock is not callable using a shared const object > src/gc/impl/conservative/gc.d(1403): Consider adding const or inout to core.internal.spinlock.SpinLock.unlock > ``` > > Commenting out the locks (eww!!) yields no change in behavior though. As a workaround: (cast(shared)rangesLock).lock(); |
Copyright © 1999-2021 by the D Language Foundation