Jump to page: 1 2
Thread overview
How to debug (potential) GC bugs?
Sep 25, 2016
Matthias Klumpp
Sep 27, 2016
Marco Leise
Sep 27, 2016
Guillaume Piolat
Sep 27, 2016
Kapps
Sep 29, 2016
Kagamin
Oct 01, 2016
Matthias Klumpp
Oct 03, 2016
Kagamin
Oct 03, 2016
Kagamin
Oct 07, 2016
Martin Nowak
Oct 03, 2016
Kagamin
Oct 04, 2016
Ilya Yaroshenko
Oct 07, 2016
Martin Nowak
Oct 07, 2016
Johannes Pfau
September 25, 2016
Hello!
I am working together with others on the D-based appstream-generator[1] project, which is generating software metadata for "software centers" and other package-manager functionality on Linux distributions, and is used by default on Debian, Ubuntu and Arch Linux.

For Ubuntu, some modifications on the code were needed, and apparently for them the code is currently crashing in the GC collection thread: http://paste.debian.net/840490/

The project is running a lot of stuff in parallel and is using the GC (if the extraction is a few seconds slower due to the GC being active, it doesn't matter much).

We also link against a lot of 3rd-party libraries and use a big amount of existing C code in the project.

So, I would like to know the following things:

1) Is there any caveat when linking to C libraries and using the GC in a project? So far, it seems to be working well, but there have been a few cases where I was suspicious about the GC actually doing something to malloc'ed stuff or C structs present in the bindings.

2) How can one debug issues like the one mentioned above properly? Since it seems to happen in the GC and doesn't give me information on where to start searching for the issue, I am a bit lost.

3) The tool seems to leak memory somewhere and OOMs pretty quickly on some machines. All the stuff using C code frees resources properly though, and using Valgrind on the project is a pain due to large amounts of data being mmapped. I worked around this a while back, but then the GC interfered with Valgrind, making information less useful. Is there any information on how to find memory leaks, or e.g. large structs the GC cannot free because something is still having a needless reference on it?

Unfortunately I can't reproduce the crash from 2) myself, it only seems to happen at Ubuntu (but Ubuntu is using some different codepaths too).

Any insights would be highly appreciated!
Cheers,
   Matthias

[1[: https://github.com/ximion/appstream-generator

September 27, 2016
Am Sun, 25 Sep 2016 16:23:11 +0000
schrieb Matthias Klumpp <matthias@tenstral.net>:

> So, I would like to know the following things:
> 
> 1) Is there any caveat when linking to C libraries and using the GC in a project? So far, it seems to be working well, but there have been a few cases where I was suspicious about the GC actually doing something to malloc'ed stuff or C structs present in the bindings.

If you pass callbacks into the C code, make sure they never throw. Stack unwinding and exception handling generally doesn't work across language boundaries.

A tracing garbage collector starts with the assumption that
all the memory that it allocated is no longer reachable and
then starts scanning the known memory for any pointers to
allocations that falsify this assumption.
What you malloc'ed is unknown to the GC and wont be scanned.
Should you ever have GC memory pointers in your malloc'ed
stuff, then you need to call GC.addRange() to make those
pointers keep the allocations alive. Otherwise you will get a
"used after free" error: data corruption or access violations.
A simple case would be a string that you constructed in D and
store in C as a pointer. The GC can automatically scan the
stack and any globals/statics on the D side, but that's about
it.

I know of no tools similar to valgrind specially designed to debug the D GC. You can plug into the GC API and keep track of the allocation sizes. I.e. write a proxy GC.

-- 
Marco

September 27, 2016
On Sunday, 25 September 2016 at 16:23:11 UTC, Matthias Klumpp wrote:
> Hello!
> I am working together with others on the D-based appstream-generator[1] project, which is generating software metadata for "software centers" and other package-manager functionality on Linux distributions, and is used by default on Debian, Ubuntu and Arch Linux.
>
> For Ubuntu, some modifications on the code were needed, and apparently for them the code is currently crashing in the GC collection thread: http://paste.debian.net/840490/
>
> The project is running a lot of stuff in parallel and is using the GC (if the extraction is a few seconds slower due to the GC being active, it doesn't matter much).
>
> We also link against a lot of 3rd-party libraries and use a big amount of existing C code in the project.
>
> So, I would like to know the following things:
>


> 1) Is there any caveat when linking to C libraries and using the GC in a project? So far, it seems to be working well, but there have been a few cases where I was suspicious about the GC actually doing something to malloc'ed stuff or C structs present in the bindings.

There is no way the GC scans memory allocated with malloc (unless you tell it to) or used in the bindings.
A caveat is that if you are called from C (not your case), you must initialize the runtime, and attach/detach threads.

The GC could well stop threads that are currently in the C code if they were registered to the runtime.


> 2) How can one debug issues like the one mentioned above properly? Since it seems to happen in the GC and doesn't give me information on where to start searching for the issue, I am a bit lost.

There can be multiple reasons.
  - The GC is collecting some object that is unreachable from its POV; when you are actually using it.
  - The GC is calling destructors, that should not be called by the GC. Performing illegal operations. usually this is solved by using deterministic destruction instead and never relying on a destructor called by the GC.
  - The GC tries to stop threads that don't exist anymore or are not interruptible

My advice is to have a fuly deterministic tree of objects, like a C++ program, and Google for "GC-proof resource class" in case you are using classes.


September 27, 2016
On Sunday, 25 September 2016 at 16:23:11 UTC, Matthias Klumpp wrote:
> Hello!
> I am working together with others on the D-based appstream-generator[1] project, which is generating software metadata for "software centers" and other package-manager functionality on Linux distributions, and is used by default on Debian, Ubuntu and Arch Linux.
>
> For Ubuntu, some modifications on the code were needed, and apparently for them the code is currently crashing in the GC collection thread: http://paste.debian.net/840490/
>
> The project is running a lot of stuff in parallel and is using the GC (if the extraction is a few seconds slower due to the GC being active, it doesn't matter much).
>
> We also link against a lot of 3rd-party libraries and use a big amount of existing C code in the project.
>
> So, I would like to know the following things:
>
> 1) Is there any caveat when linking to C libraries and using the GC in a project? So far, it seems to be working well, but there have been a few cases where I was suspicious about the GC actually doing something to malloc'ed stuff or C structs present in the bindings.
>
> 2) How can one debug issues like the one mentioned above properly? Since it seems to happen in the GC and doesn't give me information on where to start searching for the issue, I am a bit lost.
>
> 3) The tool seems to leak memory somewhere and OOMs pretty quickly on some machines. All the stuff using C code frees resources properly though, and using Valgrind on the project is a pain due to large amounts of data being mmapped. I worked around this a while back, but then the GC interfered with Valgrind, making information less useful. Is there any information on how to find memory leaks, or e.g. large structs the GC cannot free because something is still having a needless reference on it?
>
> Unfortunately I can't reproduce the crash from 2) myself, it only seems to happen at Ubuntu (but Ubuntu is using some different codepaths too).
>
> Any insights would be highly appreciated!
> Cheers,
>    Matthias
>
> [1[: https://github.com/ximion/appstream-generator

First, make sure any C threads calling D code use Thread.attachThis (thread_attachThis maybe?). Otherwise the GC will not suspend those threads during a collection which will cause crashes. I'd guess this is your issue.

Second, tell the GC of non-GC memory that has pointers to GC memory by using GC.addRange / GC.addRoot as needed. Make sure to remove them once the non-GC memory is deallocated as well, otherwise you'll get memory leaks. The GC collector is also conservative, not precise, so false positives are possible. If you're using 64 bit programs, this shouldn't be much of an issue though.

Finally, make sure you're not doing any GC allocations in dtors.
September 29, 2016
Does it crash only in rt_finalize2? It calls the class destructor, and the destructor must not allocate or touch GC in any way because the GC doesn't yet support allocation during collection.
October 01, 2016
On Thursday, 29 September 2016 at 09:56:34 UTC, Kagamin wrote:
> Does it crash only in rt_finalize2? It calls the class destructor, and the destructor must not allocate or touch GC in any way because the GC doesn't yet support allocation during collection.

Thank you all for the good advice! I do none of those things in my code though...
Unfortunately for having deterministic memory management, I would essentially need to develop GC-less, and would loose classes. This means many nice features of D aren't available, e.g. I couldn't use interfaces (AFAIK they don't work on structs) or constraints.

Strangely after switching from the GDC compiler to the LDC compiler, all crashes observed at Ubuntu are gone.
So, this problem is:
 A) A compiler / DRuntime bug, or
 B) A bug in my code (not) triggered by a certain compiler / DRuntime

For the excessive memory usage, I have no idea yet - the GC not freeing its memory pool on exit is quite bad for Valgrinding the code.
Memory consumption has bettered recently after not re-opening a LMDB database twice in the same process from multiple threads, which is not supported by LMDB. I haven't done longer runs yet, so I am not sure if that really was the problem (seems unlikely, but you never know...).

Cheers,
    Matthias

October 03, 2016
On Saturday, 1 October 2016 at 00:06:05 UTC, Matthias Klumpp wrote:
> I do none of those things in my code though...

`grep "~this" *.d` gives nothing? It can be a struct with destructor stored in a class. Can you observe the error? Try to set a breakpoint at onInvalidMemoryOperationError https://github.com/dlang/druntime/blob/master/src/core/exception.d#L559 and see what stack leads to it.

> Unfortunately for having deterministic memory management, I would essentially need to develop GC-less, and would loose classes. This means many nice features of D aren't available, e.g. I couldn't use interfaces (AFAIK they don't work on structs) or constraints.

Not necessarily. You only need to dispose the resources in time, like in C#. But if you don't have destructors, you have nothing to dispose.

> Strangely after switching from the GDC compiler to the LDC compiler, all crashes observed at Ubuntu are gone.

Sounds not good.
October 03, 2016
On Sunday, 25 September 2016 at 16:23:11 UTC, Matthias Klumpp wrote:
> For Ubuntu, some modifications on the code were needed, and apparently for them the code is currently crashing in the GC collection thread: http://paste.debian.net/840490/

Oh, wait, what do you mean by crashing?
October 03, 2016
If it's heap corruption, GC has debugging option -debug=SENTINEL - for buffer overrun checks. Also that particular stack trace shows that object being destroyed is allocated in bin 512, i.e. its size is between 256 and 512 bytes.
October 04, 2016
On Sunday, 25 September 2016 at 16:23:11 UTC, Matthias Klumpp wrote:
> Hello!
> I am working together with others on the D-based appstream-generator[1] project, which is generating software metadata for "software centers" and other package-manager functionality on Linux distributions, and is used by default on Debian, Ubuntu and Arch Linux.
>
> [...]

Probably related issue: https://issues.dlang.org/show_bug.cgi?id=15939
« First   ‹ Prev
1 2