Thread overview
Debugging a Memory Leak
Nov 18, 2014
Vladimir Panteleev
Nov 18, 2014
Etienne
November 17, 2014
There seems to be a memory leak in the Higgs compiler. This problem shows up when running our test suite (`make test` command).

A new VM object is created for each unittest block, e.g.:
https://github.com/maximecb/Higgs/blob/master/source/runtime/tests.d#L201

These VM objects are unfortunately *never freed*. Not until the whole series of tests is run and the process terminates. The VM objects keep references to many other objects, and so the process keeps using more and more memory, up to over 2GB.

The VM allocates it's own JS data heap that it manages itself, i.e.:
https://github.com/maximecb/Higgs/blob/master/source/runtime/gc.d#L186

This memory is clearly marked as NO_SCAN, and so references to the VM in there should presumably not be counted. There is also executable memory I allocate with mmap, but this should also be ignored by the D GC in principle (I do not mark executable code as roots):
https://github.com/maximecb/Higgs/blob/master/source/jit/codeblock.d#L129

I don't know where the problem lies. There could be false pointers, but I'm on a 64-bit system, which should presumably make this less likely. I wish there was a way to ask the D runtime "can you tell me what is pointing to this object?", but the situation is more complex because many objects in my system refer to the VM object, there is a complicated graph of references. If anything points into that graph, the whole thing stays "live".

Help or advice on solving this problem is welcome.
November 18, 2014
On 11/17/14 6:12 PM, Maxime Chevalier-Boisvert wrote:
> There seems to be a memory leak in the Higgs compiler. This problem
> shows up when running our test suite (`make test` command).
>
> A new VM object is created for each unittest block, e.g.:
> https://github.com/maximecb/Higgs/blob/master/source/runtime/tests.d#L201
>
> These VM objects are unfortunately *never freed*. Not until the whole
> series of tests is run and the process terminates. The VM objects keep
> references to many other objects, and so the process keeps using more
> and more memory, up to over 2GB.
>
> The VM allocates it's own JS data heap that it manages itself, i.e.:
> https://github.com/maximecb/Higgs/blob/master/source/runtime/gc.d#L186
>
> This memory is clearly marked as NO_SCAN, and so references to the VM in
> there should presumably not be counted. There is also executable memory
> I allocate with mmap, but this should also be ignored by the D GC in
> principle (I do not mark executable code as roots):
> https://github.com/maximecb/Higgs/blob/master/source/jit/codeblock.d#L129
>
> I don't know where the problem lies. There could be false pointers, but
> I'm on a 64-bit system, which should presumably make this less likely. I
> wish there was a way to ask the D runtime "can you tell me what is
> pointing to this object?", but the situation is more complex because
> many objects in my system refer to the VM object, there is a complicated
> graph of references. If anything points into that graph, the whole thing
> stays "live".

Hm... such a function could be created. However, it would be tricky to make work.

First, you would need a way to store the pointer without having it actually point at the data. Clearly, if you pass the pointer to the function, it's going to be on the stack, so that would then refer to it. You have to somehow obfuscate it the whole time.

Second, you may be given "memory x is pointing at your target", but what does memory x actually mean? That isn't something the GC can deal with. Perhaps when precise scanning is included (and I think we are close on that), you will have at least some type info.

> Help or advice on solving this problem is welcome.

GC problems are *nasty*. My advice is to run the simplest program you can think of that still exhibits the problem, and then put in printf debugging everywhere to see where it breaks down.

Not sure if this is useful.

-Steve
November 18, 2014
> GC problems are *nasty*. My advice is to run the simplest program you can think of that still exhibits the problem, and then put in printf debugging everywhere to see where it breaks down.
>
> Not sure if this is useful.

Unfortunately, the program doesn't break or crash. It just keeps allocating memory that doesn't get freed. There must be some false reference somewhere. I'm not sure how I can printf debug my way out of that.
November 18, 2014
On Monday, 17 November 2014 at 23:12:10 UTC, Maxime Chevalier-Boisvert wrote:
> Help or advice on solving this problem is welcome.

The D GC has some debugging code which might be a little helpful (check the commented-out debug = X lines in druntime/src/gc/gc.d). Specifically, debug=LOGGING activates some sort of leak detector, though I'm not sure how effective it is as I've never used it.

I've begun work on reviving Diamond to work for D2, multiple threads and x64. Once complete it should be able to answer such questions definitely, but it'll probably take a few days at least. Watch this space:

https://github.com/CyberShadow/druntime/commits/diamond
https://github.com/CyberShadow/Diamond
November 18, 2014
On 11/17/14 11:41 PM, Maxime Chevalier-Boisvert wrote:
>> GC problems are *nasty*. My advice is to run the simplest program you
>> can think of that still exhibits the problem, and then put in printf
>> debugging everywhere to see where it breaks down.
>>
>> Not sure if this is useful.
>
> Unfortunately, the program doesn't break or crash. It just keeps
> allocating memory that doesn't get freed. There must be some false
> reference somewhere. I'm not sure how I can printf debug my way out of
> that.

By "break down", I mean it does what you don't want :)

You will need to instrument the GC and/or druntime.

Note, if there is a false pointer, it's likely stack based, and likely there is not very many of them.

But you have NO_INTERIOR set. This means the false pointer MUST point at the beginning of the block in order to keep it alive.

As I said, these are tricky issues. It would not be easy to determine.

One thing you can try -- allocate the block as a class, with a finalizer. This gives you the ability to sense when/if a block is finalized. That can help you determine the point at which your program starts to misbehave.

-Steve
November 18, 2014
On 2014-11-17 6:12 PM, Maxime Chevalier-Boisvert wrote:
> Help or advice on solving this problem is welcome.

I've tried dumping logs from the garbage collection process and it's the biggest waste of time. Even if you left a reference somewhere, the logs will not help identify the code that caused it.

Instead, you should do a test with the following:

Store in a string[size_t] a list of pointers that should have been collected, along with the variable name. Once you assume they should have been collected, run this:

http://dlang.org/phobos/core_thread.html#.thread_scanAll

The thread_scanAll function will send you valid memory ranges in your code.

Run the stored size_t list against each value contained in the memory range. Accumulate everything that matches into another hashmap, and then fail with the error "Variables [list of identifiers] still have references in the code!"