Thread overview
Why does this simple test program leak 500MB of RAM?
Jun 19, 2023
FeepingCreature
Jun 19, 2023
FeepingCreature
Jun 20, 2023
FeepingCreature
Jun 20, 2023
FeepingCreature
Jun 30, 2023
FeepingCreature
Jun 30, 2023
Bastiaan Veelo
Jul 01, 2023
FeepingCreature
June 19, 2023

Consider this program:

module test;

import core.memory : GC;
import core.sync.semaphore;
import core.thread;
import std;

enum keys = [
    "foo", "bar", "baz", "whee",
    "foo1", "foo2", "foo3", "foo4", "foo5", "foo6", "foo7", "foo8",
    "bar1", "bar2", "bar3", "bar4", "bar5", "bar6", "bar7", "bar8",
    "baz1", "baz2", "baz3", "baz4", "baz5", "baz6", "baz7", "baz8"];

void spamGC(Semaphore sema) {
    void recursionTest(int depth) {
        if (depth < 1000) {
            recursionTest(depth + 1);
            if (depth % 300 == 0)
                recursionTest(depth + 1);
        }
        string[string] assocArray;
        static foreach (a; keys)
        {
            assocArray[a] = a;
        }
        // ?????
        assocArray.clear;
        assocArray = null;
    }
    recursionTest(0);
    sema.notify;
    Thread.sleep(3600.seconds);
}

void main() {
    Thread[] threads;
    auto sema = new Semaphore(0);
    enum threadCount = 100;
    threadCount.iota.each!((_) {
        auto thread = new Thread({ spamGC(sema); });
        thread.start;
        threads ~= thread;
    });
    // this clears the leak up!
    // threads.each!((thread) { thread.join; });
    threadCount.iota.each!((i) {
        sema.wait;
    });
    writefln!"Done.";
    100.iota.each!((_) {
        GC.collect;
        GC.minimize;
    });
    writefln!"Collected.";

    // Now look at residential memory for the process.
    Thread.sleep(3600.seconds);
}

We've had problems with excessive memory leaks in JSON decoding in production. So I've put together this test program.

The spamGC recursive call represents recursive JSON decoding.

The Thread.sleep call represents a thread idling in a threadpool.

After completion, the program sits at ~600MB residential. I believe, from looking at it, it should be obvious that this memory is entirely dead.

If I alloca 10KB of stack at the beginning of the recursion, the residential RAM drops to ~180MB. I don't think this completely solves the issue, but it sure is interesting.

-gx does nothing.

June 19, 2023

On 6/19/23 6:15 AM, FeepingCreature wrote:

>

Consider this program:

module test;

import core.memory : GC;
import core.sync.semaphore;
import core.thread;
import std;

enum keys = [
     "foo", "bar", "baz", "whee",
     "foo1", "foo2", "foo3", "foo4", "foo5", "foo6", "foo7", "foo8",
     "bar1", "bar2", "bar3", "bar4", "bar5", "bar6", "bar7", "bar8",
     "baz1", "baz2", "baz3", "baz4", "baz5", "baz6", "baz7", "baz8"];

void spamGC(Semaphore sema) {
     void recursionTest(int depth) {
         if (depth < 1000) {
             recursionTest(depth + 1);
             if (depth % 300 == 0)
                 recursionTest(depth + 1);
         }
         string[string] assocArray;
         static foreach (a; keys)
         {
             assocArray[a] = a;
         }
         // ?????
         assocArray.clear;
         assocArray = null;
     }
     recursionTest(0);
     sema.notify;
     Thread.sleep(3600.seconds);
}

void main() {
     Thread[] threads;
     auto sema = new Semaphore(0);
     enum threadCount = 100;
     threadCount.iota.each!((_) {
         auto thread = new Thread({ spamGC(sema); });
         thread.start;
         threads ~= thread;
     });
     // this clears the leak up!
     // threads.each!((thread) { thread.join; });
     threadCount.iota.each!((i) {
         sema.wait;
     });
     writefln!"Done.";
     100.iota.each!((_) {
         GC.collect;
         GC.minimize;
     });
     writefln!"Collected.";

     // Now look at residential memory for the process.
     Thread.sleep(3600.seconds);
}

We've had problems with excessive memory leaks in JSON decoding in production. So I've put together this test program.

The spamGC recursive call represents recursive JSON decoding.

The Thread.sleep call represents a thread idling in a threadpool.

After completion, the program sits at ~600MB residential. I believe, from looking at it, it should be obvious that this memory is entirely dead.

The clear should remove all references to the bucket elements (it basically nulls out the bucket elements without deallocating the elements or the bucket array). The assocArray = null could be considered a dead store by the const-folder, so it's possible that isn't actually being performed.

>

If I alloca 10KB of stack at the beginning of the recursion, the residential RAM drops to ~180MB. I don't think this completely solves the issue, but it sure is interesting.

That is interesting...

So I'm looking at the druntime code. If I understand this correctly, the GC doesn't try running a collection when allocating small blocks until 2.5GB of memory is consumed? that can't be right...

https://github.com/dlang/dmd/blob/17c3f994e845ff0b63d7b5f6443fe5a7fdc08609/druntime/src/core/internal/gc/os.d#L215-L252

https://github.com/dlang/dmd/blob/17c3f994e845ff0b63d7b5f6443fe5a7fdc08609/druntime/src/core/internal/gc/impl/conservative/gc.d#L1949-L1967

But maybe that's only if some other condition is not met. Can you use GC profile to determine how many collections run during the affected time? I am curious if it is not a problem of the GC not collecting things it should collect vs. not collecting at all.

This seems to play a factor, but I don't understand it.

https://github.com/dlang/dmd/blob/17c3f994e845ff0b63d7b5f6443fe5a7fdc08609/druntime/src/core/internal/gc/impl/conservative/gc.d#L1871-L1892

-Steve

June 19, 2023

On Monday, 19 June 2023 at 15:59:39 UTC, Steven Schveighoffer wrote:

>

On 6/19/23 6:15 AM, FeepingCreature wrote:

>

If I alloca 10KB of stack at the beginning of the recursion, the residential RAM drops to ~180MB. I don't think this completely solves the issue, but it sure is interesting.

That is interesting...

Some additional info I probably should have mentioned.

Note that I'm manually calling the GC at the end. At that point, everything should be dead. heaptrack shows that we get something like 8 big GC runs over the program runtime: as expected, because it's overprovisioning a "healthy" amount.

My standing theory is that it's the zombie stack issue again: see https://forum.dlang.org/post/befrzndhowlwnvlqcoxx@forum.dlang.org for what I think is happening.

I'm currently experimenting with setting up a fiber, then zeroing out its stack completely after the function has run in it. Results are mixed, but promising. I still end up with ~100MB live at the end though - which admittedly is a lot better than 600MB.

June 19, 2023

On 6/19/23 12:40 PM, FeepingCreature wrote:

>

Some additional info I probably should have mentioned.

Note that I'm manually calling the GC at the end. At that point, everything should be dead. heaptrack shows that we get something like 8 big GC runs over the program runtime: as expected, because it's overprovisioning a "healthy" amount.

My standing theory is that it's the zombie stack issue again: see https://forum.dlang.org/post/befrzndhowlwnvlqcoxx@forum.dlang.org for what I think is happening.

I'm currently experimenting with setting up a fiber, then zeroing out its stack completely after the function has run in it. Results are mixed, but promising. I still end up with ~100MB live at the end though - which admittedly is a lot better than 600MB.

Are you looking at GC memory usage or process usage? Because the GC doesn't generally give back memory to the OS even when it could. And fragmentation can stop it from doing so also -- an entire allocated pool has to be unallocated for it to release it to the OS.

I know there has always been mysterious "reasons" for the GC to not collect things that are stack-referenced, but I tend to first try to rule out looking at the wrong information, and/or other simple explanations before getting there.

Having somewhat recently looked at how the GC scanning works for threads, I'm reasonably certain that unused stack isn't being scanned (even if it's allocated). On a 64-bit machine, the likelihood of accidentally keeping things referenced on the stack is low.

I would also point out that your specific case allocates most of the memory secondary referenced from the stack. That is, you might refer to the AA structure from the stack, but not the bucket elements (where the bulk is allocated). And using aa.clear should make that even less possible. All of the allocated buckets should be definitively garbage with no possibility of zombie references.

-Steve

June 20, 2023

On Monday, 19 June 2023 at 17:38:17 UTC, Steven Schveighoffer wrote:

>

On 6/19/23 12:40 PM, FeepingCreature wrote:

>

Some additional info I probably should have mentioned.

Note that I'm manually calling the GC at the end. At that point, everything should be dead. heaptrack shows that we get something like 8 big GC runs over the program runtime: as expected, because it's overprovisioning a "healthy" amount.

My standing theory is that it's the zombie stack issue again: see https://forum.dlang.org/post/befrzndhowlwnvlqcoxx@forum.dlang.org for what I think is happening.

I'm currently experimenting with setting up a fiber, then zeroing out its stack completely after the function has run in it. Results are mixed, but promising. I still end up with ~100MB live at the end though - which admittedly is a lot better than 600MB.

Are you looking at GC memory usage or process usage? Because the GC doesn't generally give back memory to the OS even when it could. And fragmentation can stop it from doing so also -- an entire allocated pool has to be unallocated for it to release it to the OS.

I know there has always been mysterious "reasons" for the GC to not collect things that are stack-referenced, but I tend to first try to rule out looking at the wrong information, and/or other simple explanations before getting there.

Having somewhat recently looked at how the GC scanning works for threads, I'm reasonably certain that unused stack isn't being scanned (even if it's allocated). On a 64-bit machine, the likelihood of accidentally keeping things referenced on the stack is low.

I would also point out that your specific case allocates most of the memory secondary referenced from the stack. That is, you might refer to the AA structure from the stack, but not the bucket elements (where the bulk is allocated). And using aa.clear should make that even less possible. All of the allocated buckets should be definitively garbage with no possibility of zombie references.

-Steve

I'm looking at process usage. But I mean, the thing is, I can make it go down to sub 100MB by playing with the stack.

I'll post code for that version when I get to work.

June 20, 2023

On Monday, 19 June 2023 at 17:38:17 UTC, Steven Schveighoffer wrote:

>

On 6/19/23 12:40 PM, FeepingCreature wrote:

>

I'm currently experimenting with setting up a fiber, then zeroing out its stack completely after the function has run in it. Results are mixed, but promising. I still end up with ~100MB live at the end though - which admittedly is a lot better than 600MB.

Are you looking at GC memory usage or process usage? Because the GC doesn't generally give back memory to the OS even when it could. And fragmentation can stop it from doing so also -- an entire allocated pool has to be unallocated for it to release it to the OS.

I know there has always been mysterious "reasons" for the GC to not collect things that are stack-referenced, but I tend to first try to rule out looking at the wrong information, and/or other simple explanations before getting there.

Okay, here's the "full stack cleanup" version: https://gist.github.com/FeepingCreature/3baf6bd62dc64441fe73f46e3df9b482

It spawns a fiber, runs the recursion in the fiber, then uses madvise and druntime abuse to set the entire fiber stack to zero.

RSS averaged over five runs:

With fiber, without madvise: ~177MB
With fiber, with madvise: ~75MB

But this is extremely weird though. Thinking about it, the madvise call really shouldn't matter! The fiber has exited, its stack shouldn't be a GC root. Even if it is, there should be no way that any references to our string[string]s live in any possibly-live stack area in there. And yet, explicitly zeroing it out repeatably improves GC performance. Conversely, just running the recursion in a fiber shouldn't do anything either, either that stack area is scanned or it isn't.

Well, empirically, "fiber and cleanup makes rss go down" which is the only actual thing of importance, I guess.

June 30, 2023

On Monday, 19 June 2023 at 17:38:17 UTC, Steven Schveighoffer wrote:

>

Are you looking at GC memory usage or process usage? Because the GC doesn't generally give back memory to the OS even when it could. And fragmentation can stop it from doing so also -- an entire allocated pool has to be unallocated for it to release it to the OS.

Okay, I can confirm it's some sort of fragmentation issue.

Printing GC stats:

    writefln!"Done. (%s used, %s free)"(GC.stats.usedSize, GC.stats.freeSize);
    100.iota.each!((_) {
        GC.collect;
        GC.minimize;
    });
    writefln!"Collected. (%s used, %s free)"(GC.stats.usedSize, GC.stats.freeSize);

Without alloca purging:

Done. (155347088 used, 527275888 free)
Collected. (495808 used, 682127168 free)

So all but 500KB GC memory is free, but it uses 796MB residential.

With alloca purging:

Done. (35950960 used, 646672016 free)
Collected. (26528 used, 36673632 free)

We got rid of another 450KB, but residential dropped to 164MB!

I think miniscule stack leaks must be keeping large GC pools alive here.

I kind of wish there was a way to say "1. All allocations in this thread should go into a dedicated pool X" and "2. Okay, we're done with the work, all remaining references to these allocations are a bug; scan for them, crash (with a message) if you find any."

June 30, 2023

On Friday, 30 June 2023 at 13:18:12 UTC, FeepingCreature wrote:

>

I think miniscule stack leaks must be keeping large GC pools alive here.

Just throwing this in here in case it is of significance: Does adding "--DRT-gcopt=gc:precise" as a command line option make any difference?

https://dlang.org/spec/garbage.html#precise_gc

-- Bastiaan.

July 01, 2023

On Friday, 30 June 2023 at 18:49:58 UTC, Bastiaan Veelo wrote:

>

On Friday, 30 June 2023 at 13:18:12 UTC, FeepingCreature wrote:

>

I think miniscule stack leaks must be keeping large GC pools alive here.

Just throwing this in here in case it is of significance: Does adding "--DRT-gcopt=gc:precise" as a command line option make any difference?

https://dlang.org/spec/garbage.html#precise_gc

-- Bastiaan.

It's never made any difference any time I tried.

I'd expect precise scanning to only be relevant on 32-bit, since it doesn't precise scan the stack.

edit: Heh. The precise GC manages to collect less garbage in the test above, leaving the program standing at 670M residential.

Explain that one. :P

July 02, 2023

On 7/1/23 5:24 AM, FeepingCreature wrote:

>

edit: Heh. The precise GC manages to collect less garbage in the test above, leaving the program standing at 670M residential.

Explain that one. :P

Probably a coincidental adjustment in gc bookkeeping allocations.

If I had a nickel every time someone posted a "why does the GC behave this way!" message, I'd have like probably $2-3.

-Steve