December 09, 2022

We have a process that balloons up to 5GB in production. That's a bit much, so I started looking into ways to reign it in.

tl;dr: Add --DRT-gcopt=heapSizeFactor:0.25 if you don't care about CPU use, but want to keep RAM usage low.

Setup

This is a process that heavily uses std.json decoded into an object hierarchy in a multithreaded setup to load data from a central store.

I let it run until it had loaded all its data, then recorded the RES value from top.

Each configuration was run three times and averaged. Note that actual numbers are from memory because I lost the data in a freak LibreOffice crash (first time I've had Restore Document outright fail on me), but I don't think they're more than ±100MB off.

heaptrack-d (thanks Tim Schendekehl, your heaptrack branch keeps on being useful) says that the used memory is about 1:2 split between internal allocations and std.json detritus. It also shows the memory usage sawtoothing up and down by ~2.5, several times during startup, as expected for a heavily GC-reliant process.

Premise

I have a standing theory (the "zombie stack hypothesis") that the D GC can leak memory by dead references that are falsely kept alive because they're not properly cleared from the stack by successive calls. Ie.

void main() {
  void foo() {
    Object obj = new Object;
  }
  foo();
  // foo returns, obj is still "live" because it's right above the main stackframe
  void bar() {
    // somehow a gap arises in bar's stackframe?
    void* ptr = void;
    Thread.sleep(600.seconds);
  }
  bar;
}

Now obj is dead but will live for at least 10 minutes, because its pointer value will never be erased.

It is unclear how much this actually happens. However, I was trying the -gx flag to ostensibly suppress this effect.

All builds targeted x86 64-bit, DMD 2.100.2, LDC2 1.30.0

Results

  • DMD stock: 3.8GB
  • DMD -gx: 3.4GB
  • LDC stock: 3.1GB
  • LDC --DRT-gcopt=heapSizeFactor:0.25: 800MB!!
  • LDC with "--DRT-gcopt=heapSizeFactor:0.25 gc:precise": 800MB

Analysis

DMD stock loses by a massive margin, even compared to LDC stock. It's unclear what is going on there: we may hypothesize that LDC can maybe make denser use of the stack than DMD (?), which would explain its superiority, but that hypothesis would predict that DMD -gx would be equal to LDC. However, even with -gx, LDC (without -gx!) still beats DMD by a good margin.

It's important to note that these values have significant noise. Due to the nature of sparse GC runs, the result may be sensitive to where exactly in the sawtooth pattern the benchmark ran out. However, as we averaged over multiple runs, LDC still seems to have an advantage here that neither noise nor the zombie stack hypothesis can fully explain.

Now to the big one: heapSizeFactor is massive. For some reason, running GC vastly more often has a >2x benefit. This is even though the default heapSizeFactor is 2, meaning at most we should get down to 1.5GB.

It is possible that something about simply running the GC more often helps it clean up dead values more effectively. The zombie stack hypothesis has some opinions on this: maybe if a thread happens to be idle when the GC runs, its low stack size helps the GC discover that references that would usually be seen as fake-alive are really dead? Am I using this hypothesis for everything because that's all I got? MAYBE!!

Caveat: The LDC run with heapSizeFactor was also 2x-3x slower than without. This is okay in our case because the process in question spends the great majority of its lifetime sitting at ~5% CPU anyways.

Interestingly, the precise GC provided no benefit. My understanding is that precise garbage collection only considers fields in heap allocated areas that are actually pointers, rather than splitting allocations into "may have pointer/pointer-free." If so, the reason this provided no advantage may be because we're running on 64-bit: it's much less likely than on 32-bit that a random non-pointer value would alias a pointer. If the zombie stack hypothesis holds up, the major benefit would come from precise stack scanning, rather than precise heap scanning: because memory is cleared on allocation by default, undead values on the heap are inherently much less likely. However, precise stack scanning is not implemented in any D compiler.

(Zombie heap leaks would arise only when a data structure like an array or hashmap is downsized in place without clearing the now-free fields.)

There is an open question what the tradeoff is for different values of heapSizeFactor. Ostensibly, no values smaller than 1 should make any difference (being approximately "run the GC on every allocation"), but this doesn't seem to be how it works. There is some internal smoothing for the actual target value at work, as well.

In any case, we will keep --DRT-gcopt=heapSizeFactor in mind as our front-line response for processes using excessive amounts of memory.

December 09, 2022

A while back, I've played around with GC params too, for the compiler frontend itself (-lowmem). maxPoolSize might be interesting too. https://github.com/ldc-developers/ldc/pull/2916#issuecomment-443433594