What's the go with the GC these days? (page 3)

On Saturday, 5 January 2019 at 22:05:19 UTC, Manu wrote:
> How much truth is in here?
> What is this business about write barriers making the GC fast? Is that
> actually fact, or are there other hurdles?

First of all, stop-the-world GC gets an unnecessarily bad rap these
days. A high throughput stop-the-world GC can actually be pretty
competitive in terms of throughput, and latency may be lower than what
you'd expect. You probably aren't going to run AAA video games with one,
but there are many application domains where latency just doesn't matter
or doesn't matter as much; even normal interactive use (such as your
typical GUI application) often works plenty fine. Stop-the-world
collectors are more of a problem for a functional programming language
(or languages that exercise allocation-heavy functional idioms a lot),
but less so for imperative languages that have more control over
allocation frequency.

In general, a modern precise stop-the-world collector should be able to
handle 1GB worth of pointers in something like .2 seconds on modern
hardware, give or take (especially if you can leverage parallelism).
Given that much of the heap will generally be occupied by non-pointer
data in an imperative language, such as strings, bitmaps, or numerical
data, this is plenty for typical interactive applications (you will
often have interaction hiccups of comparable size from other system
sources) and reasonably written server-side software can use a
multi-process approach to limit the per-process heap size. For batch
programs and most command line tools, latency matters little, anyway.
Throughput-wise, you should be able to perform on par with something
like jemalloc, possibly faster, especially if you have compiler support
(to e.g. inline pool allocations and avoid unnecessary zeroing of newly
allocated memory if you statically know the object's size and layout).
You generally won't be able to beat specialized allocators for
throughput, though, just general purpose allocators.

Note that manual memory allocation is not a free lunch, either. As an
extreme case, normal reference counting is like having a very expensive
write barrier that even applies to stack frames. Manual allocation only
really wins out for throughput if you can use optimized allocation
techniques (such as a pool/arena allocator where you throw the
pool/arena away at the end).

The problem with the D garbage collector is that it doesn't do well in
terms of either throughput or latency. Even so, GC work is proportional
to allocation work, so the impact on throughput in an imperative
language like D is often less than you think, even if the GC doesn't
have good performance. For example, I've used Tiny GC [1] in real world
C/C++ applications where I needed to avoid external dependencies and the
code still had good performance. That despite the fact that due its
design (a conservative GC based on system malloc()), its throughput
isn't particularly great.

Fun fact: Erlang uses a stop-the-world collector (or used to, I haven't
checked in a while), but as it is a distributed system with tons of
processes with small heaps, it can also keep latency low, as each
individual collection only has to traverse a limited amount of memory.

In order to get incremental GC with low pause times, you will need one
of several approaches:

-   Write barriers.
-   Read barriers.
-   Snapshot techniques.
-   Optimized reference counting.

In practice, read barriers and snapshots without hardware support are
often too inefficient for typical programming languages. Read barriers
are still used (with heavy compiler support) in Azul's GC and
Shenandoah, because (inter alia) they allow you to get around the last
problem for extremely low latency GC when it comes to scanning stacks on
a multi-threaded system.

Snapshot-based systems are rare, but there was at one point a
fork()-based snapshot collector for D1. The dual problem with that is
that fork() is POSIX-specific and fork() does not get along well with
threads. (It's possible, but a pain in the neck.)

Write barriers are usually the technique of choice, because they only
incur overhead when pointers are written to the heap, so they have
fairly low and predictable overhead.

Write barriers can also improve throughput. If you use generational
collection *and* the generational hypothesis holds (i.e. you have a fair
amount of short-lived objects), then the cost of the write barrier is
usually more than offset by the greatly reduced tracing work necessary.
You will need a write barrier for generational collection, anyway, so
you get the incremental part for (mostly) free. Write barriers are also
particularly attractive in functional (including mostly functional)
languages, where pointer writes to the heap are uncommon to begin with.
That said, in an imperative language you can often avoid having
short-lived heap-allocated objects, so the generational hypothesis may
not always apply. The most common use cases where generational GC might
benefit D are probably if you are doing lots of work with strings or are
incrementally growing a dynamic array starting from a small size.

The overhead of a write barrier is context-dependent. Probably the worst
case is something like sorting an array in place, where each pointer
swap will require two write barriers. That said, these situations can be
worked around, especially in an optimizing compiler that can elide
multiple write barriers for the same object, and more typical code will
not write pointers to the heap that much to begin with.

Basic reference counting is normally dog slow due to memory locality
issues. However, you can optimize the most performance degrading case
(assignments to local variables) in a couple of ways. You can optimize
away some, which is what Swift does [2]. Or you can use deferred
reference counting, which counts only references from global variables
and the heap, but requires occasional stack scanning. Reference counting
is less attractive for the multi-threaded case with shared heaps,
because atomic reference counting is even slower; competitive
implementations of deferred reference counting for Java therefore buffer
increments and decrements in thread-local storage and process them in
batches [3]. Also, additional work has to be done to deal with cycles.

And, of course, there are various hybrid approaches, such as
generational collection for the minor heap and reference counting for
the major heap.

Going back to D, the obvious low-hanging fruit is to bring its
throughput up to par with modern stop-the-world collectors. If your
latency requirements are much higher than what that can give you,
chances are that you are already operating in a highly specialized
domain and have very specific requirements where you'd also have to
invest non-trivial effort into tuning a general purpose GC.

That said, having (optional) support for write barriers in the compiler
would be pretty nice, though I don't know how much effort that would be.

[1] https://github.com/ivmai/tinygc

[2] https://github.com/apple/swift/blob/master/docs/ARCOptimization.rst

[3] See David Bacon's seminal paper at
https://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon01Concurrent.pdf

Forums