January 06, 2019
On Saturday, 5 January 2019 at 22:05:19 UTC, Manu wrote:
> How much truth is in here?
> What is this business about write barriers making the GC fast? Is that
> actually fact, or are there other hurdles?

First of all, stop-the-world GC gets an unnecessarily bad rap these
days. A high throughput stop-the-world GC can actually be pretty
competitive in terms of throughput, and latency may be lower than what
you'd expect. You probably aren't going to run AAA video games with one,
but there are many application domains where latency just doesn't matter
or doesn't matter as much; even normal interactive use (such as your
typical GUI application) often works plenty fine. Stop-the-world
collectors are more of a problem for a functional programming language
(or languages that exercise allocation-heavy functional idioms a lot),
but less so for imperative languages that have more control over
allocation frequency.

In general, a modern precise stop-the-world collector should be able to
handle 1GB worth of pointers in something like .2 seconds on modern
hardware, give or take (especially if you can leverage parallelism).
Given that much of the heap will generally be occupied by non-pointer
data in an imperative language, such as strings, bitmaps, or numerical
data, this is plenty for typical interactive applications (you will
often have interaction hiccups of comparable size from other system
sources) and reasonably written server-side software can use a
multi-process approach to limit the per-process heap size. For batch
programs and most command line tools, latency matters little, anyway.
Throughput-wise, you should be able to perform on par with something
like jemalloc, possibly faster, especially if you have compiler support
(to e.g. inline pool allocations and avoid unnecessary zeroing of newly
allocated memory if you statically know the object's size and layout).
You generally won't be able to beat specialized allocators for
throughput, though, just general purpose allocators.

Note that manual memory allocation is not a free lunch, either. As an
extreme case, normal reference counting is like having a very expensive
write barrier that even applies to stack frames. Manual allocation only
really wins out for throughput if you can use optimized allocation
techniques (such as a pool/arena allocator where you throw the
pool/arena away at the end).

The problem with the D garbage collector is that it doesn't do well in
terms of either throughput or latency. Even so, GC work is proportional
to allocation work, so the impact on throughput in an imperative
language like D is often less than you think, even if the GC doesn't
have good performance. For example, I've used Tiny GC [1] in real world
C/C++ applications where I needed to avoid external dependencies and the
code still had good performance. That despite the fact that due its
design (a conservative GC based on system malloc()), its throughput
isn't particularly great.

Fun fact: Erlang uses a stop-the-world collector (or used to, I haven't
checked in a while), but as it is a distributed system with tons of
processes with small heaps, it can also keep latency low, as each
individual collection only has to traverse a limited amount of memory.

In order to get incremental GC with low pause times, you will need one
of several approaches:

-   Write barriers.
-   Read barriers.
-   Snapshot techniques.
-   Optimized reference counting.

In practice, read barriers and snapshots without hardware support are
often too inefficient for typical programming languages. Read barriers
are still used (with heavy compiler support) in Azul's GC and
Shenandoah, because (inter alia) they allow you to get around the last
problem for extremely low latency GC when it comes to scanning stacks on
a multi-threaded system.

Snapshot-based systems are rare, but there was at one point a
fork()-based snapshot collector for D1. The dual problem with that is
that fork() is POSIX-specific and fork() does not get along well with
threads. (It's possible, but a pain in the neck.)

Write barriers are usually the technique of choice, because they only
incur overhead when pointers are written to the heap, so they have
fairly low and predictable overhead.

Write barriers can also improve throughput. If you use generational
collection *and* the generational hypothesis holds (i.e. you have a fair
amount of short-lived objects), then the cost of the write barrier is
usually more than offset by the greatly reduced tracing work necessary.
You will need a write barrier for generational collection, anyway, so
you get the incremental part for (mostly) free. Write barriers are also
particularly attractive in functional (including mostly functional)
languages, where pointer writes to the heap are uncommon to begin with.
That said, in an imperative language you can often avoid having
short-lived heap-allocated objects, so the generational hypothesis may
not always apply. The most common use cases where generational GC might
benefit D are probably if you are doing lots of work with strings or are
incrementally growing a dynamic array starting from a small size.

The overhead of a write barrier is context-dependent. Probably the worst
case is something like sorting an array in place, where each pointer
swap will require two write barriers. That said, these situations can be
worked around, especially in an optimizing compiler that can elide
multiple write barriers for the same object, and more typical code will
not write pointers to the heap that much to begin with.

Basic reference counting is normally dog slow due to memory locality
issues. However, you can optimize the most performance degrading case
(assignments to local variables) in a couple of ways. You can optimize
away some, which is what Swift does [2]. Or you can use deferred
reference counting, which counts only references from global variables
and the heap, but requires occasional stack scanning. Reference counting
is less attractive for the multi-threaded case with shared heaps,
because atomic reference counting is even slower; competitive
implementations of deferred reference counting for Java therefore buffer
increments and decrements in thread-local storage and process them in
batches [3]. Also, additional work has to be done to deal with cycles.

And, of course, there are various hybrid approaches, such as
generational collection for the minor heap and reference counting for
the major heap.

Going back to D, the obvious low-hanging fruit is to bring its
throughput up to par with modern stop-the-world collectors. If your
latency requirements are much higher than what that can give you,
chances are that you are already operating in a highly specialized
domain and have very specific requirements where you'd also have to
invest non-trivial effort into tuning a general purpose GC.

That said, having (optional) support for write barriers in the compiler
would be pretty nice, though I don't know how much effort that would be.

[1] https://github.com/ivmai/tinygc

[2] https://github.com/apple/swift/blob/master/docs/ARCOptimization.rst

[3] See David Bacon's seminal paper at
https://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon01Concurrent.pdf

January 06, 2019
On 1/6/2019 1:42 AM, Manu wrote:
> Why did opInc/opDec fail? It was never available in any experimental
> form, I was never able to try it out. I've been waiting eagerly for
> almost a decade...

There was a looong thread about it. It wasn't implemented before major problems were found for it.

See:

https://digitalmars.com/d/archives/digitalmars/D/draft_proposal_for_ref_counting_in_D_211885.html

There's more on the Dlang-study mailing list:

Dlang-study@puremagic.com
http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study
January 06, 2019
On 1/6/2019 1:40 AM, Manu wrote:
> It's all irrelevant though, Nobody's asking for multiple pointer types
> in D. All I want from language support for ARC in D, is an opInc/opDec
> function which are called appropriately around assignments, and elided
> appropriately.

It turns out to be fiendishly difficult to automatically elide counter bumps.

> copy ctor's can't offer this functionality.

They can produce a working ref counting solution.

D's will have a couple fundamental advantages over the C++ one:

1. D won't need the locking on the increment, because D has different types for threaded vs non-threaded.

2. With dip25 and dip1000, D can offer memory safe access to rc object contents.

January 06, 2019
On 1/6/2019 3:32 AM, Russel Winder wrote:
> Java GC continues to change and improve in leaps and bounds, both speed and
> latency. And indeed a lack of "stop the world time". The G1 GC that was seen
> as the very best there could be two Java versions ago has been surpassed again
> with Shenandoah. JVM GC people just keep coming up with ways of killing off GC
> cost.

1. Java has a very constrained interface to C, with a lot of rules to follow, mainly so the GC is not corrupted. D, being a systems programming language, simply does not have that luxury.

2. Let me know when Java lets go of write barriers!

January 06, 2019
On Sun, Jan 06, 2019 at 01:25:32PM -0800, Walter Bright via Digitalmars-d wrote:
> On 1/6/2019 1:40 AM, Manu wrote:
> > It's all irrelevant though, Nobody's asking for multiple pointer types in D. All I want from language support for ARC in D, is an opInc/opDec function which are called appropriately around assignments, and elided appropriately.
> 
> It turns out to be fiendishly difficult to automatically elide counter bumps.

Just out of curiosity, any concrete examples of difficulties that prevent easy elision of counter bumps? Just so we have a better idea of the challenges we're up against.  Are there any common cases that are easy enough to implement, that might be "good enough", leaving more tricky cases aside?


> > copy ctor's can't offer this functionality.
> 
> They can produce a working ref counting solution.

Care to elaborate?  Is it related to how current move + postblit semantics make it impractical to maintain a consistent ref count?


> D's will have a couple fundamental advantages over the C++ one:
> 
> 1. D won't need the locking on the increment, because D has different types for threaded vs non-threaded.
> 
> 2. With dip25 and dip1000, D can offer memory safe access to rc object contents.

dip25 and dip1000 have been around for a long time now. What are the remaining issues blocking their full implementation / deployment?


T

-- 
Give a man a fish, and he eats once. Teach a man to fish, and he will sit forever.
January 06, 2019
On 1/6/2019 6:00 PM, H. S. Teoh wrote:
> On Sun, Jan 06, 2019 at 01:25:32PM -0800, Walter Bright via Digitalmars-d wrote:
>> It turns out to be fiendishly difficult to automatically elide counter
>> bumps.
> Just out of curiosity, any concrete examples of difficulties that
> prevent easy elision of counter bumps? Just so we have a better idea of
> the challenges we're up against.  Are there any common cases that are
> easy enough to implement, that might be "good enough", leaving more
> tricky cases aside?

See:

https://digitalmars.com/d/archives/digitalmars/D/draft_proposal_for_ref_counting_in_D_211885.html

There's more on the Dlang-study mailing list:

Dlang-study@puremagic.com
http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study

>>> copy ctor's can't offer this functionality.
>> They can produce a working ref counting solution.
> Care to elaborate?
https://github.com/RazvanN7/DIPs/blob/efacbf8efde8027291113633984b0e7a039e8f30/DIPs/DIP1xxx-rn.md


>> D's will have a couple fundamental advantages over the C++ one:
>>
>> 1. D won't need the locking on the increment, because D has different
>> types for threaded vs non-threaded.
>>
>> 2. With dip25 and dip1000, D can offer memory safe access to rc object
>> contents.
> 
> dip25 and dip1000 have been around for a long time now. What are the
> remaining issues blocking their full implementation / deployment?

dip25 is becoming the default compiler behavior.

dip1000 needs getting Phobos to compile with dip1000. Now that https://github.com/dlang/dmd/pull/8504 is merged, I still need to improve it with `return` inference.
January 07, 2019
On Monday, 7 January 2019 at 02:00:05 UTC, H. S. Teoh wrote:
> dip25 and dip1000 have been around for a long time now. What are the remaining issues blocking their full implementation / deployment?

For a long while, documentation. See that dates on https://github.com/dlang/dlang.org/pull/2453 a good waste of 4 1/2 months. But the documentation still needs to improve a lot.

dip25 is in the process of being turned on by default as an opt-out -transition switch and going through a deprecation cycle. dip1000 has changed quite a bit since it since the reviews and I suspect it will need to go through more once the spec for to is nailed down more (hello dconf?) and it too will probably need to be transitioned in.
January 06, 2019
On 1/6/2019 12:28 PM, Reimer Behrends wrote:
> In general, a modern precise stop-the-world collector should be able to
> handle 1GB worth of pointers in something like .2 seconds on modern
> hardware,
I use Thunderbird mail, latest version, and it has random pauses that are up to 10 seconds. It'll happen when I'm in the middle of typing, which is frustrating.

I'm not positive it is a GC pause, maybe it's something else, but the randomness of it suggests it is GC.
January 06, 2019
On Sun, Jan 06, 2019 at 07:34:41PM -0800, Walter Bright via Digitalmars-d wrote:
> On 1/6/2019 12:28 PM, Reimer Behrends wrote:
> > In general, a modern precise stop-the-world collector should be able to handle 1GB worth of pointers in something like .2 seconds on modern hardware,
>
> I use Thunderbird mail, latest version, and it has random pauses that are up to 10 seconds. It'll happen when I'm in the middle of typing, which is frustrating.
> 
> I'm not positive it is a GC pause, maybe it's something else, but the randomness of it suggests it is GC.

I've experienced a similar effect in Firefox, and though I cannot say for sure it isn't a GC problem, I notice that it causes a long spike of intensive I/O, and appears to be correlated with occasional segfaults and signs of memory corruption / memory leak, and generally happens only after significant usage over a prolonged timeframe, generally ending up in a state of extreme memory usage (several GBs in resident set size for just a small number of persistent tabs) that reset to more reasonable levels upon restarting and restoring exactly the same tabs.


T

-- 
What are you when you run out of Monet? Baroque.
January 07, 2019
On Monday, 7 January 2019 at 06:52:40 UTC, H. S. Teoh wrote:
> On Sun, Jan 06, 2019 at 07:34:41PM -0800, Walter Bright via Digitalmars-d wrote:
>> [...]
>
> I've experienced a similar effect in Firefox, and though I cannot say for sure it isn't a GC problem, I notice that it causes a long spike of intensive I/O, and appears to be correlated with occasional segfaults and signs of memory corruption / memory leak, and generally happens only after significant usage over a prolonged timeframe, generally ending up in a state of extreme memory usage (several GBs in resident set size for just a small number of persistent tabs) that reset to more reasonable levels upon restarting and restoring exactly the same tabs.
I have that too. It happens always on javascript heavy pages like facebook or youtube. Killing the process with the taskmanager is the quickest way to resolve the issue.