Thread overview
Re: What's the go with the GC these days?
Jan 06, 2019
Jonathan M Davis
Jan 06, 2019
H. S. Teoh
Jan 06, 2019
Nicholas Wilson
Jan 06, 2019
Walter Bright
Jan 06, 2019
Nicholas Wilson
Jan 06, 2019
Walter Bright
Jan 07, 2019
Nicholas Wilson
Jan 07, 2019
Walter Bright
Jan 06, 2019
H. S. Teoh
January 05, 2019
On Saturday, January 5, 2019 3:05:19 PM MST Manu via Digitalmars-d wrote:
> Is progress possible, or is the hard reality that the language is just designed such to be resistant to a quality GC, while the ecosystem sadly tends to rely on it?

There are some fundamental limitations put on a GC for D, because D is a systems language and is not particularly limited in what it can do, rendering a lot of the stuff that languages like Java or Go do with their GC impossible. And it's not that uncommon in GC discussions that someone suggests something that won't work even though it initially seems like it should - e.g. having separate memory pools for each thread doesn't work with how D currently works; at minimum, how casting to and from shared and immutable works would have to be changed (which would impact performance). And at one point, Daniel Murphy explained some other technical hurdles that he didn't seem to think were surmountable, but unfortunately, I can't remember the details now - something to do with the vtable and TypeInfo, I think, but I could be misremembering. In any case, a lot of the stuff that would typically be done to improve a GC and make it more modern is off the table with D.

But that doesn't mean that D's GC can't be improved. Work has been done on that - e.g. Martin Nowak was going to do a talk at dconf 2015 about the improvements to the GC that he had done, but he missed his plane, so we unfortunately didn't get to hear that talk. Other work has been done towards GC improvements, some of which has been merged, and some of which hasn't. For instance, IIRC, Rainer made improvements to how precise the GC was a year or two back (though I don't think that it's yet fully precise, and I don't know how possible it is make it fully precise). Also, destructors now get called for structs on the heap in many cases, whereas before, that pretty much never happened. Other work has been done towards more radical changes to the GC, though I don't think that any of that has made it in (e.g. the forking GC that Sociomantic worked on which significantly reduced how long the world got stopped - but AFAIK, no one has yet figured out how to make that work on Windows, so none of that work has made it into druntime). I don't think that a lot of work has been done towards improving the GC (at least that has made it in yet), but improvements have been made. They just don't get advertised much when they do, and it's a favorite assumption that GC performance sucks. We can and should improve the GC, but it requires that folks spend time on doing so, and it's not easy.

And really, for the most part, it's just folks saying that the GC is a problem without any real evidence to back it up, whereas there are a number of cases of folks in this newsgroup talking about how they used the GC for stuff that was highly performance sensitive, and all they had to do was avoid having it be used in some particular piece of code in order to make it so that their programs were quite performant. I don't think that there's any question that some types of workloads are heavily impacted by the GC, but if anything, experience seems to show that D's GC isn't a performance problem in most cases (particularly with how typical it is for D code to use structs on the stack where possible rather than allocating on the heap), and when it is a problem, it's often possible to adjust sections of your code so that the GC isn't a problem anymore. But for the most part, what we have to go on is just anecdotal evidence. There isn't much in the way of actual performance comparisons with the GC (at least not that get made public), and it's just a common assumption that D's GC is bad and hurts performance. How true that really is in practice is an open question. But I think that how true that is is going to depend heavily on what your code is doing, and D's idioms do tend to lean towards minimizing heap usage of any kind, which usually improves performance regardless of whether you're talking about avoiding allocations from the GC heap or from somewhere else.

> Where's the ARC stuff? What happened to opAddRef/opRelease?

I don't know that we're ever going to get ARC specifically, but as I understand it, whatever reference counting stuff Walter was working on got put on hold for DIP 1000, because he needed that in order to make the reference counting stuff @safe. There's also the issue of copy constructors and Andrei's __mutable proposal which could impact what happens (since right now, as soon as const or immutable get involved reference counting doesn't work). So, basically, language improvements needed to be made in order to make any kind of built-in reference-counting work (at least with @safe or with const/immutable), and I don't know how much more needs to be completed before a reference-counting mechanism can be added. At minimum, DIP 1000 will have to be completed.

- Jonathan M Davis



January 05, 2019
On Sat, Jan 05, 2019 at 11:12:52PM +0000, Neia Neutuladh via Digitalmars-d wrote:
> On Sat, 05 Jan 2019 14:05:19 -0800, Manu wrote:
> > I'm somewhere between a light GC user and a @nogc user, and I don't really know much about where we're at, or much about start-of-the-art GC in general.
> 
> I use the GC unabashedly and only try to make sure I reuse memory when it's reasonably convenient. I've also looked into GC a bit.

I also use the GC freely, and only bother with GC optimization when my profiler shows that there's an actual problem.

As I've said numerous times before, unless you're working on extremely time-sensitive code like real-time applications or 3D game engines, D's GC usually does not cause much noticeable difference. Unless you do something extreme like allocate tens of millions of small strings (or other small objects) per second, or allocate huge objects rapidly and expect unreferenced memory to be quickly reused.

And when the GC does start becoming a problem, there are often easy first-stab solutions that offer improvements that suffice for most cases, the most obvious one being calling GC.disable and then scheduling GC.collect yourself at more convenient times than when the default system might run collections. This requires the least code changes, and IME often yields quite good improvements.

Past that, the next most obvious one is to reduce GC pressure by reusing frequently-allocated objects -- standard advice for improving GC performance, y'know.  This requires a bit more work, but should be obvious by identifying code hotspots with a profiler and examining which objects are being most frequently allocated. Use stdx.allocator to allocate from a pool instead, or just retain the object in a cache and reuse it the next time round, etc..

If the GC is still an issue after this, look into preallocating objects outside your main loop, so that you can control exactly when GC pauses happen.  Again, standard advice for GC optimization.

Past this point, if GC remains a big issue, you could start pulling out @nogc and using malloc/free, etc.. (Though I might do malloc/free in the previous step if I know there are big objects I'm gonna need, and they have straightforward lifetimes that are easy to track -- this reduces the size of the GC heap, and thereby also improves collection performance.)

//

As far as improving the GC itself is concerned, it will surely be nice, and an overall win for D, and certainly we shouldn't delay on doing this. But I don't think it's a life-or-death problem that we must fix Right Here And Now Or Else(tm).


> > How much truth is in here?
> 
> D uses a simple GC. Simple means easy to reason about. It's also got better tools to stress the GC less.

Yeah, I find that with my recent, more idiomatic D code, I tend to use ranges with lazy evaluation a lot more than plain arrays, meaning that what I'd normally allocate as arrays in equivalent C/C++ code, in D no allocation (or significantly less allocation) actually happens because of lazy evaluation.  The one place where I still tend to allocate more would be string manipulation, but even here, using slices instead of copying substrings like in C/C++ still reduces the actual number of allocations.

The situation is significantly different from GC-heavy languages like Java, where basically every non-trivial type requires an allocation with few or no alternatives or ways of bypassing it. The absence of by-value aggregates in Java causes your average code to allocate far more than the equivalent D code, not to mention the heavy OO emphasis often leads to many indirections like interfaces and vtables, which often also entail allocating adaptor objects, wrappers, etc..  So in Java, GC performance would play a much larger role in overall performance, whereas in D, the prevalence of by-value types like ranges, and passing small parcels of information as structs rather than classes makes the GC pressure much lower, so GC performance is not as significant.

(Of course, it's possible for Java compilers to optimize away some allocations if object lifetimes can be statically determined -- I don't know if any Java compilers actually do this. Still, Java code does tend to be more allocation-heavy than typical D code.)


> But one thing it could get that would be interesting is a thread-local GC.

Yes, yes, and yes!  This would allow per-thread segregation of the GC heap, which would allow much better control of GC pauses.


> (You'd have a stop-the-world phase that happened infrequently. Casting something to shared would pin it in the thread it was allocated from.)

How does this solve the problem of shared, though?  The last time I checked, casting to/from shared is the main showstopper for a thread-local GC.


[...]
> > Is progress possible, or is the hard reality that the language is just designed such to be resistant to a quality GC, while the ecosystem sadly tends to rely on it?
> 
> Three things about D make it harder to make a good GC for it:
> * unaligned pointers
> * unions
> * externally allocated memory (malloc and friends)
> 
> We've pretty much addressed malloc by telling people to manually add and remove malloced memory from what the GC scans. A union is pretty much just a pointer that might not be valid. Unaligned pointers just kind of suck.

Unaligned pointers are generally just a bad idea IMO.  I'm tempted to say it should be defined as UB, along with obscured pointers (like the doubly-linked list with 1 pointer per node trick using XOR). Maybe with a compiler / runtime switch to enable a more conservative GC.

Now that I think of it, we could deal with pointers in unions the same way -- if the compiler detects it, then trigger conservative mode in the GC.

With these two out of the way, a generational GC for D seems closer to the realm of possibility.


T

-- 
"How are you doing?" "Doing what?"
January 06, 2019
On Sunday, 6 January 2019 at 04:34:30 UTC, H. S. Teoh wrote:
> On Sat, Jan 05, 2019 at 11:12:52PM +0000, Neia Neutuladh via
>> But one thing it could get that would be interesting is a thread-local GC.
>
> Yes, yes, and yes!  This would allow per-thread segregation of the GC heap, which would allow much better control of GC pauses.
>
>
>> (You'd have a stop-the-world phase that happened infrequently. Casting something to shared would pin it in the thread it was allocated from.)
>
> How does this solve the problem of shared, though?  The last time I checked, casting to/from shared is the main showstopper for a thread-local GC.

The recent shared thread by Manu was trying (only somewhat successfully) to tie shared to @safe/@trusted with an inductive model, i.e. the user writes @trusted foundational layer and @safety is verified from there on up the call stack. I wonder if that sort of a model would safely permit a thread local GC?

>> Three things about D make it harder to make a good GC for it:
>> * unaligned pointers
>> * unions
>> * externally allocated memory (malloc and friends)
>> 
>> We've pretty much addressed malloc by telling people to manually add and remove malloced memory from what the GC scans. A union is pretty much just a pointer that might not be valid. Unaligned pointers just kind of suck.
>
> Unaligned pointers are generally just a bad idea IMO.  I'm tempted to say it should be defined as UB, along with obscured pointers (like the doubly-linked list with 1 pointer per node trick using XOR). Maybe with a compiler / runtime switch to enable a more conservative GC.
>
> Now that I think of it, we could deal with pointers in unions the same way -- if the compiler detects it, then trigger conservative mode in the GC.
>
> With these two out of the way, a generational GC for D seems closer to the realm of possibility.

With a precise GC that knows the types/layout of what its dealing with it ought to be possible to have a way to retrieve all the pointers from a type for scanning. For an inactive union that would be returning nothing. It could possibly work for an XORlist as well, not that I thunk we should be letting that kind of monstrosity dictate our designs.  Ditto for unaligned pointers.

January 05, 2019
On Sat, Jan 05, 2019 at 08:38:34PM -0700, Jonathan M Davis via Digitalmars-d wrote: [...]
> But that doesn't mean that D's GC can't be improved. Work has been done on that - e.g. Martin Nowak was going to do a talk at dconf 2015 about the improvements to the GC that he had done, but he missed his plane, so we unfortunately didn't get to hear that talk. Other work has been done towards GC improvements, some of which has been merged, and some of which hasn't.  For instance, IIRC, Rainer made improvements to how precise the GC was a year or two back (though I don't think that it's yet fully precise, and I don't know how possible it is make it fully precise). Also, destructors now get called for structs on the heap in many cases, whereas before, that pretty much never happened.

Yeah, glancing over druntime's git history, it appears that since around May last year, there's been a slow but steady trickle of fixes / improvements committed to the GC.  So work *is* happening, just perhaps not at the rate people might want, but the GC *has* improved over the years.


> Other work has been done towards more radical changes to the GC, though I don't think that any of that has made it in (e.g. the forking GC that Sociomantic worked on which significantly reduced how long the world got stopped - but AFAIK, no one has yet figured out how to make that work on Windows, so none of that work has made it into druntime).

I wonder if we could just make that forking GC a plugin available on Linux / Posix?  IIRC improvements have been made over recent years to better encapsulate the GC API, so drop-in replacements should be possible, at least in theory.

OTOH, the last time I heard about it the codebase of the forking GC was too tied down to the (very) old version of druntime it was originally based on, and it may have been Tango-specific as well, so a lot of work would have to be put it to make it work with the modern druntime.  As long as nobody is willing to put in the effort, nothing will materialize.


[...]
> And really, for the most part, it's just folks saying that the GC is a problem without any real evidence to back it up, whereas there are a number of cases of folks in this newsgroup talking about how they used the GC for stuff that was highly performance sensitive, and all they had to do was avoid having it be used in some particular piece of code in order to make it so that their programs were quite performant.

Indeed.  One particular project of mine -- which may be an atypical workload, depending on what you usually work with -- involves a batch-processing of an extremely CPU-intensive computation which stores data in an AA that grows to very large sizes, along with a couple of arrays and other simple structures.  To maximize throughput, I used a profiler.  After optimizing the few hotspots that were found, I started seeing GC collection cycles show up at the top of the profiler output, so I set about investigating what could be done to improve it.

After some experiments, I found that collections were taking place frequently, and took a long time each time.  Since the AA only ever grew (nothing gets deleted from it), frequent collections were not really necessary, so settled on the simple strategy of calling GC.disable at the beginning of the program, and scheduling GC.collect calls myself at a lower frequency.  I managed to cut total running times by about 40-50% just by tweaking the frequency of the collection cycles.

Then after that, closer investigation revealed a very frequent reallocation of a small array inside an inner loop, which was causing a lot of GC pressure.  Refactoring the code to reuse previous arrays where possible, I reduced GC pressure sufficiently that I could scale back the collection cycle frequency even more -- I managed to reduce total running time by about another 10-20% or so.

These are all rather simple, localized code changes -- I'm quite certain even more improvements can be made were I to spend the time to refactor the code more.


[...]
> But for the most part, what we have to go on is just anecdotal evidence. There isn't much in the way of actual performance comparisons with the GC (at least not that get made public), and it's just a common assumption that D's GC is bad and hurts performance. How true that really is in practice is an open question.
[...]

Well, then what about doing some actual measurements with a well-known D codebase?  Let's get an actual profiling of GC performance of an actual project and see how the numbers look.  Maybe compare it with equivalent Java code or something with a known "advanced" GC.  Or compare against different workloads to see how the GC does.

Then post the results on the wiki or something, with concrete measurements that we can point people to when they complain about the GC.


T

-- 
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
January 05, 2019
On 1/5/2019 8:34 PM, H. S. Teoh wrote:
> Of course, it's possible for Java compilers to optimize away some
> allocations if object lifetimes can be statically determined -- I don't
> know if any Java compilers actually do this.

They do. I forgot the jargon term for it, but they'll determine if the lifetime does not escape the function scope, and allocate it on the stack instead of the heap.
January 06, 2019
On Sunday, 6 January 2019 at 06:57:40 UTC, Walter Bright wrote:
> On 1/5/2019 8:34 PM, H. S. Teoh wrote:
>> Of course, it's possible for Java compilers to optimize away some
>> allocations if object lifetimes can be statically determined -- I don't
>> know if any Java compilers actually do this.
>
> They do. I forgot the jargon term for it, but they'll determine if the lifetime does not escape the function scope, and allocate it on the stack instead of the heap.

Heap to stack promotion? BTW LDC does this https://github.com/ldc-developers/ldc/blob/master/gen/passes/GarbageCollect2Stack.cpp
January 06, 2019
On 1/5/2019 11:31 PM, Nicholas Wilson wrote:
> Heap to stack promotion? BTW LDC does this https://github.com/ldc-developers/ldc/blob/master/gen/passes/GarbageCollect2Stack.cpp 


I suppose that's as good a word as any!

I just remember SROA as a candidate for the most pretentious, obtuse compiler acronym ever, right behind SFINAE.
January 07, 2019
On Sunday, 6 January 2019 at 08:55:31 UTC, Walter Bright wrote:
> On 1/5/2019 11:31 PM, Nicholas Wilson wrote:
>> Heap to stack promotion? BTW LDC does this https://github.com/ldc-developers/ldc/blob/master/gen/passes/GarbageCollect2Stack.cpp
>
>
> I suppose that's as good a word as any!
>
> I just remember SROA as a candidate for the most pretentious, obtuse compiler acronym ever, right behind SFINAE.

SROA is at least intuitive in what is does and makes sense when you consider padding, SFINAE OTOH, I know what it stands for, I have no idea what it does or why. Mind you thats par for the course for C++.
January 06, 2019
On 1/6/2019 7:05 PM, Nicholas Wilson wrote:
> SROA is at least intuitive in what is does and makes sense when you consider padding,

I would have called something like Aggregate Slicing, or simply Slice-O-Matic :-)

> SFINAE OTOH, I know what it stands for, I have no idea what it does or why.

SFINAE => Substitution Failure Is Not An Error

It means if a function template instantiation does not produce compilable code, the function is (silently) not instantiated. In D we do this with template constraints.