H. S. Teoh
Posted in reply to thebluepandabear
| On Tue, Dec 13, 2022 at 07:11:34AM +0000, thebluepandabear via Digitalmars-d wrote:
> Hello,
>
> I was speaking to one of my friends on D language and he spoke about how he doesn't like D language due to the fact that its standard library is built on top of GC (garbage collection).
>
> He said that if he doesn't want to implement GC he misses out on the standard library, which for him is a big disadvantage.
>
> Does this claim have merit? I am not far enough into learning D, so I haven't touched GC stuff yet, but I am curious what the D community has to say about this issue.
1) No, this claim has no merit. However, I sympathize with the reaction because that's the reaction I myself had when I first found D online. I came from a strong C/C++ background, got fed up with C++ and was looking for a new language closer to my ideals of what a programming language should be. Stumbled across D, which caught my interest. Then I saw the word "GC" and my knee-jerk reaction was, "what a pity, the rest of the language looks so promising, but GC? No thanks." It took me a while to realize the flaw in my reasoning. Today, I wholeheartedly embrace the GC.
2) Your friend has incomplete/inaccurate information about the standard library being dependent on the GC. A pretty significant chunk of Phobos is actually usable without the GC -- a large part of the range-based stuff (std.range, std.algorithm, etc.), for example. True, some parts are GC-dependent, but you can still get pretty good mileage out of the nogc subset of Phobos.
//
The thing about GC vs. non-GC is that, coming from a C/C++ background, my philosophy was that I must be in control of every detail of my program; I had to know exactly what it does at any given point. Especially when it comes to managing memory allocations. The idea being that if I kept my memory tidy (i.e., free allocated chunks when I'm done with them) then there wouldn't be an accumulation of garbage that would cost a lot of time to clean up later. The idea of a big black box called the GC that I don't understand, randomly taking over management of my memory, scared me. What if it triggered a collection at an inconvenient time when performance is critical?
Not an entirely wrong line of reasoning, but manual memory management comes with costs:
a) The biggest cost is the additional mental load it adds to your programming tasks. Once you go beyond your trivial hello-world and add-two-numbers-together type of functions, you have to start thinking about memory management at every turn, every juncture. "My function needs space to sort this list of stuff, hmm, I need to allocate a buffer. How big of a buffer do I need? When should I allocate it? When should I free it? I also need this other scratchpad buffer for caching this other bit of data that I'll need 2 blocks down the function body. Better allocate it too. Oh no, now I have to free it, so both branches of the if-statement has to check the pointer and free it. Oh, and inside this loop too; I can't just short-circuit it by returning from the function, I need an exit block for cleaning up my allocations. Oh, but this function might be called from a performance-critical part of the code! Better not do allocations here, let the caller pass it in. Oh wait, that changes the signature of this function, so I can't put it in the generic table of function pointers to callbacks anymore, I need a control block to store the necessary information. Oh wait, I have to allocate the control block too. Who's gonna free it? When should it be freed?"
And on and on it goes. Pretty soon, you find yourself spending an inordinate amount of time and effort fiddling with memory management rather than making progress in the problem domain, i.e., actually solving the problem you set out to solve in the first place.
And worse yet:
b) Your APIs become cluttered with memory management paraphrenalia. Instead of only input parameters that are directly related to the problem domain the function is supposed to do work in, you must also include memory-management related stuff. Like allocators, wrapped pointers -- because nobody can keep track of raw pointers without eventually tripping up -- you better wrap it in a managed pointer like auto_ptr<> or some ref-counted handle. But should you use auto_ptr or ref_counted<> or something else? In a large project, some functions will expect auto_ptr, others will expect ref_counted<>, and when you need to put them together, you need to insert additional code for interconverting between your wrapped pointer types. (And you need to take extra care not to screw up the semantics and leak/corrupt memory.)
The net result is, memory management paraphrenalia percolates throughout your code, polluting every API and demanding extra code for interconverting / gluing disparate memory management conventions together. Extra code that don't help you make any progress in your problem domain, but have to be there because of manual memory management.
c) So you went through all of the above troubles because you believed that it would save you from the bogeyman of unwanted GC pauses and keep you in control of the inner workings of your program. But does it really live up to its promises? Not necessarily.
If you have a graph of allocated objects, for example, when the last reference to some node in that graph is going out of scope, then you have to deallocate the entire graph. The dtor must recursively traverse the entire structure and destruct everything, because after that point, you no longer have a reference to the graph, and would leak the memory if you didn't clean up now. And here's the thing: in a sufficiently complex program, (1) you cannot predict the size of this graph -- it's potentially unbounded; and (2) you cannot predict where in the code the last reference will go out of scope (when the refcount goes to 0, if you're using refcounting). The net result is: your program will unpredictably get to a point where it must spend an unbounded amount of time to deallocate a large graph of allocated objects.
IOW, this is not that much different from the GC having to pause and do a collection at an unpredictable time.
So you put in all this effort just to avoid this bogeyman, and lo and behold you haven't got rid of it at all!
Furthermore, on today's CPU architectures that have cache hierarchies and memory access prediction units, one very important factor of performance is locality. I.e., if your program accesses memory in a sequential pattern, or within close proximity to each other, your program tends to run faster, than if it had to successively access multiple random locations in memory. If you manage memory yourself, then when a large graph of objects is going out of scope you're forced to clean it up right there and then -- even if the nodes happen to be widely scattered across memory (because they were allocated at different times in the program and attached to the graph). If you used a GC, however, the GC could change the order in which it scans for garbage in a way that has better cache utility -- because the GC isn't obligated to clean up immediately, but can wait until there's enough garbage that a single sweep would pick up pieces of diverse object graphs that happen to be close to each other in memory, and clean them up in a sequential order so that there are less CPU cache misses.
Or, to put it succinctly, the GC can sometimes outperform your manual management of memory!
d) Lastly, memory management is hard. Very hard. So hard that, after how many decades of industry experience with manual memory management in C/C++, well-known, battle-worn large software projects are still riddled with memory management bugs that lead to crashes and security exploits. Just check the CVE database, for example. An inordinately large proportion of security bugs are related to memory management.
Using a GC immediately gets rid of 90% of these issues. (Not 100%, unfortunately, because there are still cases where problems may arise. See: "memory management is hard".) If you don't need to write the code that frees memory, then by definition you cannot introduce bugs while doing so.
This leads us to the advantages of having a GC:
1) It greatly reduces the number of memory-related bugs in your program. Gets rid of an entire class of bugs related to manually managing your allocations.
2) It frees up your mental resources to make progress in your problem domain, instead of endlessly worrying about the nitty-gritty of memory management at every turn. More mental resources available means you can make progress in your problem domain faster, and with lower chances of bugs.
3) Your APIs become cleaner. You no longer need memory management paraphrenalia polluting your APIs; your parameters can be restricted to only those that are required for your problem domain and nothing else. Cleaner APIs lead to less boilerplate / glue code for interfacing between APIs that expect different memory management schemes (e.g., converting between auto_ptr<> and ref_counted<> or whatever). Diverse modules become more compatible with each other, and can call each other with less friction. Less friction means shorter development times, less bugs, and better maintainability (code without memory management paraphrenalia is much easier to read -- and understand correctly so that you can make modifications without introducing bugs).
4) In some cases, you may even get better runtime performance than if you manually managed everything.
//
And as a little footnote: D's GC does not run in the background independently of your program's threads; GC collections will NOT trigger unless you're allocating memory and the GC runs out of memory to give you. Meaning that you *do* have some control over GC pauses in your program -- if you want to be sure you have no collections in some piece of code, simply don't do any allocations, and collections won't start.
If you're worried that another thread might trigger a collection, you
can always bring out the GC.stop() hammer to stop the GC from doing any
collections even in the face of continuing allocations. (And then call
GC.start() later when it's safe for collections to run again.)
And if you're like me, and you like more control over how things are run in your program, you can even call GC.stop() and then periodically call GC.collect() in your own schedule, at your own convenience. (In one of my D projects, I managed to eke out a 20-25% performance boost just by reducing the frequency of GC collections by running GC.collect on my own schedule.)
//
Also, in those few places in your code where the GC really *does* get in your way, there's @nogc at your disposal. The compiler will statically enforce zero GC usage in such functions, so that you can be sure you won't trigger any collections and you won't make any new GC allocations.
//
So you see, the GC isn't really *that* bad, as if it were a plague that you have to avoid at all costs. It's actually a good helper if you know how to make use of its advantages.
T
--
Why waste time reinventing the wheel, when you could be reinventing the engine? -- Damian Conway
|