So I am helping to add a new GC to dlang, and one thing I have run into is that all correctly-written GC implementations must implement the function collectNoStack
.
This is defined the the GC interface here: https://github.com/dlang/dmd/blob/c09adbbc2793aedcc3569681acfc42260d3b0e4b/druntime/src/core/gc/gcinterface.d#L59
When looking further into what this actually means, alarmingly, it means exactly what it says -- run a collection cycle without examining any thread stacks for roots.
What in God's name is the point of this? Won't this just collect things that are still actively being referenced by threads?! The answer is -- yes.
Well, I wanted to know more about how this could be valid, so I did more research and it's kind of a fun story. Some of this is conjecture as I wasn't around for the beginnings of this, and having help filling in the holes is appreciated.
In looking to see which code actually calls this, I can find only one use, here: https://github.com/dlang/dmd/blob/c09adbbc2793aedcc3569681acfc42260d3b0e4b/druntime/src/core/internal/gc/proxy.d#L119
Quoting the code in that snippet, so you can keep the "explanatory comment" in mind:
// NOTE: There may be daemons threads still running when this routine is
// called. If so, cleaning memory out from under then is a good
// way to make them crash horribly. This probably doesn't matter
// much since the app is supposed to be shutting down anyway, but
// I'm disabling cleanup for now until I can think about it some
// more.
//
// NOTE: Due to popular demand, this has been re-enabled. It still has
// the problems mentioned above though, so I guess we'll see.
instance.collectNoStack(); // not really a 'collect all' -- still scans
// static data area, roots, and ranges.
Which is the "collect" configuration of how to terminate the GC.
In a way, this makes sense, because if you are terminating the GC, the GC is going away, and it doesn't really matter if anything is referring to the data, those references are all gonna die.
OK.... but why skip just the thread stacks? In fact, why scan anything at all? I'm not the first one to think this, there's a second configuration, which does exactly this, which is in another case of that switch.
To try and pin down why this is there, and what the "popular demand" note means, I started using git blame (I have to say, the world is a better place with git and github around, I shudder to think how I would have had to find the history of this with subversion).
Aaaaand I traced it back to the beginning of druntime. Yes, this is the repository after the very first commit from Sean Kelly for druntime: https://github.com/dlang/dmd/blob/6837c0cd426f7e828aec1a2bdc941ac9b722dd14/src/gc/basic/gc.d#L73
So, I thought, maybe I will email Sean? He might know why this note is there.
But wait! druntime takes its lineage from Tango! And Tango is also on github >:)
And now, we find out when the first note was written: https://github.com/SiegeLord/Tango-D2/commit/03ea5067558829b8c99e3cf12bb0e55c43e29269
Hoooold on a second. The line that was commented out was... not the full collect. That was already commented out, and actually, it was just doing what I proposed above -- collecting all blocks regardless of roots.
The note was added when that was commented out, and apparently, the Tango runtime just didn't do any collection at the end of a program.
What about the second note? That got added "by popular demand" later:
https://github.com/SiegeLord/Tango-D2/commit/5984ec967eaffb1d3c1c7504e9349f18c8b36038
This means, the _fullCollectNoStack
was added back in (and apparently the second call to run the destructors and clean all garbage, which must have been separated back then). I can guess because people thought it should be done.
The note concerns "deamon threads". What is a daemon thread? It's a thread that does not get joined at the end of execution (that is still the same, and you can see the explanation here: https://dlang.org/phobos/core_thread_threadbase.html#.ThreadBase.isDaemon). I checked, and literally this is the only place the isDaemon
flag is used. Daemon threads still are stopped for GC, and still get scanned. They just aren't waited for at the end of main.
OK, now the note actually makes sense -- if you clean all the garbage at the end of main without scanning thread stacks, then you clean out memory that the daemon threads may still be using.
But.. does it? When did this ever work? Isn't the GC going away?
I wanted to find out the true entomology of this... "thing". So I kept going back. And as it turns out, the collectNoStack
function comes from D1! That's right, we still have that to look at as well: https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gc.d#L171
Hm.. OK, so this is what D always did. But why? I wanted to find out exactly what happened differently when the fullCollectNoStack function was called, and I got my answer:
Peruse through that file, and you'll see the nostack
variable is used in one place: https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gcx.d#L2030
And look there... it's only skipping the stack scanning if there is exactly one thread.
In other words, with D1, where this poorly named fullCollectNoStack
function existed, it actually would scan with stacks as long as you created multiple threads. That is, in certain (very common) cases, the fullCollectNoStack
would scan stacks. Should it have been called fullCollectMaybeNoStacksIfSingleThreaded
? I digress...
And in fact, when D1 was compiled in "single threaded mode", indeed scanning of thread stack was skipped: https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gcx.d#L2117
Let's think back to why the heck we have this going on. My theory is that people who are new to GC or don't really understand how GC works, run a test like the following:
struct S
{
~this() {
printf("Destroying!\n");
}
}
void main()
{
S *s = new S;
}
If they don't get a printout, they post an angry/confused message on the forums saying
Y U No work GC?
If the stack of main
is scanned, it's possible there's still a reference to the s
there. It could even still be in registers for the thread. And that might mean that the GC won't clean it up.
The truth is, there is no guarantee any destructors are run. And especially in 32-bit D (which is what D was exclusively for a long time), random 32-bit numbers might accidentally "point" at the memory block.
So maybe, the solution Walter came up with (and I'm just guessing here), is hey, we are shutting down anyways, just avoid scanning the main thread stack, and we can satisfy the unwashed masses.
But that brings us back to WHY THE HELL DO WE STILL HAVE THIS? My guess is that the note keeps people from removing it. If we are doing a scan at all, scanning thread stacks as roots should be a trivial addition to the scan. Skipping it just adds an extra layer of complication to the implementation that is unnecessary. But that note where "I'm disabling cleanup for now until I can think about it some more" seems to be applying to an actual scan (not the blunt destruction of all memory, which is the line commented out when the note was added). That is causing people to hesitate and leave things be. Someone was behind that "I", and I probably should step on that someone's toes, they knew what they were doing.
And they did, but what they did isn't what the code says (my hypothesis).
So my solution is, let's just get rid of this extra function. Let's get rid of any idea of doing a half-ass scan that at best collects some extra stuff that might not be referenced and at worst pulls the rug out from still-running threads. And if you actually called this somehow in the middle of a program, it will corrupt all your memory immediately.
I did a PR to just see what happens when we do a full scan instead of the "no stack" scan, and the results are pretty positive. I'm going to update the PR to really remove all the tentacles of the "nostack" variable, but I wanted to bring this story to light because it's too long and bizarre to explain in the notes of a PR.
https://github.com/dlang/dmd/pull/16401
If there are any good reasons why we should have this, or I got something wrong, please let me know!
-Steve