Jump to page: 1 2 3
Thread overview
April 21

So I am helping to add a new GC to dlang, and one thing I have run into is that all correctly-written GC implementations must implement the function collectNoStack.

This is defined the the GC interface here: https://github.com/dlang/dmd/blob/c09adbbc2793aedcc3569681acfc42260d3b0e4b/druntime/src/core/gc/gcinterface.d#L59

When looking further into what this actually means, alarmingly, it means exactly what it says -- run a collection cycle without examining any thread stacks for roots.

What in God's name is the point of this? Won't this just collect things that are still actively being referenced by threads?! The answer is -- yes.

Well, I wanted to know more about how this could be valid, so I did more research and it's kind of a fun story. Some of this is conjecture as I wasn't around for the beginnings of this, and having help filling in the holes is appreciated.

In looking to see which code actually calls this, I can find only one use, here: https://github.com/dlang/dmd/blob/c09adbbc2793aedcc3569681acfc42260d3b0e4b/druntime/src/core/internal/gc/proxy.d#L119

Quoting the code in that snippet, so you can keep the "explanatory comment" in mind:

// NOTE: There may be daemons threads still running when this routine is
//       called.  If so, cleaning memory out from under then is a good
//       way to make them crash horribly.  This probably doesn't matter
//       much since the app is supposed to be shutting down anyway, but
//       I'm disabling cleanup for now until I can think about it some
//       more.
//
// NOTE: Due to popular demand, this has been re-enabled.  It still has
//       the problems mentioned above though, so I guess we'll see.

instance.collectNoStack();  // not really a 'collect all' -- still scans
                            // static data area, roots, and ranges.

Which is the "collect" configuration of how to terminate the GC.

In a way, this makes sense, because if you are terminating the GC, the GC is going away, and it doesn't really matter if anything is referring to the data, those references are all gonna die.

OK.... but why skip just the thread stacks? In fact, why scan anything at all? I'm not the first one to think this, there's a second configuration, which does exactly this, which is in another case of that switch.

To try and pin down why this is there, and what the "popular demand" note means, I started using git blame (I have to say, the world is a better place with git and github around, I shudder to think how I would have had to find the history of this with subversion).

Aaaaand I traced it back to the beginning of druntime. Yes, this is the repository after the very first commit from Sean Kelly for druntime: https://github.com/dlang/dmd/blob/6837c0cd426f7e828aec1a2bdc941ac9b722dd14/src/gc/basic/gc.d#L73

So, I thought, maybe I will email Sean? He might know why this note is there.

But wait! druntime takes its lineage from Tango! And Tango is also on github >:)

And now, we find out when the first note was written: https://github.com/SiegeLord/Tango-D2/commit/03ea5067558829b8c99e3cf12bb0e55c43e29269

Hoooold on a second. The line that was commented out was... not the full collect. That was already commented out, and actually, it was just doing what I proposed above -- collecting all blocks regardless of roots.

The note was added when that was commented out, and apparently, the Tango runtime just didn't do any collection at the end of a program.

What about the second note? That got added "by popular demand" later:

https://github.com/SiegeLord/Tango-D2/commit/5984ec967eaffb1d3c1c7504e9349f18c8b36038

This means, the _fullCollectNoStack was added back in (and apparently the second call to run the destructors and clean all garbage, which must have been separated back then). I can guess because people thought it should be done.

The note concerns "deamon threads". What is a daemon thread? It's a thread that does not get joined at the end of execution (that is still the same, and you can see the explanation here: https://dlang.org/phobos/core_thread_threadbase.html#.ThreadBase.isDaemon). I checked, and literally this is the only place the isDaemon flag is used. Daemon threads still are stopped for GC, and still get scanned. They just aren't waited for at the end of main.

OK, now the note actually makes sense -- if you clean all the garbage at the end of main without scanning thread stacks, then you clean out memory that the daemon threads may still be using.

But.. does it? When did this ever work? Isn't the GC going away?

I wanted to find out the true entomology of this... "thing". So I kept going back. And as it turns out, the collectNoStack function comes from D1! That's right, we still have that to look at as well: https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gc.d#L171

Hm.. OK, so this is what D always did. But why? I wanted to find out exactly what happened differently when the fullCollectNoStack function was called, and I got my answer:

https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gcx.d#L1031

Peruse through that file, and you'll see the nostack variable is used in one place: https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gcx.d#L2030

And look there... it's only skipping the stack scanning if there is exactly one thread.

In other words, with D1, where this poorly named fullCollectNoStack function existed, it actually would scan with stacks as long as you created multiple threads. That is, in certain (very common) cases, the fullCollectNoStack would scan stacks. Should it have been called fullCollectMaybeNoStacksIfSingleThreaded? I digress...

And in fact, when D1 was compiled in "single threaded mode", indeed scanning of thread stack was skipped: https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gcx.d#L2117

Let's think back to why the heck we have this going on. My theory is that people who are new to GC or don't really understand how GC works, run a test like the following:

struct S
{
   ~this() {
     printf("Destroying!\n");
   }
}

void main()
{
   S *s = new S;
}

If they don't get a printout, they post an angry/confused message on the forums saying

Y U No work GC?

If the stack of main is scanned, it's possible there's still a reference to the s there. It could even still be in registers for the thread. And that might mean that the GC won't clean it up.

The truth is, there is no guarantee any destructors are run. And especially in 32-bit D (which is what D was exclusively for a long time), random 32-bit numbers might accidentally "point" at the memory block.

So maybe, the solution Walter came up with (and I'm just guessing here), is hey, we are shutting down anyways, just avoid scanning the main thread stack, and we can satisfy the unwashed masses.

But that brings us back to WHY THE HELL DO WE STILL HAVE THIS? My guess is that the note keeps people from removing it. If we are doing a scan at all, scanning thread stacks as roots should be a trivial addition to the scan. Skipping it just adds an extra layer of complication to the implementation that is unnecessary. But that note where "I'm disabling cleanup for now until I can think about it some more" seems to be applying to an actual scan (not the blunt destruction of all memory, which is the line commented out when the note was added). That is causing people to hesitate and leave things be. Someone was behind that "I", and I probably should step on that someone's toes, they knew what they were doing.

And they did, but what they did isn't what the code says (my hypothesis).

So my solution is, let's just get rid of this extra function. Let's get rid of any idea of doing a half-ass scan that at best collects some extra stuff that might not be referenced and at worst pulls the rug out from still-running threads. And if you actually called this somehow in the middle of a program, it will corrupt all your memory immediately.

I did a PR to just see what happens when we do a full scan instead of the "no stack" scan, and the results are pretty positive. I'm going to update the PR to really remove all the tentacles of the "nostack" variable, but I wanted to bring this story to light because it's too long and bizarre to explain in the notes of a PR.

https://github.com/dlang/dmd/pull/16401

If there are any good reasons why we should have this, or I got something wrong, please let me know!

-Steve

April 21

On Sunday, 21 April 2024 at 19:28:11 UTC, Steven Schveighoffer wrote:

>

If they don't get a printout, they post an angry/confused message on the forums saying

Y U No work GC?

If the stack of main is scanned, it's possible there's still a reference to the s there. It could even still be in registers for the thread. And that might mean that the GC won't clean it up.

The truth is, there is no guarantee any destructors are run. And especially in 32-bit D (which is what D was exclusively for a long time), random 32-bit numbers might accidentally "point" at the memory block.

Great digging, I suspect your are correct.

Why do we even support destructors for GC allocated objects? You can't depend on it closing file-descriptors or other resources.

April 22
On 22/04/2024 7:57 AM, Daniel N wrote:
> On Sunday, 21 April 2024 at 19:28:11 UTC, Steven Schveighoffer wrote:
>>
>> If they don't get a printout, they post an angry/confused message on the forums saying
>>
>> ### Y U No work GC?
>>
>> If the stack of `main` is scanned, it's possible there's still a reference to the `s` there. It could even still be in registers for the thread. And that might mean that the GC won't clean it up.
>>
>> The truth is, there is no guarantee any destructors are run. And especially in 32-bit D (which is what D was exclusively for a long time), random 32-bit numbers might accidentally "point" at the memory block.
>>
> 
> Great digging, I suspect your are correct.
> 
> Why do we even support destructors for GC allocated objects? You can't depend on it closing file-descriptors or other resources.

I understand why removing it would seem like a great idea.

But it would mean total segmentation of reference counted types from going into non-RC memory.

I can't even get Walter to agree to have RC in the language (even if it is reluctantly), let alone that...
April 21

On Sunday, 21 April 2024 at 19:28:11 UTC, Steven Schveighoffer wrote:

>

So I am helping to add a new GC to dlang, and one thing I have run into is that all correctly-written GC implementations must implement the function collectNoStack.

Great to hear that something is coming in the GC land.

>

I did a PR to just see what happens when we do a full scan instead of the "no stack" scan, and the results are pretty positive. I'm going to update the PR to really remove all the tentacles of the "nostack" variable, but I wanted to bring this story to light because it's too long and bizarre to explain in the notes of a PR.

https://github.com/dlang/dmd/pull/16401

If there are any good reasons why we should have this, or I got something wrong, please let me know!

As you've found it - to run finalizers on everything that might be still referenced by the main thread. For the unwashed masses that expect finalizers to run at shutdown.

--
Dmitry Olshansky
CEO @ Glowlabs
https://olshansky.me

April 21
Quite an awesome bit of detective work! A fine and most entertaining read it is. I recommend you include that text in any PR to fix it.

> So maybe, the solution Walter came up with (and I'm just guessing here), is hey, we are shutting down anyways, just avoid scanning the main thread stack, and we can satisfy the unwashed masses.

If I did write that function, I don't remember doing so, or why I would. Unfortunately, D1 predates our use of github, so it wouldn't be easy trying to figure out who wrote that.
April 22
On 22/04/2024 8:58 AM, Walter Bright wrote:
> Quite an awesome bit of detective work! A fine and most entertaining read it is. I recommend you include that text in any PR to fix it.
> 
>  > So maybe, the solution Walter came up with (and I'm just guessing here), is hey, we are shutting down anyways, just avoid scanning the main thread stack, and we can satisfy the unwashed masses.
> 
> If I did write that function, I don't remember doing so, or why I would. Unfortunately, D1 predates our use of github, so it wouldn't be easy trying to figure out who wrote that.

Nothing in your email archive?
April 21
On 4/21/2024 1:01 PM, Richard (Rikki) Andrew Cattermole wrote:
> I can't even get Walter to agree to have RC in the language (even if it is reluctantly), let alone that...

A few years back, we really tried to find a way to do it that was memory safe. We failed. Timon presented us with a use-after-free case that we couldn't resolve. Re-orienting D around a memory-unsafe construct is not the future for D.

There's also the problem that the decrement has to be in a finally block, because of exceptions, which makes things bloated and slow.

P.S. A borrow checker resolves the use-after-free case, but RC isn't needed if one is using a borrow checker.
April 22
On 22/04/2024 9:07 AM, Walter Bright wrote:
> On 4/21/2024 1:01 PM, Richard (Rikki) Andrew Cattermole wrote:
>> I can't even get Walter to agree to have RC in the language (even if it is reluctantly), let alone that...
> 
> A few years back, we really tried to find a way to do it that was memory safe. We failed. Timon presented us with a use-after-free case that we couldn't resolve. Re-orienting D around a memory-unsafe construct is not the future for D.
> 
> There's also the problem that the decrement has to be in a finally block, because of exceptions, which makes things bloated and slow.

I suspect that I need to learn more about the exception hooks.

My current understanding is that there is a subset that are entirely optional and are only needed to be called if cleanup occurs. Not quite the same thing as an exception catch.

> P.S. A borrow checker resolves the use-after-free case, but RC isn't needed if one is using a borrow checker.

You are applying the borrow checker to both the owner and the borrow.

I want it only on the borrow.

The borrow isn't exclusive, it doesn't have the guarantees isolated would give. All it guarantees is that the borrow cannot escape the owner.

If you apply the borrow checker to the owner as well, you miss out on the ability to use it with data structures, DOM's, that sort of thing. All things I want to use it for.
April 22
On 22/04/2024 9:03 AM, Richard (Rikki) Andrew Cattermole wrote:
> On 22/04/2024 8:58 AM, Walter Bright wrote:
>> Quite an awesome bit of detective work! A fine and most entertaining read it is. I recommend you include that text in any PR to fix it.
>>
>>  > So maybe, the solution Walter came up with (and I'm just guessing here), is hey, we are shutting down anyways, just avoid scanning the main thread stack, and we can satisfy the unwashed masses.
>>
>> If I did write that function, I don't remember doing so, or why I would. Unfortunately, D1 predates our use of github, so it wouldn't be easy trying to figure out who wrote that.
> 
> Nothing in your email archive?

dmd 0.5 is the oldest dmd available on release archive has it.

So either its in your email archive or you wrote it.
April 21
Before github we used dsource.org and subversion:
    http://dsource.org/projects/druntime/browser

It looks like the history on github retains that history, which matches with my memory of how we transitioned over.  The history seems to start at about 2008.

What I'm not seeing is much pre-D2 history.  There's some, but I thought I'd built up a full D1 history, or mostly full -- at least a per-release snapshot.  Did those early release tarball/zips not contain the druntime code maybe?

On 4/21/2024 1:58 PM, Walter Bright via Digitalmars-d wrote:
> Quite an awesome bit of detective work! A fine and most entertaining read it is. I recommend you include that text in any PR to fix it.
> 
>  > So maybe, the solution Walter came up with (and I'm just guessing here), is hey, we are shutting down anyways, just avoid scanning the main thread stack, and we can satisfy the unwashed masses.
> 
> If I did write that function, I don't remember doing so, or why I would. Unfortunately, D1 predates our use of github, so it wouldn't be easy trying to figure out who wrote that.
« First   ‹ Prev
1 2 3