January 22, 2022

On Friday, 21 January 2022 at 14:09:18 UTC, rikki cattermole wrote:

>

On 22/01/2022 2:56 AM, Chris Katko wrote:

>

So a related question: Has anyone ever thought about "thread-local garbage collection" or some sort of "multiple pool [same process/thread] garbage collection"? The idea here is, each thread [or thread collection] would have its own garbage collector, and, be the only thread that pauses during a collection event.

Indeed we have thought about this.

What I want is a fiber aware GC and that implicitly means thread-local too.

Yes, this is better, although you would need to change the language semantics so that non-shared means local to a fiber/actor as the fiber/actor can move from one thread to another.

You also need a per-thread runtime that can collect from sleeping fibers/actors and in the worst case signal other thread runtimes to collect theirs. Otherwise you would run out of memory when memory is available...

In general I'd say you would want to have support for many different memory managers in the same executable as different fibers/actors may have different memory management needs.

January 22, 2022
On 22/01/2022 10:24 PM, Ola Fosheim Grøstad wrote:
> Yes, this is better, although you would need to change the language semantics so that non-shared means local to a fiber/actor as the fiber/actor can move from one thread to another.

This isn't what I had in mind.

By telling the GC about fibers, it should be able to take advantage of this information and limit the time it takes to do stop the world scanning.

If it knows memory is live, it doesn't need to scan for it.

If it knows that memory isn't referenced in that fiber anymore, then it only needs to confirm that it isn't anywhere else.
January 22, 2022

On Saturday, 22 January 2022 at 10:05:07 UTC, rikki cattermole wrote:

>

If it knows that memory isn't referenced in that fiber anymore, then it only needs to confirm that it isn't anywhere else.

I don't quite see how you plan to make this work without tracking dirty objects, which has a performance/storage penalty.

Fiber1 has memory chunk A

Fiber2 has memory chunks A, B

You confirm that sleeping Fiber1 does not have access to B

Fiber2 sets A.next = B

At this point, you assume wrongly that Fiber1 does not have access to B?

January 22, 2022
That's not the direction I'm thinking. Its not about freeing memory. Its about deferment.

The best way to make your code faster is to simply do less work.

So to make global memory scanning faster, we either have to decrease the number of roots (likely to end with segfaults) or to decrease the memory ranges to scan for.

And since we have a context to associate with memory that is allocated and with it roots, we have a pretty good idea of if that memory is still alive or not during the execution of said context.

Think about the context of a web server/service. You have some global heap memory, router, plugins, adminy type stuff, that sort of thing. Most of that sort of memory is either reused or sticks around till the end of the process.

But then there is request specific memory that gets allocated then thrown away. If you can defer needing to add those memory ranges into the global heap, that'll speed up scanning drastically as that is where most of your free-able memory will originate from.

To top it off, when you have a concurrent GC design like forking or snapshotting, you can scan for those dead fibers memory ranges and it won't require stopping the world.
January 22, 2022

On Saturday, 22 January 2022 at 10:38:36 UTC, rikki cattermole wrote:

>

So to make global memory scanning faster, we either have to decrease the number of roots (likely to end with segfaults) or to decrease the memory ranges to scan for.

I believe the number of roots is irrelevant, what you need is to have heap separation where you ensure through the type system that there can never be an owning pointer in heap #1 for objects located in heap #2. You can still have borrowing pointers from heap #1 to heap #2 as a non-null borrowing pointer means that the type system (or programmer) is responsible for the pointed-to object being live.

So, if you let a fiber have it's own heap of this kind, then you can reduce the scanning?

But you need to either improve the type system or put the burden on the programmer in C++-style.

Either way, I believe distinguishing between ownership and borrows can improve on GC performance, not only unique_ptr style ownership.

>

But then there is request specific memory that gets allocated then thrown away. If you can defer needing to add those memory ranges into the global heap, that'll speed up scanning drastically as that is where most of your free-able memory will originate from.

But you need to scan those memory ranges if they can contain pointers that can reach GC-memory?

>

To top it off, when you have a concurrent GC design like forking or snapshotting, you can scan for those dead fibers memory ranges and it won't require stopping the world.

You want to require taking a snapshot-copy of the pointer-containing heap? That will make the collector much less useful, I think.

January 23, 2022
On 22/01/2022 11:59 PM, Ola Fosheim Grøstad wrote:
>> But then there is request specific memory that gets allocated then thrown away. If you can defer needing to add those memory ranges into the global heap, that'll speed up scanning drastically as that is where most of your free-able memory will originate from.
> 
> But you need to scan those memory ranges if they can contain pointers that can reach GC-memory?

Yeah those are roots. They still have to be added into the global heap to be scanned.

The goal is to decrease the number of memory ranges that need to be scanned for in a full scan.

Cost = Roots * Ranges

Get Ranges down (until it actually does need to be scanned for), lowers the cost, that's the goal by what I am suggesting!

>> To top it off, when you have a concurrent GC design like forking or snapshotting, you can scan for those dead fibers memory ranges and it won't require stopping the world.
> 
> You want to require taking a snapshot-copy of the pointer-containing heap? That will make the collector much less useful, I think.

We already have forking, and that is super useful for getting user threads to not stop. We are only missing snapshotting for Windows now.
January 22, 2022
On Saturday, 22 January 2022 at 10:05:07 UTC, rikki cattermole wrote:
>
> This isn't what I had in mind.
>
> By telling the GC about fibers, it should be able to take advantage of this information and limit the time it takes to do stop the world scanning.
>
> If it knows memory is live, it doesn't need to scan for it.
>
> If it knows that memory isn't referenced in that fiber anymore, then it only needs to confirm that it isn't anywhere else.

Does C# have thread pinned GC memory? I haven't seen anything that you need to "move" the memory to other threads in that language but I don't really know what is going on under the hood there.

D is very similar to C# and tries to copy its primitives and is one of the reasons D is nice to use. Manually moving memory around will have an effect on how the language looks and ease of use. Another thing is that thread pinned memory doesn't work well with thread pools as any thread in a pool might tamper with the memory.

In general, tracing GC doesn't scale well with large amount of memory. I would rather go with something similar to ORC in Nim in order to reduce the tracing.

January 22, 2022

On Saturday, 22 January 2022 at 11:06:24 UTC, rikki cattermole wrote:

>

We already have forking, and that is super useful for getting user threads to not stop.

Well, but I would never use it. Forking is crazy expensive. Messes with TLB and therefore caches IIRC.

January 22, 2022

On Saturday, 22 January 2022 at 11:27:39 UTC, Ola Fosheim Grøstad wrote:

>

On Saturday, 22 January 2022 at 11:06:24 UTC, rikki cattermole wrote:

>

We already have forking, and that is super useful for getting user threads to not stop.

Well, but I would never use it. Forking is crazy expensive. Messes with TLB and therefore caches IIRC.

Here are some downsides of a forking collector:

  1. messes with TLB, wipes all caches completely (AFAIK).

  2. will hog extra threads, thus if your program is using all cores, you will see a penalty

  3. requires a reduced activity in pointer mutation in the main program, so the forking collector has to saturate the databus in order to complete quickly, which is bad for the main process

  4. requires carefulness in configuring OS resource handling

  5. if you actually are out of memory, or close to it, then there is no way for you to fork, so it will fail and the process will instead be killed

  6. makes it more difficult to coordinate with real time threads (throttling/backpressure)

  7. not really portable, platform dependent

A forking collector is just not suitable for system level programming, it is very much a solution for a high level programming language running on hardware with lots of headroom.

If you are going high level, you might as well introduce write barriers for pointer mutation.

January 23, 2022
On 23/01/2022 1:39 AM, Ola Fosheim Grøstad wrote:
> A forking collector is just not suitable for system level programming, it is very much a solution for a high level programming language running on hardware with lots of headroom.
> 
> If you are going high level, you might as well introduce write barriers for pointer mutation.

Agreed. There is no one size fits all solution.

Having more GC's including ones that require write barriers as an option are great things to have!