| Thread overview | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
February 23, 2015 D GC theory | ||||
|---|---|---|---|---|
| ||||
How hard would it be to modify D's GC to do the following two things: 1. Scan the heap in the BG on another thread/cpu for compactification. 2. Move blocks of memory in predictable(timewise) chunks that prevents locking the main thread for any period of time. e.g., in the first step the GC finds some blocks of memory that need to be freed/compacted. In the 2nd step it starts freeing/compacting it in predictable pieces by limiting the time it takes while working. The point is, that maybe the GC is ran more often but in smaller and predictable steps. That is, the GC should be able to calculate how long it will take to free/compact a block. If it takes too long then it simply does it in stages. This way, there is essentially a very predictable and consistent cpu usage with the GC running but never any major lag spikes that are going to throw real time behavior out the window. It would seem that such a "Feature" would be easy to implement by modifying the existing GC code to be "incremental". I'd prefer a constant 1-5% cpu usage given to the GC if it didn't blow up for no reason. This way, it being very predictable, just mean one has to get a slightly faster cpu to compensate or optimize the code slightly. It would be analogous to game programming. 1. We can have the GC steal, say, 1 fps to do it's work... 2. Or we can keep the GC asleep doing nothing until it gets so much work it has pause the entire engine for a 1/2 second dropping the fps down by half momentarily. This might happen every 1 minute or so.... but still unacceptable for most gamers. (assuming 30-60 fps) I'd prefer the first case. | ||||
February 23, 2015 Re: D GC theory | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Sativa | Basically, I am simply wondering if the GC can "throttle" itself as to reduce the *maximum* time the program has to wait. | |||
February 23, 2015 Re: D GC theory | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Sativa | On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:
> How hard would it be to modify D's GC to do the following two things:
>
> 1. Scan the heap in the BG on another thread/cpu for compactification.
>
> 2. Move blocks of memory in predictable(timewise) chunks that prevents locking the main thread for any period of time.
>
> e.g., in the first step the GC finds some blocks of memory that need to be freed/compacted. In the 2nd step it starts freeing/compacting it in predictable pieces by limiting the time it takes while working.
>
> The point is, that maybe the GC is ran more often but in smaller and predictable steps.
>
> That is, the GC should be able to calculate how long it will take to free/compact a block. If it takes too long then it simply does it in stages.
>
> This way, there is essentially a very predictable and consistent cpu usage with the GC running but never any major lag spikes that are going to throw real time behavior out the window.
>
> It would seem that such a "Feature" would be easy to implement by modifying the existing GC code to be "incremental".
>
> I'd prefer a constant 1-5% cpu usage given to the GC if it didn't blow up for no reason. This way, it being very predictable, just mean one has to get a slightly faster cpu to compensate or optimize the code slightly.
Hi,
D's GC actually is predictable. It will not collect unless you allocate. You can also disable it and manually collect. I use it this way and essentially use it as a smart freelist.
| |||
February 24, 2015 Re: D GC theory | ||||
|---|---|---|---|---|
| ||||
Posted in reply to weaselcat | On Monday, 23 February 2015 at 22:11:48 UTC, weaselcat wrote:
> On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:
>> How hard would it be to modify D's GC to do the following two things:
>>
>> 1. Scan the heap in the BG on another thread/cpu for compactification.
>>
>> 2. Move blocks of memory in predictable(timewise) chunks that prevents locking the main thread for any period of time.
>>
>> e.g., in the first step the GC finds some blocks of memory that need to be freed/compacted. In the 2nd step it starts freeing/compacting it in predictable pieces by limiting the time it takes while working.
>>
>> The point is, that maybe the GC is ran more often but in smaller and predictable steps.
>>
>> That is, the GC should be able to calculate how long it will take to free/compact a block. If it takes too long then it simply does it in stages.
>>
>> This way, there is essentially a very predictable and consistent cpu usage with the GC running but never any major lag spikes that are going to throw real time behavior out the window.
>>
>> It would seem that such a "Feature" would be easy to implement by modifying the existing GC code to be "incremental".
>>
>> I'd prefer a constant 1-5% cpu usage given to the GC if it didn't blow up for no reason. This way, it being very predictable, just mean one has to get a slightly faster cpu to compensate or optimize the code slightly.
>
> Hi,
> D's GC actually is predictable. It will not collect unless you allocate. You can also disable it and manually collect. I use it this way and essentially use it as a smart freelist.
That isn't the problem. The problem is that when it does collect, the time it takes is unpredictable. It could take 1us or 10m. If there is a cap on how long the GC can run at any particular time interval, then it it's time complexity is simple, predicable, and can be better used for RT applications.
Effectively I would rather the GC run every second and spend a maximum of 1ms doing cleaning up(not necessarily finishing) instead of running when ever and potentially taking seconds to cleanup.
It's all about how to actually distribute the GC running in time. As it stands now, it can run anytime and take as long as it wants. I'd rather have it running "continuously" but not take as long as it wants. By allowing it to run continuously, in short bursts, one should get the same long term behavior but not have the spikes in cpu usage which prevent its usefulness in RT applications.
| |||
February 24, 2015 Re: D GC theory | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Sativa Attachments: | On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote: > How hard would it be to modify D's GC to do the following two things: > > 1. Scan the heap in the BG on another thread/cpu for compactification. needs read/write barriers added to generated code. a major slowdown for ALL memory access. > 2. Move blocks of memory in predictable(timewise) chunks that prevents locking the main thread for any period of time. as you can't do "partial scans" without barriers... see above. > e.g., in the first step the GC finds some blocks of memory that need to be freed/compacted. In the 2nd step it starts freeing/compacting it in predictable pieces by limiting the time it takes while working. it's not freeing that takes time, it's scanning that takes time. after scanning is complete there is not much left to do. ah, well, if you have thousands of small objects, this can be slow too, but with such use case partial frees will lead to OOM situations very fast (as your code have to allocate as crazy to get to such situation). tl;dr: you can't do this in compiled language without introducing major slowdown for indirect memory access. and indirect memory access is used *everywhere*. you will not like the results. | |||
February 24, 2015 Re: D GC theory | ||||
|---|---|---|---|---|
| ||||
Posted in reply to ketmar | On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:
> On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:
>
>> How hard would it be to modify D's GC to do the following two things:
>>
>> 1. Scan the heap in the BG on another thread/cpu for compactification.
>
> needs read/write barriers added to generated code. a major slowdown for
> ALL memory access.
Only modifications of pointers, which introduce new cross-block dependencies (so that GC knows to recheck the new dependency). Other memory access goes without slowdown.
| |||
February 24, 2015 Re: D GC theory | ||||
|---|---|---|---|---|
| ||||
Posted in reply to ketmar | On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote: > On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote: > >> How hard would it be to modify D's GC to do the following two things: >> >> 1. Scan the heap in the BG on another thread/cpu for compactification. > > needs read/write barriers added to generated code. a major slowdown for > ALL memory access. It's possible to (ab)use the MMU as a write barrier. Sociomantic's GC implementation does this by forking and running the scan in the child, then communicating information back about what can be freed. Video: http://dconf.org/talks/lucarella.html IIRC, at the end the speaker was asked about performance overhead of the COW that is involved, but he couldn't give any numbers. | |||
February 24, 2015 Re: D GC theory | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Sativa | On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:
> 1. Scan the heap in the BG on another thread/cpu for compactification.
You need to a precise GC to do this, because you need to know what is a pointer. Otherwise if you have two variables, say a size_t and a pointer that contain the same raw value, on compacting you could overwrite the size_t despite it just being a number and not a pointer.
Also you need to ban certain pointer shenanigans (like xor fun stuff), though the current GC doesn't work when those are in use anyway, so I guess it's not too bad.
| |||
February 24, 2015 Re: D GC theory | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Marc Schütz | On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:
> It's possible to (ab)use the MMU as a write barrier.
Uhm...
The throughput for L/SFENCE is 5 cycles and SFENCE 33 cycles... The cost of a page miss is 1000 cycles + overhead for copying.
And that's assuming that you have enough memory and that the TLB isn't affected...
| |||
February 24, 2015 Re: D GC theory | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Kagamin | On Tuesday, 24 February 2015 at 08:39:02 UTC, Kagamin wrote:
> On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:
>> On Mon, 23 Feb 2015 21:11:22 +0000, Sativa wrote:
>>
>>> How hard would it be to modify D's GC to do the following two things:
>>>
>>> 1. Scan the heap in the BG on another thread/cpu for compactification.
>>
>> needs read/write barriers added to generated code. a major slowdown for
>> ALL memory access.
>
> Only modifications of pointers, which introduce new cross-block dependencies (so that GC knows to recheck the new dependency). Other memory access goes without slowdown.
But this type of thinking is the reason why the current GC is in the state it is.
The compiler knows which pointers are "free" and which ones are "bound". Bound pointers are pointers that are not assigned freely by the user. e.g., a pointer to an array who's address never is arbitrarily set by the user is bound. The compiler knows where and how the pointer is assigned. Most pointers are this way.
Bound pointers are pointers the GC can easily clean up because it knows when and how they are used. In this way, if all pointers of a program were bound, the GC can work in the background and never pause the state to clean up. (essentially the compiler would need to insert special code) most pointers are bound pointers.
Free pointers are more difficult as they can, say, be randomly initiated and point to anywhere on the heap and have to be looked in a locked way. (to prevent them changing in the middle of some GC operation)
But if one distinguishes bound and free pointers(Easily done with a bit in the pointers) and has the compiler keep track of when free pointers are used(by having a dirty bit when they are written to), then one can more easily scan the heap in the background.
In fact, one could potentially get away from all synching issues by doing the following:
When ever free pointers are used a simple spin lock is used. The spin lock checks a flag in the free pointers table that signals that a pointer is being changed by the code. When this is true, the free pointers table is in a state of flux and can't be relied on. In the mean time, the GC can build up information about the heap for the bound pointers. It can figure out what needs to be changed, setup buffering(which can be done using bits in the pointer), etc all in the background because the bound pointers are "stable" and deterministically change.
When the free pointers table's dirty flag is unset it means that the free pointers are not changing in the program and the GC can lock the table using another flag. When the flag is set the spin lock kicks in and pauses the program while the GC is working on the free pointers table. (or to be more efficient, the program can yield to some other background task code)
By having multiple tables of free pointers one can reduce the overhead. The GC looks at on a piece at a time and locks on a fraction of the code at any point in time. The compiler can distribute the locks vs pages in an optimized way through profiling.
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply