February 23, 2012
On Wednesday, February 22, 2012 16:45:29 H. S. Teoh wrote:
> But still, some benchmarks do appear to be showing signs of a large performance hit on the GC when there happens to be many integers that look like valid pointers.

As I understand it, this is mitigated considerably on 64-bit platforms due to the large pointer size.

- Jonathan M Davis
February 23, 2012
On Wed, Feb 22, 2012 at 07:48:38PM -0500, Jonathan M Davis wrote:
> On Wednesday, February 22, 2012 16:45:29 H. S. Teoh wrote:
> > But still, some benchmarks do appear to be showing signs of a large performance hit on the GC when there happens to be many integers that look like valid pointers.
> 
> As I understand it, this is mitigated considerably on 64-bit platforms due to the large pointer size.
[...]

True. You'd have to deliberately want to break the GC in order for coincidental integer values to cause a significant problem.

But this problem could be a major issue if D was to become a significant player on handhelds, which, if I understand correctly, are still largely running 32-bit CPUs.

It's also a major problem on 16-bit platforms, but the only use of those that I can conceive of are toy applications so it's probably not worth the consideration. :)


T

-- 
That's not a bug; that's a feature!
February 23, 2012
> I don't know, but after reading this:
>
> 	http://www.cs.purdue.edu/homes/grr/snapshot-gc.ps
>
> I think there might be a possibility of (almost) free GC.
>
COWing whole pages is about the last thing you need with a high mutation rate.
February 23, 2012
> 4) Thread local / global memory
>
> A callback into the runtime needs to happen in the following cases:
> - a __gshared variable is assigned
> - a reference / pointer is casted to immutable
> - a reference / pointer is casted to shared
>
> void _d_castToGlobalMem(void* ptr);
>
> This can be used by the GC to keep thread local pools of memory and move a memory block to a global memory pool as soon as it is needed there.
>
If we'd want per-thread mark/sweeping then shared memory must never own
unshared memory. I think this could be done using a separate allocator for
shared/immutable data. For casts this would require a transitive move of the
data or it'd need to be prohibited.
February 23, 2012
Am 22.02.2012 22:32, schrieb H. S. Teoh:
> On Wed, Feb 22, 2012 at 07:56:15PM +0100, Benjamin Thaut wrote:
> [...]
>> 2) Tracking references on the stack:
>>
>> The D compiler always needs to emit a full stack frame so that the
>> GC can walk up the stack at any time in the program. The stack frame
>> of every function generated by the D compiler starts which a
>> bitfield (usually the size of a machine register) where each bit
>> indicates that these bytes are a pointer / reference. The bitfield
>> needs to be large enough to cover the whole stack frame of the
>> function.
> [...]
>
> I was thinking about this a bit more, and I had an idea: why bother with
> storing bitfields on the stack? Any function's local pointer variables
> are known at compile-time. So store a function ID (probably a pointer)
> that maps to some static storage where this information is stored. Then
> we always only need 1 word of extra storage on the stack frame, and the
> GC can follow the pointer to get the info it needs. A recursively called
> function won't incur the cost of duplicated copies of bitfields, its ID
> points to same place. You can even have two different functions share
> the same ID if they have pointer variables in exactly the same places.
>
> The static storage can then be an array of relative stack offsets to the
> function's pointer variables, so the GC can easily use this info to find
> roots. No need to complicate the GC with manipulating bitfields, it's
> just an int[].
>
> If you want to get fancy, have the compiler reorder local variables so
> that pointers are clustered together in blocks, then in the static
> storage you can just encode pointer blocks by offset+length. (Although
> this may not help much with (pointer,length) pairs on the stack, like
> slices.) Or the compiler can reorder variables to maximize ID merges.
>
> The same thing can be done for scopes, since their local variables are
> also all known at compile-time.
>
>
> T
>

But where would you know from which scope variables are still (or already) valid and which are not?

void func()
{
  void* ptr = gc.alloc(...);
  //Ptr2, Ptr3 not valid yet
  void* ptr2 = gc.alloc(...);
  //ptr3 not valid yet
  {
    void* ptr3 = ptr1;
  }
  //ptr 3 not valid anymore
}

Also as the bitfiel is stored on the stack it will most likely ba already in the cache. Whereas with your approach scanning 1 stackframe would very likely also cause 1 cache miss because of the additional indirection. So if you are scanning 30 stack frames it will cause 30 cache misses.

-- 
Kind Regards
Benjamin Thaut
February 23, 2012
On 2012-02-22 20:40, H. S. Teoh wrote:
> On Wed, Feb 22, 2012 at 07:56:15PM +0100, Benjamin Thaut wrote:
>> As I'm not satisfied with the current GC D has and don't see the
>> situation improving in the future without significant changes to the
>> compiler I wrote the following document that points out all the
>> possible issues with garbage collection I could think of and
>> possible solutions for them. This document is just a draft and only
>> a proposal, critics and comments are welcome.
>
> Have you seen this?
>
> 	http://www.llucax.com.ar/proj/dgc/index.html
>
>
> [...]
>> 2) Tracking references on the stack:
>>
>> The D compiler always needs to emit a full stack frame so that the
>> GC can walk up the stack at any time in the program. The stack frame
>> of every function generated by the D compiler starts which a
>> bitfield (usually the size of a machine register) where each bit
>> indicates that these bytes are a pointer / reference. The bitfield
>> needs to be large enough to cover the whole stack frame of the
>> function.
>
> This adds a lot of overhead to the runtime stack, esp. if you have deep
> recursion. It's also not necessarily faster, since the GC now has to
> parse a bitfield (a variable-length encoded bitfield, no less), instead
> of just scanning words directly, which can be optimized by CPU-specific
> microcode depending on the target platform.
>
>
> [...]
>> Every scope generated by the D compiler would need additional code
>> at the start and end of the scope. When the scope is entered the
>> bitfield would be patched to represent the new variables inside the
>> scope and when the scope is left the bitfield is patched again to
>> remove the changes that were made on entering the scope.
>
> This would introduce quite a lot of overhead per scope. It will also
> lead to strange things like:
>
> 	if (x) y();	// faster
> 	if (x) { y(); }	// slower
>
> which will encourage people to omit {} after if, which makes code more
> fragile and hard to read.

Doesn't the "faster" example introduces an implicit scope?

-- 
/Jacob Carlborg
February 23, 2012

On Feb 22, 2012, at 8:07 PM, "Martin Nowak" <dawg@dawgfoto.de> wrote:

>> 4) Thread local / global memory
>> 
>> A callback into the runtime needs to happen in the following cases:
>> - a __gshared variable is assigned
>> - a reference / pointer is casted to immutable
>> - a reference / pointer is casted to shared
>> 
>> void _d_castToGlobalMem(void* ptr);
>> 
>> This can be used by the GC to keep thread local pools of memory and move a memory block to a global memory pool as soon as it is needed there.
>> 
> If we'd want per-thread mark/sweeping then shared memory must never own unshared memory. I think this could be done using a separate allocator for shared/immutable data. For casts this would require a transitive move of the data or it'd need to be prohibited.

Casting to/from shared needs a bit more logic anyway, so the proper thread finalizes unshared objects. Casting away shared clears the block's owner and vice versa. Nice perk to this is casting away shared could then detect a sharing violation if a block already has another owner.
February 23, 2012
On Wed, 22 Feb 2012 15:32:37 -0600, H. S. Teoh <hsteoh@quickfur.ath.cx> wrote:
> On Wed, Feb 22, 2012 at 07:56:15PM +0100, Benjamin Thaut wrote:
> [...]
>> 2) Tracking references on the stack:
>>
>> The D compiler always needs to emit a full stack frame so that the
>> GC can walk up the stack at any time in the program. The stack frame
>> of every function generated by the D compiler starts which a
>> bitfield (usually the size of a machine register) where each bit
>> indicates that these bytes are a pointer / reference. The bitfield
>> needs to be large enough to cover the whole stack frame of the
>> function.
> [...]
>
> I was thinking about this a bit more, and I had an idea: why bother with
> storing bitfields on the stack? Any function's local pointer variables
> are known at compile-time. So store a function ID (probably a pointer)
> that maps to some static storage where this information is stored. Then
> we always only need 1 word of extra storage on the stack frame, and the
> GC can follow the pointer to get the info it needs. A recursively called
> function won't incur the cost of duplicated copies of bitfields, its ID
> points to same place. You can even have two different functions share
> the same ID if they have pointer variables in exactly the same places.
>
> The static storage can then be an array of relative stack offsets to the
> function's pointer variables, so the GC can easily use this info to find
> roots. No need to complicate the GC with manipulating bitfields, it's
> just an int[].
>
> If you want to get fancy, have the compiler reorder local variables so
> that pointers are clustered together in blocks, then in the static
> storage you can just encode pointer blocks by offset+length. (Although
> this may not help much with (pointer,length) pairs on the stack, like
> slices.) Or the compiler can reorder variables to maximize ID merges.
>
> The same thing can be done for scopes, since their local variables are
> also all known at compile-time.
>
>
> T

The break even point between bit-fields and pointers is 512 bytes. Although, if one is thinking about on stack storage this probably doesn't matter since for alignment purposes you'll always end up using at least 1 word if not 2. However, a lot of functions use less then 512 (or 1024) bytes of of stack space. I'd think it would be much more space efficient to have a separate bitfield for the stack. Cache efficiency should be about the same as a on stack representation, and scanning would, in theory, be quicker. IIRC, a separate bit-field was the approach used by at least one precise C GC.
February 23, 2012
Le 22/02/2012 20:53, Benjamin Thaut a écrit :
>
> If you have a better idea for percise stack scanning I'm open for
> suggestions.
>


This is the problem with your proposal. It doesn't consider pro and cons and actual data. It doesn't consider the alternatives. You go straight to « How can we do that ? » without condidering « should we do that ? »

What would be the impact of being precise on the heap but not on the stack ?

1/ It would add some false positives. The future being 64bits, False positive will be way less present than on 32bits machines. I did searched for numbers on that, but couldn't found them. Considering this is only on the stack, this may be neglectible (or not, but it definitively require data).

2/ Data pointed by the stack are not movable.Again, what is the impact of that. How much data could be promoted from young generation to old one (considering we have young and old gen). How much data couldn't be compacted ? What would the overhead on allocators ?

This definitively lack the required data and/or analysis of pro and cons.

Additionnaly, the stack is made like a linked list. Each function calling another one register the return address. With this information, we can have data about what is on the stack except for the very last function called with no runtime overhead. This is another alternative.

But you have to consider that, even with a mask, you are not sure that what is marked as a pointer is a pointer. A memory location can represent different thing during a function execution. So thoses values can only be considered as probable pointers, or we disable some compiler optimizations. As we cannot be sure, the point 2/ stay valid.

Granted the overhead of the operation, it ay not worth it. To know that, we need actual data on how much data is the stack is actually pointer, and how much false positive we get. As the future is 64bits, I'm not sure it is interesting for us.
February 23, 2012
You didn't mention what is the most important IMO.

In D, most data are thread local. Shared data are either shared or immutable.

Both thread local data and immutable data lead to very interesting GC optimisations. This is where we need language support.