June 05, 2006
Ok, i see now that my "best and simplest" solution is neiter of both :) But I don't want to throw the idea away so fast. Please lets discuss it a bit further.

Let's think about the scenario of having reference counting (RC) and
mark&sweep (MS) in parallel. Like Daniel Keep said, it is used in
Python. If Python uses this, the idea cannot be so crazy.

> 1) cyclical data structures won't get free'd

Combine RC and MS. So cycles are detected. It should be possible to avoid cycles at all, if you don't want to have any MS pause.

> 2) every pointer copy requires an increment and a corresponding decrement - including when simply passing a reference to a function

This is true for a smart pointer class. But is it really true for a
compiler supported RC?
If I call a func/method, I hold a valid reference to the object. While
this reference is copied down the stack, these are only temporary
reference. And all of them are release before the call returns to the
caller. You can ignore them all for RC. Only if the reference is stored,
the ref counter has to be incremented. This is only possible if this is
done by the compiler. No smart pointer can do this :)

> 3) in a multithreaded app, the incs and decs must be synchronized

Isn't a atomic inc/dec enough? (Don't know anything about the performance of the asm "lock" instruction)

> 4) exception handlers (finally's) must be inserted to handle all the decs so there are no leaks. Contrary to assertions otherwise, there is no such thing as "zero overhead exceptions."

Yes, this is additional overhead in code size and runtime. And there are additional post destructors needed, to set all member refs to null, which wasn't nulled by the dtor.

> 5) in order to support slicing and interior pointers, as well as supporting reference counting on arbitrary allocations of non-object data, a separate "wrapper" object must be allocated for each allocation to be ref counted. This essentially doubles the number of allocations needed.

The MS GC allready has a memory registry. Isn't there stored, where an object begins? Let the ref counter be in this registry. This should work with all types and slicing and object member references.

> 6) The wrapper object will mean that all pointers will need to be double-dereferenced to access the data.

=> 5.)

> 7) Fixing the compiler to hide all this stuff from the programmer will make it difficult to interface cleanly with C.

hm, ok. Here is some more manual work necessary.
If you give the last reference to a C lib, isn't this a problem for the
MS GC also?

> 8) Ref counting can fragment the heap thereby consuming more memory just like the gc can, though the gc typically will consume more memory overall.

Isn't the actual GC implementation also fragmenting the heap? But I don't have any idea to this issue.

> 9) Ref counting does not eliminated latency problems, it just reduces them.

Where do you mean is the latency with RC? The collecting work is done in the smallest pieces all the time. Only the allocation of new memory needs an unbound time to complete. But it does not need to stop all other threads. e.g. hardware interrupts don't need to be disabled.

> 
> The proposed C++ shared_ptr<>, which implements ref counting, suffers from all these faults. I haven't seen a heads up benchmark of shared_ptr<> vs mark/sweep, but I wouldn't be surprised if shared_ptr<> turned out to be a significant loser in terms of both performance and memory consumption.
> 
> That said, I'd like for D to optionally support some form of ref counting, as rc is better for managing scarce resources like file handles.

As I said, I would be happy to be able to switch the whole runtime system to RC. If I would loose 50% performance this is acceptable for me if I get latency < 100us.

Frank Benoit




June 05, 2006

Frank Benoit wrote:
> In nearly every other GC implementation you will need also a read or write barrier. With reference counting you have a write barriere.
> 
> Yes, the unprecise, non-incremental, "Stop-the-world", non-concurrent, non-generational, conservative Phobos GC implementation is an exception.
> 
> But a write barrier alone doesn't make ref counting ridiculous. Well you are right, it is not the solution for all and everything, but I think a big step in the right direction. And mark&sweep GC should be the option.

I guess we'll just have to disagree, then :)

Like I said before, I think the current GC should be the default since, even if it's not the best in terms of latency, it's the *safest*. There's no need for someone coding under it to worry about things like cycles, etc.  By having refcounting an option, we can say to programmers: "turn this on if you like, but watch your back for cycles!"

<joking>Plus, the current system gives us bitchin' benchmarks with wc! ;)</joking>

>> * because references are on the object, you can't have array slices
> 
> Don't understand that point.

An array slice is a pointer and a length.  The problem is the "pointer" part:

>> * it's also incompatible with pointers: if the pointer is to a member of an object, how on earth do you update the refcount then?
> 
> In the actual implememtation of the gc, every object's memory range is registered by the gc. With that you can get the objects refcounter address.

Aaah, ok: you're putting the refcount somewhere in the allocated chunk.
   Sorry, I thought you meant hide it as a member on objects.

However, I imagine that this is still going to do horrible things to pointer operations in terms of speed.  Every time you move or copy a pointer, you'd have to ask the GC "ok, find which pool this pointer is located in, then inc/dec the refcount."

>> * slower in the immediate sense (incref & decref calls, plus it'll mean less code in L1 and L2 CPU cache, which can kill performance in some architectures)
> 
> Performance has two sides. Throughput is one thing, worst latency the
> other. We can think about techniques to optimize unecessary
> increments/decrements.
> BTW all incremental GCs will also need read and/or write barriers, which
> will have the same perfomance issues.

I concede that, as well as that the current GC has (probably) the worst all-at-once latency you can get.

>> * requires all objects to have destructors to decref child objects
>> (implicit or otherwise)
> 
> Yes, all objects having child objects. But if you think, that destructor calls are a performance issue, you will get problems with the mark&sweep also.

Well, I don't *need* destructors in most general objects under mark&sweep.  It's only when they're holding some external resource, and that's when I usually manage them with auto or scoped destruction.

... which would admittedly be easier with refcounting, if a little bit slower :)

>> * without auto, makes it impossible to ensure an object's destruction when you leave scope: you can leak references
> 
> How can that happen, if we have them native? Well ok, e.g. if you store
> the reference to a static variable, then it is not freed. But this is
> really a programmers bug. But I think that is something an advanced look
> at it will find solutions. First we have to decide to go for ref counting :)

What I mean is that if I say "auto Foo x = ..." then I know that come hell or high water, D is going to destroy that object at the end of scope.  Plus, because of auto rules, I *cannot* leak a reference to that object: D simply won't let me.

As for the static variable case, it's no different to the current implementation: you save a reference to something in a global or static variable, and it's never going to die :)

>> * inability to free cycles
>> * inability to even *run* destructors where cycles are involved, because
>> there is no order in which they can logically be executed and still be valid
> 
> Right, here a manual ref=null is needed. And perhaps a mark&sweep algorithm to detect such cycles in the debug version, called on request. So if you know you program, you can avoid cycles, or you can break them before releasing the last reference.

The problem I have is that even if I *do* "know" my program, I don't want to have to worry about this.  I just want to write my damn code! This is one of the great advantages of D: if you don't give a rat's about memory management, you don't have to.

That said, D *does* have one advantage: at compile time, the compiler can determine if it's possible for an object to indirectly refer to itself.  Using this, it can insert something like this into the decref:

# static if( typeof(ptr).canPointToItself )
#     registerForCycleChecking(ptr);

That way, you know that cycles will get checked for *eventually*.

> Doing a realtime GC seems to be not possible.

Oh, you'll get no argument from me THERE. :)

> The existing "realtime GC"
> seems to me, incremental ones, which stop the world for a short limited
> time. And that time goes into your worst latency. So if I want to do
> things with a latency less 100us, there is no chance to implement a GC.
> Ref counting whould be a solution, but this can only work if the whole
> program does not depend on "Stop-the-world".

I saw an interesting paper once that was talking about using a tri-colour mark&sweep collector that was fully concurrent: sounded heinously complicated to write, but could run a full collect in the background of your app.

Buggered if I can find the damn thing again, though...

>> As it stands, you can implement reference counting yourself: you just need to write an allocator, and call the incref and decref functions manually.  However, I do think that having built-in *support* for Reference counting would be fantastic.  The less that stuff like this is left in the hands of the programmer, the better.
> No, i can't.
> * The dereference operator cannot be overwritten.

MyRef.value

Yeah, it's not that great, but at least it's *possible* :)

> * Existing classes, native arrays etc. does not work like this.

Fair enough.

> I do not need a few classes managed with ref counting. I want to avoid the "Stop the world". And phobos needs it.

I don't think phobos "needs" it.  What I think would help immensely was if phobos was easier to rebuild with different compiler flags, collector, etc.

Would you be happy if it was trivial to recompile phobos using refcounting instead of the current GC, and then use that?

# build -phobos -collector=refcount -version=SomethingElse
# build MyProggy -collector=refcount -- uses phobos_refcount.lib

> Where is the proof that ref counting is so slow? In mark&sweep you have to run over the whole memory. If the memory is low, you have to do that often. This is really slow than.

I don't have any proof on me, although knowing the internet I'm sure I could find some "proof" without too much trouble :P

And yes; if memory is low, m&s will probably take a long time. Actually, I should find out how long that is.  Would be interesting to know exactly how long it takes to collect 512 meg of garbage :P

Personally, I don't mind how D's default collector is implemented, so long as it is efficient in the general case, and I don't have to know anything about the way it works in order to write code.

That said, I'd still like to see D support collectors other than the base GC.  Ideally, I think D should support at least:

# -collector=simple        -- current style: no special support needed
# -collector=writebarrier  -- calls std.gc.writeBarrier(ptr) on writes
# -collector=refcount      -- calls std.gc.writeBarrier(ptr), as well as
#                             std.gc.incref and std.gc.decref as
#                             appropriate.

That way, we can all have our cake, be it chocolate cake, fruit cake or cheesecake.  Plus, we even get to *eat* it :)

	-- Daniel

-- 
Unlike Knuth, I have neither proven or tried the above; it may not even make sense.

v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP  http://hackerkey.com/
June 05, 2006
Just a few comments, since you invoked my name :P

Frank Benoit wrote:
> Let's think about the scenario of having reference counting (RC) and
> mark&sweep (MS) in parallel. Like Daniel Keep said, it is used in
> Python. If Python uses this, the idea cannot be so crazy.

Just wanted to point out that if I remember correctly (and I *could* be wrong), the cycle checking is only done on decrefs, since that's the only time that cycles become a problem.

>> 3) in a multithreaded app, the incs and decs must be synchronized
> 
> Isn't a atomic inc/dec enough? (Don't know anything about the performance of the asm "lock" instruction)

They definitely have to be synched: because in decref you've got to:

1. ref--;
2. if( ref == 0 ) free(ptr);
3. if( obj.canHaveCycles ) registerForCycleCheck(obj);

If not more.  You then also need the incs synched with that, otherwise you could end up "ref++"ing in the middle of "free(ptr)".

>> 7) Fixing the compiler to hide all this stuff from the programmer will make it difficult to interface cleanly with C.
> 
> hm, ok. Here is some more manual work necessary.
> If you give the last reference to a C lib, isn't this a problem for the
> MS GC also?

MS' GC doesn't use refcounting (I assume you're talking about .NET, btw).  In fact, to solve this problem, it uses pinning:

"""
When the runtime marshaler sees that your code is passing to native code
a reference to a managed reference object, it automatically pins the
object. What this means is that the object is put in a special state
where the garbage collector will neither move the object in memory nor
remove the object from memory. ...

When the native function returns, the marshaled object is automatically
unpinned. Automatic pinning is very convenient, but it raises another
question. What happens if the native function caches the pointer for use
later on? When the function returns, won't the collector be free to move
the object? The answer is yes, and the solution for your code in such a
situation is to manually pin the object using the
System.Runtime.InteropServices.GCHandle type.
"""

 -- http://msdn.microsoft.com/msdnmag/issues/04/10/NET/

>> 8) Ref counting can fragment the heap thereby consuming more memory just like the gc can, though the gc typically will consume more memory overall.
> 
> Isn't the actual GC implementation also fragmenting the heap? But I don't have any idea to this issue.

Not as badly.  The current GC is pretty good at keeping fragmentation to a minimum.  Fragmentation happens when you allocate lots of little objects, then free some but not all, and end up with teeny-tiny "holes" in memory that are hard to fill.

I think Walter said this because of his comment on using "wrapper" objects to implement the references: you'd end up with lots of "holes" in memory.

>> 9) Ref counting does not eliminated latency problems, it just reduces them.
> 
> Where do you mean is the latency with RC? The collecting work is done in the smallest pieces all the time. Only the allocation of new memory needs an unbound time to complete. But it does not need to stop all other threads. e.g. hardware interrupts don't need to be disabled.

I think he means that it doesn't make latency just disappear; it spreads it over time.  It's the old "drop a frog into hot water and it'll jump out, drop it in cold water and then boil it and it won't notice" anecdote.

	-- Daniel

-- 
Unlike Knuth, I have neither proven or tried the above; it may not even make sense.

v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP  http://hackerkey.com/
June 05, 2006
Frank Benoit wrote:
> Ok, i see now that my "best and simplest" solution is neiter of both :)
> But I don't want to throw the idea away so fast. Please lets discuss it
> a bit further.

I'm not throwing it away so fast, I've thought about it for several years <g>.


> Let's think about the scenario of having reference counting (RC) and
> mark&sweep (MS) in parallel. Like Daniel Keep said, it is used in
> Python. If Python uses this, the idea cannot be so crazy.

Maybe, maybe not. Python isn't known for its speed. Java uses gc, not rc.


>> 1) cyclical data structures won't get free'd
> 
> Combine RC and MS. So cycles are detected. It should be possible to
> avoid cycles at all, if you don't want to have any MS pause.

There are ways to avoid the RC cycle problem, but I don't know how expensive they are.


>> 2) every pointer copy requires an increment and a corresponding
>> decrement - including when simply passing a reference to a function
> This is true for a smart pointer class. But is it really true for a
> compiler supported RC?

Yes.


> If I call a func/method, I hold a valid reference to the object. While
> this reference is copied down the stack, these are only temporary
> reference. And all of them are release before the call returns to the
> caller. You can ignore them all for RC.

This fails if the passed reference 'escapes'. It can escape by being assigned to a global, being assigned to a class member, being inserted into some data structure that lives longer than the function, and being returned by the function.

> Only if the reference is stored,
> the ref counter has to be incremented. This is only possible if this is
> done by the compiler. No smart pointer can do this :)
> 
>> 3) in a multithreaded app, the incs and decs must be synchronized
> 
> Isn't a atomic inc/dec enough?

That requires synchronizing.

> (Don't know anything about the
> performance of the asm "lock" instruction)

It's slow enough to be noticable.

>> 4) exception handlers (finally's) must be inserted to handle all the
>> decs so there are no leaks. Contrary to assertions otherwise, there is
>> no such thing as "zero overhead exceptions."
> 
> Yes, this is additional overhead in code size and runtime. And there are
> additional post destructors needed, to set all member refs to null,
> which wasn't nulled by the dtor.
> 
>> 5) in order to support slicing and interior pointers, as well as
>> supporting reference counting on arbitrary allocations of non-object
>> data, a separate "wrapper" object must be allocated for each allocation
>> to be ref counted. This essentially doubles the number of allocations
>> needed.
> 
> The MS GC allready has a memory registry. Isn't there stored, where an
> object begins? Let the ref counter be in this registry. This should work
> with all types and slicing and object member references.

Yes, you can find out the beginning of an arbitrary pointer's memory block. But that isn't a cheap operation.

>> 6) The wrapper object will mean that all pointers will need to be
>> double-dereferenced to access the data.
> 
> => 5.)

Finding the start of the memory chunk will be several times more expensive than the double indirect.

>> 7) Fixing the compiler to hide all this stuff from the programmer will
>> make it difficult to interface cleanly with C.
> 
> hm, ok. Here is some more manual work necessary.
> If you give the last reference to a C lib, isn't this a problem for the
> MS GC also?

I don't know anything about the MS GC.

>> 8) Ref counting can fragment the heap thereby consuming more memory just
>> like the gc can, though the gc typically will consume more memory overall.
> 
> Isn't the actual GC implementation also fragmenting the heap?

Yes. malloc will fragment the heap, too. But only GC has the hope of doing compaction.

> But I don't have any idea to this issue.
> 
>> 9) Ref counting does not eliminated latency problems, it just reduces them.
> 
> Where do you mean is the latency with RC? The collecting work is done in
> the smallest pieces all the time. Only the allocation of new memory
> needs an unbound time to complete.

That's the same with gc. Only upon an allocation is the time unbound.

> But it does not need to stop all
> other threads. e.g. hardware interrupts don't need to be disabled.

GC doesn't require disabling hardware interrupts. It does not need to stop any threads that do not hold references to GC allocated data.


>> The proposed C++ shared_ptr<>, which implements ref counting, suffers
>> from all these faults. I haven't seen a heads up benchmark of
>> shared_ptr<> vs mark/sweep, but I wouldn't be surprised if shared_ptr<>
>> turned out to be a significant loser in terms of both performance and
>> memory consumption.
>>
>> That said, I'd like for D to optionally support some form of ref
>> counting, as rc is better for managing scarce resources like file handles.
> 
> As I said, I would be happy to be able to switch the whole runtime
> system to RC. If I would loose 50% performance this is acceptable for me
> if I get latency < 100us.

Latency cannot be guaranteed even with malloc().

I don't know what constraints you have on the system you're developing. But when I've written real time interrupt code, the ISRs were written to not make any OS calls or any allocations. All they'd do is just gather data and post it to a FIFO buffer. A non-realtime thread would then pull the data out of the buffer and process it. The system would work without dropping data as long as the average (not maximum) processing time per datum was less than the data acquisition rate.
June 05, 2006
> They definitely have to be synched: because in decref you've got to:
>
> 1. ref--;
> 2. if( ref == 0 ) free(ptr);
> 3. if( obj.canHaveCycles ) registerForCycleCheck(obj);
>
> If not more.  You then also need the incs synched with that, otherwise you could end up "ref++"ing in the middle of "free(ptr)".

The thread doing the ++ apparently has a pointer to the object without a reference. How could that ever happen?

L.


June 05, 2006

Lionello Lunesu wrote:
>> They definitely have to be synched: because in decref you've got to:
>>
>> 1. ref--;
>> 2. if( ref == 0 ) free(ptr);
>> 3. if( obj.canHaveCycles ) registerForCycleCheck(obj);
>>
>> If not more.  You then also need the incs synched with that, otherwise you could end up "ref++"ing in the middle of "free(ptr)".
> 
> The thread doing the ++ apparently has a pointer to the object without a reference. How could that ever happen?
> 
> L.
> 

*slaps forehead*

Of course; you're right.  I was thinking about race conditions, but obviously got confused.

Please refer to my signature.  Specifically the "it may not even make sense." part :P

	-- Daniel

-- 
Unlike Knuth, I have neither proven or tried the above; it may not even make sense.

v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP  http://hackerkey.com/
June 05, 2006
>
> If not more.  You then also need the incs synched with that, otherwise you could end up "ref++"ing in the middle of "free(ptr)".
> 
As allready pointed out by Lionello, if the counter goes zero there is no other manipulator.

>>> 7) Fixing the compiler to hide all this stuff from the programmer will make it difficult to interface cleanly with C.
>> hm, ok. Here is some more manual work necessary.
>> If you give the last reference to a C lib, isn't this a problem for the
>> MS GC also?

MS=mark&sweep :)
June 05, 2006
> I'm not throwing it away so fast, I've thought about it for several years <g>.
I meant myself, hehe.

>> If I call a func/method, I hold a valid reference to the object. While this reference is copied down the stack, these are only temporary reference. And all of them are release before the call returns to the caller. You can ignore them all for RC.
> 
> This fails if the passed reference 'escapes'. It can escape by being assigned to a global, being assigned to a class member, being inserted into some data structure that lives longer than the function, and being returned by the function.

Yes, and the compiler can catch these cases. With incr/decr the parameter reference you win nothing. So it should be possible to optimize them.

>>> 3) in a multithreaded app, the incs and decs must be synchronized
>>
>> Isn't a atomic inc/dec enough?
> 
> That requires synchronizing.
> 
>> (Don't know anything about the
>> performance of the asm "lock" instruction)
> 
> It's slow enough to be noticable.
> 
> Yes, you can find out the beginning of an arbitrary pointer's memory block. But that isn't a cheap operation.

yikes, you are absolutely right.
So here comes the more ugly solution:
Every pointer is allways a struct with the pointer to the data and a
pointer to the refcounter.

>>> 7) Fixing the compiler to hide all this stuff from the programmer will make it difficult to interface cleanly with C.
>>
>> hm, ok. Here is some more manual work necessary.
>> If you give the last reference to a C lib, isn't this a problem for the
>> MS GC also?
> 
> I don't know anything about the MS GC.

you do :) MS=mark&sweep


>>> 9) Ref counting does not eliminated latency problems, it just reduces them.
>>
>> Where do you mean is the latency with RC? The collecting work is done in the smallest pieces all the time. Only the allocation of new memory needs an unbound time to complete.
> 
> That's the same with gc. Only upon an allocation is the time unbound.

And additional the time is unbound for the GC fullcollect...

> 
>> But it does not need to stop all
>> other threads. e.g. hardware interrupts don't need to be disabled.
> 
> GC doesn't require disabling hardware interrupts. It does not need to stop any threads that do not hold references to GC allocated data.

If you want to have you ISR written in D, you have two choices:
a) be sure that you do not move a reference (refb=refa; refa=null;).
This can end up in GC deleting a living object. If you cannot guarantee
this, than
b) disable the interrupting code also while the GC is running. But this
is not acceptable, so go back to a) or have another GC.

> Latency cannot be guaranteed even with malloc().
> 
> I don't know what constraints you have on the system you're developing. But when I've written real time interrupt code, the ISRs were written to not make any OS calls or any allocations. All they'd do is just gather data and post it to a FIFO buffer. A non-realtime thread would then pull the data out of the buffer and process it. The system would work without dropping data as long as the average (not maximum) processing time per datum was less than the data acquisition rate.

Yes, I will also not do any OS call, neither I would do a "new". And I
can easily check the code for not doing this. I even can do a check into
the GC allocator, that it is not called in a ISR.
But a moving reference can corrupt the working GC. And there is no way
to check for that.






June 05, 2006
> * The cost of removing the last pointer to an object is unbounded
delete can be unbound? I didn't know that.

> * Even if the general latency is good, the total overhead of adjusting references is significantly greater than that of a tracing GC


> * It has a substantial space overhead, making it less useful for smaller heaps
In codesize yes, but in heap size?
I though the tracing GC does need much more heap because it should be
called very seldom.

> * Cyclic references (already mentioned)
> 
> I think these makes RC less suited for the general case, whereas it might fit very well in a tightly controlled system where low latency is required (which I'm well aware that you need ;).
> 
> IMO, the best general GC (the default) would be one which can handle as many
> cases as possible, and which is manipulable in those (preferably known)
> cases which it don't handle out of the box. I think that one need to
> combine features from several GC techniques (possibly also RC) to achieve
> this.
> 

agreed.
June 05, 2006
Daniel Keep wrote:

> 
> 
> Frank Benoit wrote:
>> In nearly every other GC implementation you will need also a read or write barrier. With reference counting you have a write barriere.
>> 
>> Yes, the unprecise, non-incremental, "Stop-the-world", non-concurrent, non-generational, conservative Phobos GC implementation is an exception.
>> 
>> But a write barrier alone doesn't make ref counting ridiculous. Well you are right, it is not the solution for all and everything, but I think a big step in the right direction. And mark&sweep GC should be the option.
> 
> I guess we'll just have to disagree, then :)
> 
> Like I said before, I think the current GC should be the default since, even if it's not the best in terms of latency, it's the *safest*. There's no need for someone coding under it to worry about things like cycles, etc.

Well, actually, the current GC is kinda bad with cycles. Try to make a circular linked list where you keep a single additional reference to one of the elements. Then remove the reference and collect. In theory this memory should now "just" be lost/leaked, but you might also experience crashes. I think Chris Miller had some interesting revelations on the subject.

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource & #D: larsivi