February 10, 2010
On Wed, 10 Feb 2010 02:12:04 -0500, Andrei Alexandrescu <andrei at erdani.com> wrote:

> Robert Jacques wrote:
>> On Tue, 09 Feb 2010 22:23:05 -0500, Andrei Alexandrescu <andrei at erdani.com> wrote:
>>> Robert Jacques wrote:
>>> [snip]
>>>
>>> Just to make sure, I'm tuning out this entire GC-related discussion because I assume it's not related to the core language. If there are bits that prompt language changes please let me know. Thanks.
>>>
>>> Andrei
>>  Yes, there was one important language change which my first post
>> advocated: the separation of local and shared (+immutable & unique)
>> data into logically separate memory pools. I'm sorry that I didn't make
>> it clearer. From a TDPL point of view, this allows implementations to
>> eliminate false sharing (an important fact to highlight) and to use
>> thread-local allocation and/or thread-local collectors. The downside is
>> that it adds language complexity and makes casting from local to
>> shared/immutable/unique implementation defined.
>
> Doing this would disallow people from transporting large amounts of data across thread boundaries, which is now possible with cast and the requisite care. I'm not sure we are ready to give that up.

The recommended way of transporting large amounts of data across thread boundaries is already to use immutable or unique data, both of which can and should be allocated from the shared memory pool. So while the vast majority of use cases would be covered, naturally there wile be exceptions.

>> Now, I've mentioned this enhancement to D before, in regard to thread-local garbage collectors, but Walter didn't think the cost/benefit was worth it at the time. Indeed, thread-local collectors are just one of many possible modern GCs a D runtime could adopt. Except, the follow-up discussion with Fawzi has started to give me a sinking feeling. All those wonderful modern GCs seem to be designed for functional or (mostly) pure-OO languages, not systems programming languages. So far, everything but thread-local GCs has gone down in flames. This is disconcerting as even TLGCs are meant to be paired with a modern shared GC of some kind. I'm not a GC expert, so there might be something out there, but I think it would be very prudent to find out what they are, in case they too require some change to the language to work.
>
> I'm not sure your concerns are valid. Most other OO languages are messier than D with regard to concurrency, yet their garbage collectors do manage.

Language concurrency and GC concurrency are very different beasts.

> I agree you have a good point, but the constraints within which we are operating are forcing me to ignore this aspect unless a very clear vision of the problem and a possible solution comes forward. (But by all means feel free to continue the discussion.) From what I can tell after spending little time (th|s)inking about this matter, it's certain casts that thread-local mutable storage would disallow, and that is a possibility that is easy to introduce later.

I understand and it is a minor change, as you pointed out. The only issue I saw was that a while ago it was recommended to build immutable objects/strings using local copies first and then casting. Although I think unique is now taking over this role.
February 10, 2010
On 10-feb-10, at 08:25, Walter Bright wrote:

> Robert Jacques wrote:
>>
>> Yes, there was one important language change which my first post advocated: the separation of local and shared (+immutable & unique) data into logically separate memory pools. I'm sorry that I didn't make it clearer. From a TDPL point of view, this allows implementations to eliminate false sharing (an important fact to highlight) and to use thread-local allocation and/or thread-local collectors. The downside is that it adds language complexity and makes casting from local to shared/immutable/unique implementation defined.
>>
>
> There's another way to do this that requires no language changes. Simply have each thread have its own thread local pool to allocate from, but memory once allocated is treated as global. This allows individual allocations to be cast to shared, but the vast bulk will wind up being thread local without sharing cache lines, and it will not require locks to be taken for most allocations.

Yes this was exactly what I was advocating, probably I was too unclear about it, doing this removes the worse contention (the global gc lock on allocation) for all the "smallish" allocations, large allocations will probably still have a global lock somwhere, but that is probably ok.

> A collection cycle, however, will still need to pause all threads and do the whole shebang.

yes if that is parallelized then I don't think that the GC time overhead is much larger than having separate local GC (just trigger the global GC collection when the global allocation exceed X).

In this setting one can also try to have a concurrent GC, but that is
more complex (that was my subsequent discussion), because then one has
to ensure that modifications during the mark phase don't lead to
"lost" objects:
think that you have a.ptr=b and you do { c.ptr=a.ptr; a.ptr=null; }
If you are unluky the mark phase will loose b because when you look at
c.ptr it doesn't yet point to b, and you look at a.ptr it doesn't
point at b anymore....
there are several methods to avoid this, but all introduce an overhead.

Both these GC approaches work without shared, the second one could
have a reduced overhead due to shared.
The gain due to shared is not so clear to me because the main
bottleneck (allocation global lock) is removed in both cases.

Fawzi
February 10, 2010
On 10-feb-10, at 03:14, Robert Jacques wrote:

> On Tue, 09 Feb 2010 05:12:29 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:
>> On 8-feb-10, at 19:15, Robert Jacques wrote:
>>> On Mon, 08 Feb 2010 06:07:36 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:
>
> [snip]
>
>>> Concurrent collectors trade throughput (GC time / program time) for program responsiveness. They also often trade program correctness for latency, as they are notoriously hard to get correct for languages with mutation. While, there are some very elegance concurrent GCs for functional languages, the high level descriptions of the Boehm concurrent GC seems to be inherently racy, while Java concurrent GCs tend to be highly instrumented and lock-tastic (mark-bit locking on every ref assignment).
>> that can be done with atomic ops...
>
> It can only be done correctly using a DCAS, (not dwCAS), so isn't implementable on any mainstream chip. So you have to lock the mark bits on an object. This can be done as a spin-lock using atomic ops, but that doesn't change the number or frequency they have to be taken.

I might miss something (and I don't know what DCAS is) but as you always go in one direction (unmarked -> marked) I don't see why something like (for 32 bit)

mark(uint[]bitfield,uint bit){
	auto idx=(bit+31)/32;
	auto mask=1>>(bit&0x1F);
	volatile auto oldV=bitfield[idx];
	while((oldV & mask)==0){
		auto oldV=atomicCAS(bitfield[idx],oldV | mask,oldV);
	}
}

this can be extended if you have 3 states (white,gray,black) for
concurrent GC or to refCounting.
You have collisions only when two threads try to modify the same
memory location, and even then given that the function is so fast to
compute it should not be a big problem.
False sharing makes things worse, but also that should not be
something that cannot be made acceptable distributing a bit the mark
bits in the memory.

At the end of the mark phase you need a memory barrier in each working thread, and a semaphore (or some flag) to ensure that all changes have global visibility, but that is it.

>>> I'd disagree in that memory partitioning should be equally difficult on either 32-bit or 64-bit platforms. Consider memory pages A B C. If thread 1 allocates some memory and gets page A, then thread 2 allocates some memory and gets page B, when thread 1 next allocates memory it'll get page C and have a discontinuous memory set (A C).
>> you can use mmap or similar to reserve big ranges of memory that you might never actually use (commit), so yes there is a difference. 32bit you cannot afford reserving large chuncks of memory to quickly know which thread owns that memory
>
> Interesting, though the fact that this makes each pool finite would appear to be a design flaw to me.

Well you cannot have it both ways, if you know that you have lot of virtual space you can make them finite but large reservations, large allocations should use another method anyway, and if you really fill a pool you might think of using several pools for the same thread...

>>>> In general one would not expect heavy sharing between threads so for example an explicit list of external pointers could be kept if smaller than X.
>>>
>>> Do you know an example of this? Given the Java thread-local collection papers, this doesn't seem to be done in practice. Also, it seems very overhead intensive and prone to races.
>>
>> well not directly, but I know that for example http://www.research.ibm.com/people/d/dfb/papers/Bacon03Pure.ps
>>   keeps a list of changes on a per thread basis. That example
>> actually keeps a reference count instead of a simple mark flag to
>> avoid having to keep the "externally marked" list explicitly.
>> Probably a better approach if one goes in that direction.
>
> Ahh, that sounds more reasonable and an interesting hybrid approach to boot. Although, to the best of my knowledge, ref-counting GCs are not 100% correct given the multi-paradigm nature of the D programming language. (To say nothing of efficiency)

ref counting simply allows to immediately discard objects that don't
build cycles, at the cost of having a counter instead of a simple flag
(marked/unmarked).
Cycles removal might be delayed, but otherwise correctness/
uncorrectness is the same as other gc (it depends only on the
information you have about pointers, and the probability of treating
as pointer something that is not one).

> [snip]
>
>>>> This is an advantage, but using stack variables, scope variables, good escape analysis and automatic collection for what does not escape, a good concurrent gc,... the gain offered by shared is not so clear cut to me, seeing its complexity.
>>>
>>> First, full escape analysis is both difficult and slow in
>>> interpreted languages and impossible in compiled languages such as
>>> D. So automatic collection is out too.
>>> Second, scope/stack variables are limited in use and size; for
>>> example, you can't define trees or dynamic arrays on the stack nor
>>> exceed the stack size (typically 1MB)
>> sure those things alone don't solve all problems, but what does not fall into them generates so much overhead wrt. a "normal" shared allocation pools, parallel or concurrent GC to be worthwhile?
>
> I assume you're talking about programmer and not implementation overhead? (Since the implementation overhead is definitely lower) Apple seems to think so. Microsoft and Intel seem to think so too. Pretty much everybody has basically agreed that the "normal" way has failed and we need a different solution. Just that nobody's sure what those solutions should be, so we're getting lots of different answers: lib-dispatch, OpenCL, CUDA, TBB, PLINQ, Erlang, etc.

mmh I think several things are mixed in this, not only GC.
For the GC, as far as I know apple microsoft and intel don't have
purely local GC, can you point  reference what you were saying? Or
maybe it was because my statement was kind of badly formulated, I
meant several pools (see the plural there) and having memory shared by
everybody.

>>> Third, thread-local mark-sweep GCs actually equal or out perform modern concurrent-parallel-generational GCs (according to Apple)
>> I fully believe this, I don't contest that purely local mark and sweep is more efficient, but is it that much more efficient to justify the extra complexity in the language?

does apple really have purely thread local GC, I thought they had something like a ref counting GC that could be concurrent.

>>> , and that's in a language which supports them. Further, concurrent collectors need to access the mark bits of an object. In D this means taking the global GC lock on _every_ reference assignment and traversing a log N memory tree.
>> setting the mark bits is a prime example of something that can be
>> done with an atomic op.
>> I haven't understood what you mean with "taking the global GC lock
>> on _every_ reference assignment and traversing a log N memory tree"
>> you can find the root pointer and check the mark bit without lock,
>> only when you think that need to set it you need to use an atomic
>> op (during the mark phase you can just go unmarked->marked).
>> You mean the operations that are needed by concurrent GC to ensure
>> that modifications don't make the GC "loose" that are in use if
>> modified while collecting?
>> So basically the advantage of introducing shared is that concurrent
>> GC become cheaper and more feasible because the overhead of the
>> mutation is paid only by shared objects?
>
> Okay, the problem I think you're having is that you're thinking of Java and not of a systems programming language like C, C++ or D. In Java, all heap memory items are objects and there are no interior pointers. So it is trivial to find the mark bits of a class via it's meta-data. You still can't update the mark bits atomically since you need a DCAS to store the reference while making sure the mark bits are still what you think they are. So you have to lock-update- unlock. Anyways, in C/C++/D you have structs, interior pointers and arrays, which all mean to find those pointer bits you have to traverse the memory tree inside the GC. So you need to lock it. By the way, the same thing goes for capacity, which is why the lookup- cache was added to enhance array append performance.

I am confused again, DCAS, locking,...
 From my understanding the structure with the mark bits is not
modified during the mark phase, only its contents are modified.
Having interior pointers simply means that you need to find the root
before marking, but finding the root is something that can be done
without any extra locking (assuming again that during mark phase GC
structures are not moved, which is given for all GC that I know).
A shared cache for performance reasons, well yes there you need
locking of some sort (note how preallocated pools make these ops
faster, in the normal case they are not terrible either, but you need
extra info to know the layout of the pool, as it isn't fixed).

>>>> Page reuse between different local pools is indeed an issue, I would say that with 64 bits one can simply give empty pages back to the system if they are over a given threshold, whereas with 32 bit indeed some more tricks (stealing dirty pages) might be in order. Please note that stealing pages from another numa node is not something you want to do unless it is unavoidable
>>>
>>> I don't see what 64-bit vs 32-bit has to do with this; the former simply lets you access more memory, it doesn't magically prevent page fragmentation, false sharing, etc.
>> yes but I don't expect page fragmentation to be such a problem that
>> you need to steal dirty pages from other pools, unless memory is
>> really tight.
>> false sharing is not an issue if you don't steal pages.
>
> False sharing of local objects isn't, but false sharing of shared/ immutable objects still is.
ok, fair point, so allocation of shared/immutable should be padded if one wants to avoid flse sharing.

> And memory can be just as tight on a 64-bit system as a 32-bit system; one just has the ability to see more of it.
yes but I have difficulty thinking about a thread generating lot of dirty pages and never using them again, so that stealing is worthwhile.
February 10, 2010
On Feb 10, 2010, at 3:59 AM, Fawzi Mohamed wrote:

> 
> On 10-feb-10, at 03:14, Robert Jacques wrote:
> 
>> On Tue, 09 Feb 2010 05:12:29 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:
>>> On 8-feb-10, at 19:15, Robert Jacques wrote:
>>>> On Mon, 08 Feb 2010 06:07:36 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:
>> 
>> [snip]
>> 
>>>> Concurrent collectors trade throughput (GC time / program time) for program responsiveness. They also often trade program correctness for latency, as they are notoriously hard to get correct for languages with mutation. While, there are some very elegance concurrent GCs for functional languages, the high level descriptions of the Boehm concurrent GC seems to be inherently racy, while Java concurrent GCs tend to be highly instrumented and lock-tastic (mark-bit locking on every ref assignment).
>>> that can be done with atomic ops...
>> 
>> It can only be done correctly using a DCAS, (not dwCAS), so isn't implementable on any mainstream chip. So you have to lock the mark bits on an object. This can be done as a spin-lock using atomic ops, but that doesn't change the number or frequency they have to be taken.
> 
> I might miss something (and I don't know what DCAS is) but as you always go in one direction (unmarked -> marked) I don't see why something like (for 32 bit)
> 
> mark(uint[]bitfield,uint bit){
> 	auto idx=(bit+31)/32;
> 	auto mask=1>>(bit&0x1F);
> 	volatile auto oldV=bitfield[idx];
> 	while((oldV & mask)==0){
> 		auto oldV=atomicCAS(bitfield[idx],oldV | mask,oldV);
> 	}
> }
> 
> this can be extended if you have 3 states (white,gray,black) for concurrent GC or to refCounting.
> You have collisions only when two threads try to modify the same memory location, and even then given that the function is so fast to compute it should not be a big problem.
> False sharing makes things worse, but also that should not be something that cannot be made acceptable distributing a bit the mark bits in the memory.
> 
> At the end of the mark phase you need a memory barrier in each working thread, and a semaphore (or some flag) to ensure that all changes have global visibility, but that is it.

This works in a world where all reference updates are done via in-language operations.  But D supports both extern (C) functions and inline assembly, both of which defeat incremental collection.
February 10, 2010
On 10-feb-10, at 17:41, Sean Kelly wrote:

> On Feb 10, 2010, at 3:59 AM, Fawzi Mohamed wrote:
>
>>
>> On 10-feb-10, at 03:14, Robert Jacques wrote:
>>
>>> On Tue, 09 Feb 2010 05:12:29 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:
>>>> On 8-feb-10, at 19:15, Robert Jacques wrote:
>>>>> On Mon, 08 Feb 2010 06:07:36 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:
>>>
>>> [snip]
>>>
>>>>> Concurrent collectors trade throughput (GC time / program time) for program responsiveness. They also often trade program correctness for latency, as they are notoriously hard to get correct for languages with mutation. While, there are some very elegance concurrent GCs for functional languages, the high level descriptions of the Boehm concurrent GC seems to be inherently racy, while Java concurrent GCs tend to be highly instrumented and lock-tastic (mark-bit locking on every ref assignment).
>>>> that can be done with atomic ops...
>>>
>>> It can only be done correctly using a DCAS, (not dwCAS), so isn't implementable on any mainstream chip. So you have to lock the mark bits on an object. This can be done as a spin-lock using atomic ops, but that doesn't change the number or frequency they have to be taken.
>>
>> I might miss something (and I don't know what DCAS is) but as you always go in one direction (unmarked -> marked) I don't see why something like (for 32 bit)
>>
>> mark(uint[]bitfield,uint bit){
>> 	auto idx=(bit+31)/32;
>> 	auto mask=1>>(bit&0x1F);
>> 	volatile auto oldV=bitfield[idx];
>> 	while((oldV & mask)==0){
>> 		auto oldV=atomicCAS(bitfield[idx],oldV | mask,oldV);
>> 	}
>> }
>>
>> this can be extended if you have 3 states (white,gray,black) for
>> concurrent GC or to refCounting.
>> You have collisions only when two threads try to modify the same
>> memory location, and even then given that the function is so fast
>> to compute it should not be a big problem.
>> False sharing makes things worse, but also that should not be
>> something that cannot be made acceptable distributing a bit the
>> mark bits in the memory.
>>
>> At the end of the mark phase you need a memory barrier in each working thread, and a semaphore (or some flag) to ensure that all changes have global visibility, but that is it.
>
> This works in a world where all reference updates are done via in- language operations.  But D supports both extern (C) functions and inline assembly, both of which defeat incremental collection.

this works with a stop the world approach as well as now (and is parallel).

I agree that to use that in a concurrent GC then you need some way to
ensure that an update to a reference does not loose references.
That is possible only if you have some control on the references
updates, so either that has to be exposed to external programs, or you
declare programs that modify references transferring the reference to
an object from a memory location to another (simply loosing a
reference or adding one is not a problem, only the combination is), as
unsafe to be run during a GC mark phase if you use a concurrent GC and
do not notify it of the change.
Not the ideal solution, but that is all you can do: ether you stop the
threads, or some control on the modifications has to be done...

February 10, 2010

Robert Jacques wrote:
>
> Yes. The reason I said that casting from local to shared was implementation defined was I knew the operation could be valid or invalid based on the type of GC used. However, your comment sounds like the standard C/C++ approach to thread-local allocation, which I discussed in the original post. In short, this model has been shown to be insufficient in C/C++ with regard to false sharing and has created several hacks to remove false sharing for identifiable hot-spots. Your comment also doesn't address the key challenge to D's GC, which is: where does the garbage returned to?

The garbage is returned to the global memory pool, not the local one. When the local ones run out, they go and get more from the global pool.

> (which I also discussed) Anyways, in order to prevent false sharing, small immutable objects need their own memory pool and shared objects need to be created in a different manner than local objects. To me this creates a logical separation of their memory pools and at a minimum requires the GC implementation to perform allocation using specific flags for each type qualifier.
>

False sharing isn't just a problem with shared objects, it happens with local objects. Using separate local pools to allocate from will minimize this problem.
February 10, 2010
On Wed, Feb 10, 2010 at 1:30 PM, Walter Bright <walter at digitalmars.com>wrote:

>
>
> Robert Jacques wrote:
>
>>
>> Yes. The reason I said that casting from local to shared was
>> implementation defined was I knew the operation could be valid or invalid
>> based on the type of GC used. However, your comment sounds like the standard
>> C/C++ approach to thread-local allocation, which I discussed in the original
>> post. In short, this model has been shown to be insufficient in C/C++ with
>> regard to false sharing and has created several hacks to remove false
>> sharing for identifiable hot-spots. Your comment also doesn't address the
>> key challenge to D's GC, which is: where does the garbage returned to?
>>
>
> The garbage is returned to the global memory pool, not the local one. When the local ones run out, they go and get more from the global pool.


Would it be practical for particular cache lines to have some kind of affinity for a given local allocator or for a cache line with both live and data and garbage to have an affinity for one of the thread pools that already owns an object within it?


>  (which I also discussed) Anyways, in order to prevent false sharing, small
>> immutable objects need their own memory pool and shared objects need to be created in a different manner than local objects. To me this creates a logical separation of their memory pools and at a minimum requires the GC implementation to perform allocation using specific flags for each type qualifier.
>>
>>
> False sharing isn't just a problem with shared objects, it happens with local objects. Using separate local pools to allocate from will minimize this problem.
>
> _______________________________________________
> dmd-concurrency mailing list
> dmd-concurrency at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/dmd-concurrency/attachments/20100210/c4000710/attachment.htm>
February 10, 2010

Kevin Bealer wrote:
>
>
> Would it be practical for particular cache lines to have some kind of affinity for a given local allocator or for a cache line with both live and data and garbage to have an affinity for one of the thread pools that already owns an object within it?
>

I don't know.
February 10, 2010
On Wed, 10 Feb 2010 06:16:08 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote: [snip]
>> A collection cycle, however, will still need to pause all threads and do the whole shebang.
>
> yes if that is parallelized then I don't think that the GC time overhead is much larger than having separate local GC (just trigger the global GC collection when the global allocation exceed X).

Parallel GCs do have a fair amount of overhead. They have to pause and start every running thread plus every GC thread. Then the marking algorithm itself is usually lock-free, so every mark-bit read or write is atomic at a minimum. Some implementations also maintain a to mark list, which needs to be synchronized for loading balancing reasons. And parallel GCs don't really address the embarrassing pause problem.

[snip]
> Both these GC approaches work without shared, the second one could have
> a reduced overhead due to shared.
> The gain due to shared is not so clear to me because the main bottleneck
> (allocation global lock) is removed in both cases.

The main bottleneck in GCs is collection, not allocation, because solving the allocation problem is well known and simple. We just haven't done it yet in D. What shared brings to the table is the ability to eliminate false sharing, which effects C/C++/C#/Java/etc, and to give first class support to thread-local garbage collection, which besides its great theoretical properties, is being used practically to great effect by Apple. And while granted we don't have to impose actually separation between memory pools to achieve the false sharing guarantee, if we do there's an opportunity to choose the best GC for each memory type, instead of one that can handle everything, but does so poorly.
February 10, 2010
On Wed, 10 Feb 2010 15:14:12 -0500, Walter Bright <walter at digitalmars.com> wrote:
> Kevin Bealer wrote:
>>
>>
>> Would it be practical for particular cache lines to have some kind of affinity for a given local allocator or for a cache line with both live and data and garbage to have an affinity for one of the thread pools that already owns an object within it?
>>
>
> I don't know.

Yes, its possible. It requires DMD to give the GC allocator the type information (local/immutable/shared). The allocator can then make the capacities of shared objects to be multiples of the cache line size. These are properly aligned in D's current GC. Immutable objects would either have to be promoted as well, or draw objects under a cache-lines from a different set of free-list bins. The reason you can get away with immutable objects from different threads peacefully co-existing on a cache-line is that they are read-only, so there's no false sharing.