D's Memory Model

[dmd-concurrency] D's Memory Model
Feb 05, 2010 Robert Jacques
Feb 08, 2010 Fawzi Mohamed
Feb 08, 2010 Robert Jacques
Feb 09, 2010 Fawzi Mohamed
Feb 10, 2010 Robert Jacques
Feb 10, 2010 Andrei Alexandrescu
Feb 10, 2010 Robert Jacques
Feb 10, 2010 Andrei Alexandrescu
Feb 10, 2010 Robert Jacques
Feb 10, 2010 Walter Bright
Feb 10, 2010 Robert Jacques
Feb 10, 2010 Walter Bright
Feb 10, 2010 Kevin Bealer
Feb 10, 2010 Walter Bright
Feb 10, 2010 Robert Jacques
Feb 10, 2010 Fawzi Mohamed
Feb 10, 2010 Robert Jacques
Feb 10, 2010 Fawzi Mohamed
Feb 10, 2010 Sean Kelly
Feb 10, 2010 Fawzi Mohamed
Feb 10, 2010 Robert Jacques

February 05, 2010

[dmd-concurrency] D's Memory Model

Posted by Robert Jacques

Permalink

Robert Jacques

Permalink

From a practical perspective, it is quite infuriating to correctly
implement a parallel algorithm, only to have it perform slower than its
single threaded counter part for no apparent reason. This commonly occurs
when some form of hidden serialization occurs and can be both
deterministic and non-deterministic at either the hardware or software
level. Although, fixing these issues is more of an optimization, than a
language design issue, some of the solutions require a slight change to
D's memory model and thus a change to the language.

False sharing
False sharing occurs when two separate memory locations on the same cache
line are modified by different threads. Automatic cache synchronization
then causes the entire line to be repeatedly flushed and re-loaded,
causing stalls and wasting bandwidth. This has lead to the practice of
padding certain heap allocated data structures (like message queues) by a
cache line before and by [cache line size - T.sizeof] after. However, this
is awkward to use, only fixes hard to find hot-spots and is not portable;
cache lines sizes differ from CPU to CPU. In D, this issue effects all
objects 32 bytes or smaller (on most CPUs). It occurs when local objects
from a thread are allocated next to shared, immutable or a different
thread's local objects. It also occurs when shared objects are located
next to each other or located next to immutable objects. (Immutable
objects can be safely allocated next to each other because they aren't
written to) Thread-local memory pools (see thread local allocation below)
should solve the first of these issues, by segregating local objects from
each other and shared state. The second issue can be addressed by having
the GC always allocate memory in cache line units for shared objects. This
would be portable (runtime cache-line determination) and more memory
efficient than manual object padding; only a single, instead of two cache
lines are used for small shared objects, like message nodes. The downside
of this approach is that it requires separate memory pools for local and
shared objects, which makes casting thread-local data to shared or
immutable a compiler error.

The Garbage Collection Lock
Today in D, the single, global GC lock is a major parallel performance
bottleneck. The solution is to use thread-local allocation, which has been
well proven in other languages (C/C++/C#/Java/etc); the basic concept is
that each thread maintains its own free-list (or bump-allocators/etc) from
which it allocates and deletes to. And as objects tend to live and die on
the thread that allocates them, even without separate local and shared
memory pools, false sharing is greatly reduced. In D however, thread-local
allocation is challenging as the collection algorithm has to determine
which threads to return garbage to. Getting this wrong leads to both slow
performance and memory leaks. The Boehm GC, for example, hands out a page
of objects to the local allocators when they run low. However, if they are
'clean', memory is wasted by excessive heap fragmentation, while if they
are 'dirty', the amortized cost of the GC lock is greater and false
sharing occurs. To prevent false pointers, the garbage from 'dirty' (aka
fragmented) pages should be returned to their allocating thread, at least
for pages under the cache-line size. In general, there is a trade off
between giving garbage back preemptively and holding it globally: the
former tends to lock infrequently, but always stops at least 1 other
thread when it does (to steal clean pages), while the later locks
frequency but doesn't necessarily stop any other threads.

One additional advantage to the segregating the local and shared/immutable memory pools is that it naturally lends itself to thread-local garbage collection; a modern GC style which is pseudo-concurrent, parallel, high-throughput, low-overhead, scales linearly and has reduced false pointers and cache misses.

February 08, 2010

[dmd-concurrency] D's Memory Model

Posted by Fawzi Mohamed
in reply to Robert Jacques

Permalink

Fawzi Mohamed

Posted in reply to Robert Jacques

Permalink

I was away a couple of days, and coming back I found your interesting post Robert Jacques.

I had not thought much about false sharing, thanks for having raised the issue.

In general both for false sharing and for the global GC lock (that is *really* a bottleneck for current heavily multithreaded code if you allocate in the threads) I fully agree that a having thread local (or cpu local) allocators is the correct way out.

Without shared you can have two approaches.

One is the simplistic "stop the world" (but making the mark & co concurrent). The problem is the latency that is introduced by this approach, but if done correctly (well parallelized) the ratio GC time/ computation time should not be worse than other approaches.

The other approach is similar to the generational GC: you need to
establish checkpoints, and know which allocations are performed before
or after the checkpoint.
at that point, after the mark phase, you can free the non marked
allocations that were done before the checkpoint.
With multithreading one local pool on the other hand needs to collect
all the "retains" coming from other threads done before the
checkpoint, and then free all the non marked allocation.

This introduces a delay in the deallocation. Generational GC then make another GC just for the newer generation, without caring to mark things in the "old" generation, but that is another story.

The communication of the retains of the other threads, and the
establishing of checkpoints can be done either preemptively or upon
request (one can simply have the information stored in some way in the
pools, for example keeping an order in the allocations, and storing a
checkpoint on the stack, that suspends the thread if it wants to
reduce the stack before the mark phase on it is performed).
The thread doing the GC can be one of the running threads (or even
several), or a thread dedicated to it.

In a 64 bit system one can really do a memory partitioning without
fear of exhausting the memory space, thus making the question "which
pool/thread owns this pointer" very fast to answer, with 32 bit is
more tricky.
In general one would not expect heavy sharing between threads so for
example an explicit list of external pointers could be kept if smaller
than X.
Virtual memory is another issue, but one could assume that the program
is all in real memory (tackling that needs OS collaboration).

This is a discussion *without* shared,... using it should allow one to
reduce the communication of pointers retained by other threads to
shared objects. and make a fully local gc for the non shared objects,
without needing to synchronize with other threads.
This is an advantage, but using stack variables, scope variables, good
escape analysis and automatic collection for what does not escape, a
good concurrent gc,... the gain offered by shared is not so clear cut
to me, seeing its complexity.

Page reuse between different local pools is indeed an issue, I would say that with 64 bits one can simply give empty pages back to the system if they are over a given threshold, whereas with 32 bit indeed some more tricks (stealing dirty pages) might be in order. Please note that stealing pages from another numa node is not something you want to do unless it is unavoidable

ciao
Fawzi

February 08, 2010

[dmd-concurrency] D's Memory Model

Posted by Robert Jacques
in reply to Fawzi Mohamed

Permalink

Robert Jacques

Posted in reply to Fawzi Mohamed

Permalink

On Mon, 08 Feb 2010 06:07:36 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:
> I was away a couple of days, and coming back I found your interesting post Robert Jacques.
>
> I had not thought much about false sharing, thanks for having raised the issue.
>
> In general both for false sharing and for the global GC lock (that is *really* a bottleneck for current heavily multithreaded code if you allocate in the threads) I fully agree that a having thread local (or cpu local) allocators is the correct way out.
>
> Without shared you can have two approaches.

I'm assuming you are talking the collection, not allocation phase. The empirical tests I've seen indicate that currently the allocation bottleneck dwarfs the collection bottleneck.

> One is  the simplistic "stop the world" (but making the mark & co concurrent). The problem is the latency that is introduced by this approach, but if done correctly (well parallelized) the ratio GC time/ computation time should not be worse than other approaches.

Your statement is a bit confusing since concurrent collectors are generally not considered "stop the world". Concurrent collectors trade throughput (GC time / program time) for program responsiveness. They also often trade program correctness for latency, as they are notoriously hard to get correct for languages with mutation. While, there are some very elegance concurrent GCs for functional languages, the high level descriptions of the Boehm concurrent GC seems to be inherently racy, while Java concurrent GCs tend to be highly instrumented and lock-tastic (mark-bit locking on every ref assignment). As D is a multi-paradigm language which allows C function calls, most of the existing concurrent collector algorithms don't seem to be applicable to D. (Please feel free to correct me if you know of a paper on a concurrent GC that would be good for D)

Parallel collectors, on the other hand, address throughput and come with no theoretical issues. As parallel GCs can be "stop the world", this is what I assumed you meant.

> The other approach is similar to the generational GC: you need to
> establish checkpoints, and know which allocations are performed before
> or after the checkpoint.
> at that point, after the mark phase, you can free the non marked
> allocations that were done before the checkpoint.
> With multithreading one local pool on the other hand needs to collect
> all the "retains" coming from other threads done before the checkpoint,
> and then free all the non marked allocation.

This sounds like "very concurrent garbage collection" which is one of the elegant but immutable-objects-only collectors. Generational collectors are a form of escape analysis: when a new/young object reference is assigned to an old object's member variable it escapes to the older generation. To be efficient, generational collectors need bump-allocators (and a moving/coping GC) or an "objects only" language, so one can efficiency edit a class's meta-data. D's use of interior pointers, structs and array slices, not to mention C function calls, make generational collectors difficult to implement correctly, let alone efficiently.

> This introduces a delay in the deallocation. Generational GC then make another GC just for the newer generation, without caring to mark things in the "old" generation, but that is another story.
>
> The communication of the retains of the other threads, and the
> establishing of checkpoints can be done either preemptively or upon
> request (one can simply have the information stored in some way in the
> pools, for example keeping an order in the allocations, and storing a
> checkpoint on the stack, that suspends the thread if it wants to reduce
> the stack before the mark phase on it is performed).
> The thread doing the GC can be one of the running threads (or even
> several), or a thread dedicated to it.
>
> In a 64 bit system one can really do a memory partitioning without fear of exhausting the memory space, thus making the question "which pool/thread owns this pointer" very fast to answer, with 32 bit is more tricky.

I'd disagree in that memory partitioning should be equally difficult on either 32-bit or 64-bit platforms. Consider memory pages A B C. If thread 1 allocates some memory and gets page A, then thread 2 allocates some memory and gets page B, when thread 1 next allocates memory it'll get page C and have a discontinuous memory set (A C).

> In general one would not expect heavy sharing between threads so for example an explicit list of external pointers could be kept if smaller than X.

Do you know an example of this? Given the Java thread-local collection papers, this doesn't seem to be done in practice. Also, it seems very overhead intensive and prone to races.

> Virtual memory is another issue, but one could assume that the program is all in real memory (tackling that needs OS collaboration).
>
> This is a discussion *without* shared,... using it should allow one to reduce the communication of pointers retained by other threads to shared objects. and make a fully local gc for the non shared objects, without needing to synchronize with other threads.

Yes, this is called thread-local garbage collection (which I mentioned).

> This is an advantage, but using stack variables, scope variables, good escape analysis and automatic collection for what does not escape, a good concurrent gc,... the gain offered by shared is not so clear cut to me, seeing its complexity.

First, full escape analysis is both difficult and slow in interpreted
languages and impossible in compiled languages such as D. So automatic
collection is out too.
Second, scope/stack variables are limited in use and size; for example,
you can't define trees or dynamic arrays on the stack nor exceed the stack
size (typically 1MB)
Third, thread-local mark-sweep GCs actually equal or out perform modern
concurrent-parallel-generational GCs (according to Apple), and that's in a
language which supports them. Further, concurrent collectors need to
access the mark bits of an object. In D this means taking the global GC
lock on _every_ reference assignment and traversing a log N memory tree.
Even in Java, where the lack of interior pointers and an object-only
paradigm allows the use of two-three atomic instructions on every
assignment, eliminating this unneeded and excess overhead would be a huge
win.
Fourth, Bearophile brought this up on the newsgroup; one of the biggest
mistakes of Java is considered to be that every object is also a monitor.
Without shared, D would have to make the same mistake.

> Page reuse between different local pools is indeed an issue, I would say that with 64 bits one can simply give empty pages back to the system if they are over a given threshold, whereas with 32 bit indeed some more tricks (stealing dirty pages) might be in order. Please note that stealing pages from another numa node is not something you want to do unless it is unavoidable

I don't see what 64-bit vs 32-bit has to do with this; the former simply lets you access more memory, it doesn't magically prevent page fragmentation, false sharing, etc.

February 09, 2010

[dmd-concurrency] D's Memory Model

Posted by Fawzi Mohamed
in reply to Robert Jacques

Permalink

Fawzi Mohamed

Posted in reply to Robert Jacques

Permalink

On 8-feb-10, at 19:15, Robert Jacques wrote:

> On Mon, 08 Feb 2010 06:07:36 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:
>> I was away a couple of days, and coming back I found your interesting post Robert Jacques.
>>
>> I had not thought much about false sharing, thanks for having raised the issue.
>>
>> In general both for false sharing and for the global GC lock (that is *really* a bottleneck for current heavily multithreaded code if you allocate in the threads) I fully agree that a having thread local (or cpu local) allocators is the correct way out.
>>
>> Without shared you can have two approaches.
>
> I'm assuming you are talking the collection, not allocation phase. The empirical tests I've seen indicate that currently the allocation bottleneck dwarfs the collection bottleneck.

yes I fully agree, note "thread local (or cpu local) *allocators*". The working for a thread local pool for allocation can be basically the same as the current one, one can replicate the pools for "small" objects. So I did concentrate on how to handle the collection step.

>> One is  the simplistic "stop the world" (but making the mark & co concurrent). The problem is the latency that is introduced by this approach, but if done correctly (well parallelized) the ratio GC time/computation time should not be worse than other approaches.
>
> Your statement is a bit confusing since concurrent collectors are generally not considered "stop the world".

ok I should have said mark & co parallel, I was not implying that the resulting GC is what is called a concurrent GC.

> Concurrent collectors trade throughput (GC time / program time) for program responsiveness. They also often trade program correctness for latency, as they are notoriously hard to get correct for languages with mutation. While, there are some very elegance concurrent GCs for functional languages, the high level descriptions of the Boehm concurrent GC seems to be inherently racy, while Java concurrent GCs tend to be highly instrumented and lock-tastic (mark- bit locking on every ref assignment).
that can be done with atomic ops...

> As D is a multi-paradigm language which allows C function calls, most of the existing concurrent collector algorithms don't seem to be applicable to D. (Please feel free to correct me if you know of a paper on a concurrent GC that would be good for D)
Well all moving gc are not ok, but other can be done (see later)
>
> Parallel collectors, on the other hand, address throughput and come with no theoretical issues. As parallel GCs can be "stop the world", this is what I assumed you meant.

indeed that is what I meant

>> The other approach is similar to the generational GC: you need to
>> establish checkpoints, and know which allocations are performed
>> before or after the checkpoint.
>> at that point, after the mark phase, you can free the non marked
>> allocations that were done before the checkpoint.
>> With multithreading one local pool on the other hand needs to
>> collect all the "retains" coming from other threads done before the
>> checkpoint, and then free all the non marked allocation.
>
> This sounds like "very concurrent garbage collection" which is one of the elegant but immutable-objects-only collectors.
mmh I fear you are right... it did seem a little to simple... one should always think a little more before posting...

> Generational collectors are a form of escape analysis: when a new/ young object reference is assigned to an old object's member variable it escapes to the older generation. To be efficient, generational collectors need bump-allocators (and a moving/coping GC) or an "objects only" language, so one can efficiency edit a class's meta-data. D's use of interior pointers, structs and array slices, not to mention C function calls, make generational collectors difficult to implement correctly, let alone efficiently.
yes I was not proposing a generational collector.

>> This introduces a delay in the deallocation. Generational GC then make another GC just for the newer generation, without caring to mark things in the "old" generation, but that is another story.
>>
>> The communication of the retains of the other threads, and the establishing of checkpoints can be done either preemptively or upon request (one can simply have the information stored in some way in the pools, for example keeping an order in the allocations, and storing a checkpoint on the stack, that suspends the thread if it wants to reduce the stack before the mark phase on it is performed). The thread doing the GC can be one of the running threads (or even several), or a thread dedicated to it.
>>
>> In a 64 bit system one can really do a memory partitioning without fear of exhausting the memory space, thus making the question "which pool/thread owns this pointer" very fast to answer, with 32 bit is more tricky.
>
> I'd disagree in that memory partitioning should be equally difficult on either 32-bit or 64-bit platforms. Consider memory pages A B C. If thread 1 allocates some memory and gets page A, then thread 2 allocates some memory and gets page B, when thread 1 next allocates memory it'll get page C and have a discontinuous memory set (A C).
you can use mmap or similar to reserve big ranges of memory that you might never actually use (commit), so yes there is a difference. 32bit you cannot afford reserving large chuncks of memory to quickly know which thread owns that memory
>
>> In general one would not expect heavy sharing between threads so for example an explicit list of external pointers could be kept if smaller than X.
>
> Do you know an example of this? Given the Java thread-local collection papers, this doesn't seem to be done in practice. Also, it seems very overhead intensive and prone to races.

well not directly, but I know that for example http://www.research.ibm.com/people/d/dfb/papers/Bacon03Pure.ps
  keeps a list of changes on a per thread basis. That example actually
keeps a reference count instead of a simple mark flag to avoid having
to keep the "externally marked" list explicitly. Probably a better
approach if one goes in that direction.

>> Virtual memory is another issue, but one could assume that the program is all in real memory (tackling that needs OS collaboration).
>>
>> This is a discussion *without* shared,... using it should allow one to reduce the communication of pointers retained by other threads to shared objects. and make a fully local gc for the non shared objects, without needing to synchronize with other threads.
>
> Yes, this is called thread-local garbage collection (which I mentioned).
>
>> This is an advantage, but using stack variables, scope variables, good escape analysis and automatic collection for what does not escape, a good concurrent gc,... the gain offered by shared is not so clear cut to me, seeing its complexity.
>
> First, full escape analysis is both difficult and slow in
> interpreted languages and impossible in compiled languages such as
> D. So automatic collection is out too.
> Second, scope/stack variables are limited in use and size; for
> example, you can't define trees or dynamic arrays on the stack nor
> exceed the stack size (typically 1MB)
sure those things alone don't solve all problems, but what does not fall into them generates so much overhead wrt. a "normal" shared allocation pools, parallel or concurrent GC to be worthwhile?

> Third, thread-local mark-sweep GCs actually equal or out perform modern concurrent-parallel-generational GCs (according to Apple)
I fully believe this, I don't contest that purely local mark and sweep is more efficient, but is it that much more efficient to justify the extra complexity in the language?

> , and that's in a language which supports them. Further, concurrent collectors need to access the mark bits of an object. In D this means taking the global GC lock on _every_ reference assignment and traversing a log N memory tree.
setting the mark bits is a prime example of something that can be done
with an atomic op.
I haven't understood what you mean with "taking the global GC lock on
_every_ reference assignment and traversing a log N memory tree" you
can find the root pointer and check the mark bit without lock, only
when you think that need to set it you need to use an atomic op
(during the mark phase you can just go unmarked->marked).
You mean the operations that are needed by concurrent GC to ensure
that modifications don't make the GC "loose" that are in use if
modified while collecting?
So basically the advantage of introducing shared is that concurrent GC
become cheaper and more feasible because the overhead of the mutation
is paid only by shared objects?

> Even in Java, where the lack of interior pointers and an object-only paradigm allows the use of two-three atomic instructions on every assignment, eliminating this unneeded and excess overhead would be a huge win.
I am not sure that I understand this excess overhead, if the objects
are local there shouldn't be many problems, and even for the shared
ones I don't expect lot of problems.
mmh on a second reading I think that you probably refer again at the
overhead to make concurrent GC feasible, and ensure that mutations
don't "loose" an object.

> Fourth, Bearophile brought this up on the newsgroup; one of the biggest mistakes of Java is considered to be that every object is also a monitor. Without shared, D would have to make the same mistake.
why? explicit locks are not an option? I use mutex,... quite often...

>> Page reuse between different local pools is indeed an issue, I would say that with 64 bits one can simply give empty pages back to the system if they are over a given threshold, whereas with 32 bit indeed some more tricks (stealing dirty pages) might be in order. Please note that stealing pages from another numa node is not something you want to do unless it is unavoidable
>
> I don't see what 64-bit vs 32-bit has to do with this; the former simply lets you access more memory, it doesn't magically prevent page fragmentation, false sharing, etc.
yes but I don't expect page fragmentation to be such a problem that
you need to steal dirty pages from other pools, unless memory is
really tight.
false sharing is not an issue if you don't steal pages.

February 09, 2010

[dmd-concurrency] D's Memory Model

Posted by Robert Jacques
in reply to Fawzi Mohamed

Permalink

Robert Jacques

Posted in reply to Fawzi Mohamed

Permalink

On Tue, 09 Feb 2010 05:12:29 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:
> On 8-feb-10, at 19:15, Robert Jacques wrote:
>> On Mon, 08 Feb 2010 06:07:36 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:

[snip]

>> Concurrent collectors trade throughput (GC time / program time) for program responsiveness. They also often trade program correctness for latency, as they are notoriously hard to get correct for languages with mutation. While, there are some very elegance concurrent GCs for functional languages, the high level descriptions of the Boehm concurrent GC seems to be inherently racy, while Java concurrent GCs tend to be highly instrumented and lock-tastic (mark-bit locking on every ref assignment).
> that can be done with atomic ops...

It can only be done correctly using a DCAS, (not dwCAS), so isn't implementable on any mainstream chip. So you have to lock the mark bits on an object. This can be done as a spin-lock using atomic ops, but that doesn't change the number or frequency they have to be taken.

[snip]

>> I'd disagree in that memory partitioning should be equally difficult on either 32-bit or 64-bit platforms. Consider memory pages A B C. If thread 1 allocates some memory and gets page A, then thread 2 allocates some memory and gets page B, when thread 1 next allocates memory it'll get page C and have a discontinuous memory set (A C).
> you can use mmap or similar to reserve big ranges of memory that you might never actually use (commit), so yes there is a difference. 32bit you cannot afford reserving large chuncks of memory to quickly know which thread owns that memory

Interesting, though the fact that this makes each pool finite would appear to be a design flaw to me.

>>> In general one would not expect heavy sharing between threads so for example an explicit list of external pointers could be kept if smaller than X.
>>
>> Do you know an example of this? Given the Java thread-local collection papers, this doesn't seem to be done in practice. Also, it seems very overhead intensive and prone to races.
>
> well not directly, but I know that for example http://www.research.ibm.com/people/d/dfb/papers/Bacon03Pure.ps  keeps a list of changes on a per thread basis. That example actually keeps a reference count instead of a simple mark flag to avoid having to keep the "externally marked" list explicitly. Probably a better approach if one goes in that direction.

Ahh, that sounds more reasonable and an interesting hybrid approach to boot. Although, to the best of my knowledge, ref-counting GCs are not 100% correct given the multi-paradigm nature of the D programming language. (To say nothing of efficiency)

[snip]

>>> This is an advantage, but using stack variables, scope variables, good escape analysis and automatic collection for what does not escape, a good concurrent gc,... the gain offered by shared is not so clear cut to me, seeing its complexity.
>>
>> First, full escape analysis is both difficult and slow in interpreted
>> languages and impossible in compiled languages such as D. So automatic
>> collection is out too.
>> Second, scope/stack variables are limited in use and size; for example,
>> you can't define trees or dynamic arrays on the stack nor exceed the
>> stack size (typically 1MB)
> sure those things alone don't solve all problems, but what does not fall into them generates so much overhead wrt. a "normal" shared allocation pools, parallel or concurrent GC to be worthwhile?

I assume you're talking about programmer and not implementation overhead? (Since the implementation overhead is definitely lower) Apple seems to think so. Microsoft and Intel seem to think so too. Pretty much everybody has basically agreed that the "normal" way has failed and we need a different solution. Just that nobody's sure what those solutions should be, so we're getting lots of different answers: lib-dispatch, OpenCL, CUDA, TBB, PLINQ, Erlang, etc.

>> Third, thread-local mark-sweep GCs actually equal or out perform modern concurrent-parallel-generational GCs (according to Apple)
> I fully believe this, I don't contest that purely local mark and sweep is more efficient, but is it that much more efficient to justify the extra complexity in the language?
>
>> , and that's in a language which supports them. Further, concurrent collectors need to access the mark bits of an object. In D this means taking the global GC lock on _every_ reference assignment and traversing a log N memory tree.
> setting the mark bits is a prime example of something that can be done
> with an atomic op.
> I haven't understood what you mean with "taking the global GC lock on
> _every_ reference assignment and traversing a log N memory tree" you can
> find the root pointer and check the mark bit without lock, only when you
> think that need to set it you need to use an atomic op (during the mark
> phase you can just go unmarked->marked).
> You mean the operations that are needed by concurrent GC to ensure that
> modifications don't make the GC "loose" that are in use if modified
> while collecting?
> So basically the advantage of introducing shared is that concurrent GC
> become cheaper and more feasible because the overhead of the mutation is
> paid only by shared objects?

Okay, the problem I think you're having is that you're thinking of Java
and not of a systems programming language like C, C++ or D. In Java, all
heap memory items are objects and there are no interior pointers. So it is
trivial to find the mark bits of a class via it's meta-data. You still
can't update the mark bits atomically since you need a DCAS to store the
reference while making sure the mark bits are still what you think they
are. So you have to lock-update-unlock. Anyways, in C/C++/D you have
structs, interior pointers and arrays, which all mean to find those
pointer bits you have to traverse the memory tree inside the GC. So you
need to lock it. By the way, the same thing goes for capacity, which is
why the lookup-cache was added to enhance array append performance.
And you're right about shared meaning that you only have to paid for the
concurrency overhead where its really needed.

[snip]

>> Fourth, Bearophile brought this up on the newsgroup; one of the biggest mistakes of Java is considered to be that every object is also a monitor. Without shared, D would have to make the same mistake.
> why? explicit locks are not an option? I use mutex,... quite often...

There are several good posts in the mailing list (both this and the main D one) as well as the web in general on the fallacy on using raw locks. Monitors are know to be much safer. I think the point of the article was not that monitors were used to protect objects, but that _all_ objects were protected by monitors, which requires a bunch of excess overhead when in most cases its not needed. It also leads people to shared use of any old class, even if that's logically incorrect.

>>> Page reuse between different local pools is indeed an issue, I would say that with 64 bits one can simply give empty pages back to the system if they are over a given threshold, whereas with 32 bit indeed some more tricks (stealing dirty pages) might be in order. Please note that stealing pages from another numa node is not something you want to do unless it is unavoidable
>>
>> I don't see what 64-bit vs 32-bit has to do with this; the former simply lets you access more memory, it doesn't magically prevent page fragmentation, false sharing, etc.
> yes but I don't expect page fragmentation to be such a problem that you
> need to steal dirty pages from other pools, unless memory is really
> tight.
> false sharing is not an issue if you don't steal pages.

False sharing of local objects isn't, but false sharing of shared/immutable objects still is. And memory can be just as tight on a 64-bit system as a 32-bit system; one just has the ability to see more of it.

February 09, 2010

[dmd-concurrency] D's Memory Model

Posted by Andrei Alexandrescu
in reply to Robert Jacques

Permalink

Andrei Alexandrescu

Posted in reply to Robert Jacques

Permalink

Robert Jacques wrote:
[snip]

Just to make sure, I'm tuning out this entire GC-related discussion because I assume it's not related to the core language. If there are bits that prompt language changes please let me know. Thanks.

Andrei

February 10, 2010

[dmd-concurrency] D's Memory Model

Posted by Robert Jacques
in reply to Andrei Alexandrescu

Permalink

Robert Jacques

Posted in reply to Andrei Alexandrescu

Permalink

On Tue, 09 Feb 2010 22:23:05 -0500, Andrei Alexandrescu <andrei at erdani.com> wrote:
> Robert Jacques wrote:
> [snip]
>
> Just to make sure, I'm tuning out this entire GC-related discussion because I assume it's not related to the core language. If there are bits that prompt language changes please let me know. Thanks.
>
> Andrei

Yes, there was one important language change which my first post advocated: the separation of local and shared (+immutable & unique) data into logically separate memory pools. I'm sorry that I didn't make it clearer. From a TDPL point of view, this allows implementations to eliminate false sharing (an important fact to highlight) and to use thread-local allocation and/or thread-local collectors. The downside is that it adds language complexity and makes casting from local to shared/immutable/unique implementation defined.

Now, I've mentioned this enhancement to D before, in regard to thread-local garbage collectors, but Walter didn't think the cost/benefit was worth it at the time. Indeed, thread-local collectors are just one of many possible modern GCs a D runtime could adopt. Except, the follow-up discussion with Fawzi has started to give me a sinking feeling. All those wonderful modern GCs seem to be designed for functional or (mostly) pure-OO languages, not systems programming languages. So far, everything but thread-local GCs has gone down in flames. This is disconcerting as even TLGCs are meant to be paired with a modern shared GC of some kind. I'm not a GC expert, so there might be something out there, but I think it would be very prudent to find out what they are, in case they too require some change to the language to work.

February 09, 2010

[dmd-concurrency] D's Memory Model

Posted by Andrei Alexandrescu
in reply to Robert Jacques

Permalink

Andrei Alexandrescu

Posted in reply to Robert Jacques

Permalink

Robert Jacques wrote:
> On Tue, 09 Feb 2010 22:23:05 -0500, Andrei Alexandrescu <andrei at erdani.com> wrote:
>> Robert Jacques wrote:
>> [snip]
>>
>> Just to make sure, I'm tuning out this entire GC-related discussion because I assume it's not related to the core language. If there are bits that prompt language changes please let me know. Thanks.
>>
>> Andrei
> 
> Yes, there was one important language change which my first post advocated: the separation of local and shared (+immutable & unique) data into logically separate memory pools. I'm sorry that I didn't make it clearer. From a TDPL point of view, this allows implementations to eliminate false sharing (an important fact to highlight) and to use thread-local allocation and/or thread-local collectors. The downside is that it adds language complexity and makes casting from local to shared/immutable/unique implementation defined.

Doing this would disallow people from transporting large amounts of data across thread boundaries, which is now possible with cast and the requisite care. I'm not sure we are ready to give that up.

> Now, I've mentioned this enhancement to D before, in regard to thread-local garbage collectors, but Walter didn't think the cost/benefit was worth it at the time. Indeed, thread-local collectors are just one of many possible modern GCs a D runtime could adopt. Except, the follow-up discussion with Fawzi has started to give me a sinking feeling. All those wonderful modern GCs seem to be designed for functional or (mostly) pure-OO languages, not systems programming languages. So far, everything but thread-local GCs has gone down in flames. This is disconcerting as even TLGCs are meant to be paired with a modern shared GC of some kind. I'm not a GC expert, so there might be something out there, but I think it would be very prudent to find out what they are, in case they too require some change to the language to work.

I'm not sure your concerns are valid. Most other OO languages are messier than D with regard to concurrency, yet their garbage collectors do manage.

I agree you have a good point, but the constraints within which we are operating are forcing me to ignore this aspect unless a very clear vision of the problem and a possible solution comes forward. (But by all means feel free to continue the discussion.) From what I can tell after spending little time (th|s)inking about this matter, it's certain casts that thread-local mutable storage would disallow, and that is a possibility that is easy to introduce later.

Andrei

February 09, 2010

[dmd-concurrency] D's Memory Model

Posted by Walter Bright
in reply to Robert Jacques

Permalink

Walter Bright

Posted in reply to Robert Jacques

Permalink

Robert Jacques wrote:
>
> Yes, there was one important language change which my first post advocated: the separation of local and shared (+immutable & unique) data into logically separate memory pools. I'm sorry that I didn't make it clearer. From a TDPL point of view, this allows implementations to eliminate false sharing (an important fact to highlight) and to use thread-local allocation and/or thread-local collectors. The downside is that it adds language complexity and makes casting from local to shared/immutable/unique implementation defined.
>

There's another way to do this that requires no language changes. Simply have each thread have its own thread local pool to allocate from, but memory once allocated is treated as global. This allows individual allocations to be cast to shared, but the vast bulk will wind up being thread local without sharing cache lines, and it will not require locks to be taken for most allocations.

A collection cycle, however, will still need to pause all threads and do the whole shebang.

February 10, 2010

[dmd-concurrency] D's Memory Model

Posted by Robert Jacques
in reply to Walter Bright

Permalink

Robert Jacques

Posted in reply to Walter Bright

Permalink

On Wed, 10 Feb 2010 02:25:12 -0500, Walter Bright <walter at digitalmars.com> wrote:
> Robert Jacques wrote:
>>
>> Yes, there was one important language change which my first post advocated: the separation of local and shared (+immutable & unique) data into logically separate memory pools. I'm sorry that I didn't make it clearer. From a TDPL point of view, this allows implementations to eliminate false sharing (an important fact to highlight) and to use thread-local allocation and/or thread-local collectors. The downside is that it adds language complexity and makes casting from local to shared/immutable/unique implementation defined.
>>
>
> There's another way to do this that requires no language changes. Simply have each thread have its own thread local pool to allocate from, but memory once allocated is treated as global. This allows individual allocations to be cast to shared, but the vast bulk will wind up being thread local without sharing cache lines, and it will not require locks to be taken for most allocations.
>
> A collection cycle, however, will still need to pause all threads and do the whole shebang.

Yes. The reason I said that casting from local to shared was implementation defined was I knew the operation could be valid or invalid based on the type of GC used. However, your comment sounds like the standard C/C++ approach to thread-local allocation, which I discussed in the original post. In short, this model has been shown to be insufficient in C/C++ with regard to false sharing and has created several hacks to remove false sharing for identifiable hot-spots. Your comment also doesn't address the key challenge to D's GC, which is: where does the garbage returned to? (which I also discussed) Anyways, in order to prevent false sharing, small immutable objects need their own memory pool and shared objects need to be created in a different manner than local objects. To me this creates a logical separation of their memory pools and at a minimum requires the GC implementation to perform allocation using specific flags for each type qualifier.

Top | Forum index | About this forum

Forums