October 14, 2013

On 14.10.2013 19:50, John Colvin wrote:
> On Monday, 14 October 2013 at 17:43:33 UTC, Rainer Schuetze wrote:
>>
>>
>> On 13.10.2013 19:05, Sönke Ludwig wrote:
>>> Am 13.10.2013 17:15, schrieb Sean Kelly:
>>>> On Sunday, 13 October 2013 at 07:03:39 UTC, Rainer Schuetze wrote:
>>>>>
>>>>> According to the "Handbook of Garbage Collection" by Richard Jones
>>>>> eager lock-free reference counting can only be done with a cas2
>>>>> operation modifying two seperate locations atomically (algorithm 18.2
>>>>> "Eager reference counting with CompareAndSwap is broken"). This might
>>>>> be the quoted paper:
>>>>> http://scholr.ly/paper/2199608/lock-free-reference-counting
>>>>>
>>>>> Unfortunately the CAS2 operation does not exist in most processors.
>>>>
>>>> I suppose it's worth noting that Boost (and now standard C++) has a
>>>> shared_ptr that works across threads and the implementation I've seen
>>>> doesn't use a mutex.  In fact, I think the Boost one doesn't even use
>>>> CAS on x86, though it's been quite a few years so my memory could be
>>>> wrong on that last detail.
>>>
>>> I didn't read the paper, but I'd suspect that the paper refers to the
>>> case where both, the reference count _and_ the reference is thread-safe,
>>> since the boost/c++ shared_ptr only has a thread-safe reference count
>>> after all.
>>
>> I haven't read it either, but AFAICT the cas2 operation is used two
>> modify the pointer and the reference count at the same time atomically.
>>
>> I just checked boost::shared_ptr, it uses cas operations on the
>> reference counts. It has the same problem as described in my example,
>> see the read/write example 3 here:
>> http://www.boost.org/doc/libs/1_54_0/libs/smart_ptr/shared_ptr.htm#ThreadSafety
>>
>>
>> boost::shared_ptr is also unsafe with respect to calling member
>> functions through "->" as it doesn't increment the reference count.
>
> I'm totally out of my depth here but can't you store the reference count
> adjacent to the pointer and use CMPXCHG16B

This might work for a single pointer, but not if you have multiple pointers to the same object.
October 14, 2013
On 2013-10-14 17:57:18 +0000, Rainer Schuetze <r.sagitario@gmx.de> said:

> On 13.10.2013 13:48, Michel Fortin wrote:
>> 
>> For one of my projects I implemented a shared pointer like this. It uses
>> the pointer value itself as a spin lock with the assumption that -1 is
>> an invalid pointer value:
>> 
>> 1. read pointer value
>> 2. if read value is -1 go to line 1 (spin)
>> 3. compare and swap (previously read value <-> -1)
>> 4. if failure go to line 1 (spin)
>> // now pointer is "locked", its value is -1 but we have a copy of the
>> original
>> 5. copy pointer locally or assign to it (and update counter)
>> 6. write back pointer value atomically to replace the -1
>> 
>> No mutex, but there's a spin lock so it's not good if there's contention.
>> 
>> That said, I find it extremely rare to want a shared pointer that isn't
>> already protected by a mutex alongside other variables, or that isn't
>> propagated using some form of message passing.
> 
> Locking is very bad if you have threads at different priorities as it might introduce priority inversion. Spinning is probably even worse in that scenario.

Spinning is good only when you very rarely expect contention, which is the case for me. The above code is used once per object the first time someone requests a weak pointer for it. Having contention for that just doesn't make sense in most usage patterns. But still, being curious I added a log message anytime it does actually has to spin, and I have yet to see that message once in my logs (which probably means aggressive enough unit tests are missing).

If you have a lot of read accesses and rarely write to the pointer you could instead try a read-write mutex with concurrent read access. In any case, there's no solution that will be ideal in all cases. Different situations asks for different trade-offs.

> At work, I use shared pointers all the time to pass information to a real time audio thread. The scheme uses triple-buffering of pointers for a lock free safe transport from/to the real time thread.
> 
> Not having to worry about these low-level locking stuff is one of the good aspects about garbage collecting.

Indeed. The current garbage collector makes it easy to have shared pointers to shared objects. But the GC can also interrupt real-time threads for an unpredictable duration, how do you cope with that in a real-time thread?

I know ARC isn't the ideal solution for all use cases. But neither is the GC, especially for real-time applications. So, which one would you recommend for a project having a real-time audio thread?

-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca

October 14, 2013
On Monday, 14 October 2013 at 19:42:36 UTC, Michel Fortin wrote:
> Indeed. The current garbage collector makes it easy to have shared pointers to shared objects. But the GC can also interrupt real-time threads for an unpredictable duration, how do you cope with that in a real-time thread?
>
> I know ARC isn't the ideal solution for all use cases. But neither is the GC, especially for real-time applications. So, which one would you recommend for a project having a real-time audio thread?

If you don't want any pause, concurrent GC is the way to go. This type of GC come at a cost of increased memory usage (everything is a tradeoff) but exists.
October 14, 2013
Am 14.10.2013 21:42, schrieb Michel Fortin:
> On 2013-10-14 17:57:18 +0000, Rainer Schuetze <r.sagitario@gmx.de> said:
>
>> On 13.10.2013 13:48, Michel Fortin wrote:
>>>
>>> For one of my projects I implemented a shared pointer like this. It uses
>>> the pointer value itself as a spin lock with the assumption that -1 is
>>> an invalid pointer value:
>>>
>>> 1. read pointer value
>>> 2. if read value is -1 go to line 1 (spin)
>>> 3. compare and swap (previously read value <-> -1)
>>> 4. if failure go to line 1 (spin)
>>> // now pointer is "locked", its value is -1 but we have a copy of the
>>> original
>>> 5. copy pointer locally or assign to it (and update counter)
>>> 6. write back pointer value atomically to replace the -1
>>>
>>> No mutex, but there's a spin lock so it's not good if there's
>>> contention.
>>>
>>> That said, I find it extremely rare to want a shared pointer that isn't
>>> already protected by a mutex alongside other variables, or that isn't
>>> propagated using some form of message passing.
>>
>> Locking is very bad if you have threads at different priorities as it
>> might introduce priority inversion. Spinning is probably even worse in
>> that scenario.
>
> Spinning is good only when you very rarely expect contention, which is
> the case for me. The above code is used once per object the first time
> someone requests a weak pointer for it. Having contention for that just
> doesn't make sense in most usage patterns. But still, being curious I
> added a log message anytime it does actually has to spin, and I have yet
> to see that message once in my logs (which probably means aggressive
> enough unit tests are missing).
>
> If you have a lot of read accesses and rarely write to the pointer you
> could instead try a read-write mutex with concurrent read access. In any
> case, there's no solution that will be ideal in all cases. Different
> situations asks for different trade-offs.
>
>> At work, I use shared pointers all the time to pass information to a
>> real time audio thread. The scheme uses triple-buffering of pointers
>> for a lock free safe transport from/to the real time thread.
>>
>> Not having to worry about these low-level locking stuff is one of the
>> good aspects about garbage collecting.
>
> Indeed. The current garbage collector makes it easy to have shared
> pointers to shared objects. But the GC can also interrupt real-time
> threads for an unpredictable duration, how do you cope with that in a
> real-time thread?
>
> I know ARC isn't the ideal solution for all use cases. But neither is
> the GC, especially for real-time applications. So, which one would you
> recommend for a project having a real-time audio thread?
>


Well, if real time concurrent GC for Java systems is good enough for systems that control militar missile systems, maybe it is good enough for real-time audio as well.

--
Paulo
October 14, 2013
On 2013-10-14 20:45:35 +0000, "deadalnix" <deadalnix@gmail.com> said:

> On Monday, 14 October 2013 at 19:42:36 UTC, Michel Fortin wrote:
>> Indeed. The current garbage collector makes it easy to have shared pointers to shared objects. But the GC can also interrupt real-time threads for an unpredictable duration, how do you cope with that in a real-time thread?
>> 
>> I know ARC isn't the ideal solution for all use cases. But neither is the GC, especially for real-time applications. So, which one would you recommend for a project having a real-time audio thread?
> 
> If you don't want any pause, concurrent GC is the way to go. This type of GC come at a cost of increased memory usage (everything is a tradeoff) but exists.

I'm not an expert in GCs, but as far as I know a concurrent GC also requires some bookkeeping to be done when assigning to pointers, similar to ARC, and also when moving pointers, unlike ARC. So it requires hooks in the codegen that will perform atomic operations, just like ARC.

Unless of course you use "fork" and scan inside the child process… but that has its own set of problems on some platforms.

The only consensus we'll reach is that different projects have different needs. In theory being able to swap the GC for something else could bring everyone together. But to be able to replace the GC for another with a strategy different enough to matter (concurrent GC or ARC) you need the codegen to be different. So we can either:

1. make the codegen configurable -- which brings its own set of compatibility problems for compiled code but is good for experimentation, or

2. settle on something middle-ground performance-wise that can accommodate a couple of strategies, or

3. choose one GC strategy that ought to satisfy everybody and adapt the codegen to fit, or

4. keep everything as is.

We're stuck with 4 until something else is decided.

-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca

October 14, 2013
On Monday, 14 October 2013 at 21:25:29 UTC, Michel Fortin wrote:
> I'm not an expert in GCs, but as far as I know a concurrent GC also requires some bookkeeping to be done when assigning to pointers, similar to ARC, and also when moving pointers, unlike ARC. So it requires hooks in the codegen that will perform atomic operations, just like ARC.
>

Usual strategy include :
 - When you JIT, change the function itself, to write pointers through a function that mark the old value as live.
 - When AOT, always go throw that function, which make a test and mark alive the old value if this is done during a collection. This basically add a read to a global and a test for each pointer write.
 - Use the page protection mechanism and do regular write. This can be done via fork, but also via remapping the GCed memory as COW. The tax is then more expensive, but you only pay it once per page you actually write and only when actually collecting.

The good news, is that this tax is only required for object that contains shared mutable pointers. In D, most data are thread local or immutable. D's type system is really friendly to concurrent GC, and we definitively should go in that direction.

> The only consensus we'll reach is that different projects have different needs. In theory being able to swap the GC for something else could bring everyone together. But to be able to replace the GC for another with a strategy different enough to matter (concurrent GC or ARC) you need the codegen to be different. So we can either:
>

ARC like system need a different codegen, but you can do this with regular codegen if you use page protection to detect writes.

> 1. make the codegen configurable -- which brings its own set of compatibility problems for compiled code but is good for experimentation, or
>

Bad, we will end up having different incompatible binaries.
October 15, 2013
On 2013-10-14 23:45:42 +0000, "deadalnix" <deadalnix@gmail.com> said:

> On Monday, 14 October 2013 at 21:25:29 UTC, Michel Fortin wrote:
>> I'm not an expert in GCs, but as far as I know a concurrent GC also requires some bookkeeping to be done when assigning to pointers, similar to ARC, and also when moving pointers, unlike ARC. So it requires hooks in the codegen that will perform atomic operations, just like ARC.
> 
> Usual strategy include :
>   - When you JIT, change the function itself, to write pointers through a function that mark the old value as live.
>   - When AOT, always go throw that function, which make a test and mark alive the old value if this is done during a collection. This basically add a read to a global and a test for each pointer write.
>   - Use the page protection mechanism and do regular write. This can be done via fork, but also via remapping the GCed memory as COW. The tax is then more expensive, but you only pay it once per page you actually write and only when actually collecting.
> 
> The good news, is that this tax is only required for object that contains shared mutable pointers. In D, most data are thread local or immutable. D's type system is really friendly to concurrent GC, and we definitively should go in that direction.
> 
>> The only consensus we'll reach is that different projects have different needs. In theory being able to swap the GC for something else could bring everyone together. But to be able to replace the GC for another with a strategy different enough to matter (concurrent GC or ARC) you need the codegen to be different. So we can either:
> 
> ARC like system need a different codegen, but you can do this with regular codegen if you use page protection to detect writes.

Very insightful. Thank you.


>> 1. make the codegen configurable -- which brings its own set of compatibility problems for compiled code but is good for experimentation, or
> 
> Bad, we will end up having different incompatible binaries.

So as I understand it, your plan would then be:

- use concurrent GC using the page protection mechanism and COW to catch writes during collection
- add thread-local awareness to the GC so only shared and mutable memory needs COW

Makes sense. But adding thread-local awareness requires a couple of language changes. It won't work as long as people keep casting things around so you need to fix a lot of cases where casts are needed.

But otherwise, seems like a good plan.

I'm a little wary of the cost of doing COW during collection. Obviously the GC isn't pausing the program per-see, but it'll be slowing it down. By how much is probably dependent on what you're doing. At the very least the GC should allocate types with no pointers on separate pages from those with pointers.

Also, what are the calls required to implement page protection and COW on posix? I'd like to check whether those are allowed within the OS X and iOS sandbox. For instance fork() isn't allowed for sandboxed apps.

-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca

October 15, 2013
On Tuesday, 15 October 2013 at 00:38:20 UTC, Michel Fortin wrote:
> So as I understand it, your plan would then be:
>
> - use concurrent GC using the page protection mechanism and COW to catch writes during collection
> - add thread-local awareness to the GC so only shared and mutable memory needs COW
>

Yes, I think this is what make the most sense for D. If the GC allow explicit free; the reference counting and other manual memory management techniques can be built on top of it.

> Makes sense. But adding thread-local awareness requires a couple of language changes. It won't work as long as people keep casting things around so you need to fix a lot of cases where casts are needed.
>

Type qualifiers provide the necessary information.

However, some practice (that are already mentioned as being undefined behavior) will become really unsafe. It also require to enrich the GC api, but users aren't supposed to use it directly :D

> I'm a little wary of the cost of doing COW during collection. Obviously the GC isn't pausing the program per-see, but it'll be slowing it down. By how much is probably dependent on what you're doing. At the very least the GC should allocate types with no pointers on separate pages from those with pointers.
>

Nothing is free, and indeed, trapping the write via memory protection is quite expensive. Hopefully, we can segregate object that contain mutable pointer from others and only memory protect theses.

It will indeed cause trouble for code that mutate a large amount of shared pointers. I'd say that such code is probably asking for trouble in the first place, but as always, no silver bullet. I still think solution is the one that fit D the best.

> Also, what are the calls required to implement page protection and COW on posix? I'd like to check whether those are allowed within the OS X and iOS sandbox. For instance fork() isn't allowed for sandboxed apps.

You need mmap, mprotect and all the signal handling machinery.
October 15, 2013
On 2013-10-15 02:20:49 +0000, "deadalnix" <deadalnix@gmail.com> said:

>> Also, what are the calls required to implement page protection and COW on posix? I'd like to check whether those are allowed within the OS X and iOS sandbox. For instance fork() isn't allowed for sandboxed apps.
> 
> You need mmap, mprotect and all the signal handling machinery.

mprotect is the one I'm worried about, as it lets you set the executable bit (among other things) which could be exploited to run arbitrary code. So I tested it and it seems to work fine on OS X inside the sandbox (including for setting the executable bit). I guess an executable with a reference to mprotect would probably also pass Apple's Mac App Store validation, but I haven't tested.

mprotect isn't available at all with the iOS SDK. So making this collector work on iOS (and the iOS Simulator) would require a different codegen.

-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca

October 15, 2013

On 14.10.2013 21:42, Michel Fortin wrote:
> Indeed. The current garbage collector makes it easy to have shared
> pointers to shared objects. But the GC can also interrupt real-time
> threads for an unpredictable duration, how do you cope with that in a
> real-time thread?

The work I was talking about uses C++, not D, so there is no GC involved.

The options I see for real-time threads in D is either a concurrent GC (which means read/write barriers for pointer accesses) or just excluding the real time thread from suspension by the GC. This forces the programmer to ensure that references in the real time thread are also found elsewhere. I'm not sure if this eliminates the benefits regarding locking, though.

>
> I know ARC isn't the ideal solution for all use cases. But neither is
> the GC, especially for real-time applications. So, which one would you
> recommend for a project having a real-time audio thread?

ARC doesn't work for real time threads anyway, because you are not allowed to deallocate if it can cause locks. It can only work if you defer reference counting into another thread through some buffering.

Realistically I would currently recommend the approach above: exclude the thread from suspension, and keep a reference to used object elsewhere. This is probably about as difficult as avoiding allocations/deallocations in C++, but harder to debug.