June 24, 2013
On Jun 24, 2013, at 2:04 PM, Rainer Schuetze <r.sagitario@gmx.de> wrote:

> On 24.06.2013 16:40, Steven Schveighoffer wrote:
>> 
>> Here is a possibility for rectifying: any time you share immutable data by reference, the compiler generates a call to some 'makeImmutableArrayShared' function with the pointer, and that removes the block's APPENDABLE bit (if part of the GC).  That would make any immutable shared arrays truly immutable, and avoid this problem.
>> 
>> Thoughts?  Is it possible to intercept the moment immutable data becomes shared?
> 
> I think you can make the length update lock- and race-free with a single CAS operation on the length, that pretty much does what is currently done in __setArrayAllocLength. You don't even have to care for problems like ABA, as the size always increases.
> 

This doesn't solve the problem that we cannot use thread-local caching of blockinfos for immutable and const array types (the main source of the speedup).

assumeSafeAppend can decrease the size, also.

> The cost of adding this to the non-shared code path aswell (or better not making the distinction at all) is propably more than offset by removing the tests for sharedness.

BTW, we can solve this problem by just making the compiler do the work, or making it instantiate a template to do the work of determining sharedness :)  The only reason I'm checking for sharedness in the function at runtime is because that's what the compiler gives me.

-Steve
_______________________________________________
D-runtime mailing list
D-runtime@puremagic.com
http://lists.puremagic.com/mailman/listinfo/d-runtime

June 24, 2013
On 24.06.2013 11:19, Leandro Lucarella wrote:
> I'll repeat what I said in my talk. I think the "rebuild" option is an
> extremely bad idea.

Especially in the transition to precise GC I would expect one or two hickups and everybody will tend to blame the changed GC. It should be easy to switch to non-precise GC to verify if the accusation is rectified.

> I used "initialization-time" configurability in CDGC and worked pretty
> well. It would be much more convenient to have a way to change GC
> options at source code level too, but I had the same problem, I couldn't
> find a way to do anything before the GC is initialized. I think being
> able to configure runtime options via environment variables is an
> extremely convenient option.

I was thinking about environment variables aswell, but that's not so convenient for applications run by non-developers.

One way to do it might be to put configuration variables into a module gc.config (it might still read environment variables to change defaults). This module is part of the druntime library. If the programmer then provides a different gc.config that fulfills all the link dependencies, on the link command line, it is preferred over the module from the library.

One thing I'll do for sure is to convert versions(GC_PRECISE) to if(gc_precise), so it is easy to switch between a configuration variable and an enum that allows the optimizer to just remove the disabled parts.

_______________________________________________
D-runtime mailing list
D-runtime@puremagic.com
http://lists.puremagic.com/mailman/listinfo/d-runtime

June 24, 2013
On Jun 24, 2013, at 2:28 PM, Steven Schveighoffer <schveiguy@yahoo.com> wrote:

> On Jun 24, 2013, at 2:04 PM, Rainer Schuetze <r.sagitario@gmx.de> wrote:
> 
>> On 24.06.2013 16:40, Steven Schveighoffer wrote:
>>> 
>>> Here is a possibility for rectifying: any time you share immutable data by reference, the compiler generates a call to some 'makeImmutableArrayShared' function with the pointer, and that removes the block's APPENDABLE bit (if part of the GC).  That would make any immutable shared arrays truly immutable, and avoid this problem.
>>> 
>>> Thoughts?  Is it possible to intercept the moment immutable data becomes shared?
>> 
>> I think you can make the length update lock- and race-free with a single CAS operation on the length, that pretty much does what is currently done in __setArrayAllocLength. You don't even have to care for problems like ABA, as the size always increases.
>> 
> 
> This doesn't solve the problem that we cannot use thread-local caching of blockinfos for immutable and const array types (the main source of the speedup).

I should say, we can cache block info lookups for everything that's under 1 PAGE, including shared, since those will never change size (I actually never thought of this before).  But we can't cache the lookup for >= 1PAGE because the block's length may change from call to call.

Thinking about this some more, it may be I'm being too conservative.

Consider that:

1. a block will only grow in size.  That is, a memory block (not the allocated space) will never be SMALLER than the cached size.  I don't think the append runtime will ever "shrink" a block's allocated length.
2. Because I store the allocated size at the front of larger blocks, I am guaranteed that no matter the block size, the location of the "stored data" is constant.

So, I would like to do the following in order:

1. verify with memory gurus, that CAS in this case is safe to do.  I'm super-worried about subtle memory issues (your ABA comment scares me)
2. Submit pull request which ignores whether type is shared, and uses CAS to ensure integrity.  All blockinfo lookups, including ones to shared data, will go through the cache.
3. In general, I would like the runtime to be generated instead of defined by the compiler.  To that end, I think it would be good to have the compiler change array append calls from runtime-only calls with TypeInfo as a parameter to runtime template calls that can possibly optimize.
4. Add special "unshared" appending calls that are called by the template when the type says it's mutable.  Call those instead of the CAS version when the slice is mutable.

Can someone answer 1?

Also, does this all make sense to everyone?

-Steve
_______________________________________________
D-runtime mailing list
D-runtime@puremagic.com
http://lists.puremagic.com/mailman/listinfo/d-runtime

June 25, 2013
On 24.06.2013 20:59, Steven Schveighoffer wrote:
>
> On Jun 24, 2013, at 2:28 PM, Steven Schveighoffer
> <schveiguy@yahoo.com> wrote:
>
>> On Jun 24, 2013, at 2:04 PM, Rainer Schuetze <r.sagitario@gmx.de>
>> wrote:
>>
>>> On 24.06.2013 16:40, Steven Schveighoffer wrote:
>>>>
>>>> Here is a possibility for rectifying: any time you share
>>>> immutable data by reference, the compiler generates a call to
>>>> some 'makeImmutableArrayShared' function with the pointer, and
>>>> that removes the block's APPENDABLE bit (if part of the GC).
>>>> That would make any immutable shared arrays truly immutable,
>>>> and avoid this problem.
>>>>
>>>> Thoughts?  Is it possible to intercept the moment immutable
>>>> data becomes shared?
>>>
>>> I think you can make the length update lock- and race-free with a
>>> single CAS operation on the length, that pretty much does what is
>>> currently done in __setArrayAllocLength. You don't even have to
>>> care for problems like ABA, as the size always increases.
>>>
>>
>> This doesn't solve the problem that we cannot use thread-local
>> caching of blockinfos for immutable and const array types (the main
>> source of the speedup).
>
> I should say, we can cache block info lookups for everything that's
> under 1 PAGE, including shared, since those will never change size (I
> actually never thought of this before).  But we can't cache the
> lookup for >= 1PAGE because the block's length may change from call
> to call.
>
> Thinking about this some more, it may be I'm being too conservative.
>
> Consider that:
>
> 1. a block will only grow in size.  That is, a memory block (not the
> allocated space) will never be SMALLER than the cached size.  I don't
> think the append runtime will ever "shrink" a block's allocated
> length. 2. Because I store the allocated size at the front of larger
> blocks, I am guaranteed that no matter the block size, the location
> of the "stored data" is constant.
>
> So, I would like to do the following in order:
>
> 1. verify with memory gurus, that CAS in this case is safe to do.
> I'm super-worried about subtle memory issues (your ABA comment scares
> me)

I don't know if I qualify, but I think it is the job of the operations in core.atomic to handle the memory issues of the architecture, the CAS operation should be safe to use. Regarding ABA when shrinking: you are in trouble anyway if you shrink the array and append to to it at the same time from different threads. I recommend throwing an InvalidMemoryError whenever appending to an array if the actual allocation size is smaller than the array.

> 2. Submit pull request which ignores whether type is shared, and
> uses CAS to ensure integrity.  All blockinfo lookups, including ones
> to shared data, will go through the cache.

Sounds good. I think we should also look into making gc_query faster, e.g. I'm not sure it needs the global GC lock.

> 3. In general, I would
> like the runtime to be generated instead of defined by the compiler.
> To that end, I think it would be good to have the compiler change
> array append calls from runtime-only calls with TypeInfo as a
> parameter to runtime template calls that can possibly optimize.

That's what I was also hoping for. But I guess it will have to wait until a successful transformation of the associative arrays.

_______________________________________________
D-runtime mailing list
D-runtime@puremagic.com
http://lists.puremagic.com/mailman/listinfo/d-runtime

June 25, 2013
On 23.06.2013 18:34, Andrei Alexandrescu wrote:
> On 6/21/13 11:36 AM, Rainer Schuetze wrote:
>> I want to make precise garbage collection as presented at the D
>> conference ready for inclusion into druntime.
>
> I'm all for this. One area we should be careful about is not introducing
> performance regressions.
>

One thing we can do to mitigate the additional cost when allocating: make all GC functions nothrow (they are declared to be nothrow anyway in core.memory, but this gets lost by different extern(C) declarations). This removes exception-handling-unwinding from scope(exit) statements. Last time I tried, it improved performance of testgc3 from the test suite by about 5%.

When doing this I wondered why so many functions in druntime are declared extern(C) and are even declared again in other modules. Sometimes it is caused by an undesired dependency from core package to the rt package that should be invisible to the user of the core package, but this is rare. It gives the impression the module system is no good for systems programming :-p

_______________________________________________
D-runtime mailing list
D-runtime@puremagic.com
http://lists.puremagic.com/mailman/listinfo/d-runtime

June 25, 2013
On Mon, Jun 24, 2013 at 08:29:36PM +0200, Rainer Schuetze wrote:
> On 24.06.2013 11:19, Leandro Lucarella wrote:
> >I'll repeat what I said in my talk. I think the "rebuild" option is an extremely bad idea.
> 
> Especially in the transition to precise GC I would expect one or two hickups and everybody will tend to blame the changed GC. It should be easy to switch to non-precise GC to verify if the accusation is rectified.

Yes, at first I would even leave it disabled by default, so only the courageous ones can test it (like for real, not just for a week in the "beta testing"). I would even say to wait a couple of releases or more to enable it by default, this will allow us to get a lot of "real world experiences" without disturbing people that don't really care ("Why did you break my code when I don't even need precise scanning?!").

It has to be easily enabled for people to test it though ;)

> >I used "initialization-time" configurability in CDGC and worked pretty well. It would be much more convenient to have a way to change GC options at source code level too, but I had the same problem, I couldn't find a way to do anything before the GC is initialized. I think being able to configure runtime options via environment variables is an extremely convenient option.
> 
> I was thinking about environment variables aswell, but that's not so convenient for applications run by non-developers.

Very true.

> One way to do it might be to put configuration variables into a module gc.config (it might still read environment variables to change defaults). This module is part of the druntime library. If the programmer then provides a different gc.config that fulfills all the link dependencies, on the link command line, it is preferred over the module from the library.

Mmm, interesting. I thought about something similar but with a function (that would return a similar struct). I think the configuration mechanism should be part of the runtime, not only the GC, there are other runtime stuff that can or could be configured.

> One thing I'll do for sure is to convert versions(GC_PRECISE) to if(gc_precise), so it is easy to switch between a configuration variable and an enum that allows the optimizer to just remove the disabled parts.

Good. I would say to just use that and then we can figure out where that gc_precise can come from. Maybe just configuration through environment variables is good enough as a start for people wanting to try it to be able to enable it easily.

-- 
Leandro Lucarella
Senior R&D Developer
Sociomantic Labs GmbH <http://www.sociomantic.com>
_______________________________________________
D-runtime mailing list
D-runtime@puremagic.com
http://lists.puremagic.com/mailman/listinfo/d-runtime

June 28, 2013
On Jun 23, 2013, at 9:07 AM, Rainer Schuetze <r.sagitario@gmx.de> wrote:

> On 23.06.2013 00:48, Steven Schveighoffer wrote:
>> 
>> Note, you can have a mix of solutions -- for up to 16*sizeof(void*) bytes allocations, store bitmaps.   For everything else, use a typeinfo.  Both solutions could try off-page or in-block.  You can use the same philosophy as array append code -- for non-extendable GC blocks (i.e. < PAGE size) place at the end, for PAGE size and above, place at the front.
> 
> I was thinking in the same direction, using bitmaps for anything < page size and typeinfo for larger blocks. The downside is that you have to implement both versions, and that you loose the flexibility of the bitmap for larger allocations. One way to solve this might be to switch to bitmaps once gc_emplace is used on large allocations.

Another consideration to think about -- we currently have no calls to non-class finalizers in the GC.  Using TypeInfo pointers would facilitate this.  I feel like this may tip the scales towards an all-TypeInfo solution.

See also: http://d.puremagic.com/issues/show_bug.cgi?id=9334 and http://d.puremagic.com/issues/show_bug.cgi?id=9335

-Steve
_______________________________________________
D-runtime mailing list
D-runtime@puremagic.com
http://lists.puremagic.com/mailman/listinfo/d-runtime

June 28, 2013
On Friday, June 28, 2013 10:16:29 Steven Schveighoffer wrote:
> Another consideration to think about -- we currently have no calls to non-class finalizers in the GC. Using TypeInfo pointers would facilitate this. I feel like this may tip the scales towards an all-TypeInfo solution.

Yeah. It occurred to me that that would be the case when Rainer gave his talk. Getting it fixed so that struct finalizers can be run would be really cool.

- Jonathan M Davis
_______________________________________________
D-runtime mailing list
D-runtime@puremagic.com
http://lists.puremagic.com/mailman/listinfo/d-runtime

June 28, 2013
On 28.06.2013 16:16, Steven Schveighoffer wrote:
> Another consideration to think about -- we currently have no calls
> to non-class finalizers in the GC.  Using TypeInfo pointers would
> facilitate this.

I think that could also be implemented by wrapping struct types with
destructor in a class that contains the struct. That would go in
_d_newitem* or _d_newarray*. Creating the appropriate type info at
runtime might be a problem though.

> I feel like this may tip the scales towards an all-TypeInfo
> solution.

There is still the problem of manually arranged data to be solved. E.g. the associative array combine node-list entry, key and value into a single allocation, and only the single type infos are available (if at all). Same problem as above: how to create a combined type info at runtime?

The current implementation "emplaces" the pointer bits at the corresponding addresses, but this needs two additional calls into the GC, resulting in far from optimal performance.
_______________________________________________
D-runtime mailing list
D-runtime@puremagic.com
http://lists.puremagic.com/mailman/listinfo/d-runtime

June 29, 2013
On Jun 28, 2013, at 5:28 PM, Rainer Schuetze <r.sagitario@gmx.de> wrote:

> On 28.06.2013 16:16, Steven Schveighoffer wrote:
>> Another consideration to think about -- we currently have no calls to non-class finalizers in the GC.  Using TypeInfo pointers would facilitate this.
> 
> I think that could also be implemented by wrapping struct types with destructor in a class that contains the struct. That would go in _d_newitem* or _d_newarray*. Creating the appropriate type info at runtime might be a problem though.

Wrapping in a class may work, but seems heavyweight for just storing a finalizer pointer.

> 
>> I feel like this may tip the scales towards an all-TypeInfo solution.
> 
> There is still the problem of manually arranged data to be solved. E.g. the associative array combine node-list entry, key and value into a single allocation, and only the single type infos are available (if at all). Same problem as above: how to create a combined type info at runtime?

If you are inventing the type of the block, you are responsible for it.  Allocate it as a void * and you get to own how it works.  Otherwise, use a struct and let the system generate a type info for it.

The associative array could just as easily make a struct out of that pair.  It so happens that when AA's were written, they were a purely runtime feature, and did not have the ability to "invent" typeinfo.  The situation has changed -- AAs are now a template.

> The current implementation "emplaces" the pointer bits at the corresponding addresses, but this needs two additional calls into the GC, resulting in far from optimal performance.

You mean the current implementation of the precise GC or the AA?

BTW, I want to stress again that we really should attempt to move the compiler away from calling runtime functions, and instead have it call template functions.  Such things will make life SOOO much easier for experimentation, inlining runtime calls, etc.  For example, having the rtInfo has opened several doors already that weren't already envisioned on its creation.  Imagine if all runtime calls were wrapped in a template, and one could simply update the template to experiment with new ideas, instead of hooking the runtime function, which is limited to whatever was expected when written.

If this is the case, then instead of _d_newarray struggling to create a dummy typeinfo so a dtor is called (I have actually done this before in a project on D1, and it isn't pretty), a template can trivially generate the class.

-Steve
_______________________________________________
D-runtime mailing list
D-runtime@puremagic.com
http://lists.puremagic.com/mailman/listinfo/d-runtime