April 09, 2012
On 4/9/2012 11:30 AM, deadalnix wrote:
> In the other hand, TLS can be collected independently and only influence the
> thread that own the data. Both are every powerfull improvement, and the design
> you propose « as this » cannot provide any mean to handle that. Which is a big
> missed opportunity, and will be hard to change in the future.

I think this is an orthogonal issue.
April 09, 2012
On 10 April 2012 00:06, deadalnix <deadalnix@gmail.com> wrote:

> Le 09/04/2012 20:33, Manu a écrit :
>
>  Eh?
>> Not sure what you mean. The idea is the template would produce a
>> struct/table of data instead of being a pointer to a function, this way
>> the GC could work without calling anything. If the GC was written to
>> assume GC info in a particular format/structure, it could be written
>> without any calls.
>> I'm just saying to leave that as a possibility, and not REQUIRE an
>> indirect function call for every single allocation in the system. Some
>> GC might be able to make better use of that sort of setup.
>>
>
> If you have reference to objects, you can't avoid a function call. If you have something you know at compile time, the generated function can directly call the other function that mark the pointed data (or even can do it itself, if you don't fear code bloat) without going back to the GC and its indirect call.
>
> So it make no difference in the number of indirect calls you have, but the struct proposal is a stronger constraint on the GC that the function one.
>
> BTW, starting you answer by « Not sure what you mean. » should have been a red flag.
>

It is, and I still don't follow. I can't imagine there are any indirect
function calls, except for the ones introduced by this proposal, where you
may register a function to mark the pointers in complex structs.
You seem to be suggesting that another one already exists anyway? Where is
it? Why is it there?


April 10, 2012
Le 09/04/2012 23:27, Walter Bright a écrit :
> On 4/9/2012 11:30 AM, deadalnix wrote:
>> In the other hand, TLS can be collected independently and only
>> influence the
>> thread that own the data. Both are every powerfull improvement, and
>> the design
>> you propose « as this » cannot provide any mean to handle that. Which
>> is a big
>> missed opportunity, and will be hard to change in the future.
>
> I think this is an orthogonal issue.

You mean an allocator/deallocator one ?

I'm not sure. For instance, concurrent shared memory scanning will require some magic on reference changes (it can be hooked into the program using page protection). In such a case, you have constraint in what the scanning function can do or can't.

If the function is scanning immutable data, such a constraint disappears.

In a similar way, when scanning TLS, you'll want to avoid going into non TLS world. This is currently possible only of you go back to main GC code and trigger the indirect call every single time you encounter a pointer or a reference. This is going to be a performance killer on many architecture.

So this code, in a way or another will need to be aware of the qualifier. Or it will either require to pass every single pointer/reference into an indirect function call, or forget about optimizations that the type system has been made to allow (in the program in general, not especially in the GC).
April 10, 2012
Le 10/04/2012 00:39, Manu a écrit :
> It is, and I still don't follow. I can't imagine there are any indirect
> function calls, except for the ones introduced by this proposal, where
> you may register a function to mark the pointers in complex structs.
> You seem to be suggesting that another one already exists anyway? Where
> is it? Why is it there?

OK, back to basics.

For every type, a function template (let's call it GCscan) will be instantiated to scan it. This function can be ANY code. ANY code include the possibility for GCscan!A to call GCscan!B directly, without going back to GC main loop and indirect call. If inlined, you can forget about function call at all (and you can force that using mixin template for example, but it is likely to massively generate code bloat).

This can't be done for reference/pointer to polymorphic types, but for any other it is an available option, and it can reduce dramatically the number of indirect calls.
April 10, 2012
On 4/10/12 3:03 AM, deadalnix wrote:
> For every type, a function template (let's call it GCscan) will be
> instantiated to scan it. This function can be ANY code. ANY code include
> the possibility for GCscan!A to call GCscan!B directly, without going
> back to GC main loop and indirect call. If inlined, you can forget about
> function call at all (and you can force that using mixin template for
> example, but it is likely to massively generate code bloat).

That is correct. The code bloat is equal to that generated if the end user sat down and wrote by hand appropriate routines for collection for all types.

> This can't be done for reference/pointer to polymorphic types, but for
> any other it is an available option, and it can reduce dramatically the
> number of indirect calls.

Indeed. For non-final class member variables, the template will fetch their Typeinfo, from that the pointer to the mark function, and will issue the call through pointer.


Andrei
April 13, 2012
On Sunday, 8 April 2012 at 12:02:10 UTC, Alex Rønne Petersen wrote:
>> This sounds important to me. If it is also possible to do the work with
>> generated tables, and not calling thousands of indirect functions in
>> someone's implementation, it would be nice to reserve that possibility.
>> Indirect function calls in hot loops make me very nervous for non-x86
>> machines.
>
> Yes, I agree here. The last thing we need is a huge amount of kinda-sorta-virtual function calls on ARM, MIPS, etc. It may work fine on x86, but anywhere else, it's really not what you want in a GC.

What's the problem with virtual calls on ARM?
April 13, 2012
On 13 April 2012 15:53, Kagamin <spam@here.lot> wrote:

> On Sunday, 8 April 2012 at 12:02:10 UTC, Alex Rønne Petersen wrote:
>
>> This sounds important to me. If it is also possible to do the work with
>>> generated tables, and not calling thousands of indirect functions in someone's implementation, it would be nice to reserve that possibility. Indirect function calls in hot loops make me very nervous for non-x86 machines.
>>>
>>
>> Yes, I agree here. The last thing we need is a huge amount of kinda-sorta-virtual function calls on ARM, MIPS, etc. It may work fine on x86, but anywhere else, it's really not what you want in a GC.
>>
>
> What's the problem with virtual calls on ARM?
>

No other processors have branch prediction units anywhere near
the sophistication of modern x86. Any call through a function pointer
stalls the pipeline, pipelines are getting longer all the time, and PPC has
even more associated costs/hazards.
Most processors can only perform trivial binary branch prediction around an
'if'.
It also places burden on the icache (unable to prefetch), and of course the
dcache, both of which are much less sophisticated than x86 aswell.
Compiler can't do anything with code locality (improves icache usage),
since the target is unknown at compile time... there are also pipeline
stalls introduced by the sequence of indirect pointer lookups preceding any
virtual call.
Virtuals are possibly the worst hazard to modern CPU's, and the hardest to
detect/profile, since their cost is evenly spread throughout the entire
code base, you can never gauge their true impact on your performance. You
also can't easily measure the affect of icache misses on your code, suffice
to say, you will have MANY more in virtual-heavy code.

While I'm at it. 'final:' and 'virtual' keyword please ;)


April 13, 2012
On 2012-04-13 16:54:28 +0300 Manu <turkeyman@gmail.com> wrote:
> While I'm at it. 'final:' and 'virtual' keyword please ;)

Hmmm, I thought we decided that was a good idea, anybody in the know if this going to happen or not?

--
James Miller
April 13, 2012
On Friday, 13 April 2012 at 13:54:39 UTC, Manu wrote:
> No other processors have branch prediction units anywhere near
> the sophistication of modern x86. Any call through a function pointer
> stalls the pipeline, pipelines are getting longer all the time, and PPC has
> even more associated costs/hazards.
> Most processors can only perform trivial binary branch prediction around an
> 'if'.
> It also places burden on the icache (unable to prefetch), and of course the
> dcache, both of which are much less sophisticated than x86 aswell.

Allocation of small aggregated objects usually involves allocation of several equally small objects of different types in a row, so they sit one after another in heap and gc will visit them in a row every time calling function different from the previous time, so to x86 processor it would result in constant misprediction: AFAIK x86 processor caches only one target address per branch (ARM caches a flag?). And icache should not suffer in both cases: once you prefetched the function, it will remain in the icache and be reused from there the next time.
April 14, 2012
On 13 April 2012 17:25, Kagamin <spam@here.lot> wrote:
>
> once you prefetched the function, it will remain in the icache and be reused from there the next time.
>

All depends how much you love object orientation. If you follow the C++
book and make loads of classes for everything, you'll thrash the hell out
of it. If you only have a couple of different object, maybe they'll coexist
in the icache.
The GC is a bit of a special case though because it runs in a tight loop.
That said, the pipeline hazards still exist regardless of the state of
icache.
Conventional virtuals are worse, since during the process of executing
regular code, there's not usually such a tight loop pattern.
(note: I was answering the prior question about virtual functions in
general, not as applied to the specific use case of a GC scan)

The latest 'turbo' ARM chips (Cortex, etc) and such may store a branch
target table, they are alleged to have improved prediction, but I haven't
checked.
Prior chips (standard Arm9 and down, note: most non-top-end androids fall
in this category, and all games consoles with arms in them) don't
technically have branch prediction at all. ARM has conditional execution
bits on instructions, so it can filter opcodes based on the state of the
flags register. This is a cool design for binary branching, or performing
'select' operations, but it can't do anything to help an indirect jump.

Point is, the GC is the most fundamental performance hazard to D, and I
just think it's important to make sure the design is such that it is
possible to write a GC loop which can do its job with generated data tables
if possible, instead of requiring generated marker functions.
It would seem to me that carefully crafted tables of data could technically
perform the same function as marker functions, but without any function
calls... and the performance characteristics may be better/worse for
different architectures.