What would be the consequence of implementing interfaces as fat pointers ? (page 6)

On Tuesday, 1 April 2014 at 09:38:09 UTC, Manu wrote: > under the impression that the typical implementation would also keep the > value around in a renamed register, and when it pops up again at a later > time, it would use the register directly, rather than load from memory. Not sure how that would work, the memory-page/cache-line would have to be marked as read-only Section 10.8 in this document only talks about elimination of register-to-register moves: http://www.agner.org/optimize/microarchitecture.pdf But new x86s have a cache for decoded instructions and special looping optimizations for tight inner loops that bypasses decoding (microop-cache). Anyway, I think the best solution to multiple inheritance and interfaces is whole program optimization either in the compiler or the linker. The cost of long vtables is probably quite low on todays desktop.

On 2014-04-01 14:17:33 +0000, Manu <turkeyman@gmail.com> said: > On 1 April 2014 22:03, Michel Fortin <michel.fortin@michelf.ca> wrote: > >> On 2014-04-01 07:11:51 +0000, Manu <turkeyman@gmail.com> said: >> >>> Of course, I use alias this all the time too for various stuff. I said >>> before, it's a useful tool and it's great D *can* do this stuff, but I'm >>> >>> talking about this particular super common use case where it's used to >>> hack >>> together nothing more than a class without a vtable, ie, a basic ref type. >>> I'd say that's worth serious consideration as a 1st-class concept? >>> >> >> You don't need it as a 1st-class D concept though. Just implement the >> basics of the C++ object model in D, similar to what I did for Objective-C, >> and let people define their own extern(C++) classes with no base class. >> Bonus if it's binary compatible with the equivalent C++ class. Hasn't >> someone done that already? > > I don't think the right conceptual solution to a general ref-type intended > for use throughout D code is to mark it extern C++... That makes no sense. I was thinking of having classes that'd be semantically equivalent to those in D but would follow the C++ ABI, hence the extern(C++). It doesn't have to support all of C++, just the parts that intersect with what you can express in D. For instance, those classes would be reference types, just like D classes; if you need value-type behaviour, use a struct. But maybe that doesn't make sense. -- Michel Fortin michel.fortin@michelf.ca http://michelf.ca

On 01/04/14 07:53, Walter Bright wrote: > 'alias this' is inelegant (sorry Andrei) but it was designed for precisely this > purpose - being able to use a struct to wrap any other type, and forward to and > override behaviors of that type. Nobody has found a better way. There are currently some unfortunate problems with it. Unless I've missed a recent fix, the examples in TDPL pp.230-233 don't work as they should, because of problems with protection attributes. :-( > Fortunately, the inelegance can be encapsulated within that type, and the user > of the type need not be even aware of it. > > Remember my halffloat implementation? It relied on 'alias this' to work. Just > try doing that in C++ <g>. It's fantastic that we can do this, but I have to say, having spent over a year exploring different reference-type solutions for a successor to std.random, it was striking how nice and simple it felt to just use classes.

April 01, 2014

Re: What would be the consequence of implementing interfaces as fat pointers ?

Posted by dajones
in reply to Manu

Permalink

dajones

Posted in reply to Manu

Permalink

"Manu" <turkeyman@gmail.com> wrote in message news:mailman.9.1396345088.19942.digitalmars-d@puremagic.com...
> On 1 April 2014 18:33, dajones <dajones@hotmail.com> wrote:
>
>>
>> "Manu" <turkeyman@gmail.com> wrote in message news:mailman.122.1396231817.25518.digitalmars-d@puremagic.com...
>> > On 30 March 2014 13:39, Walter Bright <newshound2@digitalmars.com>
>> wrote:
>> >>>
>> >>> Two pointers structs are passed in register, which is fast. If that
>> >>> spill, that
>> >>> spill on stack, which is hot, and prefetcher friendly.
>> >>>
>> >>
>> >> That underestimates how precious register real estate is on the x86.
>> >
>> >
>> > This is only a concern when passing args. x86 has huge internal
>> > register
>> > files and uses aggressive register renaming,
>>
>> If we could use them that would be great but we cant. We have to
>> store/load
>> to memory, and that means aprox 3 cycle latency each way. The cpu cant
>> guess
>> that we're only saving it for later, it has to do the memory write, and
>> even
>> with the store to load forwarding mechanism, spilling and reloading is
>> expensive.
>>
> Can you detail this more?

x86 uses something called (IIRC) a "store forwarding buffer". Essentialy it keeps track of stores untill they have been completed. Any time you read from an address the store forwrding buffer is checked first, then caches and main memory. If it cant do that you have to wait for the store to finalize, and that can be a lot slower again. If there's no pending store it comes from the cache.

either way memory stores/loads generaly have at best a 3 cycle latency.

> Obviously it must perform the store to maintain memory coherency, but I
> was
> under the impression that the typical implementation would also keep the
> value around in a renamed register, and when it pops up again at a later
> time, it would use the register directly, rather than load from memory.

I've never read of any x86 doing what you describe. But I'm not too well up on the latest CPUs.

> The store shouldn't take any significant time since there's no dependency on the stored value, it should only take as long as issuing the store instruction; latency is irrelevant, since it's never read back, there's nothing waiting on it.

True.

> Not sure what you mean by 'each way', since stored values shouldn't be
> read
> back if gets the value from a stashed register.
>
> I'm not an expert on the topic, but I read about it some years back, and haven't given it much thought since.

Check out the agnor fog microarchitechre and instruction timings pdfs. That's pretty much the holy scripture when it comes to this stuff.

It may even be that reducing contention on the memroy unit helps, modern x86 tend to have multiple ALUs but only 1 memory unit. So instructions with memory operands cant be done in paralell as often.

On Tuesday, 1 April 2014 at 23:05:55 UTC, dajones wrote: > x86 uses something called (IIRC) a "store forwarding buffer". Essentialy it > keeps track of stores untill they have been completed. Any time you read > from an address the store forwrding buffer is checked first, then caches and > main memory. If it cant do that you have to wait for the store to finalize, > and that can be a lot slower again. If there's no pending store it comes > from the cache. > It is commonly called a store buffer? Most CPU have it these days. Indeed, store are put in the store buffer until realized (which can take some time as you have to acquire the cache line from another core or memory). When you load, the CPU snoop in the store buffer in parallel as L1 cache for a value.

On Tuesday, 1 April 2014 at 23:05:55 UTC, dajones wrote: > x86 uses something called (IIRC) a "store forwarding buffer". Essentialy it > keeps track of stores untill they have been completed. Any time you read > from an address the store forwrding buffer is checked first, then caches and main memory. Store forwarding is probably important for passing parameters on the stack (where you have frequent subsequent writes/reads to the same memory location), but optimizing for it seems like very CPU dependent PITA and you are usually better off using SIMD registers IMO. After all store forwarding is only relevant until the store hits the L1 cache of the core. > either way memory stores/loads generaly have at best a 3 cycle latency. Because the CPU has to check the dirty flag of the L3 cacheline in case another core have a dirty L1 from a store to the same memory?

On Wednesday, 2 April 2014 at 10:21:50 UTC, Ola Fosheim Grøstad wrote: >> either way memory stores/loads generaly have at best a 3 cycle latency. > > Because the CPU has to check the dirty flag of the L3 cacheline in case another core have a dirty L1 from a store to the same memory? You don't even come close to L3 in 3 cycles. Propagating signal takes time. You end up with 2 constraint in tension: the bigger your cache, the longer the round trip. That is why we have L1 cache of 32kb for ages now. Making it bigger would require to increase the response time, which lower the performances.

On Wednesday, 2 April 2014 at 18:21:26 UTC, deadalnix wrote: > You don't even come close to L3 in 3 cycles. Propagating signal > takes time. You end up with 2 constraint in tension: the bigger > your cache, the longer the round trip. I was thinking about it the wrong way, I guess it does not matter if a read is getting the wrong value if there are concurrent writes to the same location when there is no synchronization. It's feels weird, but speculative out-of-order execution etc is not-very-intuitive in the first place...

On Wednesday, 2 April 2014 at 19:29:29 UTC, Ola Fosheim Grøstad wrote: > On Wednesday, 2 April 2014 at 18:21:26 UTC, deadalnix wrote: >> You don't even come close to L3 in 3 cycles. Propagating signal >> takes time. You end up with 2 constraint in tension: the bigger >> your cache, the longer the round trip. > > I was thinking about it the wrong way, I guess it does not matter if a read is getting the wrong value if there are concurrent writes to the same location when there is no synchronization. It's feels weird, but speculative out-of-order execution etc is not-very-intuitive in the first place... Yes this mechanism can cause memory operation to be seen out of order by other cores.

Forums