std.variant benchmark (page 3)

July 30, 2012

Re: std.variant benchmark

Posted by Dmitry Olshansky
in reply to Andrei Alexandrescu

Permalink

Dmitry Olshansky

Posted in reply to Andrei Alexandrescu

Permalink

On 30-Jul-12 06:01, Andrei Alexandrescu wrote:
>> In fact memcpy could and should be replaced with word by word copy for
>> almost all of struct sizes up to ~32 bytes (as size is known in advance
>> for this particular function pointer i.e. handler!int).
>
> In fact memcpy should be smart enough to do all that, but apparently it
> doesn't.
>

I'd say array ops could and should do this (since compiler has all the info at compile-time). On the other hand memcpy is just one tired C function aimed at fast blitting of any memory chunks.
(Even just   call/ret pair is too much of overhead in case of int).

>> I've found something far more evil:
> [snip]
>> So to boot in order to get to int + int _detected_ it needs 4 calls to
>> fptr(OpID.testConversion, null, &info) == 0 and acquire TypeInfo each
>> time. Not counting 2 gets to obtain result. It's just madness.
>
> Heh, this all gives me a new, humbling perspective on my earlier self
> running my mouth about other designs :o). As always, history and context
> makes one much more empathic.
>
> Variant has a very old design and implementation. It is, in fact, the
> very first design I've ever carried in D; I'd decided, if I can't do
> that, D isn't powerful enough. The compiler was so buggy back then,
> implementation needed to take many contortions, and it was a minor
> miracle the whole house of cards stood together.
>

These were indeed very rough days. I think it's a minor miracle I haven't gave up on D after seeing tons of bugs back in 2010.

> The basic approach was to use a pointer to function (instead of an
> integral) as a discriminator. The pointer to function was taken from the
> instantiation of a template function. That means Variant could work with
> any type, even types not yet defined. I'd made it a clear goal that I
> must not enumerate all types in one place, or "register" types for use
> with Variant - that would have been failure. Using a pointer to template
> function naturally associated a distinct tag with each type,
> automatically. I think it was a good idea then, and it's a good idea now.

It's indeed nice trick and design sounds great (no register function, yay!).
The only piece missing is that e.g. arithmetic operations are binary thus requiring double-dispatch or something like arithmeticHandler!(T1,T2) to archive optimal performance.

The virtual handcrafted C competition would do it via nested switches.

[snip]
>>> We can use the pointer to function as a tag for improving performance on
>>> primitive types.
>>>
>>
>> Indeed we can though it won't be as fast as simple tag - pointers
>> have quite big and sparse addresses giving us the same as:
>>
>> if(ptr == handler!int)
>> ....
>> else if(ptr == handler!uint)
>> ....
>> else
>> ....
>
> Yes, and what we gain is speed for select known types. As long as the
> number of those is relatively small, using simple test and branch should
> serve us well. And we get to keep the generality, too.
>

OK I'll try it for opArithmetic that should help a lot.

>> Another way to go is make 2-d function table for all built-ins and all
>> operations on them. It's in fact a bunch of tables:
>> Type1 x Type2 for all of important basic operations. Tables are static
>> so it won't be much of a problem. At least int + int then would be
>> 1 memory read, 1 call, 2 (casted as needed) reads & 1 store.
>
> I think that's not worth it, but hey, worth a try.
>

I'm thinking more of it and common types are just uint, int, long, ulong, double, string. In fact most of e.g. int x string  and such are "throw me an exception" entries, they could be merged.
Then we could make 2-stage table to save some space, it sounds like trie, hm....


-- 
Dmitry Olshansky

On 7/30/12 4:34 AM, Dmitry Olshansky wrote: > On 30-Jul-12 06:01, Andrei Alexandrescu wrote: >>> In fact memcpy could and should be replaced with word by word copy for >>> almost all of struct sizes up to ~32 bytes (as size is known in advance >>> for this particular function pointer i.e. handler!int). >> >> In fact memcpy should be smart enough to do all that, but apparently it >> doesn't. >> > > I'd say array ops could and should do this (since compiler has all the > info at compile-time). On the other hand memcpy is just one tired C > function aimed at fast blitting of any memory chunks. > (Even just call/ret pair is too much of overhead in case of int). memcpy is implemented as an intrinsic on many platforms. I'm not sure whether it is on dmd, but it is on dmc (http://www.digitalmars.com/ctg/sc.html), icc, and gcc (http://software.intel.com/en-us/articles/memcpy-performance/). But then clearly using simple word assignments wherever possible makes for a more robust performance profile. > I'm thinking more of it and common types are just uint, int, long, > ulong, double, string. In fact most of e.g. int x string and such are > "throw me an exception" entries, they could be merged. > Then we could make 2-stage table to save some space, it sounds like > trie, hm.... You can implement it via visitation too (two indirect calls, no search). "Modern C++ Design" discusses possible approaches at length. Andrei

On 30/07/12 13:24, Andrei Alexandrescu wrote: > On 7/30/12 4:34 AM, Dmitry Olshansky wrote: >> On 30-Jul-12 06:01, Andrei Alexandrescu wrote: >>>> In fact memcpy could and should be replaced with word by word copy for >>>> almost all of struct sizes up to ~32 bytes (as size is known in advance >>>> for this particular function pointer i.e. handler!int). >>> >>> In fact memcpy should be smart enough to do all that, but apparently it >>> doesn't. >>> >> >> I'd say array ops could and should do this (since compiler has all the >> info at compile-time). On the other hand memcpy is just one tired C >> function aimed at fast blitting of any memory chunks. >> (Even just call/ret pair is too much of overhead in case of int). > > memcpy is implemented as an intrinsic on many platforms. I'm not sure > whether it is on dmd, but it is on dmc > (http://www.digitalmars.com/ctg/sc.html), icc, and gcc > (http://software.intel.com/en-us/articles/memcpy-performance/). But then > clearly using simple word assignments wherever possible makes for a more > robust performance profile. It is an intrinsic on DMD, but it isn't done optimally. Mostly it just compiles to a couple of loads + the single instruction rep movsd; / rep movsq; which is perfect for medium-sized lengths when everything is aligned, but once it is longer than a few hundred bytes, it should be done as a function call. (The optimal method involves cache blocking). Also for very short lengths it should be done as a couple of loads.

On Sunday, 29 July 2012 at 20:24:42 UTC, Dmitry Olshansky wrote: > On 30-Jul-12 00:11, Dmitry Olshansky wrote: >> I've found something far more evil: >> >> @property bool convertsTo(T)() const >> { >> TypeInfo info = typeid(T); >> return fptr(OpID.testConversion, null, &info) == 0; >> } >> >> Okay... now let me pull off another piece of rag: >> >> private VariantN opArithmetic(T, string op)(T other) >> { >> VariantN result; >> static if (is(T == VariantN)) >> { >> /*if (convertsTo!(uint) && other.convertsTo!(uint)) >> result = mixin("get!(uint) " ~ op ~ " other.get!(uint)"); >> else*/ if (convertsTo!(int) && other.convertsTo!(int)) >> result = mixin("get!(int) " ~ op ~ " other.get!(int)"); > ... > Apparently I'm spot on. > Just commenting one extra branch of this horror movie > gives interesting change: > > 2779us > 2667us > 3153762us > > After: > > 2319us > 2523us > 288581us > > Aye, 10x :o) I profiled it and found out much of the time is spent inside TypeInfo.opEquals being called from tryPutting. So I tried replacing "!= typeid" in tryPutting with "!is typeid". That brought the time from 2.8 s down to 0.12 on my machine. I don't know if that is the proper solution, since I don't know if typeid can ever return two TypeInfo objects that aren't the same but are equal (I haven't used typeid and TypeInfo much before). The fib function here does return correct result after doing that, though.

On Fri, Aug 3, 2012 at 8:28 AM, jerro <a@a.com> wrote: > On Sunday, 29 July 2012 at 20:24:42 UTC, Dmitry Olshansky wrote: > >> On 30-Jul-12 00:11, Dmitry Olshansky wrote: >> >>> I've found something far more evil: >>> >>> @property bool convertsTo(T)() const >>> { >>> TypeInfo info = typeid(T); >>> return fptr(OpID.testConversion, null, &info) == 0; >>> } >>> >>> Okay... now let me pull off another piece of rag: >>> >>> private VariantN opArithmetic(T, string op)(T other) >>> { >>> VariantN result; >>> static if (is(T == VariantN)) >>> { >>> /*if (convertsTo!(uint) && other.convertsTo!(uint)) >>> result = mixin("get!(uint) " ~ op ~ " other.get!(uint)"); >>> else*/ if (convertsTo!(int) && other.convertsTo!(int)) >>> result = mixin("get!(int) " ~ op ~ " other.get!(int)"); >>> >> ... >> Apparently I'm spot on. >> Just commenting one extra branch of this horror movie >> gives interesting change: >> >> 2779us >> 2667us >> 3153762us >> >> After: >> >> 2319us >> 2523us >> 288581us >> >> Aye, 10x :o) >> > > I profiled it and found out much of the time is spent inside TypeInfo.opEquals being called from tryPutting. So I tried replacing "!= typeid" in tryPutting with "!is typeid". That brought the time from 2.8 s down to 0.12 on my machine. I don't know if that is the proper solution, since I don't know if typeid can ever return two TypeInfo objects that aren't the same but are equal (I haven't used typeid and TypeInfo much before). The fib function here does return correct result after doing that, though. > Wow! Now that's an impressive improvement. If TypeInfo instances are unique for every type, then we should be good to go! -- Bye, Gor Gyolchanyan.

>> I profiled it and found out much of the time is spent inside >> TypeInfo.opEquals being called from tryPutting. So I tried replacing "!= >> typeid" in tryPutting with "!is typeid". That brought the time from 2.8 s >> down to 0.12 on my machine. I don't know if that is the proper solution, >> since I don't know if typeid can ever return two TypeInfo objects that >> aren't the same but are equal (I haven't used typeid and TypeInfo much >> before). The fib function here does return correct result after doing that, >> though. >> > > Wow! Now that's an impressive improvement. If TypeInfo instances are unique > for every type, then we should be good to go! I've tried Dmitry's fork now (he optimized opArithmetic to not do as many calls to fptr) and it's even a little faster (1.01 s for 100000 iterations). Replacing != with !is in Dmitry's fork speeds it up a little bit more (0.935 s for 100000 iterations).

Forums