July 29, 2012
On Sunday, 29 July 2012 at 14:43:09 UTC, Dmitry Olshansky wrote:
> On 29-Jul-12 18:17, Andrei Alexandrescu wrote:
>> On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:
>>> std.variant is so incredibly slow! It's practically unusable for
>>> anything, which requires even a tiny bit of performance.
>>
>> You do realize you actually benchmark against a function that does
>> nothing, right? Clearly there are ways in which we can improve
>> std.variant to the point initialization costs assignment of two words,
>> but this benchmark doesn't help. (Incidentally I just prepared a class
>> at C++ and Beyond on benchmarking, and this benchmark makes a lot of the
>> mistakes described therein...)
>>
>>
>> Andrei
>
>
> This should be more relevant then:
>
> //fib.d
> import std.datetime, std.stdio, std.variant;
>
> auto fib(Int)()
> {
> 	Int a = 1, b = 1;
> 	for(size_t i=0; i<100; i++){
> 		Int c = a + b;
> 		a = b;
> 		b = c;
> 	}
> 	return a;	
> }
>
> void main()
> {
> 	writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000));
> }
>
>
> dmd -O -inline -release fib.d
>
> Output:
>
> [TickDuration(197), TickDuration(276), TickDuration(93370107)]
>
> I'm horrified. Who was working on std.variant enhancements? Please chime in.

You underestimate DMD's optimisations :-)

For int and long, DMD does the whole loop at compile time, so again you are benchmarking against an empty function. It's easy to see that this is the case by changing the size of the loop and noting that the int and long versions take the same amount of time.

Here's my results with optimisations turned *off*:

int: 2,304,868
long: 1,559,679
Variant: 2,667,320,252

Yes, it's not good to test without optimisations, but I think this gives a clearer picture of the relative differences between the two.

Still not great at ~1000x difference, but much better than a million :-)




July 29, 2012
On 29-Jul-12 23:20, Peter Alexander wrote:
> On Sunday, 29 July 2012 at 14:43:09 UTC, Dmitry Olshansky wrote:
>> On 29-Jul-12 18:17, Andrei Alexandrescu wrote:
>>> On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:
>>>> std.variant is so incredibly slow! It's practically unusable for
>>>> anything, which requires even a tiny bit of performance.
>>>
>>> You do realize you actually benchmark against a function that does
>>> nothing, right? Clearly there are ways in which we can improve
>>> std.variant to the point initialization costs assignment of two words,
>>> but this benchmark doesn't help. (Incidentally I just prepared a class
>>> at C++ and Beyond on benchmarking, and this benchmark makes a lot of the
>>> mistakes described therein...)
>>>
>>>
>>> Andrei
>>
>>
>> This should be more relevant then:
>>
>> //fib.d
>> import std.datetime, std.stdio, std.variant;
>>
>> auto fib(Int)()
>> {
>>     Int a = 1, b = 1;
>>     for(size_t i=0; i<100; i++){
>>         Int c = a + b;
>>         a = b;
>>         b = c;
>>     }
>>     return a;
>> }
>>
>> void main()
>> {
>>     writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000));
>> }
>>
>>
>> dmd -O -inline -release fib.d
>>
>> Output:
>>
>> [TickDuration(197), TickDuration(276), TickDuration(93370107)]
>>
>> I'm horrified. Who was working on std.variant enhancements? Please
>> chime in.
>
> You underestimate DMD's optimisations :-)
>

Yeah, guilty as charged. Was in a harry, yet I did test it without optimization before posting and results were not good either.

> For int and long, DMD does the whole loop at compile time, so again you
> are benchmarking against an empty function. It's easy to see that this
> is the case by changing the size of the loop and noting that the int and
> long versions take the same amount of time.
>
> Here's my results with optimisations turned *off*:
>
> int: 2,304,868
> long: 1,559,679

Guess you are on 64 bit system or what? Long takes somewhat longer for me.

> Variant: 2,667,320,252
>

Mine with -release -inline:
2307us
2577us
3170315us

3 orders of magnitude. Sill too bad.

> Yes, it's not good to test without optimisations, but I think this gives
> a clearer picture of the relative differences between the two.
>
> Still not great at ~1000x difference, but much better than a million :-)
>



-- 
Dmitry Olshansky
July 29, 2012
On 29-Jul-12 22:59, Andrei Alexandrescu wrote:
> On 7/29/12 10:43 AM, Dmitry Olshansky wrote:
>> I'm horrified. Who was working on std.variant enhancements? Please chime
>> in.
>
> I guess you just volunteered! When I looked at it this morning I noticed
> a few signs of bit rot, e.g. opAssign returns by value and such. (Only
> planting a "ref" there improves performance a good amount.)
>
Strange, but numbers stay the same for me.

> Variant has a simple design with (in case of int) an int and a pointer
> to a function. Many of its operations incur an indirect call through
> that pointer. This makes operations slower than the time-honored design
> of using an integral tag and switching on it, but offers in return the
> ability to hold any type without needing to enumerate all types explicitly.
>

Well obviously integers do not take lightly when they are copied around with memcpy like some pompous arrays that alone incurs *some* penalty.

In fact memcpy could and should be replaced with word by word copy for almost all of struct sizes up to ~32 bytes (as size is known in advance for this particular function pointer i.e. handler!int).

I've found something far more evil:

@property bool convertsTo(T)() const
    {
        TypeInfo info = typeid(T);
        return fptr(OpID.testConversion, null, &info) == 0;
    }

Okay... now let me pull off another piece of rag:

private VariantN opArithmetic(T, string op)(T other)
    {
        VariantN result;
        static if (is(T == VariantN))
        {
            if (convertsTo!(uint) && other.convertsTo!(uint))
                result = mixin("get!(uint) " ~ op ~ " other.get!(uint)");
            else if (convertsTo!(int) && other.convertsTo!(int))
                result = mixin("get!(int) " ~ op ~ " other.get!(int)");
            else if (convertsTo!(ulong) && other.convertsTo!(ulong))
                result = mixin("get!(ulong) " ~ op ~ " other.get!(ulong)");
            else if (convertsTo!(long) && other.convertsTo!(long))
                result = mixin("get!(long) " ~ op ~ " other.get!(long)");
            else if (convertsTo!(double) && other.convertsTo!(double))
                result = mixin("get!(double) " ~ op ~ " other.get!(double)");
            else
                result = mixin("get!(real) " ~ op ~ " other.get!(real)");
        }
        else
        {
            if (is(typeof(T.max) : uint) && T.min == 0 && convertsTo!(uint))
                result = mixin("get!(uint) " ~ op ~ " other");
            else if (is(typeof(T.max) : int) && T.min < 0 && convertsTo!(int))
                result = mixin("get!(int) " ~ op ~ " other");
            else if (is(typeof(T.max) : ulong) && T.min == 0
                     && convertsTo!(ulong))
                result = mixin("get!(ulong) " ~ op ~ " other");
            else if (is(typeof(T.max) : long) && T.min < 0 && convertsTo!(long))
                result = mixin("get!(long) " ~ op ~ " other");
            else if (is(T : double) && convertsTo!(double))
                result = mixin("get!(double) " ~ op ~ " other");
            else
                result = mixin("get!(real) " ~ op ~ " other");
        }
        return result;
    }


So to boot in order to get to int + int _detected_ it needs 4 calls to
 fptr(OpID.testConversion, null, &info) == 0 and acquire TypeInfo each time. Not counting 2 gets to obtain result. It's just madness.

Plain hardcoded Type to integer switch seems so much better now.

> We can use the pointer to function as a tag for improving performance on
> primitive types.
>

Indeed we can though it won't be as fast as simple tag - pointers
have quite big and sparse addresses giving us the same as:

if(ptr == handler!int)
...
else if(ptr == handler!uint)
...
else
...

Another way to go is make 2-d function table for all built-ins and all operations on them. It's in fact a bunch of tables:
Type1 x Type2 for all of important basic operations. Tables are static so it won't be much of a problem. At least int + int then would be
1 memory read, 1 call, 2 (casted as needed) reads & 1 store.

-- 
Dmitry Olshansky
July 29, 2012
On 30-Jul-12 00:11, Dmitry Olshansky wrote:
> I've found something far more evil:
>
> @property bool convertsTo(T)() const
>      {
>          TypeInfo info = typeid(T);
>          return fptr(OpID.testConversion, null, &info) == 0;
>      }
>
> Okay... now let me pull off another piece of rag:
>
> private VariantN opArithmetic(T, string op)(T other)
>      {
>          VariantN result;
>          static if (is(T == VariantN))
>          {
>             /*if (convertsTo!(uint) && other.convertsTo!(uint))
>                  result = mixin("get!(uint) " ~ op ~ " other.get!(uint)");
>              else*/ if (convertsTo!(int) && other.convertsTo!(int))
>                  result = mixin("get!(int) " ~ op ~ " other.get!(int)");
...
Apparently I'm spot on.
Just commenting one extra branch of this horror movie
gives interesting change:

2779us
2667us
3153762us

After:

2319us
2523us
288581us

Aye, 10x :o)

-- 
Dmitry Olshansky
July 29, 2012
On Sunday, July 29, 2012 18:43:07 Dmitry Olshansky wrote:
> I'm horrified. Who was working on std.variant enhancements? Please chime in.

I believe that Robert Jaques fixed up std.variant as part of his work on fixing up std.json and that he intended to get the std.variant changes reviewed and merged in before doing the same with his std.json changes, but he's never asked for a review of either, and I don't know if his code is available online or not. I think that it's another one of those cases where someone did a fair bit of work but became too busy to finish the job.

- Jonathan M Davis
July 29, 2012
Am Mon, 30 Jul 2012 00:24:40 +0400
schrieb Dmitry Olshansky <dmitry.olsh@gmail.com>:

> Just commenting one extra branch of this horror movie
> gives interesting change:
> 
> 2779us
> 2667us
> 3153762us
> 
> After:
> 
> 2319us
> 2523us
> 288581us
> 
> Aye, 10x :o)

Very nice finding. Now it is only 100x slower. :D It seems well worth changing the style Variant is written in.

-- 
Marco

July 29, 2012
https://jshare.johnshopkins.edu/rjacque2/public_html/variant.mht
https://jshare.johnshopkins.edu/rjacque2/public_html/variant.d
July 30, 2012
On 7/29/12 4:11 PM, Dmitry Olshansky wrote:
> On 29-Jul-12 22:59, Andrei Alexandrescu wrote:
>> On 7/29/12 10:43 AM, Dmitry Olshansky wrote:
>>> I'm horrified. Who was working on std.variant enhancements? Please chime
>>> in.
>>
>> I guess you just volunteered! When I looked at it this morning I noticed
>> a few signs of bit rot, e.g. opAssign returns by value and such. (Only
>> planting a "ref" there improves performance a good amount.)
>>
> Strange, but numbers stay the same for me.

Depends on what you measure.

>> Variant has a simple design with (in case of int) an int and a pointer
>> to a function. Many of its operations incur an indirect call through
>> that pointer. This makes operations slower than the time-honored design
>> of using an integral tag and switching on it, but offers in return the
>> ability to hold any type without needing to enumerate all types
>> explicitly.
>>
>
> Well obviously integers do not take lightly when they are copied around
> with memcpy like some pompous arrays that alone incurs *some* penalty.

Yah, when I replaced memcpy with simple assignment for Variant<int> I saw some improvements.

> In fact memcpy could and should be replaced with word by word copy for
> almost all of struct sizes up to ~32 bytes (as size is known in advance
> for this particular function pointer i.e. handler!int).

In fact memcpy should be smart enough to do all that, but apparently it doesn't.

> I've found something far more evil:
[snip]
> So to boot in order to get to int + int _detected_ it needs 4 calls to
> fptr(OpID.testConversion, null, &info) == 0 and acquire TypeInfo each
> time. Not counting 2 gets to obtain result. It's just madness.

Heh, this all gives me a new, humbling perspective on my earlier self running my mouth about other designs :o). As always, history and context makes one much more empathic.

Variant has a very old design and implementation. It is, in fact, the very first design I've ever carried in D; I'd decided, if I can't do that, D isn't powerful enough. The compiler was so buggy back then, implementation needed to take many contortions, and it was a minor miracle the whole house of cards stood together.

The approach was to design for usability first, performance could come later. So design only had to not prevent good performance from being improved transparently later, which I think it has achieved.

The basic approach was to use a pointer to function (instead of an integral) as a discriminator. The pointer to function was taken from the instantiation of a template function. That means Variant could work with any type, even types not yet defined. I'd made it a clear goal that I must not enumerate all types in one place, or "register" types for use with Variant - that would have been failure. Using a pointer to template function naturally associated a distinct tag with each type, automatically. I think it was a good idea then, and it's a good idea now.

On this data layout functionality can be built in two ways:

(a) Implement it inside the dispatcher function template, or
(b) Test against it "outside", in the template Variant code.

The first approach is more general, the second is faster. Right now Variant uses (a) almost exclusively. It's time to start doing (b) for more types.

> Plain hardcoded Type to integer switch seems so much better now.

But we can do that now.

>> We can use the pointer to function as a tag for improving performance on
>> primitive types.
>>
>
> Indeed we can though it won't be as fast as simple tag - pointers
> have quite big and sparse addresses giving us the same as:
>
> if(ptr == handler!int)
> ....
> else if(ptr == handler!uint)
> ....
> else
> ....

Yes, and what we gain is speed for select known types. As long as the number of those is relatively small, using simple test and branch should serve us well. And we get to keep the generality, too.

> Another way to go is make 2-d function table for all built-ins and all
> operations on them. It's in fact a bunch of tables:
> Type1 x Type2 for all of important basic operations. Tables are static
> so it won't be much of a problem. At least int + int then would be
> 1 memory read, 1 call, 2 (casted as needed) reads & 1 store.

I think that's not worth it, but hey, worth a try.


Thanks,

Andrei


July 30, 2012
On 7/29/12 11:20 AM, Gor Gyolchanyan wrote:
> On Sun, Jul 29, 2012 at 6:43 PM, Dmitry Olshansky <dmitry.olsh@gmail.com
> <mailto:dmitry.olsh@gmail.com>> wrote:
>     This should be more relevant then:
>
>     //fib.d
>     import std.datetime, std.stdio, std.variant;
>
>     auto fib(Int)()
>     {
>              Int a = 1, b = 1;
>              for(size_t i=0; i<100; i++){
>                      Int c = a + b;
>                      a = b;
>                      b = c;
>              }
>              return a;
>     }
>
>     void main()
>     {
>              writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000));
>     }
>
>
>     dmd -O -inline -release fib.d
>
>     Output:
>
>     [TickDuration(197), TickDuration(276), TickDuration(93370107)]
>
>     I'm horrified. Who was working on std.variant enhancements? Please
>     chime in.
>
>     --
>     Dmitry Olshansky
>
>
> Thank you for demonstrating my point. :-)

I don't think he demonstrated your point (even leaving aside that the benchmark above is also flawed). I don't think there was a point, unless it was you wanted to vent and looked for a pretext - which is, by the way, entirely reasonable.

You mentioned you need "a very fast typeless storage with maximum performance and type safety." Well if it's typeless then it means you're using it for storage, not for computationally-intensive operations, as the code above does. So what you'd presumably want to test is the speed of data storage and retrieval. A loop that does a bunch of adds on Variant does not go in that direction.

Benchmarks are notoriously hard to get right. You need to think of reasonable baselines - if you want to use Variant for typeless storage, what is your vanilla implementation, the "obvious contender"? What are the primitives of which speed is important? Then you need to make sure you subtract noise from all your comparisons. Then you need to figure whether the issue is with the design or with the implementation of Variant. In the former case, maybe Variant isn't for you. In the latter, bug reports are always welcome.


Andrei
July 30, 2012
On Mon, Jul 30, 2012 at 6:07 AM, Andrei Alexandrescu < SeeWebsiteForEmail@erdani.org> wrote:

> On 7/29/12 11:20 AM, Gor Gyolchanyan wrote:
>
>> On Sun, Jul 29, 2012 at 6:43 PM, Dmitry Olshansky <dmitry.olsh@gmail.com
>> <mailto:dmitry.olsh@gmail.com>**> wrote:
>>     This should be more relevant then:
>>
>>     //fib.d
>>     import std.datetime, std.stdio, std.variant;
>>
>>     auto fib(Int)()
>>     {
>>              Int a = 1, b = 1;
>>              for(size_t i=0; i<100; i++){
>>                      Int c = a + b;
>>                      a = b;
>>                      b = c;
>>              }
>>              return a;
>>     }
>>
>>     void main()
>>     {
>>              writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000));
>>     }
>>
>>
>>     dmd -O -inline -release fib.d
>>
>>     Output:
>>
>>     [TickDuration(197), TickDuration(276), TickDuration(93370107)]
>>
>>     I'm horrified. Who was working on std.variant enhancements? Please
>>     chime in.
>>
>>     --
>>     Dmitry Olshansky
>>
>>
>> Thank you for demonstrating my point. :-)
>>
>
> I don't think he demonstrated your point (even leaving aside that the benchmark above is also flawed). I don't think there was a point, unless it was you wanted to vent and looked for a pretext - which is, by the way, entirely reasonable.
>
> You mentioned you need "a very fast typeless storage with maximum performance and type safety." Well if it's typeless then it means you're using it for storage, not for computationally-intensive operations, as the code above does. So what you'd presumably want to test is the speed of data storage and retrieval. A loop that does a bunch of adds on Variant does not go in that direction.
>
> Benchmarks are notoriously hard to get right. You need to think of reasonable baselines - if you want to use Variant for typeless storage, what is your vanilla implementation, the "obvious contender"? What are the primitives of which speed is important? Then you need to make sure you subtract noise from all your comparisons. Then you need to figure whether the issue is with the design or with the implementation of Variant. In the former case, maybe Variant isn't for you. In the latter, bug reports are always welcome.
>
>
> Andrei
>

You're right. I'm trying to make a subject-oriented system, where there are nodes and accessors and each accessor has access to a specific part of the node. I must have the data stored in typeless manner and cast to the appropriate type (depending on the accessor). This is all necessary for my very flexible graph data structure.

-- 
Bye,
Gor Gyolchanyan.