Help with Template Code (page 2)

On Sun, 1 Apr 2007 11:43:58 -0400, "Jarrett Billingsley" <kb3ctd2@yahoo.com> wrote: >"Max Samukha" <samukha@voliacable.com> wrote in message news:dopu039g3u5dtb0ubovm8c7ach304bdbb7@4ax.com... >> Note that the mixin ctor is slower then manually coded one because of the loop. To improve performance, you could skip the struct initialization: > >That loop is unrolled at compile time since it's iterating over a tuple. I thought it should, too. But when tested on Windows with dmd 1.010, the tuple version is significantly slower. I'm still not sure why. >It'll be just as fast as writing the initialization out line-by-line. But using =void is another nice optimization :) >

"Max Samukha" <samukha@voliacable.com> wrote in message news:nmmv03h5g5mtbkn6hetbnu33ei6nnnhd67@4ax.com... > I thought it should, too. But when tested on Windows with dmd 1.010, the tuple version is significantly slower. I'm still not sure why. Ahh, looking at the disassembly it makes sense now. What happens is that when you write: foreach(i, arg; args) t.tupleof[i] = arg; It gets turned into something like _this_: typeof(args[0]) arg0 = args[0]; t.tupleof[0] = arg0; typeof(args[1]) arg1 = args[1]; t.tupleof[1] = arg1; typeof(args[2]) arg2 = args[2]; t.tupleof[2] = arg2; Notice it copies the argument value into a temp variable, then that temp variable into the struct. Very inefficient. Unfortunately I don't know of any way to get around this.. I was hoping t.tupleof[] = args[]; would work, but it's illegal. A static for loop would be nice.

Jarrett Billingsley wrote: > "Max Samukha" <samukha@voliacable.com> wrote in message news:nmmv03h5g5mtbkn6hetbnu33ei6nnnhd67@4ax.com... >> I thought it should, too. But when tested on Windows with dmd 1.010, >> the tuple version is significantly slower. I'm still not sure why. > > Ahh, looking at the disassembly it makes sense now. What happens is that when you write: > > foreach(i, arg; args) > t.tupleof[i] = arg; > > It gets turned into something like _this_: > > typeof(args[0]) arg0 = args[0]; > t.tupleof[0] = arg0; > typeof(args[1]) arg1 = args[1]; > t.tupleof[1] = arg1; > typeof(args[2]) arg2 = args[2]; > t.tupleof[2] = arg2; > > Notice it copies the argument value into a temp variable, then that temp variable into the struct. Very inefficient. > > Unfortunately I don't know of any way to get around this.. Yes, DMD does that, *unless you turn on optimizations* ;). Measuring performance without optimization switches is pretty much useless. With optimizations it just moves mem->reg, reg->mem. It generates code bit-for-bit identical to: --- static S opCall(int x_, float y_, char[] z_) { S s = void; s.x = x_; s.y = y_; s.z = z_; return s; } --- for the version Max posted (with =void) (The only difference is the mangled name; the mixin name is in there for the mixed-in version)

On Sun, 01 Apr 2007 22:19:03 +0200, Frits van Bommel <fvbommel@REMwOVExCAPSs.nl> wrote: >Jarrett Billingsley wrote: >> "Max Samukha" <samukha@voliacable.com> wrote in message news:nmmv03h5g5mtbkn6hetbnu33ei6nnnhd67@4ax.com... >>> I thought it should, too. But when tested on Windows with dmd 1.010, the tuple version is significantly slower. I'm still not sure why. >> >> Ahh, looking at the disassembly it makes sense now. What happens is that when you write: >> >> foreach(i, arg; args) >> t.tupleof[i] = arg; >> >> It gets turned into something like _this_: >> >> typeof(args[0]) arg0 = args[0]; >> t.tupleof[0] = arg0; >> typeof(args[1]) arg1 = args[1]; >> t.tupleof[1] = arg1; >> typeof(args[2]) arg2 = args[2]; >> t.tupleof[2] = arg2; >> >> Notice it copies the argument value into a temp variable, then that temp variable into the struct. Very inefficient. >> >> Unfortunately I don't know of any way to get around this.. > >Yes, DMD does that, *unless you turn on optimizations* ;). >Measuring performance without optimization switches is pretty much useless. > >With optimizations it just moves mem->reg, reg->mem. It generates code bit-for-bit identical to: >--- > static S opCall(int x_, float y_, char[] z_) { > S s = void; > s.x = x_; > s.y = y_; > s.z = z_; > return s; > } >--- >for the version Max posted (with =void) > >(The only difference is the mangled name; the mixin name is in there for the mixed-in version) When compiling on Win XP with dmd 1.010 using -O -inline -release, the time difference is more than 40%. The source is this: import std.stdio; import std.c.time; template StructCtor() { static typeof(*this) opCall(typeof(typeof(*this).tupleof) args) { typeof(*this) t = void; foreach(i, arg; args) t.tupleof[i] = arg; return t; } } struct Bar { int x; int y; int z; //mixin StructCtor; static Bar opCall(int x, int y, int z) { Bar result = void; result.x = x; result.y = y; result.z = z; return result; } } void main() { auto c = clock(); for (int i = 0; i < 100000000; i++) { auto test = Bar(i, i, i); } writefln(clock() - c); } What am I doing wrong?

Max Samukha wrote: > On Sun, 01 Apr 2007 22:19:03 +0200, Frits van Bommel > <fvbommel@REMwOVExCAPSs.nl> wrote: [snip] >> With optimizations it just moves mem->reg, reg->mem. It generates code bit-for-bit identical to: >> --- >> static S opCall(int x_, float y_, char[] z_) { >> S s = void; >> s.x = x_; >> s.y = y_; >> s.z = z_; >> return s; >> } >> --- >> for the version Max posted (with =void) >> >> (The only difference is the mangled name; the mixin name is in there for the mixed-in version) > When compiling on Win XP with dmd 1.010 using -O -inline -release, the > time difference is more than 40%. The source is this: > [snip] > > What am I doing wrong? For one thing, clock() isn't exactly accurate. Also, you didn't mention how many times you ran the test (the first time will likely take longer because the program has to be loaded into cache first, it's best to run it a couple of times before looking at the results) However, those don't seem to be the issue here. Looking at the generated assembly I see that while the functions compile to the exact same thing (as I mentioned in my previous post), DMD doesn't seem to inline the mixed-in version :(... (You can also see this without inspecting the generated code: if you leave off -inline from the command line the two versions take the same amount of time, at least on my computer) So it would seem there's no way to get the mixed-in version to equal speed simply because it won't be inlined by DMD... Note: GDC (with -O3 -finline) doesn't seem to have this problem. In fact, I had to add some code so it doesn't optimize out the entire loop :P. Even then, the code seems to be identical and (unsurprisingly) runs just as fast. P.S. I performed these tests on Linux (amd64). Another fun fact: the GDC-compiled version ran about twice as fast as the fastest DMD-compiled one. I think that my GDC being set up to generate 64-bit code may have had something to do with this though, so it's not really a fair comparison of the optimizers in the compilers. (Unless you count generating 64-bit code for 64-bit processors as an optimization ;) )

April 02, 2007

Re: Help with Template Code

Posted by Max Samukha
in reply to Frits van Bommel

Permalink

Max Samukha

Posted in reply to Frits van Bommel

Permalink

On Mon, 02 Apr 2007 01:24:03 +0200, Frits van Bommel <fvbommel@REMwOVExCAPSs.nl> wrote:

>Max Samukha wrote:
>> On Sun, 01 Apr 2007 22:19:03 +0200, Frits van Bommel <fvbommel@REMwOVExCAPSs.nl> wrote:
>[snip]
>>> With optimizations it just moves mem->reg, reg->mem. It generates code bit-for-bit identical to:
>>> ---
>>>     static S opCall(int x_, float y_, char[] z_) {
>>> 	    S s = void;
>>> 	    s.x = x_;
>>> 	    s.y = y_;
>>> 	    s.z = z_;
>>> 	    return s;
>>>     }
>>> ---
>>> for the version Max posted (with =void)
>>>
>>> (The only difference is the mangled name; the mixin name is in there for the mixed-in version)
>> When compiling on Win XP with dmd 1.010 using -O -inline -release, the time difference is more than 40%. The source is this:
>> 
>[snip]
>> 
>> What am I doing wrong?
>
>For one thing, clock() isn't exactly accurate. Also, you didn't mention how many times you ran the test (the first time will likely take longer because the program has to be loaded into cache first, it's best to run it a couple of times before looking at the results)

I was running it over and over again. You are right, of course. I shouldn't have run those silly speed tests at all. The disassembly is a D programmer's friend:).
>
>However, those don't seem to be the issue here.
>Looking at the generated assembly I see that while the functions compile
>to the exact same thing (as I mentioned in my previous post), DMD
>doesn't seem to inline the mixed-in version :(...
>(You can also see this without inspecting the generated code: if you
>leave off -inline from the command line the two versions take the same
>amount of time, at least on my computer)
>
>So it would seem there's no way to get the mixed-in version to equal speed simply because it won't be inlined by DMD...
>
>Note: GDC (with -O3 -finline) doesn't seem to have this problem. In fact, I had to add some code so it doesn't optimize out the entire loop :P. Even then, the code seems to be identical and (unsurprisingly) runs just as fast.
>
>
>P.S. I performed these tests on Linux (amd64). Another fun fact: the GDC-compiled version ran about twice as fast as the fastest DMD-compiled one. I think that my GDC being set up to generate 64-bit code may have had something to do with this though, so it's not really a fair comparison of the optimizers in the compilers. (Unless you count generating 64-bit code for 64-bit processors as an optimization ;) )

It seems like dmd is not going to support 64 bit processors in the foreseeable future (stdio super-performance seems to be the priority)

Forums