Thread overview
DMD2 out parameters
Dec 23, 2010
Pete
Dec 23, 2010
Pete
Dec 23, 2010
Pete
Dec 23, 2010
Don
Dec 23, 2010
Pete
Dec 23, 2010
Johann MacDonagh
December 23, 2010
Hi,

I'm not sure if this is already a widely known phenomenon but I ran across a little gotcha yesterday regarding floating point out parameters using DMD2.

A year or so ago I wrote a ray tracer using DMD1. A few months ago I tried compiling and running it using DMD2. It was 50% slower. This disappointed me so much that I stopped using D2 until about a week ago. I spent a few hours yesterday investigating why the D2 version of the code was so much worse than the D1 version. After some head scratching and use of -profile and objconv, I eventually managed to isolate the problem. It boiled down to this example:

float f;
func(f);

void func(out float ff) {
ff = 1;
}

This use of 'out' causes func to execute in around 250 ticks on DMD2. Change 'out' to 'ref' and it takes around 10 ticks (the same time as the 'out' version executes on DMD1). If you initialise f to 0 before calling func then it all works quickly again which makes me wonder whether it's some strange DMD2 nan/fpu exceptions quirk which may be documented somewhere?? When I looked at the generated assembly I saw that both DMD1 and DMD2 seem to generate the same thing (using -O -inline - release):

func  LABEL NEAR
push    ebp
mov     ebp, esp
push    eax       // eax = ptr to ff
fld     dword ptr [_nan]
fstp    dword ptr [eax]
fld     dword ptr [_one]
fstp    dword ptr [eax]

mov     esp, ebp
pop     ebp
ret

Now this code looks ok if you ignore the fact that 'ff' is being written to twice. And the strange seemingly redundant push of EAX.

Has anyone else come across this and if so is it a bug? I'm also interested in people's thoughts on the strange code gen.

My D2 version is now running faster than the old D1 version by the way :)

Regards,
Pete.
December 23, 2010
//If you initialise f to 0 before calling func then it all works quickly again

Actually I think this is a red herring. I don't think initialising f helps
December 23, 2010
Ok, i've done some more investigating and it appears that in DMD2 a float NaN is 0x7FE00000 (in dword format) but when it initialises a float 'out' parameter it initialises it with 0x7FA00000H. This causes an FPU trap which is where the time is going. This looks like a bug to me. Can anyone confirm?

Thanks.
December 23, 2010
Pete wrote:
> Ok, i've done some more investigating and it appears that in DMD2 a float NaN is
> 0x7FE00000 (in dword format) but when it initialises a float 'out' parameter it
> initialises it with 0x7FA00000H. This causes an FPU trap which is where the time
> is going. This looks like a bug to me. Can anyone confirm?
> 
> Thanks.

Yes, it sounds like a NaN-related peformance issue. Note, though, that the slowdown you experience is processor-model specific. It's a penalty of ~250 cycles on a Pentium 4 with x87 instructions, but zero cycles on many other processors. (in fact, it's also zero cycles with SSE on Pentium 4!).
December 23, 2010
On 12/23/2010 12:19 PM, Pete wrote:
> Ok, i've done some more investigating and it appears that in DMD2 a float NaN is
> 0x7FE00000 (in dword format) but when it initialises a float 'out' parameter it
> initialises it with 0x7FA00000H. This causes an FPU trap which is where the time
> is going. This looks like a bug to me. Can anyone confirm?
>
> Thanks.

I just did a test with DMD 2.051 on Linux

void F1(ref float a)
{
        a++;
}

void F2(out float a)
{
        a++;
}

void main()
{
        float a;
        float b;

        F1(a);
        F2(b);
}

And ASM:

080490e4 <_D3out2F1FKfZv>:
 80490e4:       55                      push   ebp
 80490e5:       8b ec                   mov    ebp,esp
 80490e7:       83 ec 04                sub    esp,0x4
 80490ea:       d9 e8                   fld1
 80490ec:       d8 00                   fadd   DWORD PTR [eax]
 80490ee:       d9 18                   fstp   DWORD PTR [eax]
 80490f0:       c9                      leave
 80490f1:       c3                      ret
 80490f2:       90                      nop
 80490f3:       90                      nop

080490f4 <_D3out2F2FJfZv>:
 80490f4:       55                      push   ebp
 80490f5:       8b ec                   mov    ebp,esp
 80490f7:       83 ec 04                sub    esp,0x4
 80490fa:       d9 05 00 81 05 08       fld    DWORD PTR ds:0x8058100
 8049100:       d9 18                   fstp   DWORD PTR [eax]
 8049102:       d9 e8                   fld1
 8049104:       d8 00                   fadd   DWORD PTR [eax]
 8049106:       d9 18                   fstp   DWORD PTR [eax]
 8049108:       c9                      leave
 8049109:       c3                      ret
 804910a:       90                      nop
 804910b:       90                      nop

0804910c <_Dmain>:
 804910c:       55                      push   ebp
 804910d:       8b ec                   mov    ebp,esp
 804910f:       83 ec 08                sub    esp,0x8
 8049112:       d9 05 00 81 05 08       fld    DWORD PTR ds:0x8058100
 8049118:       d9 5d f8                fstp   DWORD PTR [ebp-0x8]
 804911b:       d9 05 00 81 05 08       fld    DWORD PTR ds:0x8058100
 8049121:       d9 5d fc                fstp   DWORD PTR [ebp-0x4]
 8049124:       8d 45 f8                lea    eax,[ebp-0x8]
 8049127:       e8 b8 ff ff ff          call   80490e4 <_D3out2F1FKfZv>
 804912c:       8d 45 fc                lea    eax,[ebp-0x4]
 804912f:       e8 c0 ff ff ff          call   80490f4 <_D3out2F2FJfZv>
 8049134:       31 c0                   xor    eax,eax
 8049136:       c9                      leave
 8049137:       c3                      ret

And 0x8058100 is 0x7FA00000. As you can see out doesn't force the loading and storing of a different NaN value.

Of course, maybe the compiler should skip initializing a float that gets passed into a routine as an out parameter as its first use. E.g.

float a;
a = 1.0;

wouldn't generate two separate assignments.
December 23, 2010
I noticed this on an Intel Core 2. I skipped the pentium 4 generation :)