February 15, 2008
downs wrote:
> downs wrote:
>> I rewrote my version for freestanding functions .. 9.5s :confused: Why do struct members (which are inlined, I checked) take such a speed hit?
>>
> 
> My version had a bug. x__X
> 
> The correct version takes 11.2s again.
> 
>  --downs

If I fix the bug, the 'external function' version is exactly as fast as the opFoo version.

Sorry.

I think the 8s version posted earlier has a similar bug.

Look at the output. :)

 -- downs
February 15, 2008
I've been playing around with the 8-9s version posted earlier.

The problem seems to lie in ray_sphere.

Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing a correct output once the printf at the bottom has been fixed,
but Vec v = center - ray.orig; runs in 11.1s.

Still investigating why this happens.

 --downs
February 15, 2008
downs wrote:
> I've been playing around with the 8-9s version posted earlier.
> 
> The problem seems to lie in ray_sphere.
> 
> Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing a correct output once the printf at the bottom has been fixed,
> but Vec v = center - ray.orig; runs in 11.1s.
> 
> Still investigating why this happens.
> 
>  --downs

Okay, found the cause, if not the reason, by looking at the assembler output.

For some reason, the bad case, although inlined, stores its values back into memory. The fast case keeps working with them.

Here's the disassembly for ray_sphere for both cases:

slow (opSub)

http://paste.dprogramming.com/dpcds3p3

fast

http://paste.dprogramming.com/dpd6pi8n

So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help. Does anybody have an up-to-date version of the 4.2.x patch?

 --downs
February 15, 2008
downs wrote:
>> Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing a correct output once the printf at the bottom has been fixed,
>> but Vec v = center - ray.orig; runs in 11.1s.
> 
> For some reason, the bad case, although inlined, stores its values back into memory. The fast case keeps working with them.
> 
> So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help. Does anybody have an up-to-date version of the 4.2.x patch?

Hey good deal on figuring this out!  It's good to know, especially for those of us using D for real-time simulation type stuff.

Is there really a GDC that compiles against gcc >= 4.2?!
February 15, 2008
Tim Burrell wrote:
> downs wrote:
>>> Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing a correct output once the printf at the bottom has been fixed,
>>> but Vec v = center - ray.orig; runs in 11.1s.
>> For some reason, the bad case, although inlined, stores its values back into memory. The fast case keeps working with them.
>>
>> So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help. Does anybody have an up-to-date version of the 4.2.x patch?
> 
> Hey good deal on figuring this out!  It's good to know, especially for those of us using D for real-time simulation type stuff.
> 
> Is there really a GDC that compiles against gcc >= 4.2?!

I'm not sure; I remember somebody saying he'd managed to build it. And there's a post on d.gnu from somebody saying he'd gotten it to work, although he couldn't build phobos.

Since GDC seems to be .. inert at the moment, it'd probably up to some volunteer effort to upgrade it to 4.[23]. That, or get llvmdc up to speed.

Myself of course is mostly clueless about both compilers. :/

 --downs
February 15, 2008
With a little bit of commenting, this could be an excellent tutorial.


February 15, 2008
downs wrote:
> Tim Burrell wrote:
>> downs wrote:
>>>> Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing a correct output once the printf at the bottom has been fixed,
>>>> but Vec v = center - ray.orig; runs in 11.1s.
>>> For some reason, the bad case, although inlined, stores its values back into memory. The fast case keeps working with them.
>>>
>>> So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help. Does anybody have an up-to-date version of the 4.2.x patch?
>> 
>> Hey good deal on figuring this out!  It's good to know, especially for those of us using D for real-time simulation type stuff.
>> 
>> Is there really a GDC that compiles against gcc >= 4.2?!
> 
> I'm not sure; I remember somebody saying he'd managed to build it. And there's a post on d.gnu from somebody saying he'd gotten it to work, although he couldn't build phobos.
> 
> Since GDC seems to be .. inert at the moment, it'd probably up to some volunteer effort to upgrade it to 4.[23]. That, or get llvmdc up to speed.
> 
> Myself of course is mostly clueless about both compilers. :/

I notice that the Ubuntu team appears to have a working 4.2 based gdc that the changelog also says works with 4.3:

http://packages.ubuntu.com/hardy/devel/gdc-4.2

Changelog is here: http://changelogs.ubuntu.com/changelogs/pool/universe/g/gdc-4.2/gdc-4.2_0.25-4.2.3-0ubuntu1/changelog

It'd be really nice to see a new gdc release!

I wonder if David even knows about these patches!?
February 15, 2008
downs wrote:
> Here's the disassembly for ray_sphere for both cases:
> 
> slow (opSub)
> 
> http://paste.dprogramming.com/dpcds3p3
> 
> fast
> 
> http://paste.dprogramming.com/dpd6pi8n
> 
> So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help. Does anybody have an up-to-date version of the 4.2.x patch?
> 
>  --downs

Especially interesting to note (slow case):

    fstpl    -24(%ebp)
[...]
    movl    -24(%ebp), %eax
    movl    %eax, -48(%ebp)
    movl    -20(%ebp), %eax
    movl    %eax, -44(%ebp)

Translation:
	Store floating-point number to ebp[-24]. No, wait, move it to ebp[-48].

This indicates a pretty serious problem with optimization, since the whole thing is basically redundant.

The "fast" version doesn't have any memory writes at all during the computation.

 --downs
February 15, 2008
downs wrote:
> Especially interesting to note (slow case):
> 
>     fstpl    -24(%ebp)
> [...]
>     movl    -24(%ebp), %eax
>     movl    %eax, -48(%ebp)
>     movl    -20(%ebp), %eax
>     movl    %eax, -44(%ebp)
> 
> Translation:
> 	Store floating-point number to ebp[-24]. No, wait, move it to ebp[-48].

I left something out.

    fstpl    -24(%ebp)
[...]
    movl    -24(%ebp), %eax
    movl    %eax, -48(%ebp)
    movl    -20(%ebp), %eax
    movl    %eax, -44(%ebp)
[...]
    fldl    -48(%ebp)


So, the whole thing comes down to "Store FP number to memory. No wait, move it somewhere else! No wait, read it back!"

No wonder it's slow.
February 15, 2008
downs wrote:
> No difference. But then why the obvious speed difference? Color me confused ._.

Test to see if the stack is aligned, i.e. if the doubles start on 16 byte address boundaries.