Thread overview
Operator overloading leads to bad code optimization
Dec 03
claptrap
Dec 05
kinke
Dec 06
ClapTrap
December 03

Just a simple function to split a bezier in two.

Using "-O3"

LDC the operator version is 84 instructions
LDC the hand expanded math is 49 instructions.

It seems something as simple as this should be better optimised? Or am I missing something?

https://godbolt.org/z/4h9vob3Yo

In fact there's quite a few bits where it looks like completely redundant code is left in? Eg...

123 movss dword ptr [rsp - 24], xmm1
124 movss xmm0, dword ptr [rip + .LCPI4_0]
125 mulss xmm1, xmm0
126 movss dword ptr [rsp - 24], xmm1

137 movss dword ptr [rsp - 24], xmm2
138 mulss xmm2, xmm0
139 movss dword ptr [rsp - 24], xmm2

December 05

On Friday, 3 December 2021 at 21:24:07 UTC, claptrap wrote:

>

Just a simple function to split a bezier in two.

Using "-O3"

LDC the operator version is 84 instructions
LDC the hand expanded math is 49 instructions.

It seems something as simple as this should be better optimised? Or am I missing something?

https://godbolt.org/z/4h9vob3Yo

In fact there's quite a few bits where it looks like completely redundant code is left in? Eg...

123 movss dword ptr [rsp - 24], xmm1
124 movss xmm0, dword ptr [rip + .LCPI4_0]
125 mulss xmm1, xmm0
126 movss dword ptr [rsp - 24], xmm1

137 movss dword ptr [rsp - 24], xmm2
138 mulss xmm2, xmm0
139 movss dword ptr [rsp - 24], xmm2

This is (to me at least) an odd one. Maybe there's a pass-ordering issue here leading to bad code.

Seems like GCC does not have this issue.

December 05

On Sunday, 5 December 2021 at 21:38:55 UTC, max haughton wrote:

>

On Friday, 3 December 2021 at 21:24:07 UTC, claptrap wrote:

>

Just a simple function to split a bezier in two.

Using "-O3"

LDC the operator version is 84 instructions
LDC the hand expanded math is 49 instructions.

It seems something as simple as this should be better optimised? Or am I missing something?

https://godbolt.org/z/4h9vob3Yo
[...]

[...]
Seems like GCC does not have this issue.

With gdc v11.1, I count 69 instructions for split and 51 for split2 (59 with -O3). So I guess there's a semantic difference here with the slightly changed evaluation order (2D addition before scaling).

With alias Point = __vector(float[2]), split is reduced to 28 instructions: https://godbolt.org/z/7ffebjaz8

December 06

On Sunday, 5 December 2021 at 23:36:21 UTC, kinke wrote:

>

On Sunday, 5 December 2021 at 21:38:55 UTC, max haughton wrote:

>

On Friday, 3 December 2021 at 21:24:07 UTC, claptrap wrote:

>

Just a simple function to split a bezier in two.

Using "-O3"

LDC the operator version is 84 instructions
LDC the hand expanded math is 49 instructions.

It seems something as simple as this should be better optimised? Or am I missing something?

https://godbolt.org/z/4h9vob3Yo
[...]

[...]
Seems like GCC does not have this issue.

With gdc v11.1, I count 69 instructions for split and 51 for split2 (59 with -O3). So I guess there's a semantic difference here with the slightly changed evaluation order (2D addition before scaling).

gdc v11.1 doesn't inline the operator calls when I try it, if you try an earlier version 10.2 it does which reduces it to 48 instructions

>

With alias Point = __vector(float[2]), split is reduced to 28 instructions: https://godbolt.org/z/7ffebjaz8

Wow, that's awesome!

December 06

On Monday, 6 December 2021 at 00:38:18 UTC, ClapTrap wrote:

>

On Sunday, 5 December 2021 at 23:36:21 UTC, kinke wrote:

>

On Sunday, 5 December 2021 at 21:38:55 UTC, max haughton wrote:

>

On Friday, 3 December 2021 at 21:24:07 UTC, claptrap wrote:

>

Just a simple function to split a bezier in two.

Using "-O3"

LDC the operator version is 84 instructions
LDC the hand expanded math is 49 instructions.

It seems something as simple as this should be better optimised? Or am I missing something?

https://godbolt.org/z/4h9vob3Yo
[...]

[...]
Seems like GCC does not have this issue.

With gdc v11.1, I count 69 instructions for split and 51 for split2 (59 with -O3). So I guess there's a semantic difference here with the slightly changed evaluation order (2D addition before scaling).

gdc v11.1 doesn't inline the operator calls when I try it, if you try an earlier version 10.2 it does which reduces it to 48 instructions

>

With alias Point = __vector(float[2]), split is reduced to 28 instructions: https://godbolt.org/z/7ffebjaz8

Wow, that's awesome!

To make GCC inline properly without LTO you can use -fwhole-program.

Maybe Iain also has a flag that restores the old template behaviour.

These kinds of wacky phase ordering (I assume) issues is why I am slightly distrustful of GDC post-inlining decision.

December 06

On Monday, 6 December 2021 at 00:41:06 UTC, max haughton wrote:

>

These kinds of wacky phase ordering (I assume) issues is why I am slightly distrustful of GDC post-inlining decision.

Multiplying with 0.5 only affects the exponent, but the add could overflow/underflow. Maybe that is wacky for D since it specifies IEEE compliance, but in the gcc-family -O# is really a shortcut for a set of options. If I specify -O or -O3 I would expect the same options as gcc. Otherwise people will claim that C++ is faster?

December 08

On Monday, 6 December 2021 at 11:55:08 UTC, Ola Fosheim Grøstad wrote:

>

On Monday, 6 December 2021 at 00:41:06 UTC, max haughton wrote:

>

These kinds of wacky phase ordering (I assume) issues is why I am slightly distrustful of GDC post-inlining decision.

Multiplying with 0.5 only affects the exponent, but the add could overflow/underflow. Maybe that is wacky for D since it specifies IEEE compliance, but in the gcc-family -O# is really a shortcut for a set of options. If I specify -O or -O3 I would expect the same options as gcc. Otherwise people will claim that C++ is faster?

My sentence was referring to Iains decision to refuse to inline templates (i.e. defer to LTO). Makes it harder to work out what the compiler is going to do / is doing.