Jump to page: 1 2 3
Thread overview
Performance issue with @fastmath and vectorization
Nov 12, 2016
dextorious
Nov 12, 2016
rikki cattermole
Nov 12, 2016
deXtoRious
Nov 12, 2016
Nicholas Wilson
Nov 12, 2016
deXtoRious
Nov 12, 2016
LiNbO3
Nov 12, 2016
deXtoRious
Nov 12, 2016
Johan Engelen
Nov 12, 2016
deXtoRious
Nov 12, 2016
Johan Engelen
Nov 12, 2016
deXtoRious
Nov 12, 2016
Johan Engelen
Nov 12, 2016
deXtoRious
Nov 12, 2016
deXtoRious
Nov 12, 2016
Johan Engelen
Nov 12, 2016
deXtoRious
Nov 12, 2016
Johan Engelen
Nov 14, 2016
Kagamin
Nov 14, 2016
Johan Engelen
Nov 14, 2016
Kagamin
Nov 12, 2016
Johan Engelen
Nov 12, 2016
deXtoRious
November 12, 2016
As part of slowly learning the basics of programming in D, I ported some of my fluid dynamics code from C++ to D and quickly noticed a rather severe performance degradation by a factor of 2-3x. I've narrowed it down to a simple representative benchmark of virtually identical C++ and D code.

The D version: http://pastebin.com/Rs9CUA5j
The C++ code:  http://pastebin.com/XzStHXA2

I compile the D code using the latest beta release on GitHub, using the compiler switches -release -O5 -mcpu=haswell -boundscheck=off. The C++ version is compiled using Clang 3.9.0 with the switches -std=c++14 -Ofast -fno-exceptions -fno-rtti -flto -ffast-math -march=native, which is my usual configuration for numerical code.

On my Haswell i7-4710HQ machine the C++ version runs in ~10ms/iteration while the D code takes 25ms. Comparing profiler output with the generated assembly code quickly reveals the reason - while Clang fully unrolls the inner loop and uses FMA instructions wherever possible, the inner loop assembly produced by LDC looks like this:

  0.24 │6c0:   vmovss (%r15,%rbp,4),%xmm4
  1.03 │       vmovss (%r12,%rbp,4),%xmm5
  3.51 │       add    $0x4,%rdi
  6.96 │       add    $0x4,%rax
  1.04 │6d4:   vmulss (%rax,%rcx,1),%xmm4,%xmm4
  4.66 │       vmulss (%rax,%rdx,1),%xmm5,%xmm5
  8.44 │       vaddss %xmm4,%xmm5,%xmm4
  1.09 │       vmulss %xmm0,%xmm4,%xmm5
  3.73 │       vmulss %xmm4,%xmm5,%xmm4
  7.48 │       vsubss %xmm3,%xmm4,%xmm4
  1.13 │       vmulss %xmm1,%xmm4,%xmm4
  2.00 │       vaddss %xmm2,%xmm5,%xmm5
  3.46 │       vmovss 0x0(%r13,%rbp,4),%xmm6
  7.85 │       vmulss (%rax,%rsi,1),%xmm6,%xmm6
  2.50 │       vaddss %xmm4,%xmm5,%xmm4
  6.49 │       vmulss %xmm4,%xmm6,%xmm4
 25.48 │       vmovss %xmm4,(%rdi)
  8.26 │       cmp    $0x20,%rax
  0.00 │     ↑ jne    6c0

Am I doing something blatantly wrong here or have I run into a compiler limitation? Is there anything short of using intrinsics or calling C/C++ code I can do here to get to performance parity?

Also, while on the subject, is there a way to force LDC to apply the relaxed floating point model to the entire program, rather than individual functions (the equivalent of --fast-math)?
November 12, 2016
On 12/11/2016 1:03 PM, dextorious wrote:
> As part of slowly learning the basics of programming in D, I ported some
> of my fluid dynamics code from C++ to D and quickly noticed a rather
> severe performance degradation by a factor of 2-3x. I've narrowed it
> down to a simple representative benchmark of virtually identical C++ and
> D code.
>
> The D version: http://pastebin.com/Rs9CUA5j
> The C++ code:  http://pastebin.com/XzStHXA2
>
> I compile the D code using the latest beta release on GitHub, using the
> compiler switches -release -O5 -mcpu=haswell -boundscheck=off. The C++
> version is compiled using Clang 3.9.0 with the switches -std=c++14
> -Ofast -fno-exceptions -fno-rtti -flto -ffast-math -march=native, which
> is my usual configuration for numerical code.
>
> On my Haswell i7-4710HQ machine the C++ version runs in ~10ms/iteration
> while the D code takes 25ms. Comparing profiler output with the
> generated assembly code quickly reveals the reason - while Clang fully
> unrolls the inner loop and uses FMA instructions wherever possible, the
> inner loop assembly produced by LDC looks like this:
>
>   0.24 │6c0:   vmovss (%r15,%rbp,4),%xmm4
>   1.03 │       vmovss (%r12,%rbp,4),%xmm5
>   3.51 │       add    $0x4,%rdi
>   6.96 │       add    $0x4,%rax
>   1.04 │6d4:   vmulss (%rax,%rcx,1),%xmm4,%xmm4
>   4.66 │       vmulss (%rax,%rdx,1),%xmm5,%xmm5
>   8.44 │       vaddss %xmm4,%xmm5,%xmm4
>   1.09 │       vmulss %xmm0,%xmm4,%xmm5
>   3.73 │       vmulss %xmm4,%xmm5,%xmm4
>   7.48 │       vsubss %xmm3,%xmm4,%xmm4
>   1.13 │       vmulss %xmm1,%xmm4,%xmm4
>   2.00 │       vaddss %xmm2,%xmm5,%xmm5
>   3.46 │       vmovss 0x0(%r13,%rbp,4),%xmm6
>   7.85 │       vmulss (%rax,%rsi,1),%xmm6,%xmm6
>   2.50 │       vaddss %xmm4,%xmm5,%xmm4
>   6.49 │       vmulss %xmm4,%xmm6,%xmm4
>  25.48 │       vmovss %xmm4,(%rdi)
>   8.26 │       cmp    $0x20,%rax
>   0.00 │     ↑ jne    6c0
>
> Am I doing something blatantly wrong here or have I run into a compiler
> limitation? Is there anything short of using intrinsics or calling C/C++
> code I can do here to get to performance parity?
>
> Also, while on the subject, is there a way to force LDC to apply the
> relaxed floating point model to the entire program, rather than
> individual functions (the equivalent of --fast-math)?

Just a thought but try this:

void compute_neq(float[] neq,
                 const float[] ux,
                 const float[] uy,
                 const float[] rho,
                 const float[] ex,
                 const float[] ey,
                 const float[] w,
                 const size_t N) @fastmath {
    foreach(idx; 0 .. N*N) {
        float usqr = ux[idx] * ux[idx] + uy[idx] * uy[idx];

        foreach(q; 0 .. 9) {
            float eu = 3.0f * (ex[q] * ux[idx] + ey[q] * uy[idx]);
            float tmp = 1.0f + eu + 0.5f * eu * eu - 1.5f * usqr;
            tmp *= w[q] * rho[idx];
            neq[idx * 9 + q] = tmp;
        }
    }
}

It may not make any difference since it is semantically the same but I thought at the very least rewriting it to be a bit more idiomatic may help.
November 12, 2016
On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
> As part of slowly learning the basics of programming in D, I ported some of my fluid dynamics code from C++ to D and quickly noticed a rather severe performance degradation by a factor of 2-3x. I've narrowed it down to a simple representative benchmark of virtually identical C++ and D code.
>
> [...]

you can apply attributes to whole files with
@fastmath:

void IamFastMath(){}

void SoAmI(){}

Don't know about whole program.

i got some improvements with -vectorize-loops and making the stencil array static and passing by ref. I couldn't get it to unroll the inner loop though.
November 12, 2016
On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
> On my Haswell i7-4710HQ machine the C++ version runs in ~10ms/iteration while the D code takes 25ms. Comparing profiler output with the generated assembly code quickly reveals the reason - while Clang fully unrolls the inner loop and uses FMA instructions wherever possible, the inner loop assembly produced by LDC looks like this:

By compiling your code with the same set of flags you used on the godbolt (https://d.godbolt.org/) service I do see the FMA instructions being used.
November 12, 2016
On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
>
> Also, while on the subject, is there a way to force LDC to apply the relaxed floating point model to the entire program, rather than individual functions (the equivalent of --fast-math)?

Not yet. If you really think this has value, please file an issue for it on GH.
It will be easy to add it to LDC.

-Johan



November 12, 2016
On Saturday, 12 November 2016 at 03:30:47 UTC, rikki cattermole wrote:
> Just a thought but try this:
>
> void compute_neq(float[] neq,
>                  const float[] ux,
>                  const float[] uy,
>                  const float[] rho,
>                  const float[] ex,
>                  const float[] ey,
>                  const float[] w,
>                  const size_t N) @fastmath {
>     foreach(idx; 0 .. N*N) {
>         float usqr = ux[idx] * ux[idx] + uy[idx] * uy[idx];
>
>         foreach(q; 0 .. 9) {
>             float eu = 3.0f * (ex[q] * ux[idx] + ey[q] * uy[idx]);
>             float tmp = 1.0f + eu + 0.5f * eu * eu - 1.5f * usqr;
>             tmp *= w[q] * rho[idx];
>             neq[idx * 9 + q] = tmp;
>         }
>     }
> }
>
> It may not make any difference since it is semantically the same but I thought at the very least rewriting it to be a bit more idiomatic may help.

That's how I originally wrote the code, then reverted to the C++-style for the comparison to make the code as identical as possible and make sure it doesn't make any difference. As expected, it doesn't.
November 12, 2016
On Saturday, 12 November 2016 at 07:38:16 UTC, Nicholas Wilson wrote:
> On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
>> As part of slowly learning the basics of programming in D, I ported some of my fluid dynamics code from C++ to D and quickly noticed a rather severe performance degradation by a factor of 2-3x. I've narrowed it down to a simple representative benchmark of virtually identical C++ and D code.
>>
>> [...]
>
> you can apply attributes to whole files with
> @fastmath:
>
> void IamFastMath(){}
>
> void SoAmI(){}
>
> Don't know about whole program.
>
> i got some improvements with -vectorize-loops and making the stencil array static and passing by ref. I couldn't get it to unroll the inner loop though.

Isn't -vectorize-loops already enabled by the other flags? Simply adding it doesn't seem to make a difference to the inner loop assembly for me. I'll try passing a static array by ref, which should slightly improve the function call performance, but I'd be surprised if it actually lets the compiler properly vectorize the inner loop or fully unroll it.
November 12, 2016
On Saturday, 12 November 2016 at 09:45:29 UTC, LiNbO3 wrote:
> On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
>> On my Haswell i7-4710HQ machine the C++ version runs in ~10ms/iteration while the D code takes 25ms. Comparing profiler output with the generated assembly code quickly reveals the reason - while Clang fully unrolls the inner loop and uses FMA instructions wherever possible, the inner loop assembly produced by LDC looks like this:
>
> By compiling your code with the same set of flags you used on the godbolt (https://d.godbolt.org/) service I do see the FMA instructions being used.

There are three vfmadd231ss in the entire assembly, but none of them are in the inner loop. The presence of any FMA instructions at all does show that the compiler properly accepts the -mcpu switch, but it doesn't seem to recognize the opportunities present in the inner loop. The assembly generated by the godbolt service seems largely identical to the one I got on my local machine.
November 12, 2016
On Saturday, 12 November 2016 at 10:10:44 UTC, Johan Engelen wrote:
> On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
>>
>> Also, while on the subject, is there a way to force LDC to apply the relaxed floating point model to the entire program, rather than individual functions (the equivalent of --fast-math)?
>
> Not yet. If you really think this has value, please file an issue for it on GH.
> It will be easy to add it to LDC.
>
> -Johan

Will do. The syntax Nicholas Wilson mentioned previously does make it easier to apply the attribute to multiple functions at once, but in many cases numerical code is written with the uniform assumption of the relaxed floating point model, so it seems much more appropriate to set it at the compiler level in those cases. It would also simplify benchmarking.
November 12, 2016
On Saturday, 12 November 2016 at 10:27:53 UTC, deXtoRious wrote:
>
> There are three vfmadd231ss in the entire assembly, but none of them are in the inner loop. The presence of any FMA instructions at all does show that the compiler properly accepts the -mcpu switch, but it doesn't seem to recognize the opportunities present in the inner loop.

Does the C++ need `__restrict__` for the parameters to get the assembly you want?

> The assembly generated by the godbolt service seems largely identical to the one I got on my local machine.

It is easier for the discussion if you paste godbolt.org links btw, so we don't have to manually do it ourselves ;-)

-Johan

« First   ‹ Prev
1 2 3