Rather Bizarre slow downs using Complex!float with avx (ldc).

Sep 30, 2021

james.p.leblanc

Sep 30, 2021

Johan

Oct 01, 2021

james.p.leblanc

Oct 02, 2021

Guillaume Piolat

// Table of Computation Times // // self math std math // explicit no-explicit explicit no-explicit // align align align align // 0.12 0.21 0.15 0.21 ; # Float with AVX // 3.23 3.24 3.30 3.22 ; # Float without AVX // 0.31 0.42 0.31 0.42 ; # Double with AVX // 3.25 3.24 3.24 3.27 ; # Double without AVX // 6.42 6.62 6.61 6.59 ; # Complex!float with AVX // 4.04 4.17 6.68 5.82 ; # Complex!float without AVX // 1.67 1.69 1.73 1.71 ; # Complex!double with AVX // 3.34 3.42 3.28 3.31 # Complex!double without AVX

On Thursday, 30 September 2021 at 16:40:03 UTC, james.p.leblanc wrote:

D-Ers,

I have been getting counterintuitive results on avx/no-avx timing
experiments.

This could be an template instantiation culling problem. If the compiler is able to determine that Complex!float is already instantiated (codegen) inside Phobos, then it may decide not to codegen it again when you are compiling your code with AVX+fastmath enabled. This could explain why you don't see improvement for Complex!float, but do see improvement with Complex!double. This does not explain the worse performance with AVX+fastmath vs without it.

Generally, for performance issues like this you need to study assembly output (--output-s) or LLVM IR (--output-ll).
First thing I would look out for is function inlining yes/no.

cheers,
Johan

October 01, 2021

Re: Rather Bizarre slow downs using Complex!float with avx (ldc).

Posted by james.p.leblanc
in reply to Johan

Permalink

james.p.leblanc

Posted in reply to Johan

Permalink

On Thursday, 30 September 2021 at 16:52:57 UTC, Johan wrote:

On Thursday, 30 September 2021 at 16:40:03 UTC, james.p.leblanc

Generally, for performance issues like this you need to study assembly output (--output-s) or LLVM IR (--output-ll).
First thing I would look out for is function inlining yes/no.

cheers,
Johan

Johan,

Thanks kindly for your reply. As suggested, I have looked at the assembly output.

Strangely the fused multiplay add are indeed there in the avx version, but example
still runs slower for Complex!float data type.

I have stripped the code down to a minimum, which demonstrates the weird result:


import ldc.attributes;  // with or without this line makes no difference
import std.stdio;
import std.datetime.stopwatch;
import std.complex;

alias T = Complex!float;
auto typestr = "COMPLEX FLOAT";
/* alias T = Complex!double; */
/* auto typestr = "COMPLEX DOUBLE"; */

auto alpha = cast(T) complex(0.1, -0.2);  // dummy values to fill arrays
auto beta = cast(T) complex(-0.7, 0.6);

auto dotprod( T[] x, T[] y)
{
   auto sum = cast(T) 0;
      foreach( size_t i ; 0 .. x.length)
         sum += x[i] * conj(y[i]);
   return sum;
}

void main()
{
   int nEle = 1000;
   int nIter = 2000;

   auto startTime = MonoTime.currTime;
   auto dur = cast(double) (MonoTime.currTime-startTime).total!"usecs";

   T[] x, y;
   x.length = nEle;
   y.length = nEle;
   T z;
   x[] = alpha;
   y[] = beta;

   startTime = MonoTime.currTime;
   foreach( i ; 0 .. nIter){
      foreach( j ; 0 .. nIter){
            z = dotprod(x,y);
      }
   }
   auto etime = cast(double) (MonoTime.currTime-startTime).total!"msecs" / 1.0e3;
   writef(" result:  % 5.2f%+5.2fi  comp time:  %5.2f \n", z.re, z.im, etime);
}

For convenience I include bash script used compile/run/generate assembly code / and grep:

echo
echo "With AVX:"
ldc2 -O3 -release question.d --ffast-math -mcpu=haswell
question
ldc2 -output-s -O3 -release question.d --ffast-math -mcpu=haswell
mv question.s question_with_avx.s

echo
echo "Without AVX"
ldc2 -O3 -release question.d
question
ldc2 -output-s -O3 -release question.d
mv question.s question_without_avx.s

echo
echo "fused multiply adds are found in avx code (as desired)"
grep vfmadd *.s /dev/null

Here is output when run on my machine:

With AVX:
 result:  -190.00+80.00i  comp time:   6.45

Without AVX
 result:  -190.00+80.00i  comp time:   5.74

fused multiply adds are found in avx code (as desired)
question_with_avx.s:    vfmadd231ss     %xmm2, %xmm5, %xmm3
question_with_avx.s:    vfmadd231ss     %xmm0, %xmm2, %xmm3
question_with_avx.s:    vfmadd231ss     %xmm2, %xmm4, %xmm1
question_with_avx.s:    vfmadd231ss     %xmm3, %xmm5, %xmm1
question_with_avx.s:    vfmadd231ss     %xmm3, %xmm1, %xmm0

Repeating the experiment after changing to datatype of Complex!double
shows AVX code to be twice as fast (perhaps more aligned with expectations).

I admit my confusion as to why the Complex!float is misbehaving.

Does anyone have insight to what is happening?

Thanks,
James

Forums