Thread overview
Rather Bizarre slow downs using Complex!float with avx (ldc).
Sep 30, 2021
james.p.leblanc
Sep 30, 2021
Johan
Oct 01, 2021
james.p.leblanc
Oct 02, 2021
Guillaume Piolat
September 30, 2021

D-Ers,

I have been getting counterintuitive results on avx/no-avx timing
experiments. Storyline to date (notes at end):

Experiment #1) Real float data type (i.e. non-complex numbers),
speed comparison.
a) moving from non-avx --> avx shows non-realistic speed up of 15-25 X.
b) this is weird, but story continues ...

Experiment #2) Real double data type (non-complex numbers),
a) moving from non-avx --> avx again shows amazing gains, but the
gains are about half of those seen in Experiment #1, so maybe
this looks plausible?

Experiment #3) Complex!float datatypes:
a) now going from non-avx to avx shows a serious performance LOSS
of 40% to breaking even at best. What is happening here?

Experiment #4) Complex!double:
a) non-avx --> avx shows performancegains again about 2X (so the
gains appear to be reasonable).

The main question I have is:

"What is going on with the Complex!float performance?" One might expect
floats to have a better perfomance than doubles as we saw with the
real-value data (becuase of vector packaging, memory bandwidth, etc).

But, Complex!float shows MUCH WORSE avx performance than Complex!Double
(by a factor of almost 4).

//            Table of Computation Times
//
//       self math              std math
// explicit  no-explicit   explicit  no-explicit
//   align      align        align      align
//   0.12       0.21          0.15      0.21 ;  # Float with AVX
//   3.23       3.24          3.30      3.22 ;  # Float without AVX
//   0.31       0.42          0.31      0.42 ;  # Double with AVX
//   3.25       3.24          3.24      3.27 ;  # Double without AVX
//   6.42       6.62          6.61      6.59 ;  # Complex!float with AVX
//   4.04       4.17          6.68      5.82 ;  # Complex!float without AVX
//   1.67       1.69          1.73      1.71 ;  # Complex!double with AVX
//   3.34       3.42          3.28      3.31    # Complex!double without AVX

Notes:

  1. Based on forum hints from ldc experts, I got good guidance
    on enabling avx ( i.e. compiling modules on command line, using
    --fast-math and -mcpu=haswell on command line).

  2. From Mir-glas experts I received hints to try to implement own version
    of the complex math. (this is what the "self-math" column refers to).

I understand that detail of the computations are not included here, (I
can do that if there is interest, and if I figure out an effective way to present
it in a forum.)

But, I thought I might begin with a simple question, "Is there some well-known
issue that I am missing here". Have others been done this road as well?

Thanks for any and all input.
Best Regards,
James

PS Sorry for the inelegant table ... I do not believe there is a way
to include the beautiful bars charts on this forum. Please correct me
if there is a way...)

September 30, 2021

On Thursday, 30 September 2021 at 16:40:03 UTC, james.p.leblanc wrote:

>

D-Ers,

I have been getting counterintuitive results on avx/no-avx timing
experiments.

This could be an template instantiation culling problem. If the compiler is able to determine that Complex!float is already instantiated (codegen) inside Phobos, then it may decide not to codegen it again when you are compiling your code with AVX+fastmath enabled. This could explain why you don't see improvement for Complex!float, but do see improvement with Complex!double. This does not explain the worse performance with AVX+fastmath vs without it.

Generally, for performance issues like this you need to study assembly output (--output-s) or LLVM IR (--output-ll).
First thing I would look out for is function inlining yes/no.

cheers,
Johan

October 01, 2021

On Thursday, 30 September 2021 at 16:52:57 UTC, Johan wrote:

>

On Thursday, 30 September 2021 at 16:40:03 UTC, james.p.leblanc

>

Generally, for performance issues like this you need to study assembly output (--output-s) or LLVM IR (--output-ll).
First thing I would look out for is function inlining yes/no.

cheers,
Johan

Johan,

Thanks kindly for your reply. As suggested, I have looked at the assembly output.

Strangely the fused multiplay add are indeed there in the avx version, but example
still runs slower for Complex!float data type.

I have stripped the code down to a minimum, which demonstrates the weird result:


import ldc.attributes;  // with or without this line makes no difference
import std.stdio;
import std.datetime.stopwatch;
import std.complex;

alias T = Complex!float;
auto typestr = "COMPLEX FLOAT";
/* alias T = Complex!double; */
/* auto typestr = "COMPLEX DOUBLE"; */

auto alpha = cast(T) complex(0.1, -0.2);  // dummy values to fill arrays
auto beta = cast(T) complex(-0.7, 0.6);

auto dotprod( T[] x, T[] y)
{
   auto sum = cast(T) 0;
      foreach( size_t i ; 0 .. x.length)
         sum += x[i] * conj(y[i]);
   return sum;
}

void main()
{
   int nEle = 1000;
   int nIter = 2000;

   auto startTime = MonoTime.currTime;
   auto dur = cast(double) (MonoTime.currTime-startTime).total!"usecs";

   T[] x, y;
   x.length = nEle;
   y.length = nEle;
   T z;
   x[] = alpha;
   y[] = beta;

   startTime = MonoTime.currTime;
   foreach( i ; 0 .. nIter){
      foreach( j ; 0 .. nIter){
            z = dotprod(x,y);
      }
   }
   auto etime = cast(double) (MonoTime.currTime-startTime).total!"msecs" / 1.0e3;
   writef(" result:  % 5.2f%+5.2fi  comp time:  %5.2f \n", z.re, z.im, etime);
}

For convenience I include bash script used compile/run/generate assembly code / and grep:

echo
echo "With AVX:"
ldc2 -O3 -release question.d --ffast-math -mcpu=haswell
question
ldc2 -output-s -O3 -release question.d --ffast-math -mcpu=haswell
mv question.s question_with_avx.s

echo
echo "Without AVX"
ldc2 -O3 -release question.d
question
ldc2 -output-s -O3 -release question.d
mv question.s question_without_avx.s

echo
echo "fused multiply adds are found in avx code (as desired)"
grep vfmadd *.s /dev/null

Here is output when run on my machine:

With AVX:
 result:  -190.00+80.00i  comp time:   6.45

Without AVX
 result:  -190.00+80.00i  comp time:   5.74

fused multiply adds are found in avx code (as desired)
question_with_avx.s:    vfmadd231ss     %xmm2, %xmm5, %xmm3
question_with_avx.s:    vfmadd231ss     %xmm0, %xmm2, %xmm3
question_with_avx.s:    vfmadd231ss     %xmm2, %xmm4, %xmm1
question_with_avx.s:    vfmadd231ss     %xmm3, %xmm5, %xmm1
question_with_avx.s:    vfmadd231ss     %xmm3, %xmm1, %xmm0

Repeating the experiment after changing to datatype of Complex!double
shows AVX code to be twice as fast (perhaps more aligned with expectations).

I admit my confusion as to why the Complex!float is misbehaving.

Does anyone have insight to what is happening?

Thanks,
James

October 02, 2021

On Friday, 1 October 2021 at 08:32:14 UTC, james.p.leblanc wrote:

>

Does anyone have insight to what is happening?

Thanks,
James

Maybe something related to: https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e02d774 ?

AVX is not always a clear win in terms of performance.
Processing 8x float at once may not do anything if you are memory-bound, etc.