Thread overview
AVX for math code ... avx instructions later disappearing ?
Sep 26, 2021
james.p.leblanc
Sep 26, 2021
kinke
Sep 26, 2021
james.p.leblanc
September 26, 2021

Dear D-ers,

I enjoyed reading some details of incorporating AVX into math code
from Johan Engelen's programming blog post:

http://johanengelen.github.io/ldc/2016/10/11/Math-performance-LDC.html

Basically, one can use the ldc compiler to insert avx code, nice!

In playing with some variants of his example code, I realize
that there are issues I do not understand. For example, the following
code successfully incorporates the avx instructions:

// File here is called dotFirst.d
import ldc.attributes : fastmath;
@fastmath

double dot( double[] a, double[] b)
{
    double s = 0.0;
    foreach (size_t i; 0 .. a.length) {
        s += a[i] * b[i];
    }
    return s;
}

double[8] x =[0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, ];
double[8] y =[0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, ];

void main()
{
    double z = 0.0;
    z = dot(x, y);
}

If we run:

ldc2 -c -output-s -O3 -release dotFirst.d -mcpu=haswell
echo "Results of grep ymm dotFirst.s:"
grep ymm dotFirst.s

The "grep" shows a number of vector instructions, such as:

vfmadd132pd 160(%rcx,%rdi,8), %ymm5, %ymm1

However, subtle changes in the code (such as moving the dot product
function to a module, or even moving the array declarations to before
the dot product function, and the avx instructions will disappear!

import ldc.attributes : fastmath;
@fastmath

double[8] x =[0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, ];
double[8] y =[0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, ];

double dot( double[] a, double[] b)
{
    double s = 0.0;
    foreach (size_t i; 0 .. a.length) {
...

Now a grep will not find a single ymm.

It is understood that ldc needs proper alignment to be able to do the vector
instructions...

But my question is: how is proper alignment guaranteed? (Most importantly
how guaranteed among code using modules)?? (There are related stack alignment
issues -- 16?)

Best Regards,
James

PS I have come across scattered bits of (sometimes contradictory) information on
avx/simd for dlang. Is there a canonical source for vector info?

September 26, 2021

On Sunday, 26 September 2021 at 18:08:46 UTC, james.p.leblanc wrote:

>

or even moving the array declarations to before
the dot product function, and the avx instructions will disappear!

That's because the @fastmath UDA applies to the next declaration only, which is the x array in your 2nd example (where it obviously has no effect). Either use @fastmath: with the colon to apply it to the entire scope, or use -ffast-math in the LDC cmdline.

Similarly, when moving the function to another module and you don't include that module in the cmdline, it's only imported and not compiled and won't show up in the resulting assembly.

Wrt. stack alignment, there aren't any issues with LDC AFAIK (not limited to 16 or whatever like DMD).

September 26, 2021

On Sunday, 26 September 2021 at 19:00:54 UTC, kinke wrote:

>

On Sunday, 26 September 2021 at 18:08:46 UTC, james.p.leblanc wrote:

>

or even moving the array declarations to before
the dot product function, and the avx instructions will disappear!

That's because the @fastmath UDA applies to the next declaration only, which is the x array in your 2nd example (where it obviously has no effect). Either use @fastmath: with the colon to apply it to the entire scope, or use -ffast-math in the LDC cmdline.

Similarly, when moving the function to another module and you don't include that module in the cmdline, it's only imported and not compiled and won't show up in the resulting assembly.

Wrt. stack alignment, there aren't any issues with LDC AFAIK (not limited to 16 or whatever like DMD).

Kinke,

Thanks very much for your response. There were many issues that I
had been misunderstanding in my attempts. The provided explanation
helped me understand the broader scope of what is happening.

(I never even thought about the @fastmath UDA aspect! ... a bit
embarrassing for me!) Using the -ffast-math in the LDC
cmdline seems to be a most elegant solution.

Much appreciated!
Regards,
James