Optimizing for SIMD: best practices?(i.e. what features are allowed?) (page 2)

March 07, 2021

Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

Posted by tsbockman
in reply to z

Permalink

tsbockman

Posted in reply to z

Permalink

On Sunday, 7 March 2021 at 13:26:37 UTC, z wrote:
> On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
> However, AVX512 support seems limited to being able to use the 16 other YMM registers, rather than using the same code template but changed to use ZMM registers and double the offsets to take advantage of the new size.
> Compiled with «-g -enable-unsafe-fp-math -enable-no-infs-fp-math -ffast-math -O -release -mcpu=skylake» :

You're not compiling with AVX512 enabled. You would need to use -mcpu=skylake-avx512.

However, LLVM's code generation for AVX512 seems to be pretty terrible still, so you'll need to either use some inline ASM, or stick with AVX2. Here's a structure of arrays style example:

import std.meta : Repeat;
void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3, const(V)) a, ref Repeat!(3, const(V)) b, out V result)
    if(is(V : __vector(float[length]), size_t length))
{
    Repeat!(3, V) diffSq = a;
    static foreach(i; 0 .. 3) {
        diffSq[i] -= b[i];
        diffSq[i] *= diffSq[i];
    }

    result = diffSq[0];
    static foreach(i; 0 .. 3)
        result += diffSq[i];

    version(LDC) { version(X86_64) {
        enum isSupportedPlatform = true;
        import ldc.llvmasm : __asm;
        result = __asm!V(`vsqrtps $1, $0`, `=x, x`, result);
    } }
    static assert(isSupportedPlatform);
}

Resulting asm with is(V == __vector(float[16])):

.LCPI1_0:
        .long   0x7fc00000
pure nothrow @nogc void app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistanceFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), out __vector(float[16])):
        mov     rax, qword ptr [rsp + 8]
        vbroadcastss    zmm0, dword ptr [rip + .LCPI1_0]
        vmovaps zmmword ptr [rdi], zmm0
        vmovaps zmm0, zmmword ptr [rax]
        vmovaps zmm1, zmmword ptr [r9]
        vmovaps zmm2, zmmword ptr [r8]
        vsubps  zmm0, zmm0, zmmword ptr [rcx]
        vmulps  zmm0, zmm0, zmm0
        vsubps  zmm1, zmm1, zmmword ptr [rdx]
        vsubps  zmm2, zmm2, zmmword ptr [rsi]
        vaddps  zmm0, zmm0, zmm0
        vfmadd231ps     zmm0, zmm1, zmm1
        vfmadd231ps     zmm0, zmm2, zmm2
        vmovaps zmmword ptr [rdi], zmm0
        vsqrtps zmm0, zmm0
        vmovaps zmmword ptr [rdi], zmm0
        vzeroupper
        ret

On Sunday, 7 March 2021 at 18:00:57 UTC, z wrote: > On Friday, 26 February 2021 at 03:57:12 UTC, tsbockman wrote: >>>> static foreach(size_t i; 0 .. 3/+typeof(a).length+/){ >>>> distance += a[i].abs;//abs required by the caller >> >> (a * a) above is always positive for real numbers. You don't need the call to abs unless you're trying to guarantee that even nan values will have a clear sign bit. >> > I do not know why but the caller's performance nosedives My way is definitely (slightly) better; something is going wrong in either the caller, or the optimizer. Show me the code for the caller and maybe I can figure it out. > whenever there is no .abs at this particular line.(there's a 3x difference, no joke.) Perhaps the compiler is performing a value range propagation (VRP) based optimization in the caller, but isn't smart enough to figure out that the value is already always positive without the `abs` call? I've run into that specific problem before. Alternatively, sometimes trivial changes to the code that *shouldn't* matter make the difference between hitting a smart path in the optimizer, and a dumb one. Automatic SIMD optimization is quite sensitive and temperamental. Either way, the problem can be fixed by figuring out what optimization the compiler is doing when it knows that distance is positive, and just doing it manually. > Same for assignment instead of addition, but with a 2x difference instead. Did you fix the nan bug I pointed out earlier? More generally, are you actually verifying the correctness of the results in any way for each alternative implementation? Because you can get big speedups sometimes from buggy code when the compiler realizes that some later logic error makes earlier code irrelevant, but that doesn't mean the buggy code is better...

On Sunday, 7 March 2021 at 22:54:32 UTC, tsbockman wrote: > import std.meta : Repeat; > void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3, const(V)) a, ref Repeat!(3, const(V)) b, out V result) > if(is(V : __vector(float[length]), size_t length)) > ... > > Resulting asm with is(V == __vector(float[16])): > > .LCPI1_0: > .long 0x7fc00000 > pure nothrow @nogc void app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistanceFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), out __vector(float[16])): > mov rax, qword ptr [rsp + 8] > vbroadcastss zmm0, dword ptr [rip + .LCPI1_0] > ... Apparently the optimizer is too stupid to skip the redundant float.nan broadcast when result is an `out` parameter, so just make it `ref V result` instead for better code gen: pure nothrow @nogc void app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistanceFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref __vector(float[16])): mov rax, qword ptr [rsp + 8] vmovaps zmm0, zmmword ptr [rax] vmovaps zmm1, zmmword ptr [r9] vmovaps zmm2, zmmword ptr [r8] vsubps zmm0, zmm0, zmmword ptr [rcx] vmulps zmm0, zmm0, zmm0 vsubps zmm1, zmm1, zmmword ptr [rdx] vsubps zmm2, zmm2, zmmword ptr [rsi] vaddps zmm0, zmm0, zmm0 vfmadd231ps zmm0, zmm1, zmm1 vfmadd231ps zmm0, zmm2, zmm2 vmovaps zmmword ptr [rdi], zmm0 vsqrtps zmm0, zmm0 vmovaps zmmword ptr [rdi], zmm0 vzeroupper ret

On Sunday, 7 March 2021 at 22:54:32 UTC, tsbockman wrote: > ... > result = diffSq[0]; > static foreach(i; 0 .. 3) > result += diffSq[i]; > ... Oops, that's supposed to say `i; 1 .. 3`. Fixed: import std.meta : Repeat; void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3, const(V)) a, ref Repeat!(3, const(V)) b, ref V result) if(is(V : __vector(float[length]), size_t length)) { Repeat!(3, V) diffSq = a; static foreach(i; 0 .. 3) { diffSq[i] -= b[i]; diffSq[i] *= diffSq[i]; } result = diffSq[0]; static foreach(i; 1 .. 3) result += diffSq[i]; version(LDC) { version(X86_64) { enum isSupportedPlatform = true; import ldc.llvmasm : __asm; result = __asm!V(`vsqrtps $1, $0`, `=x, x`, result); } } static assert(isSupportedPlatform); } Fixed asm: pure nothrow @nogc void app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistanceFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref __vector(float[16])): mov rax, qword ptr [rsp + 8] vmovaps zmm0, zmmword ptr [rax] vmovaps zmm1, zmmword ptr [r9] vmovaps zmm2, zmmword ptr [r8] vsubps zmm0, zmm0, zmmword ptr [rcx] vsubps zmm1, zmm1, zmmword ptr [rdx] vmulps zmm1, zmm1, zmm1 vsubps zmm2, zmm2, zmmword ptr [rsi] vfmadd231ps zmm1, zmm0, zmm0 vfmadd231ps zmm1, zmm2, zmm2 vmovaps zmmword ptr [rdi], zmm1 vsqrtps zmm0, zmm1 vmovaps zmmword ptr [rdi], zmm0 vzeroupper ret (I really wish I could just edit my posts here...)

Forums