Jump to page: 1 2
Thread overview
Optimizing for SIMD: best practices?(i.e. what features are allowed?)
Feb 25, 2021
z
Feb 25, 2021
Guillaume Piolat
Feb 26, 2021
Guillaume Piolat
Mar 07, 2021
z
Mar 07, 2021
Bruce Carneal
Feb 26, 2021
tsbockman
Feb 26, 2021
tsbockman
Mar 07, 2021
z
Mar 07, 2021
tsbockman
Feb 26, 2021
Bruce Carneal
Mar 07, 2021
z
Mar 07, 2021
tsbockman
Mar 07, 2021
tsbockman
Mar 07, 2021
tsbockman
February 25, 2021
How does one optimize code to make full use of the CPU's SIMD capabilities?
Is there any way to guarantee that "packed" versions of SIMD instructions will be used?(e.g. vmulps, vsqrtps, etc...)
To give some context, this is a sample of one of the functions that could benefit from better SIMD usage :
>float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
>   float distance;
>   a[] -= b[];
>   a[] *= a[];
>   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
>       distance += a[i].abs;//abs required by the caller
>   }
>   return sqrt(distance);
>}
>vmovsd xmm0,qword ptr ds:[rdx]
>vmovss xmm1,dword ptr ds:[rdx+8]
>vmovsd xmm2,qword ptr ds:[rcx+4]
>vsubps xmm0,xmm0,xmm2
>vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
>vmulps xmm0,xmm0,xmm0
>vmulss xmm1,xmm1,xmm1
>vbroadcastss xmm2,dword ptr ds:[<__real@7fffffff>]
>vandps xmm0,xmm0,xmm2
>vpermilps xmm3,xmm0,F5
>vaddss xmm0,xmm0,xmm3
>vandps xmm1,xmm1,xmm2
>vaddss xmm0,xmm0,xmm1
>vsqrtss xmm0,xmm0,xmm0
>vmovaps xmm6,xmmword ptr ss:[rsp+20]
>add rsp,38
>ret

I've tried to experiment with dynamic arrays of float[3] but the output assembly seemed to be worse.[1](in short, it's calling internal D functions which use "vxxxss" instructions while performing many moves.)

Big thanks
[1] https://run.dlang.io/is/F3Xye3
February 25, 2021
On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
> How does one optimize code to make full use of the CPU's SIMD capabilities?
> Is there any way to guarantee that "packed" versions of SIMD instructions will be used?(e.g. vmulps, vsqrtps, etc...)

https://code.dlang.org/packages/intel-intrinsics


February 26, 2021
On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
> Is there any way to guarantee that "packed" versions of SIMD instructions will be used?(e.g. vmulps, vsqrtps, etc...)
> To give some context, this is a sample of one of the functions that could benefit from better SIMD usage :
>>float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {

You need to use __vector(float[4]) instead of float[3] to tell the compiler to pack multiple elements per SIMD register. Right now your data lacks proper alignment for SIMD load/stores.

Beyond that, SIMD code is rather difficult to optimize. Code written in ignorance or in a rush is unlikely to be meaningfully faster than ordinary scalar code, unless the data flow is very simple. You will probably get a bigger speedup for less effort and pain by first minimizing heap allocations, maximizing locality of reference, minimizing indirections, and minimizing memory use. (And, of course, it should go without saying that choosing an asymptotically efficient high-level algorithm is more important than any micro-optimization for large data sets.) Nevertheless, if you are up to the challenge, SIMD can sometimes provide a final 2-3x speed boost.

Your algorithms will need to be designed to minimize mixing of data between SIMD channels, as this forces the generation of lots of extra instructions to swizzle the data, or worse to unpack and repack it. Something like a Cartesian dot product or cross product will benefit much less from SIMD than vector addition, for example. Sometimes the amount of swizzling can be greatly reduced with a little algebra, other times you might need to refactor an array of structures into a structure of arrays.

Per-element conditional branches are very bad, and often completely defeat the benefits of SIMD. For very short segments of code (like conditional assignment), replace them with a SIMD conditional move (vcmp and vblend). Bit-twiddling is your friend.

Finally, do not trust the compiler or the optimizer. People love to make the claim that "The Compiler" is always better than humans at micro-optimizations, but this is not at all the case for SIMD code with current systems. I have found even LLVM to produce quite bad SIMD code for complex algorithms, unless I carefully structure my code to make it as easy as possible for the optimizer to get to the final assembly I want. A sprinkling of manual assembly code (directly, or via a library) is also necessary to fill in certain instructions that the compiler doesn't know when to use at all.

Resources I have found very helpful:

Matt Godbolt's Compiler Explorer online visual disassembler (supports D):
    https://godbolt.org/

Felix Cloutier's x86 and amd64 instruction reference:
    https://www.felixcloutier.com/x86/

Agner Fog's optimization guide (especially the instruction tables):
    https://agner.org/optimize/
February 26, 2021
On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
>>float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {

Use __vector(float[4]), not float[3].

>>   float distance;

The default value for float is float.nan. You need to explicitly initialize it to 0.0f or something if you want this function to actually do anything useful.

>>   a[] -= b[];
>>   a[] *= a[];

With __vector types, this can be simplified (not optimized) to just:
    a -= b;
    a *= a;

>>   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
>>       distance += a[i].abs;//abs required by the caller

(a * a) above is always positive for real numbers. You don't need the call to abs unless you're trying to guarantee that even nan values will have a clear sign bit.

Also, there is no point to adding the first component to zero, and copying element [0] from a SIMD register into a scalar is free, so this can become:

    float distance = a[0];
    static foreach(size_t i; 1 .. 3)
        distance += a[i];

>>   }
>>   return sqrt(distance);
>>}

Final assembly output (ldc 1.24.0 with -release -O3 -preview=intpromote -preview=dip1000 -m64 -mcpu=haswell -fp-contract=fast -enable-cross-module-inlining):

    vsubps  xmm0, xmm1, xmm0
    vmulps  xmm0, xmm0, xmm0
    vmovshdup       xmm1, xmm0
    vaddss  xmm1, xmm0, xmm1
    vpermilpd       xmm0, xmm0, 1
    vaddss  xmm0, xmm0, xmm1
    vsqrtss xmm0, xmm0, xmm0
    ret
February 26, 2021
On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
> How does one optimize code to make full use of the CPU's SIMD capabilities?
> Is there any way to guarantee that "packed" versions of SIMD instructions will be used?(e.g. vmulps, vsqrtps, etc...)
> To give some context, this is a sample of one of the functions that could benefit from better SIMD usage :
>>float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
>>   float distance;
>>   a[] -= b[];
>>   a[] *= a[];
>>   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
>>       distance += a[i].abs;//abs required by the caller
>>   }
>>   return sqrt(distance);
>>}
>>vmovsd xmm0,qword ptr ds:[rdx]
>>vmovss xmm1,dword ptr ds:[rdx+8]
>>vmovsd xmm2,qword ptr ds:[rcx+4]
>>vsubps xmm0,xmm0,xmm2
>>vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
>>vmulps xmm0,xmm0,xmm0
>>vmulss xmm1,xmm1,xmm1
>>vbroadcastss xmm2,dword ptr ds:[<__real@7fffffff>]
>>vandps xmm0,xmm0,xmm2
>>vpermilps xmm3,xmm0,F5
>>vaddss xmm0,xmm0,xmm3
>>vandps xmm1,xmm1,xmm2
>>vaddss xmm0,xmm0,xmm1
>>vsqrtss xmm0,xmm0,xmm0
>>vmovaps xmm6,xmmword ptr ss:[rsp+20]
>>add rsp,38
>>ret
>
> I've tried to experiment with dynamic arrays of float[3] but the output assembly seemed to be worse.[1](in short, it's calling internal D functions which use "vxxxss" instructions while performing many moves.)
>
> Big thanks
> [1] https://run.dlang.io/is/F3Xye3

If you are developing for deployment to a platform that has a GPU, you might consider going SIMT (dcompute) rather than SIMD.  SIMT is a lot easier on the eyes.  More importantly, if you're targetting an SoC or console, or have relatively chunky computations that allow you to work around the PCIe transit costs, the path is open to very large performance improvements.  I've only been using dcompute for a week or so but so far it's been great.

If your algorithims are very branchy, or you decide to stick with multi-core/SIMD for any of a number of other good reasons, here are a few things I learned before decamping to dcompute land that may help:

  1)  LDC is pretty good at auto vectorization as you have probably observed.  Definitely worth a few iterations to try and get the vectorizer engaged.

  2)  LDC auto vectorization was good but explicit __vector programming is more predictable and was, at least for my tasks, much faster. I couldn't persuade the auto vectorizer to "do the right thing" throughout the hot path but perhaps you'll have better luck.

  3)  LDC does a good job of going between T[N] <==> __vector(T[N]) so using the static array types as your input/output types and the __vector types as your compute types works out well whenever you have to interface with an unaligned world. LDC issues unaligned vector loads/stores for casts or full array assigns: v = cast(VT)sa[];  or v[] = sa[];  These are quite good on modern CPUs.  To calibrate note that Ethan recently talked about a 10% gain he experienced using full alignment, IIRC, so there's that.

  4) LDC also does a good job of discovering SIMD equivalents given static foreach unrolled loops with explicit complie-time indexing of vector element operands.  You can use those along with pragma(inline, true) to develop your own "intrinsics" that supplement other libs.

  5) If you adopt the __vector approach you'll have to handle the partials manually. (array length % vec length != 0 indicates a partial or tail fragment).  If the classic (copying/padding) approaches to such fragmentation don't work for you I'd suggest using nested static functions that take ref T[N] inputs and outputs.  The main loops become very simple and the tail handling reduces to loading stack allocated T[N] variables explicitly, calling the static function, and unloading.

Good luck.


February 26, 2021
On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat wrote:
> On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
>> How does one optimize code to make full use of the CPU's SIMD capabilities?
>> Is there any way to guarantee that "packed" versions of SIMD instructions will be used?(e.g. vmulps, vsqrtps, etc...)
>
> https://code.dlang.org/packages/intel-intrinsics

A bit of elaboration on why you might want to prefer intel-intrinsics:
- it supports all D compilers, including DMD 32-bit target
- targets arm32 and arm64 with same code (LDC only)
- core.simd just give you the basic operators, but not say, pmaddwd or any of the complex instructions. Some instructions need very specific work to get them.
- at least with LLVM, optimizers works reliably over subsequent versions of the compiler.

March 07, 2021
On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
> ...
It seems that using static foreach with pointer parameters exclusively is the best way to "guide" LDC into optimizing code.(using arr1[] += arr2[] syntax resulted in worse performance for me.)
However, AVX512 support seems limited to being able to use the 16 other YMM registers, rather than using the same code template but changed to use ZMM registers and double the offsets to take advantage of the new size.
Compiled with «-g -enable-unsafe-fp-math -enable-no-infs-fp-math -ffast-math -O -release -mcpu=skylake» :
>__gshared simdf init = [0f,0f,0f,0f,0f,0f,0f,0f];
>alias simdf = float[8]
>extern(C)//with extern(D)(the default), the assembly output uses one register for two pointers.
>void vEUCLIDpsptr_void(simdf* a0, simdf* a1, simdf* a2, simdf* a3, simdf* b1, simdf* b2, simdf* b3) {
>	simdf amm0 = init;//returned simdf
>	simdf amm1 = *a1;
>	simdf amm2 = *a2;
>	simdf amm3 = *a3;
>	static foreach(size_t i; 0..simdlength) {
>		//Needs to be interleaved like this, otherwise LDC generates worse code.
>		amm1[i] -= (*b1)[i];
>		amm1[i] *= amm1[i];
>		amm2[i] -= (*b2)[i];
>		amm2[i] *= amm2[i];
>		amm3[i] -= (*b3)[i];
>		amm3[i] *= amm3[i];
>		amm0[i] += amm1[i];
>		amm0[i] += amm2[i];
>		amm0[i] += amm3[i];
>		amm0[i] = sqrt(amm0[i]);
>	}
>	*a0 = amm0;
>	return;
>}

>mov r10,qword ptr ss:[rsp+38]
>mov r11,qword ptr ss:[rsp+30]
>mov rax,qword ptr ss:[rsp+28]
>vmovups ymm0,yword ptr ds:[rdx]
>vmovups ymm1,yword ptr ds:[r8]
>vsubps ymm0,ymm0,yword ptr ds:[rax]
>vmovups ymm2,yword ptr ds:[r9]
>vfmadd213ps ymm0,ymm0,yword ptr ds:[<_D12euclideandst4initG8f>]
>vsubps ymm1,ymm1,yword ptr ds:[r11]
>vfmadd213ps ymm1,ymm1,ymm0
>vsubps ymm0,ymm2,yword ptr ds:[r10]
>vfmadd213ps ymm0,ymm0,ymm1
>vsqrtps ymm0,ymm0
>vmovups yword ptr ds:[rcx],ymm0
>vzeroupper ret

The speed difference is near 400% for the same amount of distances compared with the single distance function example.
However, the assembly generated isn't the fastest, for example removing vzeroupper and using the unused and known-zeroed YMM15 register as a zero-filled register operand for the first vfmadd213ps instruction improves performance by 10%(70 vs 78ms for 256 million distances...)
The function can then be improved further to use pointer offsets and more registers, this is more efficient and results in a 500%~ improvement :
>extern(C)
>void vEUCLIDpsptr_void_40(simdf* a0, simdf* a1, simdf* a2, simdf* a3, simdf* b1, simdf* b2, simdf* b3) {
>	simdf amm0 = init;
>	simdf amm1 = *a1;
>	simdf amm2 = *a2;
>	simdf amm3 = *a3;
>	simdf emm0 = init;
>	simdf emm1 = amm1;
>	simdf emm2 = amm2;//mirror AMM for positions
>	simdf emm3 = amm3;
>	simdf imm0 = init;
>	simdf imm1 = emm1;
>	simdf imm2 = emm2;
>	simdf imm3 = emm3;
>	simdf omm0 = init;
>	simdf omm1 = emm1;
>	simdf omm2 = emm2;
>	simdf omm3 = emm3;
>	simdf umm0 = init;
>	simdf umm1 = omm1;
>	simdf umm2 = omm2;
>	simdf umm3 = omm3;
>	//cascading assignment may not be the fastest way, especially compared to just loading from the pointer!
>
>	static foreach(size_t i; 0..simdlength) {
>		amm1[i] -= (b1[0])[i];
>		amm1[i] *= amm1[i];
>		amm0[i] += amm1[i];
>
>		amm2[i] -= (b2[0])[i];
>		amm2[i] *= amm2[i];
>		amm0[i] += amm2[i];
>
>		amm3[i] -= (b3[0])[i];
>		amm3[i] *= amm3[i];
>		amm0[i] += amm3[i];
>
>		amm0[i] = sqrt(amm0[i]);
>		//template
>		emm1[i] -= (b1[1])[i];
>		emm1[i] *= emm1[i];
>		emm0[i] += emm1[i];
>
>		emm2[i] -= (b2[1])[i];
>		emm2[i] *= emm2[i];
>		emm0[i] += emm2[i];
>
>		emm3[i] -= (b3[1])[i];
>		emm3[i] *= emm3[i];
>		emm0[i] += emm3[i];
>
>		emm0[i] = sqrt(emm0[i]);
>		//
>		imm1[i] -= (b1[2])[i];
>		imm1[i] *= imm1[i];
>		imm0[i] += imm1[i];
>
>		imm2[i] -= (b2[2])[i];
>		imm2[i] *= imm2[i];
>		imm0[i] += imm2[i];
>
>		imm3[i] -= (b3[2])[i];
>		imm3[i] *= imm3[i];
>		imm0[i] += imm3[i];
>
>		imm0[i] = sqrt(imm0[i]);
>		//
>		omm1[i] -= (b1[3])[i];
>		omm1[i] *= omm1[i];
>		omm0[i] += omm1[i];
>
>		omm2[i] -= (b2[3])[i];
>		omm2[i] *= omm2[i];
>		omm0[i] += omm2[i];
>
>		omm3[i] -= (b3[3])[i];
>		omm3[i] *= omm3[i];
>		omm0[i] += omm3[i];
>
>		omm0[i] = sqrt(omm0[i]);
>		//
>		umm1[i] -= (b1[4])[i];
>		umm1[i] *= umm1[i];
>		umm0[i] += umm1[i];
>
>		umm2[i] -= (b2[4])[i];
>		umm2[i] *= umm2[i];
>		umm0[i] += umm2[i];
>
>		umm3[i] -= (b3[4])[i];
>		umm3[i] *= umm3[i];
>		umm0[i] += umm3[i];
>
>		umm0[i] = sqrt(umm0[i]);
>
>	}
>
>	//Sadly, writing to the arrays from within the static foreach causes LDC to not use SIMD at all
>	*a0 = amm0;
>	*(a0+1) = emm0;
>	*(a0+2) = imm0;
>	*(a0+3) = omm0;
>	*(a0+4) = umm0;
>	return;
>
>}

>sub rsp,38
>vmovaps xmmword ptr ss:[rsp+20],xmm8
>vmovaps xmmword ptr ss:[rsp+10],xmm7
>vmovaps xmmword ptr ss:[rsp],xmm6
>mov r10,qword ptr ss:[rsp+70]
>mov r11,qword ptr ss:[rsp+68]
>mov rax,qword ptr ss:[rsp+60]
>vmovaps ymm1,yword ptr ds:[<_D12euclideandst4initG8f>]
>vmovups ymm3,yword ptr ds:[rdx]
>vmovups ymm2,yword ptr ds:[r8]
>vmovups ymm0,yword ptr ds:[r9]
>vsubps ymm4,ymm3,yword ptr ds:[rax]
>vfmadd213ps ymm4,ymm4,ymm1
>vsubps ymm5,ymm2,yword ptr ds:[r11]
>vfmadd213ps ymm5,ymm5,ymm4
>vsubps ymm4,ymm0,yword ptr ds:[r10]
>vfmadd213ps ymm4,ymm4,ymm5
>vsqrtps ymm4,ymm4
>vsubps ymm5,ymm3,yword ptr ds:[rax+20]
>vfmadd213ps ymm5,ymm5,ymm1
>vsubps ymm6,ymm2,yword ptr ds:[r11+20]
>vfmadd213ps ymm6,ymm6,ymm5
>vsubps ymm5,ymm0,yword ptr ds:[r10+20]
>vfmadd213ps ymm5,ymm5,ymm6
>vsqrtps ymm5,ymm5
>vsubps ymm6,ymm3,yword ptr ds:[rax+40]
>vsubps ymm7,ymm2,yword ptr ds:[r11+40]
>vfmadd213ps ymm6,ymm6,ymm1
>vfmadd213ps ymm7,ymm7,ymm6
>vsubps ymm6,ymm0,yword ptr ds:[r10+40]
>vfmadd213ps ymm6,ymm6,ymm7
>vsqrtps ymm6,ymm6
>vsubps ymm7,ymm3,yword ptr ds:[rax+60]
>vfmadd213ps ymm7,ymm7,ymm1
>vsubps ymm8,ymm2,yword ptr ds:[r11+60]
>vfmadd213ps ymm8,ymm8,ymm7
>vsubps ymm7,ymm0,yword ptr ds:[r10+60]
>vfmadd213ps ymm7,ymm7,ymm8
>vsqrtps ymm7,ymm7
>vsubps ymm3,ymm3,yword ptr ds:[rax+80]
>vfmadd213ps ymm3,ymm3,ymm1
>vsubps ymm1,ymm2,yword ptr ds:[r11+80]
>vfmadd213ps ymm1,ymm1,ymm3
>vsubps ymm0,ymm0,yword ptr ds:[r10+80]
>vfmadd213ps ymm0,ymm0,ymm1
>vsqrtps ymm0,ymm0
>vmovups yword ptr ds:[rcx],ymm4
>vmovups yword ptr ds:[rcx+20],ymm5
>vmovups yword ptr ds:[rcx+40],ymm6
>vmovups yword ptr ds:[rcx+60],ymm7
>vmovups yword ptr ds:[rcx+80],ymm0
>vmovaps xmm6,xmmword ptr ss:[rsp]
>vmovaps xmm7,xmmword ptr ss:[rsp+10]
>vmovaps xmm8,xmmword ptr ss:[rsp+20]
>add rsp,38
>vzeroupper ret

This is slightly more efficient(54 vs 78ms for the same amount of distances(256 million)) but still far from perfect, also i do not know what it is doing with XMM registers knowing i can safely remove the instructions with a debugger and it doesn't crash or corrupt data.(sometimes it was doing this for 0x100 bytes)
March 07, 2021
On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat wrote:
> On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
>> How does one optimize code to make full use of the CPU's SIMD capabilities?
>> Is there any way to guarantee that "packed" versions of SIMD instructions will be used?(e.g. vmulps, vsqrtps, etc...)
>
> https://code.dlang.org/packages/intel-intrinsics

I'd try to use it but the platform i'm building on requires AVX to get the most performance.


March 07, 2021
On Friday, 26 February 2021 at 03:57:12 UTC, tsbockman wrote:
>>>   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
>>>       distance += a[i].abs;//abs required by the caller
>
> (a * a) above is always positive for real numbers. You don't need the call to abs unless you're trying to guarantee that even nan values will have a clear sign bit.
>
I do not know why but the caller's performance nosedives whenever there is no .abs at this particular line.(there's a 3x difference, no joke.)
Same for assignment instead of addition, but with a 2x difference instead.
March 07, 2021
On Sunday, 7 March 2021 at 14:15:58 UTC, z wrote:
> On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat wrote:
>> On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
>>> How does one optimize code to make full use of the CPU's SIMD capabilities?
>>> Is there any way to guarantee that "packed" versions of SIMD instructions will be used?(e.g. vmulps, vsqrtps, etc...)
>>
>> https://code.dlang.org/packages/intel-intrinsics
>
> I'd try to use it but the platform i'm building on requires AVX to get the most performance.

The code below might be worth a try on your AVX512 machine.

Unless you're looking for a combined result, you might need to separate out the memory access overhead by running multiple passes over a "known optimal for L2" data set.

Also note that I compiled with -preview=in.  I don't know if that matters.



import std.math : sqrt;
enum SIMDBits = 512; // 256 was tested, 512 was not
alias A = float[SIMDBits / (float.sizeof * 8)];
pragma(inline, true)
    void soaEuclidean(ref A a0, in A a1, in A a2, in A a3, in A b1, in A b2, in A b3)
{
    alias V = __vector(A);
    static V vsqrt(V v)
    {
        A a = cast(A) v;
        static foreach (i; 0 .. A.length)
            a[i] = sqrt(a[i]);
        return cast(V)a;
    }

    static V sd(in A a, in A b)
    {
        V v = cast(V) b - cast(V) a;
        return v * v;
    }

    auto v = sd(a1, b1) + sd(a2, b2) + sd(a3, b3);
    a0[] = vsqrt(v)[];
}


« First   ‹ Prev
1 2