Thread overview
Using SSE3 vector shuffel with LDC
May 26, 2019
KytoDragon
May 26, 2019
kinke
May 26, 2019
KytoDragon
May 26, 2019
KytoDragon
May 26, 2019
kinke
May 29, 2019
Guillaume Piolat
May 26, 2019
Nicholas Wilson
May 26, 2019
I have been trying to port some programs to D that heavely use SSE instructions.
In particular, i still need _mm_shuffle_epi8, _mm_alignr_epi8 and _mm_aesdec_si128.
LDC does not support the core.simd approach and ldc.simd only supports a few operations, including a vector shuffel with a fixed mask (I need a variable mask).
So how would one go about using theese with LDC?

I need to be able to:
- consistently generate SSE instruction, even in debug builds.
- inline the function.

I have been unable to find a solution using either the simd package, inline asm or inline llvm-ir.
May 26, 2019
On Sunday, 26 May 2019 at 12:10:30 UTC, KytoDragon wrote:
> I have been trying to port some programs to D that heavely use SSE instructions.
> In particular, i still need _mm_shuffle_epi8, _mm_alignr_epi8 and _mm_aesdec_si128.
> LDC does not support the core.simd approach and ldc.simd only supports a few operations, including a vector shuffel with a fixed mask (I need a variable mask).
> So how would one go about using theese with LDC?
>
> I need to be able to:
> - consistently generate SSE instruction, even in debug builds.
> - inline the function.
>
> I have been unable to find a solution using either the simd package, inline asm or inline llvm-ir.

There's https://github.com/AuburnSounds/intel-intrinsics which tries to be compatible with the Intel intrinsic names.

_mm_aesdec_si128 is available in ldc.gccbuiltins_x86 as __builtin_ia32_aesdec128; _mm_shuffle_epi8 as __builtin_ia32_pshufb128. Make sure to specify that the instructions are available via something like `-mattr=+ssse3` in the LDC command line.
I haven't found something corresponding to _mm_alignr_epi8, but inline asm can always be used. Here's an example for a manual __builtin_ia32_pshufb128 using LLVM inline assembly:

alias byte16 = __vector(byte[16]);

version (Manual)
{
    pragma(inline, true)
    byte16 _mm_shuffle_epi8(byte16 a, byte16 b)
    {
        import ldc.llvmasm;
        return __asm!byte16("pshufb $2, $1", "=x,0,x", a, b);
    }
}
else
{
    import ldc.gccbuiltins_x86 : _mm_shuffle_epi8 = __builtin_ia32_pshufb128;
}

void main()
{
    byte16 a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 ];
    byte16 b = [ -1, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 ];
    const actual = _mm_shuffle_epi8(a, b);
    byte16 expected = b;
    expected[0] = 0;
    assert(actual == expected);
}
May 26, 2019
On Sunday, 26 May 2019 at 12:10:30 UTC, KytoDragon wrote:
> I have been trying to port some programs to D that heavely use SSE instructions.
> In particular, i still need _mm_shuffle_epi8, _mm_alignr_epi8 and _mm_aesdec_si128.
> LDC does not support the core.simd approach and ldc.simd only supports a few operations, including a vector shuffel with a fixed mask (I need a variable mask).
> So how would one go about using theese with LDC?
>
> I need to be able to:
> - consistently generate SSE instruction, even in debug builds.
> - inline the function.
>
> I have been unable to find a solution using either the simd package, inline asm or inline llvm-ir.

Have you seen https://github.com/AuburnSounds/intel-intrinsics ? ( see also http://dconf.org/2019/talks/piolat.html)
May 26, 2019
On Sunday, 26 May 2019 at 13:54:32 UTC, kinke wrote:
> There's https://github.com/AuburnSounds/intel-intrinsics which tries to be compatible with the Intel intrinsic names.
>
> _mm_aesdec_si128 is available in ldc.gccbuiltins_x86 as __builtin_ia32_aesdec128; _mm_shuffle_epi8 as __builtin_ia32_pshufb128. Make sure to specify that the instructions are available via something like `-mattr=+ssse3` in the LDC command line.
> I haven't found something corresponding to _mm_alignr_epi8, but inline asm can always be used. Here's an example for a manual __builtin_ia32_pshufb128 using LLVM inline assembly:
> <skip>

Thank You! I already have the intel-intrinsics package, that one just didn't have these spefic ones. I also didn't know about the sse compiler option, got that working now.
Concerning inline asm I thought that that prevents inlining, I will try out ldc.llvmasm .
May 26, 2019
After tinkering with ldc.llvmasm (and figuring out that the asm argument a specified in reverse order) i have got everything working. E.g.

__m128i _mm_alignr_epi8(u8 count)(__m128i A, __m128i B) {
    return __asm!__m128i("palignr $3, $2, $1", "=x,0,x,i", A, B, count);
}

Thank you again!
May 26, 2019
On Sunday, 26 May 2019 at 16:35:48 UTC, KytoDragon wrote:
> After tinkering with ldc.llvmasm (and figuring out that the asm argument a specified in reverse order) i have got everything working. E.g.
>
> __m128i _mm_alignr_epi8(u8 count)(__m128i A, __m128i B) {
>     return __asm!__m128i("palignr $3, $2, $1", "=x,0,x,i", A, B, count);
> }
>
> Thank you again!

Excellent. Wrt. order, yeah, LLVM uses AT&T syntax. Guillaume would surely welcome an intel-intrinsics PR. :)
May 29, 2019
On Sunday, 26 May 2019 at 16:40:58 UTC, kinke wrote:
> On Sunday, 26 May 2019 at 16:35:48 UTC, KytoDragon wrote:
>> After tinkering with ldc.llvmasm (and figuring out that the asm argument a specified in reverse order) i have got everything working. E.g.
>>
>> __m128i _mm_alignr_epi8(u8 count)(__m128i A, __m128i B) {
>>     return __asm!__m128i("palignr $3, $2, $1", "=x,0,x,i", A, B, count);
>> }
>>
>> Thank you again!
>
> Excellent. Wrt. order, yeah, LLVM uses AT&T syntax. Guillaume would surely welcome an intel-intrinsics PR. :)

Absolutely, SSE3 up to SSE4.2 are on the roadmap, there were just a lack of people showing up with more needs.