Using SSE3 vector shuffel with LDC

May 26, 2019

KytoDragon

May 26, 2019

kinke

May 26, 2019

May 26, 2019

May 26, 2019

May 29, 2019

May 26, 2019

I have been trying to port some programs to D that heavely use SSE instructions. In particular, i still need _mm_shuffle_epi8, _mm_alignr_epi8 and _mm_aesdec_si128. LDC does not support the core.simd approach and ldc.simd only supports a few operations, including a vector shuffel with a fixed mask (I need a variable mask). So how would one go about using theese with LDC? I need to be able to: - consistently generate SSE instruction, even in debug builds. - inline the function. I have been unable to find a solution using either the simd package, inline asm or inline llvm-ir.

May 26, 2019

Re: Using SSE3 vector shuffel with LDC

Posted by kinke
in reply to KytoDragon

Permalink

kinke

Posted in reply to KytoDragon

Permalink

On Sunday, 26 May 2019 at 12:10:30 UTC, KytoDragon wrote:
> I have been trying to port some programs to D that heavely use SSE instructions.
> In particular, i still need _mm_shuffle_epi8, _mm_alignr_epi8 and _mm_aesdec_si128.
> LDC does not support the core.simd approach and ldc.simd only supports a few operations, including a vector shuffel with a fixed mask (I need a variable mask).
> So how would one go about using theese with LDC?
>
> I need to be able to:
> - consistently generate SSE instruction, even in debug builds.
> - inline the function.
>
> I have been unable to find a solution using either the simd package, inline asm or inline llvm-ir.

There's https://github.com/AuburnSounds/intel-intrinsics which tries to be compatible with the Intel intrinsic names.

_mm_aesdec_si128 is available in ldc.gccbuiltins_x86 as __builtin_ia32_aesdec128; _mm_shuffle_epi8 as __builtin_ia32_pshufb128. Make sure to specify that the instructions are available via something like `-mattr=+ssse3` in the LDC command line.
I haven't found something corresponding to _mm_alignr_epi8, but inline asm can always be used. Here's an example for a manual __builtin_ia32_pshufb128 using LLVM inline assembly:

alias byte16 = __vector(byte[16]);

version (Manual)
{
    pragma(inline, true)
    byte16 _mm_shuffle_epi8(byte16 a, byte16 b)
    {
        import ldc.llvmasm;
        return __asm!byte16("pshufb $2, $1", "=x,0,x", a, b);
    }
}
else
{
    import ldc.gccbuiltins_x86 : _mm_shuffle_epi8 = __builtin_ia32_pshufb128;
}

void main()
{
    byte16 a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 ];
    byte16 b = [ -1, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 ];
    const actual = _mm_shuffle_epi8(a, b);
    byte16 expected = b;
    expected[0] = 0;
    assert(actual == expected);
}

On Sunday, 26 May 2019 at 12:10:30 UTC, KytoDragon wrote: > I have been trying to port some programs to D that heavely use SSE instructions. > In particular, i still need _mm_shuffle_epi8, _mm_alignr_epi8 and _mm_aesdec_si128. > LDC does not support the core.simd approach and ldc.simd only supports a few operations, including a vector shuffel with a fixed mask (I need a variable mask). > So how would one go about using theese with LDC? > > I need to be able to: > - consistently generate SSE instruction, even in debug builds. > - inline the function. > > I have been unable to find a solution using either the simd package, inline asm or inline llvm-ir. Have you seen https://github.com/AuburnSounds/intel-intrinsics ? ( see also http://dconf.org/2019/talks/piolat.html)

On Sunday, 26 May 2019 at 13:54:32 UTC, kinke wrote: > There's https://github.com/AuburnSounds/intel-intrinsics which tries to be compatible with the Intel intrinsic names. > > _mm_aesdec_si128 is available in ldc.gccbuiltins_x86 as __builtin_ia32_aesdec128; _mm_shuffle_epi8 as __builtin_ia32_pshufb128. Make sure to specify that the instructions are available via something like `-mattr=+ssse3` in the LDC command line. > I haven't found something corresponding to _mm_alignr_epi8, but inline asm can always be used. Here's an example for a manual __builtin_ia32_pshufb128 using LLVM inline assembly: > <skip> Thank You! I already have the intel-intrinsics package, that one just didn't have these spefic ones. I also didn't know about the sse compiler option, got that working now. Concerning inline asm I thought that that prevents inlining, I will try out ldc.llvmasm .

After tinkering with ldc.llvmasm (and figuring out that the asm argument a specified in reverse order) i have got everything working. E.g. __m128i _mm_alignr_epi8(u8 count)(__m128i A, __m128i B) { return __asm!__m128i("palignr $3, $2, $1", "=x,0,x,i", A, B, count); } Thank you again!

On Sunday, 26 May 2019 at 16:35:48 UTC, KytoDragon wrote: > After tinkering with ldc.llvmasm (and figuring out that the asm argument a specified in reverse order) i have got everything working. E.g. > > __m128i _mm_alignr_epi8(u8 count)(__m128i A, __m128i B) { > return __asm!__m128i("palignr $3, $2, $1", "=x,0,x,i", A, B, count); > } > > Thank you again! Excellent. Wrt. order, yeah, LLVM uses AT&T syntax. Guillaume would surely welcome an intel-intrinsics PR. :)

On Sunday, 26 May 2019 at 16:40:58 UTC, kinke wrote: > On Sunday, 26 May 2019 at 16:35:48 UTC, KytoDragon wrote: >> After tinkering with ldc.llvmasm (and figuring out that the asm argument a specified in reverse order) i have got everything working. E.g. >> >> __m128i _mm_alignr_epi8(u8 count)(__m128i A, __m128i B) { >> return __asm!__m128i("palignr $3, $2, $1", "=x,0,x,i", A, B, count); >> } >> >> Thank you again! > > Excellent. Wrt. order, yeah, LLVM uses AT&T syntax. Guillaume would surely welcome an intel-intrinsics PR. :) Absolutely, SSE3 up to SSE4.2 are on the roadmap, there were just a lack of people showing up with more needs.

Forums