Thread overview
LCD inline assembly expressions
Dec 23, 2018
NaN
Dec 23, 2018
NaN
Dec 23, 2018
kinke
Dec 23, 2018
NaN
Dec 23, 2018
kinke
Dec 23, 2018
NaN
Dec 24, 2018
NaN
Dec 24, 2018
NaN
Jan 01, 2019
Guillaume Piolat
December 23, 2018
Ok, so i'm delving into LCD intrinsics and hit a wall, cant find the

_mm_cmpgt_epi32

instruction anywhere, looks like it's not included, not in the gcc_builtins or anywhere else. I'm using the wrapper lib that gives you intel style intrinsics from

https://github.com/AuburnSounds/intel-intrinsics

And from what I can tell if it's not an llvm intrinisic, or not in gcc builtins you're out of luck. So i wondered if I can use inline assembly expressions but I'm obviously missing somthing. Ive got as far as...

// load dst into EAX and return it
int4 _mm_cmpgt_epi32(int4 a, int4 b) {
  return __asm!int4("pcmpgtd $1,$0", "=&r,r,r", a, b);
}

but get the error...

error: non-trivial scalar-to-vector conversion, possible invalid constraint for vector type

Compiler returned: 1


December 23, 2018
On Sunday, 23 December 2018 at 01:27:40 UTC, NaN wrote:
> Ok, so i'm delving into LCD intrinsics and hit a wall, cant find the
>
> _mm_cmpgt_epi32
>
> instruction anywhere, looks like it's not included, not in the gcc_builtins or anywhere else. I'm using the wrapper lib that gives you intel style intrinsics from
>
> https://github.com/AuburnSounds/intel-intrinsics
>
> And from what I can tell if it's not an llvm intrinisic, or not in gcc builtins you're out of luck. So i wondered if I can use inline assembly expressions but I'm obviously missing somthing. Ive got as far as...
>
> // load dst into EAX and return it
> int4 _mm_cmpgt_epi32(int4 a, int4 b) {
>   return __asm!int4("pcmpgtd $1,$0", "=&r,r,r", a, b);
> }
>
> but get the error...
>
> error: non-trivial scalar-to-vector conversion, possible invalid constraint for vector type
>
> Compiler returned: 1

ignore the comment line , thats just a left over copy and paste from the wiki example
December 23, 2018
On Sunday, 23 December 2018 at 01:27:40 UTC, NaN wrote:
> int4 _mm_cmpgt_epi32(int4 a, int4 b) {
>   return __asm!int4("pcmpgtd $1,$0", "=&r,r,r", a, b);
> }

This is a working variant (`r` is a GP register, `x` a vector register for x86, see https://llvm.org/docs/LangRef.html#supported-constraint-code-list):

int4 _mm_cmpgt_epi32(int4 a, int4 b) {
  int4 r = void;
  __asm("pcmpgtd $1,$0; movdqa $0,$2", "x,x,*m", a, b, &r);
  return r;
}
December 23, 2018
On Sunday, 23 December 2018 at 12:54:01 UTC, kinke wrote:
> On Sunday, 23 December 2018 at 01:27:40 UTC, NaN wrote:
>> int4 _mm_cmpgt_epi32(int4 a, int4 b) {
>>   return __asm!int4("pcmpgtd $1,$0", "=&r,r,r", a, b);
>> }
>
> This is a working variant (`r` is a GP register, `x` a vector register for x86, see https://llvm.org/docs/LangRef.html#supported-constraint-code-list):
>
> int4 _mm_cmpgt_epi32(int4 a, int4 b) {
>   int4 r = void;
>   __asm("pcmpgtd $1,$0; movdqa $0,$2", "x,x,*m", a, b, &r);
>   return r;
> }

Hi thanks, i was just coming here to post the solution as I figured it out after following the link to the llvm docs.

Is there any difference between using this vs the other method of doing intrinsics?


December 23, 2018
On Sunday, 23 December 2018 at 13:00:54 UTC, NaN wrote:
> Is there any difference between using this vs the other method of doing intrinsics?

Assuming there's really no LLVM intrinsic for your desired instruction, the manual variant is what it is, a regular function with an inline asm expression. I guess the LLVM backends lower calls to these instruction-intrinsics directly to inline asm expressions in the caller. With inlining, it might result in equivalent final asm.

My version above with the memory indirection isn't nice, this is better:

extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) {
  return __asm!int4("pcmpgtd $2,$1", "={xmm0},{xmm0},{xmm1}", a, b);
}

and is going to be inlined with `-O`.

Note that if you used equivalent naked DMD-style inline asm instead, e.g.,

extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) {
  asm {
    naked;
    pcmpgtd XMM0, XMM1;
    ret;
  }
}

that is lowered to *module*-level inline asm and the function is NOT inline-able.
December 23, 2018
On Sunday, 23 December 2018 at 13:33:51 UTC, kinke wrote:
> On Sunday, 23 December 2018 at 13:00:54 UTC, NaN wrote:
>> Is there any difference between using this vs the other method of doing intrinsics?
>
> Assuming there's really no LLVM intrinsic for your desired instruction, the manual variant is what it is, a regular function with an inline asm expression. I guess the LLVM backends lower calls to these instruction-intrinsics directly to inline asm expressions in the caller. With inlining, it might result in equivalent final asm.
>
> My version above with the memory indirection isn't nice, this is better:
>
> extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) {
>   return __asm!int4("pcmpgtd $2,$1", "={xmm0},{xmm0},{xmm1}", a, b);
> }
>
> and is going to be inlined with `-O`.

that's pretty much what I've got, i've been using compiler explorer so I can see what actually gets generated. Been quite an eye opener how good the LLVM optimizer is tbh.

> Note that if you used equivalent naked DMD-style inline asm instead, e.g.,
>
> extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) {
>   asm {
>     naked;
>     pcmpgtd XMM0, XMM1;
>     ret;
>   }
> }
>
> that is lowered to *module*-level inline asm and the function is NOT inline-able.

Im ignoring DMD since it kills performance by about 60% anyway.
December 24, 2018
On Sunday, 23 December 2018 at 13:33:51 UTC, kinke wrote:
> On Sunday, 23 December 2018 at 13:00:54 UTC, NaN wrote:
>> Is there any difference between using this vs the other method of doing intrinsics?
>
> Assuming there's really no LLVM intrinsic for your desired instruction, the manual variant is what it is, a regular function with an inline asm expression. I guess the LLVM backends lower calls to these instruction-intrinsics directly to inline asm expressions in the caller. With inlining, it might result in equivalent final asm.
>
> My version above with the memory indirection isn't nice, this is better:
>
> extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) {
>   return __asm!int4("pcmpgtd $2,$1", "={xmm0},{xmm0},{xmm1}", a, b);
> }

so I had this..

__m128i _mm_cmpgt_epi32(__m128i a, __m128i b) {
  return __asm!__m128i("pcmpgtd $2,$1","=x,x,x",a,b);
}

Looked OK at first but it's actually wrong, the cmp instruction writes to $1 which is actually 'a', and it doesnt write anything to $0 which is the return, so it overwrites one of the inputs, and doesnt write the output. So it actualy needs to be this...

__m128i _mm_cmpgt_epi32(__m128i a, __m128i b) {
    return __asm!int4("
        movdqu $1,$0
        pcmpgtd $2,$0",
        "=x,x,x", a,b);
}

basically copy 'a' to the output, then do the compare with 'b' and the output

I dont think there's anyway to get around the temporary copy, since it depends on knowing if 'a' is ever use after its used in the compare. And it doesn't seem like the optimiser can cull it away in this case.




December 24, 2018
On Monday, 24 December 2018 at 01:40:42 UTC, NaN wrote:
>
> I dont think there's anyway to get around the temporary copy, since it depends on knowing if 'a' is ever use after its used in the compare. And it doesn't seem like the optimiser can cull it away in this case.

OK think I've figured it out...

__m128i _mm_cmpgt_epi32(__m128i a, __m128i b) {
  return __asm!__m128i("pcmpgtd $2,$0","=x,0,x",a,b);
}

Basically....

$0 is the return, the constraint '=x' means its the output and uses xmm register
$1 is 'a', the constraint '0', means this param uses same register as $0
$2 is 'b', the constrain 'x' means this uses an xmm register

It's also AT&T syntax so the operands are reversed to what Im used to, so...

Although $1 is not written in the asm expression it has been tied to $0 by the '0'  constraint. So as far as the compiler is concerned 'a' comes in on the same register as the output goes out in. By knowing this it can create a temporary copy of 'a' if it needs to avoid trashing 'a'.

I've done some tests and if you do...

r = _mm_cmpgt_epi32(a,b)

it only creates the temporary if you use 'a' again afterwards.

So its all working i think.

January 01, 2019
On Sunday, 23 December 2018 at 01:27:40 UTC, NaN wrote:
> Ok, so i'm delving into LCD intrinsics and hit a wall, cant find the
>
> _mm_cmpgt_epi32
>
> instruction anywhere, looks like it's not included, not in the gcc_builtins or anywhere else. I'm using the wrapper lib that gives you intel style intrinsics from
>
> https://github.com/AuburnSounds/intel-intrinsics
>
> And from what I can tell if it's not an llvm intrinisic, or not in gcc builtins you're out of luck. So i wondered if I can use inline assembly expressions but I'm obviously missing somthing. Ive got as far as...
>

Just a note that after you posted here, the intrinsics has been implemented in "intel-intrinsics" package through ldc.simd:

https://github.com/AuburnSounds/intel-intrinsics/blob/fa3866dc782b0d2c4a567f6547bdc0b321ada8cc/source/inteli/emmintrin.d#L293

It generates pcmpgtd https://d.godbolt.org/z/ronCG_