Thread overview
What is the better signature for this?
Oct 09, 2022
Guillaume Piolat
Oct 09, 2022
Guillaume Piolat
Oct 10, 2022
Bruce Carneal
Oct 10, 2022
Guillaume Piolat
Oct 11, 2022
Johan
Oct 11, 2022
Kagamin
Oct 11, 2022
Bruce Carneal
October 09, 2022

Consider the following "intrinsic" signature.

__m256i _mm256_loadu_si256 (const(__m256i)* mem_addr) pure @trusted;  // (A)

The intel intrinsics signature have the problem that you must pass an implictely aligned __m256i (aka long4), however the pointer doesn't need to be aligned for an unaligned load. So, this is a bit playing with the type system. Inside the "intrinsic" implementation, nothing should use that non-existent alignment. Though in a way that hasn't blown up yet.

It is tempting to fix that and just take a long* or void* instead.

__m256i _mm256_loadu_si256 (const(void)* mem_addr) pure @system;      // (B)

However, in that case, the function is not @trusted anymore, but becomes @system.
Indeed, it is safe to dereference a pointer, but not index from it.

What about float[4] then? We can get back @trusted.

 __m256i _mm256_loadu_si256 (const(float[4])* mem_addr) pure @trusted; // (C)

Then, we loose compatibility ith intrinsics code originally written in C++. Casting to const(float[4])* is even more annoying to type than casting to const(__m256i)*.

What do you think is the better signature?
I'd prefer to go A > B > C, but figured I might be missing something.

October 09, 2022

On Sunday, 9 October 2022 at 19:44:13 UTC, Guillaume Piolat wrote:

>

What about float[4] then? We can get back @trusted.

 __m256i _mm256_loadu_si256 (const(float[4])* mem_addr) pure @trusted; // (C)

Then, we loose compatibility ith intrinsics code originally written in C++. Casting to const(float[4])* is even more annoying to type than casting to const(__m256i)*.

Erratum: it is long[4], not float[4]

October 10, 2022

On Sunday, 9 October 2022 at 19:44:13 UTC, Guillaume Piolat wrote:

>

Consider the following "intrinsic" signature.

...

What do you think is the better signature?
I'd prefer to go A > B > C, but figured I might be missing something.

Could using the static array representation type of the vector (.array) be a useful idiom here? I ask because I don't know the constraints/preferences of veteran intrinsic programmers. That idiom does work well in other SIMD formulations but may not be well suited here.

October 10, 2022

On Monday, 10 October 2022 at 12:31:04 UTC, Bruce Carneal wrote:

>

On Sunday, 9 October 2022 at 19:44:13 UTC, Guillaume Piolat wrote:

>

Consider the following "intrinsic" signature.

...

What do you think is the better signature?
I'd prefer to go A > B > C, but figured I might be missing something.

Could using the static array representation type of the vector (.array) be a useful idiom here? I ask because I don't know the constraints/preferences of veteran intrinsic programmers. That idiom does work well in other SIMD formulations but may not be well suited here.

That is solution C.
It could work.

The slight problem is that function that takes __m128i* use that as "any packed integer taking 128-bit" space, and it's not immediately obvious that __m128i is int4 and __m256i is long4, it's rather counterintuitive. Smenatically, it could be short8 or byte16...

GCC vectors can be unaligned, and there are types for it (eg: __m128i_u), but I don't think the other compilers can do that. That would be a prime contender.

October 11, 2022

On Monday, 10 October 2022 at 12:44:18 UTC, Guillaume Piolat wrote:

>

GCC vectors can be unaligned, and there are types for it (eg: __m128i_u), but I don't think the other compilers can do that. That would be a prime contender.

I think you should be able to define the unaligned type like this:

struct __m128u {
    align(1) __m128 data;
    alias data this;
}

It works, but I am not 100% sure if this type will always behave the same (ABI) as __m128 when used as value, e.g. when passing to a function (void fun(__m128u a, __m128u b, passed in simd register?).

But unfortunately currently it runs into this LDC bug: https://github.com/ldc-developers/ldc/issues/4236 .

cheers,
Johan

October 11, 2022

On Monday, 10 October 2022 at 12:44:18 UTC, Guillaume Piolat wrote:

>

That is solution C.
It could work.

Static array works like this:

int Load4LE(in ref ubyte[4] b) pure
{
	return (b[3]<<24)|(b[2]<<16)|(b[1]<<8)|b[0];
}

ubyte[] data;
int val=Load4LE(data[0..4]);

It's safe, bound checked, ctfeable, no casts.

>

The slight problem is that function that takes __m128i* use that as "any packed integer taking 128-bit" space, and it's not immediately obvious that __m128i is int4 and __m256i is long4, it's rather counterintuitive. Smenatically, it could be short8 or byte16...

There's no solution, only tradeoffs.

October 11, 2022

On Tuesday, 11 October 2022 at 12:51:13 UTC, Kagamin wrote:

>

On Monday, 10 October 2022 at 12:44:18 UTC, Guillaume Piolat wrote:

>

That is solution C.
It could work.

Static array works like this:

int Load4LE(in ref ubyte[4] b) pure
{
	return (b[3]<<24)|(b[2]<<16)|(b[1]<<8)|b[0];
}

ubyte[] data;
int val=Load4LE(data[0..4]);

It's safe, bound checked, ctfeable, no casts.

Yes. Starting at line 57 you'll find examples of the above for a target-adaptive/generic environment: https://godbolt.org/z/qW6PYT3Yd

I've not found a way to trigger those one-instruction unaligned loads from DMD but ldc and gdc are doing great.