512 bit static array to vector

Jun 19, 2022

Bruce Carneal

Jun 19, 2022

kinke

Jun 19, 2022

Bruce Carneal

Jun 19, 2022

Bruce Carneal

On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:

LDC requires more instructions at 512 bits. At 256 (x86-64-v3) they're the same.

Different results (actually using a 512-bit move, not 2x256) with -mattr=avx512bw. I guess LLVM makes performance assumptions for the provided CPU and prefers 256-bit instructions.

The biggest difference with gdc is an ABI difference - gdc returning the vector directly in an AVX512 register, whereas LDC returns it indirectly (sret return - caller passes a pointer to its pre-allocated result). That's a limitation of the frontend's https://github.com/dlang/dmd/blob/master/src/dmd/argtypes_sysv_x64.d, which supports 256-bit vectors but no 512-bit ones (the SysV ABI keeps getting extended for broader vectors...).

LDC fabricates non-HW __vectors so is(someVector) has diminished CT utility.

I consider that useful, e.g., allowing to use a double4 without having to consider CPU limitations. - I think the compiler should expose a trait for the largest supported vector size instead.

June 19, 2022

Re: 512 bit static array to vector

Posted by Bruce Carneal
in reply to kinke

Permalink

Bruce Carneal

Posted in reply to kinke

Permalink

On Sunday, 19 June 2022 at 16:56:17 UTC, kinke wrote:

On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:

LDC requires more instructions at 512 bits. At 256 (x86-64-v3) they're the same.

Different results (actually using a 512-bit move, not 2x256) with -mattr=avx512bw. I guess LLVM makes performance assumptions for the provided CPU and prefers 256-bit instructions.

An unexpected choice there given that x86-64-v4 requires avx512bw. Still, glad to hear that the narrower specialization works well. It would be part of any multi-target binary where the programmer was concerned about maximum width-sensitive performance across the widest range of machines.

So, IIUC, gdc and ldc are not interoperable currently but will be once the frontend is updated?

> >

LDC fabricates non-HW __vectors so is(someVector) has diminished CT utility.

I consider that useful, e.g., allowing to use a double4 without having to consider CPU limitations. - I think the compiler should expose a trait for the largest supported vector size instead

I agree. Perhaps a template: maxISAVectorLengthFor(T). If we're getting fancy we could do: maxMicroarchVectorLengthFor(T). Even better if these work correctly in multi target compilation scenarios and for the expanding set of types (f16, bf16, other?).

Having both variants could be useful when targetting split/paired architectures, as AMD is fond of lately, or the SVE/RVV machines.

Forums