Thread overview
512 bit static array to vector
Jun 19
kinke
June 19

Here's a comparison between ldc and gdc converting static arrays to 512 bit vectors:
https://godbolt.org/z/8jxafh76W

A few observations:

  1. LDC requires more instructions at 512 bits. At 256 (x86-64-v3) they're the same.
  2. LDC emits worse code for the cleaner .array assignment than for the union hack.
  3. LDC fabricates non-HW __vectors so is(someVector) has diminished CT utility.

Is improving LLVM/LDC wrt any of the above relatively simple?

June 19

On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:

>
  1. LDC requires more instructions at 512 bits. At 256 (x86-64-v3) they're the same.

Different results (actually using a 512-bit move, not 2x256) with -mattr=avx512bw. I guess LLVM makes performance assumptions for the provided CPU and prefers 256-bit instructions.

The biggest difference with gdc is an ABI difference - gdc returning the vector directly in an AVX512 register, whereas LDC returns it indirectly (sret return - caller passes a pointer to its pre-allocated result). That's a limitation of the frontend's https://github.com/dlang/dmd/blob/master/src/dmd/argtypes_sysv_x64.d, which supports 256-bit vectors but no 512-bit ones (the SysV ABI keeps getting extended for broader vectors...).

>
  1. LDC fabricates non-HW __vectors so is(someVector) has diminished CT utility.

I consider that useful, e.g., allowing to use a double4 without having to consider CPU limitations. - I think the compiler should expose a trait for the largest supported vector size instead.

June 19

On Sunday, 19 June 2022 at 16:56:17 UTC, kinke wrote:

>

On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:

>
  1. LDC requires more instructions at 512 bits. At 256 (x86-64-v3) they're the same.

Different results (actually using a 512-bit move, not 2x256) with -mattr=avx512bw. I guess LLVM makes performance assumptions for the provided CPU and prefers 256-bit instructions.

An unexpected choice there given that x86-64-v4 requires avx512bw. Still, glad to hear that the narrower specialization works well. It would be part of any multi-target binary where the programmer was concerned about maximum width-sensitive performance across the widest range of machines.

>

The biggest difference with gdc is an ABI difference - gdc returning the vector directly in an AVX512 register, whereas LDC returns it indirectly (sret return - caller passes a pointer to its pre-allocated result). That's a limitation of the frontend's https://github.com/dlang/dmd/blob/master/src/dmd/argtypes_sysv_x64.d, which supports 256-bit vectors but no 512-bit ones (the SysV ABI keeps getting extended for broader vectors...).

So, IIUC, gdc and ldc are not interoperable currently but will be once the frontend is updated?

> >
  1. LDC fabricates non-HW __vectors so is(someVector) has diminished CT utility.

I consider that useful, e.g., allowing to use a double4 without having to consider CPU limitations. - I think the compiler should expose a trait for the largest supported vector size instead

I agree. Perhaps a template: maxISAVectorLengthFor(T). If we're getting fancy we could do: maxMicroarchVectorLengthFor(T). Even better if these work correctly in multi target compilation scenarios and for the expanding set of types (f16, bf16, other?).

Having both variants could be useful when targetting split/paired architectures, as AMD is fond of lately, or the SVE/RVV machines.

June 19

On Sunday, 19 June 2022 at 16:56:17 UTC, kinke wrote:

>

On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:

>
  1. LDC requires more instructions at 512 bits. At 256 (x86-64-v3) they're the same.

Different results (actually using a 512-bit move, not 2x256) with -mattr=avx512bw. I guess LLVM makes performance assumptions for the provided CPU and prefers 256-bit instructions.

Note that llvm/ldc chooses a 512 bit 2 instruction ld/st sequence for a2vUnion given x86-64-v4 as the target but goes for a 256 bit wide 4 instruction ld/st sequence in a2vArray.

As you note, -mattr=avx512bw forces a2vArray into the 2 instruction form but apparently some difference in the IR presented to LLVM? enables the choice of the shorter sequence for a2vUnion in either case.

Just curious. Thanks for your having taken a look and for highlighting the workaround (specify avx512bw explicitly).