On Sunday, 19 June 2022 at 16:56:17 UTC, kinke wrote:
On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:
- LDC requires more instructions at 512 bits. At 256 (x86-64-v3) they're the same.
Different results (actually using a 512-bit move, not 2x256) with
-mattr=avx512bw. I guess LLVM makes performance assumptions for the provided CPU and prefers 256-bit instructions.
An unexpected choice there given that x86-64-v4 requires avx512bw. Still, glad to hear that the narrower specialization works well. It would be part of any multi-target binary where the programmer was concerned about maximum width-sensitive performance across the widest range of machines.
The biggest difference with gdc is an ABI difference - gdc returning the vector directly in an AVX512 register, whereas LDC returns it indirectly (sret return - caller passes a pointer to its pre-allocated result). That's a limitation of the frontend's https://github.com/dlang/dmd/blob/master/src/dmd/argtypes_sysv_x64.d, which supports 256-bit vectors but no 512-bit ones (the SysV ABI keeps getting extended for broader vectors...).
So, IIUC, gdc and ldc are not interoperable currently but will be once the frontend is updated?
- LDC fabricates non-HW __vectors so is(someVector) has diminished CT utility.
I consider that useful, e.g., allowing to use a
double4 without having to consider CPU limitations. - I think the compiler should expose a trait for the largest supported vector size instead
I agree. Perhaps a template: maxISAVectorLengthFor(T). If we're getting fancy we could do: maxMicroarchVectorLengthFor(T). Even better if these work correctly in multi target compilation scenarios and for the expanding set of types (f16, bf16, other?).
Having both variants could be useful when targetting split/paired architectures, as AMD is fond of lately, or the SVE/RVV machines.