On Monday, 10 January 2022 at 03:04:22 UTC, Bruce Carneal wrote:>
On Monday, 10 January 2022 at 00:17:48 UTC, Johan wrote:>
On Sunday, 9 January 2022 at 20:21:41 UTC, Bruce Carneal wrote:>
With -mcpu=native, -O3, and an EPYC/zen1 target, ldc 1.28 vectorizes the code below when T == ubyte but does not vectorize that code when T == ushort.
Intra cache throughput testing on a 2.4GhZ zen1 reveals:
30GB/sec -- custom template function in vanilla D (no asm, no intrinsics)
27GB/sec -- auto vectorized ubyte
6GB/sec -- non vectorized ushort
I'll continue to use that custom code, so no particular urgency here, but if anyone of the LDC crew can, off the top of their head, shed some light on this I'd be interested. My guess is that the cost/benefit function in play here does not take bandwidth into account at all.
void interleave(T)(T* s0, T* s1, T* s2, T* s3, T quads)
foreach (i, ref dst; quads)
dst = s0[i];
dst = s1[i];
dst = s2[i];
dst = s3[i];
This could be due to a number of things. Probably it's due to pointer aliasing possibility. Could also be alignment assumptions.
I don't think it's pointer aliasing since the 10 line template function seen above was used for both ubyte and ushort instantiations. The ubyte instantiation auto vectorized nicely. The ushort instantiation did not.
Also, I dont think the unaligned vector load/store instructions have alignment restrictions. They are generated by LDC when you have something like:
ushort* sap = ...
auto tmp = cast(__vector(ushort))sap; // turns into: vmovups ...
The compiler complains about aliasing when optimizing.
For example, the write to
dst may alias with
s1[i] needs to be reloaded. I think the problem gets worse with 16bit numbers because they may partially overlap? (8bits of dst overlap with s1[i]) Just a guess of why the lookup tables
.LCPI0_x are generated...