On Tuesday, 7 June 2022 at 18:21:57 UTC, Walter Bright wrote:
>On 6/7/2022 2:23 AM, Bruce Carneal wrote:
...
I've never much liked autovectorization:
Same here, which is why my initial CPU-side implementation was all explicit __vector/intrinsics code (with corresponding static arrays to get a sane unaligned load/store capability).
>- you never know if it is going to vectorize or not. The vector instruction sets vary all over the place, and whether they line up with your loops or not is not determinable in general - you have to look at the assembler dump.
I now take this as an argument for auto vectorization.
>- when autovectorization doesn't happen, the compiler reverts to non-vectorized slow code. Often, you're not aware this has happened, and the expected performance doesn't happen. You can usually refactor the loop so it will autovectorize, but that's something only an expert programmer can accomplish, but he can't do it if he doesn't realize the autovectorization didn't happen. You said it yourself: "if perf drops"!
Well, presumably you're "unittesting" performance to know where the hot spots are so... It's always nicer to know things at compile time but for me it's acceptable at "unittest time" since the measurements will be part of any performance code development setup.
>- it's fundamentally a backwards thing. The programmer writes low level code (explicit loops) and the compiler tries to work backwards to create high level code (vectors) for it! This is completely backwards to how compilers normally work - specify a high level construct, and the compiler converts it into low level.
I see it as a choice on the "time to develop" <==> "performance achieved" axis. Fortunately autovectorization can be a win here: develop simple/correct code with an eye to compiler-visible indexing and hand-vectorize if there's a problem. (I actually went the other way, starting with hand optimized core functions, and discovered that auto-vectorization worked as well or better for many of those functions).
>- with vector code, the compiler will tell you when the instruction set won't map onto it, so you have a chance to refactor it so it will.
Yes, better to know things at compile time but OK to know them at perf "unittest" time.
Here are some of the reasons I'm migrating much of my code to auto-vectorization with perf regression tests from the initial __vector/intrinsic implementation:
-
It's more readable.
-
It is auto upgradeable (with @target meta programming for the multi-target deployability)
-
It's measurably (slightly) faster in many instances (it helps that I can shape the operand flows for this app)
-
It fits more readily with upcoming CPU-centric vector arch (SVE, SVE2, RVV...), Cray vectors ride again! :-)
-
It aligns stylistically with SIMT (I think in terms of index spaces and memory subsystem blocking rather than HW details). SIMT is where I believe we should be looking for future, significant performance gains (the PCIe bottleneck is a stumbling block but SoCs and consoles have the right idea).
The mid-range goal is to develop in an it-just-works, no-big-deal SIMT environment where the traditional SIMD awkwardness is in the rear view mirror and where we can surf the improving HW performance wave (clock increases were nice while they lasted but ...). dcompute is already a good ways down that road but it can be friendlier and more capable. As I've mentioned elsewhere, I already prefer it to CUDA.
Finally, thanks for creating D. It's great.