On Thursday, 20 January 2022 at 13:29:26 UTC, Ola Fosheim Grøstad wrote:
On Thursday, 20 January 2022 at 12:18:27 UTC, Bruce Carneal wrote:
Because compilers are not sufficiently advanced to extract all the performance that is available on their own.
Well, but D developers cannot test on all available CPU/GPU combinations either so then you don't know if SIMD would perform better than GPU.
It can be very expensive to write and test all the permutations, yes, but often you'll understand the bottlenecks of your algorithms sufficiently to be able to correctly filter out the work up front. Restating here, these are a few of the traditional ways to look at it: Throughput or latency limited? Operand/memory or arithmetic limited? Power (watts) preferred or other performance?
It's possible, for instance, that you can know, from first principles, that you'll never meet objective X if forced to use platform Y. In general, though, you'll just have a sense of the order in which things should be evaluated.
Something automated has to be present, at least on install, otherwise you risk performance degradation compared to a pure SIMD implementation. And then it is better (and cheaper) to just avoid GPU altogether.
Yes, SIMD can be the better performance choice sometimes. I think that many people will choose to do a SIMD implementation as a performance, correctness testing and portability baseline regardless of the accelerator possibilities.
A good example of where the automated/simple approach was not good enough is CUB (CUDA unbound), a high performance CUDA library found here https://github.com/NVIDIA/cub/tree/main/cub
I'd recommend taking a look at the specializations that occur in CUB in the name of performance.
I am sure you are right, but I didn't find anything special when I browsed through the repo?
The key thing to note is how much effort the authors put into specialization wrt the HW x SW cross product. There are entire subdirectories devoted to specialization.
At least some of this complexity, this programming burden, can be factored out with better language support.
If you can achieve your performance objectives with automated or hinted solutions, great! But what if you can't?
Well, my gut instinct is that if you want maximal performance for a specific GPU then you would be better off using Metal/Vulkan/etc directly?
That's what seems reasonable, yes, but fortunately I don't think it's correct. By analogy, you can get maximum performance from assembly level programming, if you have all the compiler back-end knowledge in your head, but if your language allows you to communicate all relevant information (mainly dependencies and operand localities but also "intrinsics") then the compiler can do at least as well as the assembly level programmer. Add language support for inline and factored specialization and the lower level alternatives become even less attractive.
But I have no experience with that as it is quite time consuming to go that route. Right now basic SIMD is time consuming enough… (but OK)
Indeed. I'm currently working on the SIMD variant of something I partially prototyped earlier on a 2080 and it has been slow going compared to either that GPU implementation or the scalar/serial variant.
There are some very nice assists from D for SIMD programming: the __vector typing, __vector arithmetic, unaligned vector loads/stores via static array operations, static foreach to enable portable expression of single-instruction SIMD functions like min, max, select, various shuffles, masks, ... but, yes, SIMD programming is definitely a slog compared to either scalar or SIMT GPU programming.