On Wednesday, 19 January 2022 at 10:17:45 UTC, Ola Fosheim Grøstad wrote:
> On Wednesday, 19 January 2022 at 09:49:59 UTC, Nicholas Wilson wrote:
> Arguably that already describes Nvidia. Luckily for us, it has an intermediate layer in PTX that LLVM can target, and that's exactly what dcompute does.
For desktop applications one has to support Intel, AMD, Nvidia, Apple. So, does that mean that one have to support Metal, Vulkan, PTX and RocM? Sounds like too much…
That was a comment mostly about the market share and "business practices" Nvidia.
Intel is well supported by OpenCL/SPIR-V.
There are some murmurings that AMD is getting SPIR-V support for ROCm, though if that is insufficient, I don't think it would be too difficult to hook the AMDGPU backend to LDC+DCompute (runtime libraries would be a bit tedious, given the lack of familiarity and volume of code), but I have no hardware to run ROCm math the moment.
Metal should also not be too difficult (the kernel argument format is different which is annoying) to hook LDC up to, the main thing lacking is Objective-C support to bind the runtime libraries for DCompute (which would also need to be written.
LDC can already target Vulkan compute (although the pipeline is tedious, and there is no runtime library support).
> > Unlike C++, D can much more easily statically condition on aspects of the hardware, making the tuning process faster to navigate the parameter configuration space.
Not sure what you meant here?
I mean there are parametric attributes of the hardware, say for example cache size (or available registers for GPUs), that have a direct effect on how many times you can unroll the inner loop, say for a windowing function, and you want to ship optimised code for multiple configurations of hardware.
You can much more easily create multiple copies for different sized cache (or register availability) in D than you can in C++, because static foreach and static if >>> if constexpr.