January 19, 2022

On Wednesday, 19 January 2022 at 13:32:37 UTC, Ola Fosheim Grøstad wrote:

>

On Wednesday, 19 January 2022 at 12:49:11 UTC, Paulo Pinto wrote:

>

It also needs to plug into the libraries, IDEs and GPGPU debuggers available to the community.

But the presentation is not only about HPC, but making parallel GPU computing as easy as writing regular C++ code and being able to debug that code on the CPU.

I actually think it is sufficient to support Metal and Vulkan for this to be of value. The question is how much more performance Nvidia manage to get out of their their nvc++ compiler for regular GPUs in comparison to a Vulkan solution.

Currently Vulkan Compute is not to be taken seriously.

Yes, the end goal of the industry efforts is that C++ will be the lingua franca of GPGPUs and FPGAs, that is why SYSCL is collaborating with ISO C++ efforts.

As for HPC, that is where the money for these kind of efforts comes from.

January 19, 2022

On Wednesday, 19 January 2022 at 14:24:14 UTC, Paulo Pinto wrote:

>

On Wednesday, 19 January 2022 at 13:32:37 UTC, Ola Fosheim Grøstad wrote:

>

On Wednesday, 19 January 2022 at 12:49:11 UTC, Paulo Pinto wrote:

>

It also needs to plug into the libraries, IDEs and GPGPU debuggers available to the community.

But the presentation is not only about HPC, but making parallel GPU computing as easy as writing regular C++ code and being able to debug that code on the CPU.

I actually think it is sufficient to support Metal and Vulkan for this to be of value. The question is how much more performance Nvidia manage to get out of their their nvc++ compiler for regular GPUs in comparison to a Vulkan solution.

Currently Vulkan Compute is not to be taken seriously.

Yes, the end goal of the industry efforts is that C++ will be the lingua franca of GPGPUs and FPGAs, that is why SYSCL is collaborating with ISO C++ efforts.

As for HPC, that is where the money for these kind of efforts comes from.

Is Rust utterly irrelevant in this space? Feels weird not seeing it at all in this discussion, with all the talks about just how flexible the type system is and the emphasis on functional paradigm(things like the Typestate pattern), I thought it would matter quite a bit in this context as well, since functional programming languages are found to model hardware more fluidly(naturally?) than imperative languages like C++(yes, it's multi paradigm as well but come on)

January 19, 2022

On Wednesday, 19 January 2022 at 15:25:31 UTC, Tejas wrote:

>

I thought it would matter quite a bit in this context as well, since functional programming languages are found to model hardware more fluidly(naturally?) than imperative languages like C++(yes, it's multi paradigm as well but come on)

I haven't experienced that at all. Functional programming is nothing like a HDL language (VHDL, Verilog) and those languages functions completely differently than functional programming. They are somewhat parallel in nature at least but not like functional programming.

I've found that imperative languages models a CPU better (sequence of instructions) than functional programming languages which seems to have more a high level concept.

January 19, 2022

On Wednesday, 19 January 2022 at 14:24:14 UTC, Paulo Pinto wrote:

>

On Wednesday, 19 January 2022 at 13:32:37 UTC, Ola Fosheim Grøstad wrote:

>

On Wednesday, 19 January 2022 at 12:49:11 UTC, Paulo Pinto wrote:

>

It also needs to plug into the libraries, IDEs and GPGPU debuggers available to the community.

But the presentation is not only about HPC, but making parallel GPU computing as easy as writing regular C++ code and being able to debug that code on the CPU.

I actually think it is sufficient to support Metal and Vulkan for this to be of value. The question is how much more performance Nvidia manage to get out of their their nvc++ compiler for regular GPUs in comparison to a Vulkan solution.

Currently Vulkan Compute is not to be taken seriously.

For those wishing to deploy today, I agree, but it should be considered for future deployments. That said, it's just one way for dcompute to tie in. My current dcompute work comes in, for example, via PTX-jit courtesy of an Nvidia driver.

>

Yes, the end goal of the industry efforts is that C++ will be the lingua franca of GPGPUs and FPGAs, that is why SYSCL is collaborating with ISO C++ efforts.

Yes, apparently there's a huge amount of time/money being spent on SYCL. We can co-opt much of that work underneath (the upcoming LLVM SPIR-V backend, debuggers, profilers, some libs) and provide a much better language on top. C++/SYCL is, to put it charitably, cumbersome.

>

As for HPC, that is where the money for these kind of efforts comes from.

Perhaps, but I suspect other market segments will be (already are?) more important going forward. Gaming generally and ML on SoCs comes to mind.

January 20, 2022

On Wednesday, 19 January 2022 at 10:17:45 UTC, Ola Fosheim Grøstad wrote:

>

On Wednesday, 19 January 2022 at 09:49:59 UTC, Nicholas Wilson wrote:

>

Arguably that already describes Nvidia. Luckily for us, it has an intermediate layer in PTX that LLVM can target, and that's exactly what dcompute does.

For desktop applications one has to support Intel, AMD, Nvidia, Apple. So, does that mean that one have to support Metal, Vulkan, PTX and RocM? Sounds like too much…

That was a comment mostly about the market share and "business practices" Nvidia.

Intel is well supported by OpenCL/SPIR-V.

There are some murmurings that AMD is getting SPIR-V support for ROCm, though if that is insufficient, I don't think it would be too difficult to hook the AMDGPU backend to LDC+DCompute (runtime libraries would be a bit tedious, given the lack of familiarity and volume of code), but I have no hardware to run ROCm math the moment.

Metal should also not be too difficult (the kernel argument format is different which is annoying) to hook LDC up to, the main thing lacking is Objective-C support to bind the runtime libraries for DCompute (which would also need to be written.

LDC can already target Vulkan compute (although the pipeline is tedious, and there is no runtime library support).

> >

Unlike C++, D can much more easily statically condition on aspects of the hardware, making the tuning process faster to navigate the parameter configuration space.

Not sure what you meant here?

I mean there are parametric attributes of the hardware, say for example cache size (or available registers for GPUs), that have a direct effect on how many times you can unroll the inner loop, say for a windowing function, and you want to ship optimised code for multiple configurations of hardware.

You can much more easily create multiple copies for different sized cache (or register availability) in D than you can in C++, because static foreach and static if >>> if constexpr.

January 20, 2022

On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson wrote:

>

I mean there are parametric attributes of the hardware, say for example cache size (or available registers for GPUs), that have a direct effect on how many times you can unroll the inner loop, say for a windowing function, and you want to ship optimised code for multiple configurations of hardware.

You can much more easily create multiple copies for different sized cache (or register availability) in D than you can in C++, because static foreach and static if >>> if constexpr.

And you can do that even more easily with an AST macro system. Which Julia has...

January 20, 2022

On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson wrote:

>

I mean there are parametric attributes of the hardware, say for example cache size (or available registers for GPUs), that have a direct effect on how many times you can unroll the inner loop, say for a windowing function, and you want to ship optimised code for multiple configurations of hardware.

You can much more easily create multiple copies for different sized cache (or register availability) in D than you can in C++, because static foreach and static if >>> if constexpr.

Hmm, I dont understand, the unrolling should happen at runtime so that you can target all GPUs with one executable?

If you have to do the unrolling in D, then a lot of the advantage is lost and I might just as well write in a shader language...

January 20, 2022

On Thursday, 20 January 2022 at 06:57:28 UTC, Ola Fosheim Grøstad wrote:

>

On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson wrote:

>

I mean there are parametric attributes of the hardware, say for example cache size (or available registers for GPUs), that have a direct effect on how many times you can unroll the inner loop, say for a windowing function, and you want to ship optimised code for multiple configurations of hardware.

You can much more easily create multiple copies for different sized cache (or register availability) in D than you can in C++, because static foreach and static if >>> if constexpr.

Hmm, I dont understand, the unrolling should happen at runtime so that you can target all GPUs with one executable?

Now you've confused me. You can select which implementation to use at runtime with e.g. CPUID or more sophisticated methods. LDC targeting DCompute can produce multiple objects with the same compiler invocation, i.e. you can get CUDA for any set of SM version, OpenCL compatible SPIR-V which you can get per GPU, inspect its hardware characteristics and then select which of your kernels to run.

>

If you have to do the unrolling in D, then a lot of the advantage is lost and I might just as well write in a shader language...

D can be your compute shading language for Vulkan and with a bit of work whatever you'd use HLSL for, it can also be your compute kernel language substituting for OpenCL and CUDA. Same caveats apply for metal (should be pretty easy to do: need Objective-C support in LDC, need Metal bindings).

January 20, 2022

On Thursday, 20 January 2022 at 04:01:09 UTC, Araq wrote:

>

And you can do that even more easily with an AST macro system. Which Julia has...

I think these approaches are somewhat pointless for desktop applications. Although a JIT does help.

If time consuming compile-time adaption to the hardware is needed then this should happen at installation. A better approach is to ship code in a high level IR and then bundle a compiler with the installer.

January 20, 2022

On Thursday, 20 January 2022 at 08:20:58 UTC, Nicholas Wilson wrote:

>

Now you've confused me. You can select which implementation to use at runtime with e.g. CPUID or more sophisticated methods. LDC targeting DCompute can produce multiple objects with the same compiler invocation, i.e. you can get CUDA for any set of SM version, OpenCL compatible SPIR-V which you can get per GPU, inspect its hardware characteristics and then select which of your kernels to run.

Yes, so why do you need compile time features?

My understanding is that the goal of nvc++ is to compile to CPU or GPU based on what pays of more for the actual code. So it will not need any annotations (it is up to the compiler to choose between CPU/GPU?). Bryce suggested that it currently only targets one specific GPU, but that it will target multiple GPUs for the same executable in the future.

The goal for C++ parallelism is to make it fairly transparent to the programmer. Or did I misunderstand what he said?

My viewpoint is that if one are going to take a performance hit by not writing the shaders manually one need to get maximum convenience as a payoff.

It should be an alternative for programmers that cannot afford to put in the extra time to support GPU compute manually.

> >

If you have to do the unrolling in D, then a lot of the advantage is lost and I might just as well write in a shader language...

D can be your compute shading language for Vulkan and with a bit of work whatever you'd use HLSL for, it can also be your compute kernel language substituting for OpenCL and CUDA.

I still don't understand why you would need static if/static for-loops? Seems to me that this is too hardwired, you'd be better off with compiler unrolling hints (C++ has these) if the compiler does the wrong thing.

>

Same caveats apply for metal (should be pretty easy to do: need Objective-C support in LDC, need Metal bindings).

Use clang to compile the objective-c code to object files and link with it?