January 14, 2022

On Thursday, 13 January 2022 at 22:27:27 UTC, Ola Fosheim Grøstad wrote:

>

Are there some performance benchmarks on modest hardware? (e.g. a standard macbook, imac or mac mini) Benchmarks that compares dcompute to CPU with auto-vectorization (SIMD)?

Part of the difficulty with that, is that it is an apples to oranges comparison. Also I no longer have hardware that can run dcompute, as my old windows box (with intel x86 and OpenCL 2.1 with an nvidia GPU) died some time ago.

Unfortunately Macs and dcompute don't work very well. CUDA requires nvidia, and OpenCL needs the ability to run SPIR-V (clCreateProgramWithIL call) which requires OpenCL 2.x which Apple do not support. Hence why supporting Metal was of some interest. You might in theory be able to use PoCL or intel based OpenCL runtimes but I don't have an intel mac anymore and I haven't tried PoCL.

January 14, 2022

On Friday, 14 January 2022 at 01:37:29 UTC, Bruce Carneal wrote:

>

WRT OpenCL I don't have much to say. From what I gather people consider OpenCL to be even less hospitable than CUDA, preferring OpenCL mostly (only?) for its non-proprietary status. I'd be interested to hear from OpenCL gurus on this topic.

Not that I'm an OpenCL guru by any stretch of the imagination, but yes, OpenCL as a base API is much less nice than even the CUDA driver APIs, but the foundation is solid and you can abstract and prettify it with D to a level of usability that is at least on par with (and imo exceeds) CUDA's runtime API (the one with the <<<>>>'s) with D kernels.

That is to say the selling point for dcompute vs. OpenCL is, you get an API that is just as easy as CUDA (w.r.t type safety and tedium) and you get to write your kernels in D, whereas dcompute vs. CUDA is just, you get to write your kernels in D (and the API is not any worse).

January 14, 2022

On Friday, 14 January 2022 at 00:56:32 UTC, Nicholas Wilson wrote:

>

If you're thinking of "special compiler support" as what CUDA does with its <<<>>>, then no, dcompute does all of that, but not with special help from the compiler, only with what meta programming and reflection is available to any other D program.
It's D all the way down to the API calls. Obviously there is special compiler support to turn D code into compute kernels.

The main benefit of dcompute is turning kernel launches into type safe one-liners, as opposed to brittle, type unsafe, paragraphs of code.

Sound indeed less brittle than separate langage. In my time in CUDA I never got to use <<<>>>.

In OpenCL you'd have to templatize the string kernels quite quickly, and with CUDA you'd have to also make lots of entry points. Plus all the import problems, so I can see how it's better with LDC intrinsics.

January 14, 2022

On Friday, 14 January 2022 at 09:39:58 UTC, Guillaume Piolat wrote:

> >

The main benefit of dcompute is turning kernel launches into type safe one-liners, as opposed to brittle, type unsafe, paragraphs of code.

Sound indeed less brittle than separate langage. In my time in CUDA I never got to use <<<>>>.

Pity, the <<<>>> is actually quite nice, and not at all brittle, but it is CUDA C/C++ (and maybe fortran?) only, AMDs attempts at HIP notwithstanding. The main thing that makes it brittle is that is you change the signature of the kernel then you need to remember to change wherever it is invoked, and the compiler will not tell you that you forgot something.

>

In OpenCL you'd have to templatize the string kernels quite quickly, and with CUDA you'd have to also make lots of entry points. Plus all the import problems, so I can see how it's better with LDC intrinsics.

I'm not quite sure what you mean here.

January 14, 2022

On Friday, 14 January 2022 at 01:39:32 UTC, Nicholas Wilson wrote:

>

On Thursday, 13 January 2022 at 22:27:27 UTC, Ola Fosheim Grøstad wrote:

>

Are there some performance benchmarks on modest hardware? (e.g. a standard macbook, imac or mac mini) Benchmarks that compares dcompute to CPU with auto-vectorization (SIMD)?

Part of the difficulty with that, is that it is an apples to oranges comparison. Also I no longer have hardware that can run dcompute, as my old windows box (with intel x86 and OpenCL 2.1 with an nvidia GPU) died some time ago.

Unfortunately Macs and dcompute don't work very well. CUDA requires nvidia, and OpenCL needs the ability to run SPIR-V (clCreateProgramWithIL call) which requires OpenCL 2.x which Apple do not support. Hence why supporting Metal was of some interest. You might in theory be able to use PoCL or intel based OpenCL runtimes but I don't have an intel mac anymore and I haven't tried PoCL.

*nods* For a long time we could expect "home computers" to be Intel/AMD, but then the computing environment changed and maybe Apple tries to make its own platform stand out as faster than it is by forcing developers to special case their code for Metal rather than going through a generic API.

I guess FPGAs will be available in entry level machines at some point as well. So, I understand that it will be a challenge to get dcompute to a "ready for the public" stage when there is no multi-person team behind it.

But I am not so sure about the apples and oranges aspect of it. The presentation by Bryce was quite explicitly focusing on making GPU computation available at the same level as CPU computations (sans function pointers). This should be possible for homogeneous memory systems (GPU and CPU sharing the same memory bus) in a rather transparent manner and languages that plan for this might be perceived as being much more productive and performant if/when this becomes reality. And C++23 isn't far away, if they make the deadline.

It was also interesting to me that ISO C23 will provide custom bit width integers and that this would make it easier to efficiently compile C-code to tighter FPGA logic. I remember that LLVM used to have that in their IR, but I think it was taken out and limited to more conventional bit sizes? It just shows that being a system-level programming language requires a lot of adaptability over time and frameworks like dcompute cannot ever be considered truly finished.

January 14, 2022

On Friday, 14 January 2022 at 15:17:59 UTC, Ola Fosheim Grøstad wrote:

>

On Friday, 14 January 2022 at 01:39:32 UTC, Nicholas Wilson wrote:

>

On Thursday, 13 January 2022 at 22:27:27 UTC, Ola Fosheim Grøstad wrote:
...
The presentation by Bryce was quite explicitly focusing on making GPU computation available at the same level as CPU computations (sans function pointers). This should be possible for homogeneous memory systems (GPU and CPU sharing the same memory bus) in a rather transparent manner and languages that plan for this might be perceived as being much more productive and performant if/when this becomes reality. And C++23 isn't far away, if they make the deadline.

Yes. Homogeneous memory accelerators, as found today in game consoles and SoCs, open up some nice possibilities. Scheduling could still be problematic with a centralized resource (unlike per-core SIMD). Distinct instruction formats (GPU vs CPU) also present a challenge to achieving an it-just-works "sans function pointers" level of integration. Surmountable, but a little work to do there.

I'm hopeful that SoCs, with their relatively friendlier accelerator configurations, will be the economic enabler for widespread uptake of dcompute. World beating perf/watt from very readable code deployable on billions of units? I'm up for that!

January 14, 2022

On Friday, 14 January 2022 at 16:57:21 UTC, Bruce Carneal wrote:

>

I'm hopeful that SoCs, with their relatively friendlier accelerator configurations, will be the economic enabler for widespread uptake of dcompute.

It is difficult to predict the future, but it is at least possible that the mainstream home-computing market will be dominated by smaller focused machines with SoCs. If we ignore Apple, then maybe the market will split into something like Chrome-books for non-geek users, something like Steam Deck/Machine for gamers and some other SoC with builtin FPGA or some other tinkering-friendly configuration for Linux enthusiasts. It seems reasonable that only storage will be on discrete chips in the long term. Drops in price levels tend to favour volume markets, so it is reasonable to expect SoCs to win out.

January 14, 2022

On Friday, 14 January 2022 at 17:38:36 UTC, Ola Fosheim Grøstad wrote:

>

On Friday, 14 January 2022 at 16:57:21 UTC, Bruce Carneal wrote:

>

I'm hopeful that SoCs, with their relatively friendlier accelerator configurations, will be the economic enabler for widespread uptake of dcompute.

It is difficult to predict the future, but it is at least possible that the mainstream home-computing market will be dominated by smaller focused machines with SoCs. If we ignore Apple, then maybe the market will split into something like Chrome-books for non-geek users, something like Steam Deck/Machine for gamers and some other SoC with builtin FPGA or some other tinkering-friendly configuration for Linux enthusiasts. It seems reasonable that only storage will be on discrete chips in the long term. Drops in price levels tend to favour volume markets, so it is reasonable to expect SoCs to win out.

Yes, I think the rollout of SoCs that you describe could very well occur. I hadn't even considered those! I was thinking of the accelerators in phone SoCs.

Googling just now I saw an estimate of the number of "smart phones" world wide of over 6 billion. That seems a little high to me but the number of accelerator equipped phone SoCs is certainly in the billions with the number trending to saturation in line with the world's population.

Anybody can hook into an accelerator library, and that will be fine for many apps, but with dcompute you'll have the ability to quickly go beyond the canned solutions when those are deficient.

Lots of ways to win with dcompute.

January 15, 2022

On Friday, 14 January 2022 at 15:17:59 UTC, Ola Fosheim Grøstad wrote:

>

*nods* For a long time we could expect "home computers" to be Intel/AMD, but then the computing environment changed and maybe Apple tries to make its own platform stand out as faster than it is by forcing developers to special case their code for Metal rather than going through a generic API.

I guess FPGAs will be available in entry level machines at some point as well. So, I understand that it will be a challenge to get dcompute to a "ready for the public" stage when there is no multi-person team behind it.

Maybe, but I suspect not for a while though, but that could be wildly wrong. Anyway, I don't think they will be too difficult to support, provided the vendor in question provides an OpenCL implementation. The only thing to do is support pipes.

As for manpower, the reason is I don't have any personal particular need for dcompute these days. I am happy to do features for people that need something in particular, e.g. Vulkan compute shader, textures, and PR are welcome. Though if Bruce makes millions and gives me a job then that will obviously change ;)

>

But I am not so sure about the apples and oranges aspect of it.

The apples to oranges comment was about doing benchmarks with CPU vs. GPU, there are so many factors that make performance comparisons (more) difficult. Is the GPU discrete? How important is latency vs. throughput? How "powerful" is the GPU compared to the CPU?How well suited to the task is the GPU? The list goes on. Its hard enough to do CPU benchmarks in an unbiased way.

If the intention is to say, "look at the speedup you can for for $TASK using $COMMON_HARDWARE" then yeah, that would be possible. It would certainly be possible to do a benchmark of, say, "ease of implementation with comparable performance" of dcopmute vs CUDA, e.g. LoC, verbosity, brittleness etc., since the main advantage of D/dcompute (vs CUDA) is enumeration of kernel designs for performance. That would give a nice measurable goal to improve usability.

>

The presentation by Bryce was quite explicitly focusing on making GPU computation available at the same level as CPU computations (sans function pointers). This should be possible for homogeneous memory systems (GPU and CPU sharing the same memory bus) in a rather transparent manner and languages that plan for this might be perceived as being much more productive and performant if/when this becomes reality. And C++23 isn't far away, if they make the deadline.

Definitely. Homogenous memory is interesting for the ability to make GPUs do the things GPUs are good at and leave the rest to the CPU without worrying about memory transfer across the PCI-e. Something which CUDA can't take advantage of on account of nvidia GPUs being only discrete. I've no idea how cacheing work in a system like that though.

>

It was also interesting to me that ISO C23 will provide custom bit width integers and that this would make it easier to efficiently compile C-code to tighter FPGA logic. I remember that LLVM used to have that in their IR, but I think it was taken out and limited to more conventional bit sizes?

Arbitrary Precision integers are still a part of LLVM, and I presume LLVM IR. the problem with that is, like with addressed spaced pointers, D has no way to declare such types. I seem to remember Luís Marqeus doing something crazy like that (maybe in a dconf presentation?), compiling D to verilog.

>

It just shows that being a system-level programming language requires a lot of adaptability over time and frameworks like dcompute cannot ever be considered truly finished.

Of course.

January 15, 2022

On Saturday, 15 January 2022 at 00:29:20 UTC, Nicholas Wilson wrote:

>

....

Definitely. Homogenous memory is interesting for the ability to make GPUs do the things GPUs are good at and leave the rest to the CPU without worrying about memory transfer across the PCI-e. Something which CUDA can't take advantage of on account of nvidia GPUs being only discrete. I've no idea how cacheing work in a system like that though.
...

How is this different from unified memory?

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd