January 15, 2022

On Saturday, 15 January 2022 at 08:01:15 UTC, Paulo Pinto wrote:

>

On Saturday, 15 January 2022 at 00:29:20 UTC, Nicholas Wilson wrote:

>

....

Definitely. Homogenous memory is interesting for the ability to make GPUs do the things GPUs are good at and leave the rest to the CPU without worrying about memory transfer across the PCI-e. Something which CUDA can't take advantage of on account of nvidia GPUs being only discrete. I've no idea how cacheing work in a system like that though.
...

How is this different from unified memory?

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd

there still a PCI-e in-between. Fundamentally the memory must exist in either the CPUs RAM or the GPUs (V)RAM, from what I understand unified memory allows the GPU to access the host RAM with the same pointer. This reduces the total memory consumed by the program, but to get to the GPU the data must still cross the PCI-e.

January 15, 2022

On Saturday, 15 January 2022 at 09:03:11 UTC, Nicholas Wilson wrote:

>

from what I understand unified memory allows the GPU to access the host RAM with the same pointer. This reduces the total memory consumed by the program, but to get to the GPU the data must still cross the PCI-e.

Exactly. I remember that in 2013 "Unified Memory Access" on NVIDIA was underwhelming, performing worse than pinned transfer + GPU memory access.

January 15, 2022

On Saturday, 15 January 2022 at 09:03:11 UTC, Nicholas Wilson wrote:

>

On Saturday, 15 January 2022 at 08:01:15 UTC, Paulo Pinto wrote:

>

On Saturday, 15 January 2022 at 00:29:20 UTC, Nicholas Wilson wrote:

>

....

Definitely. Homogenous memory is interesting for the ability to make GPUs do the things GPUs are good at and leave the rest to the CPU without worrying about memory transfer across the PCI-e. Something which CUDA can't take advantage of on account of nvidia GPUs being only discrete. I've no idea how cacheing work in a system like that though.
...

How is this different from unified memory?

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd

there still a PCI-e in-between. Fundamentally the memory must exist in either the CPUs RAM or the GPUs (V)RAM, from what I understand unified memory allows the GPU to access the host RAM with the same pointer. This reduces the total memory consumed by the program, but to get to the GPU the data must still cross the PCI-e.

Yes. You also gain some simplification, from unified memory, if your data structures are pointer heavy.

I've tried to gain advantage from GPU-side pulls across the bus in the past but could never win out over explicit async copying utilizing dedicated copy circuitry. Others, particularly those with high compute-to-load/store ratios, may have had better luck.

For reference, I've only been able to get a little over 80% of the advertised PCI-e peak bandwidth out of the dedicated Nvidia copy HW.

January 15, 2022

On Saturday, 15 January 2022 at 10:35:29 UTC, Guillaume Piolat wrote:

>

On Saturday, 15 January 2022 at 09:03:11 UTC, Nicholas Wilson wrote:

>

from what I understand unified memory allows the GPU to access the host RAM with the same pointer. This reduces the total memory consumed by the program, but to get to the GPU the data must still cross the PCI-e.

Exactly. I remember that in 2013 "Unified Memory Access" on NVIDIA was underwhelming, performing worse than pinned transfer + GPU memory access.

Exactly++. Pinned buffers + async HW copies always won out for me.

I imagine there could be scenarios where programmatic peeking/poking from either side wins but I've not seen them, probably because if your data flows are small enough for that to win you'd just fire up SIMD and call it a day.

January 15, 2022

On Saturday, 15 January 2022 at 00:29:20 UTC, Nicholas Wilson wrote:

>

As for manpower, the reason is I don't have any personal particular need for dcompute these days. I am happy to do features for people that need something in particular, e.g. Vulkan compute shader, textures, and PR are welcome. Though if Bruce makes millions and gives me a job then that will obviously change ;)

He can put me on the application list as well… This sounds like lots of fun!!!

>

important is latency vs. throughput? How "powerful" is the GPU compared to the CPU?How well suited to the task is the GPU? The list goes on. Its hard enough to do CPU benchmarks in an unbiased way.

I don't think people would expect benchmarks to be unbiased. It could be 3-4 short benchmarks, some showcasing where it is beneficial, some showcasing where data dependencies (or other challenges) makes it less suitable.

E.g.

  1. compute autocorrelation over many different lags
  2. multiply and take the square root of two long arrays
  3. compute a simple IIR filter (I assume a recursive filter would be a worst case?)
>

If the intention is to say, "look at the speedup you can for for $TASK using $COMMON_HARDWARE" then yeah, that would be possible. It would certainly be possible to do a benchmark of, say, "ease of implementation with comparable performance" of dcopmute vs CUDA, e.g. LoC, verbosity, brittleness etc., since the main advantage of D/dcompute (vs CUDA) is enumeration of kernel designs for performance. That would give a nice measurable goal to improve usability.

Yes, but I think of it as an inspiration with a tutorial of how to get the benchmarks to run. For instance, like you, I have no need for this at the moment and my current computer isn't really a good showcase of GPU computation either, but I have one long term hobby project where I might use GPU-computations eventually.

I suspect many think of GPU computations as something requiring a significant amount of time to get into. Even though they may be interested that threshold alone is enough to put it in the "interesting, but I'll look at it later" box.

If you can tease people into playing with it for fun, then I think there is a larger chance of them using it at a later stage (or even thinking about the possibility of using it) when they see a need in some heavy computational problem they are working on.

There is a lower threshold to get started with something new if you already have a tiny toy-project you can cut and paste from that you have written yourself.

Also, updated benchmarks could generate new interest on the announce forum thread. Lurking forum readers, probably only read them on occasion, so you have to make several posts to make people aware of it.

>

Definitely. Homogenous memory is interesting for the ability to make GPUs do the things GPUs are good at and leave the rest to the CPU without worrying about memory transfer across the PCI-e. Something which CUDA can't take advantage of on account of nvidia GPUs being only discrete. I've no idea how cacheing work in a system like that though.

I don't know, but Steam Deck, which appears to come out next month, seems to run under Linux and has an "AMD APU" with a modern GPU and CPU integrated on the same chip, at least that is what I've read. Maybe there will be more technical info available on how that works at the hardware level later, or maybe it is already on AMDs website?

If someone reading this thread has more info on this, it would be nice if they would share what they have found out! :-)

January 15, 2022

On Saturday, 15 January 2022 at 12:21:37 UTC, Ola Fosheim Grøstad wrote:

>

...

I don't know, but Steam Deck, which appears to come out next month, seems to run under Linux and has an "AMD APU" with a modern GPU and CPU integrated on the same chip, at least that is what I've read. Maybe there will be more technical info available on how that works at the hardware level later, or maybe it is already on AMDs website?

If someone reading this thread has more info on this, it would be nice if they would share what they have found out! :-)

According to the public documentation you can expect it to be similar to AMD Ryzen 7 3750H, with Radeon RX Vega 10 Graphics (16 GB).

https://partner.steamgames.com/doc/steamdeck/testing#5

January 15, 2022

On Saturday, 15 January 2022 at 12:21:37 UTC, Ola Fosheim Grøstad wrote:

> >

Definitely. Homogenous memory is interesting for the ability to make GPUs do the things GPUs are good at and leave the rest to the CPU without worrying about memory transfer across the PCI-e. Something which CUDA can't take advantage of on account of nvidia GPUs being only discrete.

Steam Deck, which appears to come out next month, seems to run under Linux and has an "AMD APU" with a modern GPU and CPU integrated on the same chip

Related: has anyone here seen an actual measured performance gain from co-located CPU and GPU on the same chip? I used to test with OpenCL + Intel SoC and again, it was underwhelming and not faster. I'd be happy to know about other experiences.

January 15, 2022

On Saturday, 15 January 2022 at 17:29:35 UTC, Guillaume Piolat wrote:

>

On Saturday, 15 January 2022 at 12:21:37 UTC, Ola Fosheim Grøstad wrote:

> >

Definitely. Homogenous memory is interesting for the ability to make GPUs do the things GPUs are good at and leave the rest to the CPU without worrying about memory transfer across the PCI-e. Something which CUDA can't take advantage of on account of nvidia GPUs being only discrete.

Steam Deck, which appears to come out next month, seems to run under Linux and has an "AMD APU" with a modern GPU and CPU integrated on the same chip

Related: has anyone here seen an actual measured performance gain from co-located CPU and GPU on the same chip? I used to test with OpenCL + Intel SoC and again, it was underwhelming and not faster. I'd be happy to know about other experiences.

Well console memory systems are basically built around this idea. On the assumption that you mean a consumer chip with integrated graphics, any gain you see from sharing memory is going to be contrasted against the chip being intended for people who were going to actually use integrated graphics. For compute especially it seems like this is very dependant on what patterns you actually want to do with the memory.

The new Apple chips have a unified memory architecture, and a really fast one too. I don't know what GPGPU is like on it but it's one of the reason why it absolutely flies on normal code.

January 15, 2022

On Saturday, 15 January 2022 at 17:29:35 UTC, Guillaume Piolat wrote:

>

On Saturday, 15 January 2022 at 12:21:37 UTC, Ola Fosheim Grøstad wrote:

> >

Definitely. Homogenous memory is interesting for the ability to make GPUs do the things GPUs are good at and leave the rest to the CPU without worrying about memory transfer across the PCI-e. Something which CUDA can't take advantage of on account of nvidia GPUs being only discrete.

Steam Deck, which appears to come out next month, seems to run under Linux and has an "AMD APU" with a modern GPU and CPU integrated on the same chip

Related: has anyone here seen an actual measured performance gain from co-located CPU and GPU on the same chip? I used to test with OpenCL + Intel SoC and again, it was underwhelming and not faster. I'd be happy to know about other experiences.

The link below on the vkpolybench software includes graphs for integrated GPUs, among others, and shows significant (more than SIMD width) speedups wrt a single CPU core for many of the benchmarks but also break-even or worse on a few. Reports on real world experiences with the integrated accelerators would be better.

https://github.com/ElsevierSoftwareX/SOFTX_2020_86

On paper, at least, it looks like SoC GPU performance will be severely impacted by the working set size but who isn't?. Currently it also looks like the dcompute/SoC-GPU version will beat out my SIMD variant but it'll be at least a few months before I have hard data to share.

Anyone out there have real world data now?

January 15, 2022

On Thursday, 13 January 2022 at 23:28:01 UTC, Guillaume Piolat wrote:

>

As a former GPGPU guy: can you explain in what ways dcompute improves life over using CUDA and OpenCL through DerelictCL/DerelictCUDA (I used to maintain them and I think nobody ever used them). Using the API directly seems to offer the most control to me, and no special compiler support.

Forgot to respond to this. This probably does not apply to dcompute, but Bryce pointed out in his presentation how you could step through your "GPU code" on the CPU using a regular debugger since the parallel code was regular C++. Not exactly sure how that works, but I would imagine that they provide functions that match the GPU?

That sounds like a massive productivity advantage to me if you want to write complicated "shaders".