ldc/dcompute and shared (programmer managed cache) access

Mar 04, 2021

Bruce Carneal

Mar 05, 2021

kinke

Mar 05, 2021

Bruce Carneal

Mar 06, 2021

Imperatorn

Mar 06, 2021

Johan Engelen

ldc doesn't handle CUDA/Nvidia/PTX shared memory declarations but shared memory can be very useful when tuning block (subgroup) cooperative algorithms. Turns out that you can manually (or programatically) inject a .shared declaration into the .ptx output file and, thereafter, obtain the shared memory pointer with a three instruction sequence. As a slightly cleaner alternative I'll next look at using ldc/dcompute as a .o and .ptx generator while punting the fatbin/linking stuff to clang or nvcc in a build script. The current simplicity of single-ptx-file is very nice but foregoing shared memory performance boosts is not nice so ... I'm pretty sure that we'll need to move beyond the single ptx file model if we want to embrace shared cleanly. I left a comment on the open dcompute git "issue" regarding shared but saw no response there. If you have guidance to give on this topic please speak up. Finally, I'll look at incorporating all the CUDA intrinsics that clang does after locking down the shared workaround. Turns out that the very clean irEx hack that Johan provided that works for clz apparently only works on a smallish subset of the intrinsics (that or my ignorance is showing again). More on that later.

On Thursday, 4 March 2021 at 04:52:52 UTC, Bruce Carneal wrote: > As a slightly cleaner alternative I'll next look at using ldc/dcompute as a .o and .ptx generator while punting the fatbin/linking stuff to clang or nvcc in a build script. The current simplicity of single-ptx-file is very nice but foregoing shared memory performance boosts is not nice so ... I'm pretty sure that we'll need to move beyond the single ptx file model if we want to embrace shared cleanly. Not sure if it's of any help, but using something like `-mtriple=nvptx64 -mcpu=sm_50` might suffice: https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886

March 05, 2021

Re: ldc/dcompute and shared (programmer managed cache) access

Posted by Bruce Carneal
in reply to kinke

Permalink

Bruce Carneal

Posted in reply to kinke

Permalink

On Friday, 5 March 2021 at 00:39:29 UTC, kinke wrote:
> On Thursday, 4 March 2021 at 04:52:52 UTC, Bruce Carneal wrote:
>> As a slightly cleaner alternative I'll next look at using ldc/dcompute as a .o and .ptx generator while punting the fatbin/linking stuff to clang or nvcc in a build script.  The current simplicity of single-ptx-file is very nice but foregoing shared memory performance boosts is not nice so ...  I'm pretty sure that we'll need to move beyond the single ptx file model if we want to embrace shared cleanly.
>
> Not sure if it's of any help, but using something like `-mtriple=nvptx64 -mcpu=sm_50` might suffice: https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886

Per the other thread that you've seen, and helped with, the intrinsics are mostly good to go.  Thanks.

Unfortunately CUDA __shared__ is not, AFAICT, addressed by the pull request that you cited.  We need some way to get a __shared__  declaration emitted to the PTX file.

The code generation through standalone (uninitialized) __shared__ pointers, which we *can* declare, already emits the correct PTX instructions (ld.shared, st.shared) but we have no way, at least that I've found, to declare __shared__ address space storage that we can point to.  (clang cuda can do this, nvvc can obviously do this, and we can do this currently only if we manually modify the .PTX file)

The newly available shuffle intrinsics provide for efficient cooperative processing at the warp (32 lane) level, which helps a lot.  The __shared__ address space capability would allow for more efficient cooperation for up to 1024 lanes (an Nvidia "block" upper limit currently IIUC).

As I understand it the L1 caches on Nvidia cards are now writeback so, with the very relaxed memory model in play on GPUs an explicit programmer scratchpad (__shared__) is not as important as it used to be for many apps but in some situations it's still important.  Would be great if we could find a way to provide the capability some where down the road.

On Friday, 5 March 2021 at 04:49:56 UTC, Bruce Carneal wrote: > On Friday, 5 March 2021 at 00:39:29 UTC, kinke wrote: >> [...] > > Per the other thread that you've seen, and helped with, the intrinsics are mostly good to go. Thanks. > > [...] (Just wanted to say it's nice to see someone showing interest in dcompute, I hope it gets some love)

On Friday, 5 March 2021 at 04:49:56 UTC, Bruce Carneal wrote: > > The code generation through standalone (uninitialized) __shared__ pointers, which we *can* declare, already emits the correct PTX instructions (ld.shared, st.shared) but we have no way, at least that I've found, to declare __shared__ address space storage that we can point to. (clang cuda can do this, nvvc can obviously do this, and we can do this currently only if we manually modify the .PTX file) I'm not very familiar with DCompute, but sounds to me that all that is needed is a proposed syntax for what you want. The compiler support is probably pretty easy to add. -Johan

Forums