Jump to page: 1 2
Thread overview
ldc nvvm GPU intrinsics good news
Mar 05, 2021
Bruce Carneal
Mar 05, 2021
kinke
Mar 05, 2021
Bruce Carneal
Mar 05, 2021
kinke
Mar 05, 2021
Bruce Carneal
Mar 05, 2021
kinke
Mar 05, 2021
Bruce Carneal
Mar 05, 2021
kinke
Mar 05, 2021
kinke
Mar 05, 2021
Bruce Carneal
Mar 05, 2021
Bruce Carneal
Apr 25, 2021
Johan Engelen
Apr 26, 2021
Bruce Carneal
Apr 27, 2021
Imperatorn
March 05, 2021
After updating the first line to '@compute(CompileFor.hostAndDevice) module ...' and adding an 'import ldc.dcompute;' line, the runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC build apparently gives access to all manner of GPU intrinsics.

I've only tried it out on __syncthreads and __nvvm_shfl_down_i32 but both "invocations" of those LDC_intrinsics resulted in the expected single ptx instruction in the dcompute .ptx output.

There are over 600 pragma(LDC_intrinsic, "llvm.nvvm.xxxxxx") builtins in the gcc_builtins_nvvm.di file so, of course, I've not hand tested them all but it looks very promising.

If you're working with dcompute on an OpenCL device, I'd love to hear if something similar works for your use cases of if you've found another way forward.

March 05, 2021
On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
> After updating the first line to '@compute(CompileFor.hostAndDevice) module ...' and adding an 'import ldc.dcompute;' line, the runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC build apparently gives access to all manner of GPU intrinsics.

Sorry, I assumed you had already seen it (plus ldc/gccbuiltins_amdgcn.di); they've been introduced with LDC v1.22. These files are auto-generated from the LLVM .td file using a tiny program [1]. It's currently limited to intrinsics with a GCCBuiltinName, so if you'd like to make more of these available in D, see processRecord().

[1] https://github.com/ldc-developers/ldc/blob/master/utils/gen_gccbuiltins.cpp
March 05, 2021
On Friday, 5 March 2021 at 00:16:41 UTC, kinke wrote:
> On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
>> After updating the first line to '@compute(CompileFor.hostAndDevice) module ...' and adding an 'import ldc.dcompute;' line, the runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC build apparently gives access to all manner of GPU intrinsics.
>
> Sorry, I assumed you had already seen it (plus ldc/gccbuiltins_amdgcn.di); they've been introduced with LDC v1.22. These files are auto-generated from the LLVM .td file using a tiny program [1]. It's currently limited to intrinsics with a GCCBuiltinName, so if you'd like to make more of these available in D, see processRecord().
>
> [1] https://github.com/ldc-developers/ldc/blob/master/utils/gen_gccbuiltins.cpp

So what is the right/clean way to get @compute access to these builtins?  Should practitioners do a manual copy edit or is there a better way?

March 05, 2021
On Friday, 5 March 2021 at 00:32:03 UTC, Bruce Carneal wrote:
> So what is the right/clean way to get @compute access to these builtins?  Should practitioners do a manual copy edit or is there a better way?

AFAICT, the only challenge is to find a suited signature in D for the LLVM intrinsic, incl. some unique name if there's no gcc builtin name - as a GitHub PR for that tool, so that all future .di files contain these extra intrinsics. To get there, I'd suggest adding some temp logging to inspect the records from the parsed .td file.
March 05, 2021
On Friday, 5 March 2021 at 00:16:41 UTC, kinke wrote:
> On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
> ... These files are auto-generated from the LLVM .td file using a tiny program [1]. It's currently limited to intrinsics with a GCCBuiltinName, so if you'd like to make more of these available in D, see processRecord().
>
> [1] https://github.com/ldc-developers/ldc/blob/master/utils/gen_gccbuiltins.cpp

Good to know.  __nvvm_vote_ballot and __nvvm_vote_ballot_sync both show up in the .td file with GCCBuiltin names but do not show up in the .di file.

I've found a workaround for those but I'd like to understand what kept them from being included in the .di file.  I'm guessing it is the dtype(...) call from within processRecord(...) that did not know what to do with an i1 type?

If it's as simple as that I'd greatly appreciate your augmenting the if/else ladder in dtype() to include the branch for i1.  The ballot/vote instruction may be better than my workaround and has other uses.

I tried to work around the .di file issue by declaring my own pragma:
pragma(LDC_intrinsic, "llvm.nvvm.vote.ballot") int myBallot(bool);
with no luck:
 LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.vote.ballot

If you know how I can manually define the ballot pragmas correctly before any dtype(...) upgrade you might release, please let me know.

Finally and again, many thanks for your efforts maintaining LDC.

March 05, 2021
On Friday, 5 March 2021 at 00:48:26 UTC, kinke wrote:
> On Friday, 5 March 2021 at 00:32:03 UTC, Bruce Carneal wrote:
>> So what is the right/clean way to get @compute access to these builtins?  Should practitioners do a manual copy edit or is there a better way?
>
> AFAICT, the only challenge is to find a suited signature in D for the LLVM intrinsic, incl. some unique name if there's no gcc builtin name - as a GitHub PR for that tool, so that all future .di files contain these extra intrinsics. To get there, I'd suggest adding some temp logging to inspect the records from the parsed .td file.

To be clear, I'm not asking for how a new builtin might be added cleanly, I'm asking if there is a better way to access all the builtins present in the current .di file.

As noted above, I had to copy and edit the .di file to avoid this error from ldc:
...: Error: can only call functions from other @compute modules in @compute code

Once I did that, everything I have tried from the edited .di file appears to work as intended.  Amazing!  I just thought that copying/editing could be avoided.

March 05, 2021
On Friday, 5 March 2021 at 01:59:35 UTC, Bruce Carneal wrote:
> To be clear, I'm not asking for how a new builtin might be added cleanly, I'm asking if there is a better way to access all the builtins present in the current .di file.
>
> As noted above, I had to copy and edit the .di file to avoid this error from ldc:
> ...: Error: can only call functions from other @compute modules in @compute code
>
> Once I did that, everything I have tried from the edited .di file appears to work as intended.  Amazing!  I just thought that copying/editing could be avoided.

The dcompute parts in the compiler could definitely be improved (but it's nowhere on my priorities list). What I tried to get at by linking my comment [1] is that you can generally target NVPTX/AMDGCN directly, without any @compute stuff, and in that case importing and using intrinsics from that .di works without any issues.

[1] https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886
March 05, 2021
On Friday, 5 March 2021 at 11:28:31 UTC, kinke wrote:
> On Friday, 5 March 2021 at 01:59:35 UTC, Bruce Carneal wrote:
[...]
>
> The dcompute parts in the compiler could definitely be improved (but it's nowhere on my priorities list). What I tried to get at by linking my comment [1] is that you can generally target NVPTX/AMDGCN directly, without any @compute stuff, and in that case importing and using intrinsics from that .di works without any issues.
>
> [1] https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886

Okay.  I'll do the gccbuiltins_nvvm.di copy/edit for use in dcompute as part of my build script.

A copy/edit at build time will give access to many of the nvvm intrinsics and Johan's irEx suggestion can be used for access to others.

That leaves the intrinsics in the .td file that have llvm_i1_ty parameters and gcc builtin names (like vote/ballot).  IIUC these would all be handled by the gen_gccbuiltins.cpp program if it knew how to deal with llvm_i1_ty parameters.

Unfortunately, the "obvious" pragma(LDC_intrinsic, ...) forms that I've tried for llvm_i1_ty gccbuiltin intrinsics have not worked, at least not for the vote/ballot test cases.  I'll keep investigating.  Clang and nvcc are both based on LLVM and deal with these somehow so I'll start looking there.

I expect that you're very busy but if you already know or suspect what the llvm_i1_ty problem/fix is, a brief sketch here would be much appreciated.  Regardless, thanks again for your efforts.  A working LDC/dcompute is *much* nicer than C++/CUDA.

March 05, 2021
On Friday, 5 March 2021 at 13:30:52 UTC, Bruce Carneal wrote:
> Unfortunately, the "obvious" pragma(LDC_intrinsic, ...) forms that I've tried for llvm_i1_ty gccbuiltin intrinsics have not worked, at least not for the vote/ballot test cases.  I'll keep investigating.

For __nvvm_vote_ballot (no idea what it does), this seems to suffice (bool for i1):

pragma(LDC_intrinsic, "llvm.nvvm.vote.ballot")
int __nvvm_vote_ballot(bool);

void foo(bool a) { __nvvm_vote_ballot(a); }

https://run.dlang.io/is/z0k6l9
March 05, 2021
On Friday, 5 March 2021 at 16:34:07 UTC, kinke wrote:
> https://run.dlang.io/is/z0k6l9

Small correction (IR generation was fine, compilation wasn't): that intrinsic requires a newer target shader model (-mcpu=sm_70) [and -betterC to avoid ModuleInfo etc.]. Then the asm can be generated and inspected too.
« First   ‹ Prev
1 2