Thread overview
ldc/dcompute nvptx intrinsics
Feb 19, 2021
Bruce Carneal
Feb 20, 2021
Johan Engelen
Feb 20, 2021
Bruce Carneal
Feb 21, 2021
Bruce Carneal
Feb 23, 2021
Johan
Feb 23, 2021
Bruce Carneal
Mar 05, 2021
Bruce Carneal
Feb 23, 2021
Imperatorn
February 19, 2021
LLVM appears to "know" about most if not all of the CUDA intrinsics as can be seen here: https://github.com/llvm-mirror/llvm/blob/master/lib/Target/NVPTX/NVPTXIntrinsics.td

I've successfully mimicked dcompute's pragma(LDC_intrinsic "llvm.nvvm...") idiom for some things but have not succeeded in implementing "ballot" or other important CUDA intrinsics that are commented out in dcompute.std.warp.d.  Perhaps I'm just not uttering the correct incantations...

I'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve.  Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally?

I'm considering trying to merge a hand written PTX file in at build time and falling back to C++/CUDA if all else fails.  Anything that will help me avoid a trip to the dark side would be much appreciated.

February 20, 2021
On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal wrote:
>
> I'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve.  Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally?

Hi Bruce,
  You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics.
Very quickly, something like this:

```
alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*);
```

-Johan

February 20, 2021
On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen wrote:
> On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal wrote:
>>
>> I'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve.  Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally?
>
> Hi Bruce,
>   You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics.
> Very quickly, something like this:
>
> ```
> alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*);
> ```
>
> -Johan

This looks like a very good way to enable intrinsics.  I'll give it a try and report back.  If it's clean enough I'll also submit a dcompute PR.

Thanks Johan!

February 21, 2021
On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen wrote:
> On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal wrote:
>>
>> I'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve.  Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally?
>
> Hi Bruce,
>   You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics.
> Very quickly, something like this:
>
> ```
> alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*);
> ```
>
> -Johan

Here's what I get so far when testing the above out against llvm.nvvm.barrier0.popc (something that works as an LDC_intrinsic):

"can only call functions from other @compute modules in @compute code"

I'll keep digging.

February 23, 2021
On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen wrote:
> On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal wrote:
>>
>> I'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve.  Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally?
>
> Hi Bruce,
>   You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics.
> Very quickly, something like this:
>
> ```
> alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*);
> ```
>
> -Johan

Wat, didn't even know about that
February 23, 2021
On Sunday, 21 February 2021 at 01:18:10 UTC, Bruce Carneal wrote:
> On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen wrote:
>> On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal wrote:
>>>
>>> I'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve.  Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally?
>>
>> Hi Bruce,
>>   You can write LLVM IR inside D code using ldc.llvmasm.__irEx. Perhaps that works as a quick workaround for defining calls to intrinsics.
>> Very quickly, something like this:
>>
>> ```
>> alias someFunction = __irEx!("forward declaration of intrinsic", "call intrinsic(%0,%1,%2)", int, long, void*);
>> ```
>>
>> -Johan
>
> Here's what I get so far when testing the above out against llvm.nvvm.barrier0.popc (something that works as an LDC_intrinsic):
>
> "can only call functions from other @compute modules in @compute code"
>
> I'll keep digging.

Hi Bruce,
  I played around a bit and have a full working example for you:

```
@compute(CompileFor.deviceOnly) module dcompute;

import ldc.dcompute;

// Copied from ldc.llvmasm, because indeed it needs to be defined in @compute code.
pragma(LDC_inline_ir)
    R __irEx(string prefix, string code, string suffix, R, P...)(P) @trusted nothrow @nogc;

alias someFunction = __irEx!("declare i32 @llvm.nvvm.barrier0.popc(i32)", "%i = call i32 @llvm.nvvm.barrier0.popc(i32 %0)\nret i32 %i", "", int, int);


void foo(GlobalPointer!float x_in) {
    auto i = someFunction(2);
}
```

Compile with `bin/ldc2 -c -mdcompute-targets=cuda-350 -mdcompute-file-prefix=dcompute_kernel  dcompute.d -O3` gives PTX output:

```
❯ cat dcompute_kernel_cuda350_64.ptx
//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_35
.address_size 64

        // .globl       _D8dcompute3fooFS3ldcQt__T7PointerVEQtQBk9AddrSpacei1TfZQBeZv

.visible .func _D8dcompute3fooFS3ldcQt__T7PointerVEQtQBk9AddrSpacei1TfZQBeZv(
        .param .b64 _D8dcompute3fooFS3ldcQt__T7PointerVEQtQBk9AddrSpacei1TfZQBeZv_param_0
)
{
        .reg .b32       %r<3>;

        mov.u32         %r1, 2;
        {
        .reg .pred      %p1;
        setp.ne.u32     %p1, %r1, 0;
        bar.red.popc.u32        %r2, 0, %p1;
        }
        ret;

}
```

If this indeed will fit your usecase, you have a good argument for including `__irEx` into ldc.dcompute. Please file bugs/features on github!

cheers,
  Johan


February 23, 2021
On Tuesday, 23 February 2021 at 18:04:52 UTC, Johan wrote:
> On Sunday, 21 February 2021 at 01:18:10 UTC, Bruce Carneal wrote:
>> On Saturday, 20 February 2021 at 12:38:35 UTC, Johan Engelen wrote:
>>> On Friday, 19 February 2021 at 20:02:29 UTC, Bruce Carneal wrote:
>>>>
>>>> I'd love to bring up some additional functionality but I'm new to LDC/LLVM so it's slow going at this point of the learning curve.  Does anyone have additional LDC/dcompute CUDA intrinsics working or have some pointers for bringing up more CUDA/warp intrinsics generally?
>>>

>
> Hi Bruce,
>   I played around a bit and have a full working example for you:
>
> ```
> @compute(CompileFor.deviceOnly) module dcompute;
>
> import ldc.dcompute;
>
> [... working example ...]
>
> If this indeed will fit your usecase, you have a good argument for including `__irEx` into ldc.dcompute. Please file bugs/features on github!
>
> cheers,
>   Johan

Success!

After verifying that your latest example worked I plugged in llvm.nvvm.clz.i from the .td file.  That generated the hoped for single instruction function body:
  clz.b32  %r2, %r1
This clz intrinsic alone saves me a couple dozen instructions in a hot section of code where I "call" clz twice.

I will expand the set of intrinsics enabled via the __irEx method over the next few days and then try to contact Nicholas W. and/or John C. via email or beerconf to get their take on the capabilities (they may suggest a more easily supported way to go about things, or have cuda naming suggestions, or want to rationalize these with OCL, or ...).  Assuming that goes well I'll file with ldc and dcompute.

Thank you Johan.


March 05, 2021
On Tuesday, 23 February 2021 at 23:36:53 UTC, Bruce Carneal wrote:
> On Tuesday, 23 February 2021 at 18:04:52 UTC, Johan wrote:
>> On Sunday, 21 February 2021 at 01:18:10 UTC, Bruce Carneal wrote:
[...]
>>
>> Hi Bruce,
>>   I played around a bit and have a full working example for you:
>>
>> ```
>> @compute(CompileFor.deviceOnly) module dcompute;
>>
>> import ldc.dcompute;
>>
>> [... working example ...]
>>
>> If this indeed will fit your usecase, you have a good argument for including `__irEx` into ldc.dcompute. Please file bugs/features on github!
>>
>> cheers,
>>   Johan
>
> Success!
>
> After verifying that your latest example worked I plugged in llvm.nvvm.clz.i from the .td file.  That generated the hoped for single instruction function body:
>   clz.b32  %r2, %r1
> This clz intrinsic alone saves me a couple dozen instructions in a hot section of code where I "call" clz twice.
>
> I will expand the set of intrinsics enabled via the __irEx method over the next few days and then try to contact Nicholas W. and/or John C. via email or beerconf to get their take on the capabilities (they may suggest a more easily supported way to go about things, or have cuda naming suggestions, or want to rationalize these with OCL, or ...).  Assuming that goes well I'll file with ldc and dcompute.
>
> Thank you Johan.

Turns out that almost all intrinsics are available after a tiny bit of .di file massaging (see the "ldc nvvm GPU intrinsics good news" thread).  The irEx capability Johan pointed out seems to cover the rest.

The irEx accessible set is defined at the top of the intrinsics .td file and includes the popc intrinsic, the clz initrinsic, and about a dozen others, most very useful.  The .di file instrinsics number over 500, some of which appear useful.