Jump to page: 1 2
Thread overview
May 18
I just caught up on the dconf material. It was a really interesting year, and I'm super bummed I missed it!

So I'm just starting out on some GPGPU work, and I love the direction we're going here.

Given this scenario;
I want to write CUDA kernels in D, and execute them in an existing C++ CUDA
app.
Can anyone point me at the getting started material, how do I achieve this
seemingly simple goal?
Has anyone started on this content? Where are the toolchains, how to I
build a CUDA kernel, how do I link/load it in my C++ app?

If this doesn't exist, there needs to be a wiki with getting started guide covering various usage environments. I don't have a lot of time to spare on detail, so success will depend on my ability to hit the ground running, and that's really just a matter of simple clear reference material.

How far are we from integration into LDC without using forked compilers?


May 18
On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:
> I just caught up on the dconf material. It was a really interesting year, and I'm super bummed I missed it!
>
> So I'm just starting out on some GPGPU work, and I love the direction we're going here.
>
> Given this scenario;
> I want to write CUDA kernels in D, and execute them in an existing C++ CUDA
> app.

Can do already although not all the intrinsics are there yet, but they are very simple to add. See https://github.com/libmir/dcompute/blob/master/source/dcompute/std/cuda/index.d for how easy it is (PRs welcome!)

> Can anyone point me at the getting started material, how do I achieve this
> seemingly simple goal?

If you're only interested in CUDA I think any LLVM 3.9.1 or 4.0 should work, If you want SPIRV get my LLVM from https://github.com/thewilsonator/llvm/tree/compute

Get the dcompute branch of LDC ( https://github.com/ldc-developers/ldc/tree/dcompute) and its submodules (i.e. druntime, I don't think it points to a branch anymore because Johan deleted that when I merged that into the regular ldc druntime but it will still work.) The dcompute branch of LDC is in the process of being merged into master so expect some breakage when updating to that.

> Has anyone started on this content? Where are the toolchains, how to I
> build a CUDA kernel, how do I link/load it in my C++ app?
>

There is some info on the dcompute wiki (https://github.com/libmir/dcompute/wiki) and I intend make this into "production quality" documentation (again PRs welcome or I can give you write access to the wiki), but I'm very busy at the moment.

When ldc runs you will get a kernels_cudaxxx_yy.ptx (where xxx is the CUDA compute capability specified on the command line and yy is 32 or 64 for 32 or 64bit) which should fit somewhere into your existing C++ pipeline.

> If this doesn't exist, there needs to be a wiki with getting started guide covering various usage environments. I don't have a lot of time to spare on detail, so success will depend on my ability to hit the ground running, and that's really just a matter of simple clear reference material.
>
> How far are we from integration into LDC without using forked compilers?

Soon™, it the first thing I'm going to do (if I don't get it done before) after I hand in my honours thesis at the start of July. I'm stupidly busy this semester so don't count on it done beforehand.

If you have any questions about anything probably the fastest way is the libmir or ldc gitter channels. (https://gitter.im/ldc-developers/main , https://gitter.im/libmir/public).

Good luck! I look forward to your experience report.
Nic

May 18
On Thursday, 18 May 2017 at 09:07:38 UTC, Nicholas Wilson wrote:
> When ldc runs you will get a kernels_cudaxxx_yy.ptx (where xxx is the CUDA compute capability specified on the command line and yy is 32 or 64 for 32 or 64bit) which should fit somewhere into your existing C++ pipeline.

Whoops, that assumes you have a CUDA driver API pipeline in your C++ code, which if you're asking I'm not sure that you have.
If you're using the `kernel<<<...>>>(args)` form to launch you kernels then you are going to have a lot more work to do in D because you'll need to use the driver API (http://docs.nvidia.com/cuda/cuda-driver-api/#axzz4hQLA0Zdm)
You'll need to:
*get a device
*create a context from it
*get a stream on that context
*load the ptx module (possibly linking it with other modules, to resolve missing symbols).
*compile it for the device
*then launch a kernel from that module on that device, by name passing the arguments in a void*[].

The sad thing is that its still nice than OpenCL because in OpenCL you have to pass the runtime args (with sizes) one by one to a function.

Hence why I want to automate as much of that shit as is possible.
 I hope to have that done ASAP, but I don't have hardware set up to test CUDA at the moment (I have one on my windows box but I don't have dev set up there) and I'll be working on OpenCL at the same time (and theres only so much horrible API I can take in a day).
 I'll be working on dcompute part-part-time next semester though so I should be able to get a fair bit done and quite a few others are interested so that'll speed thing up a bit.
May 30
On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:
> How far are we from integration into LDC without using forked compilers?

The future is now!

https://forum.dlang.org/thread/zcfqujlgnultnqfksbjh@forum.dlang.org

https://github.com/ldc-developers/ldc/commit/69ad69e872f53c14c101e2c029c4757c4073f487
is the final commit from the stuff I've done prior to dconf.
May 30
On 30 May 2017 at 17:33, Nicholas Wilson via Digitalmars-d < digitalmars-d@puremagic.com> wrote:

> On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:
>
>> How far are we from integration into LDC without using forked compilers?
>>
>
> The future is now!
>
> https://forum.dlang.org/thread/zcfqujlgnultnqfksbjh@forum.dlang.org
>
> https://github.com/ldc-developers/ldc/commit/69ad69e872f53c1
> 4c101e2c029c4757c4073f487
> is the final commit from the stuff I've done prior to dconf.
>

Awesome stuff! That was fast :)

You're right, I'm using kernel<<<...>>>, and it's very convenient.
I looked briefly and realised that I had a lot of work to get running (as
you describe), so I stuck with my current setup for the moment :(

Is a <<<...>>> equivalent going to be possible in D, with kernel object fragments built into the binary together with the CPU code?

I'm definitely looking forward to action in this space, and the wiki to come online :)


May 30
On Tuesday, 30 May 2017 at 08:14:16 UTC, Manu wrote:
> On 30 May 2017 at 17:33, Nicholas Wilson via Digitalmars-d < digitalmars-d@puremagic.com> wrote:
>
>> On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:
>>
>>> How far are we from integration into LDC without using forked compilers?
>>>
>>
>> The future is now!
>>
>> https://forum.dlang.org/thread/zcfqujlgnultnqfksbjh@forum.dlang.org
>>
>> https://github.com/ldc-developers/ldc/commit/69ad69e872f53c1
>> 4c101e2c029c4757c4073f487
>> is the final commit from the stuff I've done prior to dconf.
>>
>
> Awesome stuff! That was fast :)
>
> You're right, I'm using kernel<<<...>>>, and it's very convenient.

Yep, thats (one of the reasons) why CUDA is more successful than OpenCL and therefore one of the more powerful draws for those poor sods using OpenCL.

> I looked briefly and realised that I had a lot of work to get running (as
> you describe), so I stuck with my current setup for the moment :(
>

I figured, I'll get you using it eventually.

> Is a <<<...>>> equivalent going to be possible in D, with kernel object fragments built into the binary together with the CPU code?

As I explained in my dconf presentation: the idea is to have
Queue q = ... ; // the equivalent of a CUDA stream

    q.enqueue!kernel(sizes)(kernel_arguments);

where q.enqueue returns a callable that you then call with the arguments. It was modelled directly after CUDAs <<<...>>>

as for embedding in the binary a post build step that does

ubyte[] ptx_code = import("kernels_cuda620_64.ptx");

should be doable as should invoking ptxas and doing the same.
Then proving a consistent naming convention is used the code can do its magic.
Or the files could just be read from disk.

>
> I'm definitely looking forward to action in this space, and the wiki to come online :)

Yeah once my thesis is done thing should start moving. Any input with your expertise with CUDA will be much appriciated.

June 03
On 30 May 2017 at 19:54, Nicholas Wilson via Digitalmars-d < digitalmars-d@puremagic.com> wrote:

>
> as for embedding in the binary a post build step that does
>
> ubyte[] ptx_code = import("kernels_cuda620_64.ptx");
>
> should be doable as should invoking ptxas and doing the same.
> Then proving a consistent naming convention is used the code can do its
> magic.
> Or the files could just be read from disk.


Is it possible to convince the compiler to emit code built for the backend
target directly into the same object file as the host code?
I feel like this should be possible, along the lines of
__attribute__((target(...)) to convince the compiler to generate code for a
few functions with different targets than the module?

Using solutions as you suggest above introduces dependent build sequencing into the build script. Different build systems might prove to be more or less difficult to integrate cleanly, and many people use build-script generators which might need to learn a few new tricks.

Any input with your expertise with CUDA will be much appriciated.
>

'Expertise' is possibly not the word I'd suggest ;)
But I'll have some established software by that time that I'd love to
attempt to port, we can work through rough edges together when you're
available. No rush.


June 03
On Saturday, 3 June 2017 at 11:58:14 UTC, Manu wrote:
> Is it possible to convince the compiler to emit code built for the backend target directly into the same object file as the host code?
> I feel like this should be possible, along the lines of
> __attribute__((target(...)) to convince the compiler to generate code for a few functions with different targets than the module?
>

Alas no. __attribute__((target(...)) works because it targeting different targets of the _same_ backend, this targets different backends.

> Using solutions as you suggest above introduces dependent build sequencing into the build script. Different build systems might prove to be more or less difficult to integrate cleanly, and many people use build-script generators which might need to learn a few new tricks.
>

Hence why I made reading from disk (with a predictable output name) a viable option.
Any build script trickery would simply just feed in slightly later down the pipeline of.

read file from disk -> in memory -> compute API -> binary -> execute

Also thats how using other kernels not written in D would work, just plug somewhere into the pipeline.

> 'Expertise' is possibly not the word I'd suggest ;)
> But I'll have some established software by that time that I'd love to attempt to port, we can work through rough edges together when you're available. No rush.

Great. Heh, still more experience than I have. Looking forward to it, earliest would probably be August.

June 03
On Saturday, 3 June 2017 at 12:13:41 UTC, Nicholas Wilson wrote:
> Alas no. __attribute__((target(...)) works because it targeting different targets of the _same_ backend, this targets different backends.

But surely you could just take the buffer with the object code emitted by the compute target backend and emit it as constant global data into the host llvm::Module?

 — David
June 03
On Saturday, 3 June 2017 at 14:12:28 UTC, David Nadlinger wrote:
> On Saturday, 3 June 2017 at 12:13:41 UTC, Nicholas Wilson wrote:
>> Alas no. __attribute__((target(...)) works because it targeting different targets of the _same_ backend, this targets different backends.
>
> But surely you could just take the buffer with the object code emitted by the compute target backend and emit it as constant global data into the host llvm::Module?
>
>  — David

Good idea, name of global needs to be known in advance though.
« First   ‹ Prev
1 2