January 20, 2022

On Thursday, 20 January 2022 at 04:01:09 UTC, Araq wrote:

>

On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson wrote:

>

I mean there are parametric attributes of the hardware, say for example cache size (or available registers for GPUs), that have a direct effect on how many times you can unroll the inner loop, say for a windowing function, and you want to ship optimised code for multiple configurations of hardware.

You can much more easily create multiple copies for different sized cache (or register availability) in D than you can in C++, because static foreach and static if >>> if constexpr.

And you can do that even more easily with an AST macro system. Which Julia has...

Given this endorsement I started reading up on Julia/GPU... Here are a few things that I found:
A gentle tutorial: https://nextjournal.com/sdanisch/julia-gpu-programming
Another, more concise: https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/

For those that are video oriented, here's a recent workshop:
https://www.youtube.com/watch?v=Hz9IMJuW5hU

While I admit to just skimming that, very long, video I was impressed by the tooling on display and the friendly presentation.

In short, I found a lot to like about Julia from the above and other writings but the material on Julia AST macros specifically was ... underwhelming. AST macros look like an inferior tool in this low level setting. They are slightly less readable to me then the dcompute alternatives without offering any compensating gain in performance.

January 20, 2022

On Thursday, 20 January 2022 at 08:36:32 UTC, Ola Fosheim Grøstad wrote:

>

On Thursday, 20 January 2022 at 08:20:58 UTC, Nicholas Wilson wrote:

>

Now you've confused me. You can select which implementation to use at runtime with e.g. CPUID or more sophisticated methods. LDC targeting DCompute can produce multiple objects with the same compiler invocation, i.e. you can get CUDA for any set of SM version, OpenCL compatible SPIR-V which you can get per GPU, inspect its hardware characteristics and then select which of your kernels to run.

Yes, so why do you need compile time features?

Because compilers are not sufficiently advanced to extract all the performance that is available on their own.

A good example of where the automated/simple approach was not good enough is CUB (CUDA unbound), a high performance CUDA library found here https://github.com/NVIDIA/cub/tree/main/cub

I'd recommend taking a look at the specializations that occur in CUB in the name of performance.

D compile time features can help reduce this kind of mess, both in extreme performance libraries and extreme performance code.

>

My understanding is that the goal of nvc++ is to compile to CPU or GPU based on what pays of more for the actual code. So it will not need any annotations (it is up to the compiler to choose between CPU/GPU?). Bryce suggested that it currently only targets one specific GPU, but that it will target multiple GPUs for the same executable in the future.

The goal for C++ parallelism is to make it fairly transparent to the programmer. Or did I misunderstand what he said?

I think that that is an entirely reasonable goal but such transparency may cost performance and any such cost will be unacceptable to some.

>

My viewpoint is that if one are going to take a performance hit by not writing the shaders manually one need to get maximum convenience as a payoff.

It should be an alternative for programmers that cannot afford to put in the extra time to support GPU compute manually.

Yes. Always good to have alternatives. Fully automated is one option, hinted is a second alternative, meta-programming assisted manual is a third.

> > >

If you have to do the unrolling in D, then a lot of the advantage is lost and I might just as well write in a shader language...

D can be your compute shading language for Vulkan and with a bit of work whatever you'd use HLSL for, it can also be your compute kernel language substituting for OpenCL and CUDA.

I still don't understand why you would need static if/static for-loops? Seems to me that this is too hardwired, you'd be better off with compiler unrolling hints (C++ has these) if the compiler does the wrong thing.

If you can achieve your performance objectives with automated or hinted solutions, great! But what if you can't? Most people will not have to go as hardcore as the CUB authors did to get the performance they need but I find myself wanting more than the compiler can easily give me quite a bit. I'm very happy to have the meta programming tools to factor/reduce these "manual" programming task.

> >

Same caveats apply for metal (should be pretty easy to do: need Objective-C support in LDC, need Metal bindings).

Use clang to compile the objective-c code to object files and link with it?

January 20, 2022

On Thursday, 20 January 2022 at 12:18:27 UTC, Bruce Carneal wrote:

>

Because compilers are not sufficiently advanced to extract all the performance that is available on their own.

Well, but D developers cannot test on all available CPU/GPU combinations either so then you don't know if SIMD would perform better than GPU.

Something automated has to be present, at least on install, otherwise you risk performance degradation compared to a pure SIMD implementation. And then it is better (and cheaper) to just avoid GPU altogether.

>

A good example of where the automated/simple approach was not good enough is CUB (CUDA unbound), a high performance CUDA library found here https://github.com/NVIDIA/cub/tree/main/cub

I'd recommend taking a look at the specializations that occur in CUB in the name of performance.

I am sure you are right, but I didn't find anything special when I browsed through the repo?

>

If you can achieve your performance objectives with automated or hinted solutions, great! But what if you can't?

Well, my gut instinct is that if you want maximal performance for a specific GPU then you would be better off using Metal/Vulkan/etc directly?

But I have no experience with that as it is quite time consuming to go that route. Right now basic SIMD is time consuming enough… (but OK)

January 20, 2022

On Thursday, 20 January 2022 at 13:29:26 UTC, Ola Fosheim Grøstad wrote:

>

On Thursday, 20 January 2022 at 12:18:27 UTC, Bruce Carneal wrote:

>

Because compilers are not sufficiently advanced to extract all the performance that is available on their own.

Well, but D developers cannot test on all available CPU/GPU combinations either so then you don't know if SIMD would perform better than GPU.

It can be very expensive to write and test all the permutations, yes, but often you'll understand the bottlenecks of your algorithms sufficiently to be able to correctly filter out the work up front. Restating here, these are a few of the traditional ways to look at it: Throughput or latency limited? Operand/memory or arithmetic limited? Power (watts) preferred or other performance?

It's possible, for instance, that you can know, from first principles, that you'll never meet objective X if forced to use platform Y. In general, though, you'll just have a sense of the order in which things should be evaluated.

>

Something automated has to be present, at least on install, otherwise you risk performance degradation compared to a pure SIMD implementation. And then it is better (and cheaper) to just avoid GPU altogether.

Yes, SIMD can be the better performance choice sometimes. I think that many people will choose to do a SIMD implementation as a performance, correctness testing and portability baseline regardless of the accelerator possibilities.

> >

A good example of where the automated/simple approach was not good enough is CUB (CUDA unbound), a high performance CUDA library found here https://github.com/NVIDIA/cub/tree/main/cub

I'd recommend taking a look at the specializations that occur in CUB in the name of performance.

I am sure you are right, but I didn't find anything special when I browsed through the repo?

The key thing to note is how much effort the authors put into specialization wrt the HW x SW cross product. There are entire subdirectories devoted to specialization.

At least some of this complexity, this programming burden, can be factored out with better language support.

> >

If you can achieve your performance objectives with automated or hinted solutions, great! But what if you can't?

Well, my gut instinct is that if you want maximal performance for a specific GPU then you would be better off using Metal/Vulkan/etc directly?

That's what seems reasonable, yes, but fortunately I don't think it's correct. By analogy, you can get maximum performance from assembly level programming, if you have all the compiler back-end knowledge in your head, but if your language allows you to communicate all relevant information (mainly dependencies and operand localities but also "intrinsics") then the compiler can do at least as well as the assembly level programmer. Add language support for inline and factored specialization and the lower level alternatives become even less attractive.

>

But I have no experience with that as it is quite time consuming to go that route. Right now basic SIMD is time consuming enough… (but OK)

Indeed. I'm currently working on the SIMD variant of something I partially prototyped earlier on a 2080 and it has been slow going compared to either that GPU implementation or the scalar/serial variant.

There are some very nice assists from D for SIMD programming: the __vector typing, __vector arithmetic, unaligned vector loads/stores via static array operations, static foreach to enable portable expression of single-instruction SIMD functions like min, max, select, various shuffles, masks, ... but, yes, SIMD programming is definitely a slog compared to either scalar or SIMT GPU programming.

January 20, 2022

On Thursday, 20 January 2022 at 17:43:22 UTC, Bruce Carneal wrote:

>

It's possible, for instance, that you can know, from first principles, that you'll never meet objective X if forced to use platform Y. In general, though, you'll just have a sense of the order in which things should be evaluated.

This doesn't change the desire to do performance testing at install or bootup IMO. Even a "narrow" platform like Mac is quite broad at this point. PCs are even broader.

>

Yes, SIMD can be the better performance choice sometimes. I think that many people will choose to do a SIMD implementation as a performance, correctness testing and portability baseline regardless of the accelerator possibilities.

My understanding is that the presentation Bryce made suggested that you would just write "fairly normal" C++ code and let the compiler generate CPU or GPU instructions transparently, so you should not have to write SIMD code. SIMD would be the fallback option.

I think that the point of having parallel support built into the language is not to get the absolute maximum performance, but to make writing more performant code more accessible and cheaper.

If you end up having to handwrite SIMD to get decent performance then that pretty much makes parallel support a fringe feature. E.g. it won't be of much use outside HPC with expensive equipment.

So in my mind this feature does require hardware vendors to focus on CPU/GPU integration, and it also requires a rather "intelligent" compiler and runtime setup in order to pay for the debts of the "abstraction overhead".

I don't think just translating a language AST to an existing shared backend will be sufficient. If that was sufficient Nvidia wouldn't need to invest in nvc++?

But, it remains to be seen who will pull this off, besides Nvidia.

January 20, 2022

On Thursday, 20 January 2022 at 19:57:54 UTC, Ola Fosheim Grøstad wrote:

>

On Thursday, 20 January 2022 at 17:43:22 UTC, Bruce Carneal wrote:

>

It's possible, for instance, that you can know, from first principles, that you'll never meet objective X if forced to use platform Y. In general, though, you'll just have a sense of the order in which things should be evaluated.

This doesn't change the desire to do performance testing at install or bootup IMO. Even a "narrow" platform like Mac is quite broad at this point. PCs are even broader.

Never meant to say that it did. Just pointed out that you can factor some of the work.

> >

Yes, SIMD can be the better performance choice sometimes. I think that many people will choose to do a SIMD implementation as a performance, correctness testing and portability baseline regardless of the accelerator possibilities.

My understanding is that the presentation Bryce made suggested that you would just write "fairly normal" C++ code and let the compiler generate CPU or GPU instructions transparently, so you should not have to write SIMD code. SIMD would be the fallback option.

The dream, for decades, has been that "the compiler" will just "do the right thing" when provided dead simple code, that it will achieve near-or-better-than-human-tuned levels of performance in all scenarios that matter. It is a dream worth pursuing.

>

I think that the point of having parallel support built into the language is not to get the absolute maximum performance, but to make writing more performant code more accessible and cheaper.

If accessibility requires less performance then you, as a language designer, have a choice. I think it's a false choice but if forced to choose my choice would bias toward performance, "system language" and all that. Others, if forced to choose, would pick accessibility.

>

If you end up having to handwrite SIMD to get decent performance then that pretty much makes parallel support a fringe feature. E.g. it won't be of much use outside HPC with expensive equipment.

I disagree but can't see how pursuing it further would be useful. We can just leave it to the market.

>

So in my mind this feature does require hardware vendors to focus on CPU/GPU integration, and it also requires a rather "intelligent" compiler and runtime setup in order to pay for the debts of the "abstraction overhead".

I put more faith in efforts that cleanly reveal low level capabilities to the community, that are composable, than I do in future hardware vendor efforts.

>

I don't think just translating a language AST to an existing shared backend will be sufficient. If that was sufficient Nvidia wouldn't need to invest in nvc++?

Well, at least for current dcompute users, it already is sufficient. The Julia efforts in this area also appear to be successful. Sean Baxter's "circle" offshoot of C++ is another. I imagine there are or will be other instances where relatively small manpower inputs successfully co-opt backends to provide nice access and great performance for their respective language communities.

>

But, it remains to be seen who will pull this off, besides Nvidia.

I don't think there is much that remains to be seen here. The rate and scope of adoption are still interesting questions but the "can we provide something very useful to our language community?" question has been answered in the affirmative.

People choose dcompute, circle, Julia-GPU over or in addition to CUDA/OpenCL today. Others await more progress from the C++/SycL movement. Meaningful choice is good.

January 21, 2022

On Thursday, 20 January 2022 at 08:36:32 UTC, Ola Fosheim Grøstad wrote:

>

Yes, so why do you need compile time features?

My understanding is that the goal of nvc++ is to compile to CPU or GPU based on what pays of more for the actual code. So it will not need any annotations (it is up to the compiler to choose between CPU/GPU?). Bryce suggested that it currently only targets one specific GPU, but that it will target multiple GPUs for the same executable in the future.

There are two major advantages for compile time features, for the host and for the device (e.g. GPU).

On the host side, D meta programming allows DCompute to do what CUDA does with its <<<>>> kernel launch syntax, in terms of type safety and convenience, with regular D code. This is the feature that makes CUDA nice to use and OpenCL's lack of such a feature quite horrible to use, and change of kernel signature a refactoring unto itself.

On the device side, I'm sure Bruce can give you some concrete examples.

>

The goal for C++ parallelism is to make it fairly transparent to the programmer. Or did I misunderstand what he said?

You want it to be transparent, not invisible.

> >

Same caveats apply for metal (should be pretty easy to do: need Objective-C support in LDC, need Metal bindings).

Use clang to compile the objective-c code to object files and link with it?

Wont work, D needs to be able to call the objective-c.
I mean you could use a C or C++ shim, but that would be pretty ugly.

January 21, 2022

On Friday, 21 January 2022 at 03:23:59 UTC, Nicholas Wilson wrote:

>

There are two major advantages for compile time features, for the host and for the device (e.g. GPU).

Are these resolved at compile time (before the executable is installed on the computer) or are they resolved at runtime?

I guess there might be instances where you might want to consider to change the entire data layout to fit the hardware, but then you to some extent outside of what most D programmers would be willing to do.

> >

The goal for C++ parallelism is to make it fairly transparent to the programmer. Or did I misunderstand what he said?

You want it to be transparent, not invisible.

The goal is to make it look like a regular C++ library, no extra syntax.

>

Wont work, D needs to be able to call the objective-c.
I mean you could use a C or C++ shim, but that would be pretty ugly.

Just write the whole runtime in Objective-C++. Why would it be ugly?

January 21, 2022

On Friday, 21 January 2022 at 08:56:22 UTC, Ola Fosheim Grøstad wrote:

>

On Friday, 21 January 2022 at 03:23:59 UTC, Nicholas Wilson wrote:

>

There are two major advantages for compile time features, for the host and for the device (e.g. GPU).

Are these resolved at compile time (before the executable is installed on the computer) or are they resolved at runtime?

Before. But with SPIR-V there is an additional compilation/optimisation step where it is converted into whatever format the hardware uses, also you could set specialisation constants here if I ever get around to supporting those. I think it probably also happens with PTX (which is an assembly like format) to whatever the binary format is.

>

I guess there might be instances where you might want to consider to change the entire data layout to fit the hardware, but then you to some extent outside of what most D programmers would be willing to do.

Indeed.

> >

You want it to be transparent, not invisible.

The goal is to make it look like a regular C++ library, no extra syntax.

There is an important difference between it looking like regular C++ (i.e. function calls not <<<>>>) and the compiler doing auto-GPU-isation. I'm not sure which one you're referring to here. I'm all for the former, that's what Dcompute does, the latter falls too far into the sufficiently advanced compiler and would have to necessarily determine what to send to the GPU and when, which could seriously impact performance.

>

Just write the whole runtime in Objective-C++. Why would it be ugly?

Just. I mean it would be doable, but I rather not spend my time doing that.

January 21, 2022

On Friday, 21 January 2022 at 09:45:32 UTC, Nicholas Wilson wrote:

>

Just. I mean it would be doable, but I rather not spend my time doing that.

:-D This is where you need more than one person for the project…

I might do it, if I found a use case for it. I am sure some other contributor than yourself could do it if Metal support was in.

1 2 3 4 5 6 7 8 9 10 11
Next ›   Last »