August 17, 2013
On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:
> You can't mix cpu and gpu code, they must be separate.

H'okay, let's be clear here. When you say 'mix CPU and GPU code', you mean you can't mix them physically in the compiled executable for all currently extant cases. They aren't the same. I agree with that. That said, this doesn't preclude having CUDA-like behavior where small functions could be written that don't violate the constraints of GPU code and simultaneously has semantics that could be executed on the CPU, and where such small functions are then allowed to be called from both CPU and GPU code.

> However this still has problems of the cpu having to generate CPU code from the contents of gpu{} code blocks, as the GPU is unable to allocate memory, so for example ,
>
> gpu{
>     auto resultGPU = dot(c, cGPU);
> }
>
> likely either won't work, or generates an array allocation in cpu code before the gpu block is otherwise ran.

I wouldn't be so negative with the 'won't work' bit, 'cuz frankly the 'or' you wrote there is semantically like what OpenCL and CUDA do anyway.

> Also how does that dot product function know the correct index range to run on?, are we assuming it knows based on the length of a?, while the syntax,
>
> c[] = a[] * b[];
>
> is safe for this sort of call, a function is less safe todo this with, with function calls the range needs to be told to the function, and you would call this function without the gpu{} block as the function itself is marked.
>
> auto resultGPU = dot$(0 .. returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);

I think it was mentioned earlier that there should be, much like in OpenCL or CUDA, builtins or otherwise available symbols for getting the global identifier of each work-item, the work-group size, global size, etc.

> Remember with gpu's you don't send instructions, you send whole programs, and the whole program must finish before you can move onto the next cpu instruction.

I disagree with the assumption that the CPU must wait for the GPU while the GPU is executing. Perhaps by default the behavior could be helpful for sequencing global memory in the GPU with CPU operations, but it's not a necessary behavior (see OpenCL and it's, in my opinion, really nice queuing mechanism).

=== Another thing...

I'm with luminousone's suggestion for some manner of function attribute, to the tune of several metric tonnes of chimes. Wind chimes. I'm supporting this suggestion with at least a metric tonne of wind chimes.

*This* (and some small number of helpers), rather than straight-up dumping a new keyword and block type into the language. I really don't think D *needs* to have this any lower level than a library based solution, because it already has the tools to make it ridiculously more convenient than C/C++ (not necessarily as much as CUDA's totally separate program nvcc does, but a huge amount).

ex.


@kernel auto myFun(BufferT)(BufferT glbmem)
{
  // brings in the kernel keywords and whatnot depending __FUNCTION__
  // (because mixins eval where they're mixed in)
  mixin KernelDefs;
  // ^ and that's just about all the syntactic noise, the rest uses mixed-in
  //   keywords and the glbmem object to define several expressions that
  //   effectively record the operations to be performed into the return type

  // assignment into global memory recovers the expression type in the glbmem.
  glbmem[glbid] += 4;

  // This assigns the *expression* glbmem[glbid] to val.
  auto val = glbmem[glbid];

  // Ignoring that this has a data race, this exemplifies recapturing the
  // expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
  glbmem[glbid+1] = val;

  return glbmem; ///< I lied about the syntactic noise. This is the last bit.
}


Now if you want to, you can at runtime create an OpenCL-code string (for example) by passing a heavily metaprogrammed type in as BufferT. The call ends up looking like this:


auto promisedFutureResult = Gpu.call!myFun(buffer);


The kernel compilation (assuming OpenCL) is memoized, and the promisedFutureResult is some asynchronous object that implements concurrent programming's future (or something to that extent). For convenience, let's say that it blocks on any read other than some special poll/checking mechanism.

The constraints imposed on the kernel functions is generalizable to even execute the code on the CPU, as the launching call ( Gpu.call!myFun(buffer) ) can, instead of using an expression-buffer, just pass a normal array in and have the proper result pop out given some interaction between the identifiers mixed in by KernelDefs and the launching caller (ex. using a loop).

With CTFE, this method *I think* can also generate the code at compile time given the proper kind of expression-type-recording-BufferT.

Again, though, this requires a significant amount of metaprogramming, heavy abuse of auto, and... did i mention a significant amount of metaprogramming? It's roughly the same method I used to embed OpenCL code in a C++ project of mine without writing a single line of OpenCL code, however, so I *know* it's doable, likely even moreso, in D.
August 17, 2013
On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:
> You can't mix cpu and gpu code, they must be separate.

H'okay, let's be clear here. When you say 'mix CPU and GPU code', you mean you can't mix them physically in the compiled executable for all currently extant cases. They aren't the same. I agree with that.

That said, this doesn't preclude having CUDA-like behavior where small functions could be written that don't violate the constraints of GPU code and simultaneously has semantics that could be executed on the CPU, and where such small functions are then allowed to be called from both CPU and GPU code.

> However this still has problems of the cpu having to generate CPU code from the contents of gpu{} code blocks, as the GPU is unable to allocate memory, so for example ,
>
> gpu{
>     auto resultGPU = dot(c, cGPU);
> }
>
> likely either won't work, or generates an array allocation in cpu code before the gpu block is otherwise ran.

I'm fine with an array allocation. I'd 'prolly have to do it anyway.

> Also how does that dot product function know the correct index range to run on?, are we assuming it knows based on the length of a?, while the syntax,
>
> c[] = a[] * b[];
>
> is safe for this sort of call, a function is less safe todo this with, with function calls the range needs to be told to the function, and you would call this function without the gpu{} block as the function itself is marked.
>
> auto resultGPU = dot$(0 .. returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);

'Dat's a point.

> Remember with gpu's you don't send instructions, you send whole programs, and the whole program must finish before you can move onto the next cpu instruction.

I disagree with the assumption that the CPU must wait for the GPU while the GPU is executing. Perhaps by default the behavior could be helpful for sequencing global memory in the GPU with CPU operations, but it's not a *necessary* behavior.

Well, I disagree with the assumption assuming said assumption is being made and I'm not just misreading that bit. :-P

=== Another thing...

I'm with luminousone's suggestion for some manner of function attribute, to the tune of several metric tonnes of chimes. Wind chimes. I'm supporting this suggestion with at least a metric tonne of wind chimes.

I'd prefer this (and some small number of helpers) rather than straight-up dumping a new keyword and block type into the language. I really don't think D *needs* to have this any lower level than a library based solution, because it already has the tools to make it ridiculously more convenient than C/C++ (not necessarily as much as CUDA's totally separate program nvcc does, but a huge amount).

ex.


@kernel auto myFun(BufferT)(BufferT glbmem)
{
  // brings in the kernel keywords and whatnot depending __FUNCTION__
  // (because mixins eval where they're mixed in)
  mixin KernelDefs;
  // ^ and that's just about all the syntactic noise, the rest uses mixed-in
  //   keywords and the glbmem object to define several expressions that
  //   effectively record the operations to be performed into the return type

  // assignment into global memory recovers the expression type in the glbmem.
  glbmem[glbid] += 4;

  // This assigns the *expression* glbmem[glbid] to val.
  auto val = glbmem[glbid];

  // Ignoring that this has a data race, this exemplifies recapturing the
  // expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
  glbmem[glbid+1] = val;

  return glbmem; ///< I lied about the syntactic noise. This is the last bit.
}


Now if you want to, you can at runtime create an OpenCL-code string (for example) by passing a heavily metaprogrammed type in as BufferT. The call ends up looking like this:


auto promisedFutureResult = Gpu.call!myFun(buffer);


The kernel compilation (assuming OpenCL) is memoized, and the promisedFutureResult is some asynchronous object that implements concurrent programming's future (or something to that extent). For convenience, let's say that it blocks on any read other than some special poll/checking mechanism.

The constraints imposed on the kernel functions is generalizable to even execute the code on the CPU, as the launching call ( Gpu.call!myFun(buffer) ) can, instead of using an expression-buffer, just pass a normal array in and have the proper result pop out given some interaction between the identifiers mixed in by KernelDefs and the launching caller (ex. using a loop).

Alternatively to returning the captured expressions, the argument glbmem could have been passed ref, and the same sort of expression capturing could occur. Heck, more arguments could've been passed, too, this doesn't require there to be one single argument representing global memory.

With CTFE, this method *I think* can also generate the code at compile time given the proper kind of expression-type-recording-BufferT.

Again, though, all this requires a significant amount of metaprogramming, heavy abuse of auto, and... did I mention a significant amount of metaprogramming? It's roughly the same method I used to embed OpenCL code in a C++ project of mine without writing a single line of OpenCL code, however, so I *know* it's doable, likely even moreso, in D.
August 17, 2013
On Saturday, 17 August 2013 at 06:09:53 UTC, Atash wrote:
> On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:
>> You can't mix cpu and gpu code, they must be separate.
>
> H'okay, let's be clear here. When you say 'mix CPU and GPU code', you mean you can't mix them physically in the compiled executable for all currently extant cases. They aren't the same. I agree with that.
>
> That said, this doesn't preclude having CUDA-like behavior where small functions could be written that don't violate the constraints of GPU code and simultaneously has semantics that could be executed on the CPU, and where such small functions are then allowed to be called from both CPU and GPU code.
>
>> However this still has problems of the cpu having to generate CPU code from the contents of gpu{} code blocks, as the GPU is unable to allocate memory, so for example ,
>>
>> gpu{
>>    auto resultGPU = dot(c, cGPU);
>> }
>>
>> likely either won't work, or generates an array allocation in cpu code before the gpu block is otherwise ran.
>
> I'm fine with an array allocation. I'd 'prolly have to do it anyway.
>
>> Also how does that dot product function know the correct index range to run on?, are we assuming it knows based on the length of a?, while the syntax,
>>
>> c[] = a[] * b[];
>>
>> is safe for this sort of call, a function is less safe todo this with, with function calls the range needs to be told to the function, and you would call this function without the gpu{} block as the function itself is marked.
>>
>> auto resultGPU = dot$(0 .. returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);
>
> 'Dat's a point.
>
>> Remember with gpu's you don't send instructions, you send whole programs, and the whole program must finish before you can move onto the next cpu instruction.
>
> I disagree with the assumption that the CPU must wait for the GPU while the GPU is executing. Perhaps by default the behavior could be helpful for sequencing global memory in the GPU with CPU operations, but it's not a *necessary* behavior.
>
> Well, I disagree with the assumption assuming said assumption is being made and I'm not just misreading that bit. :-P
>
> === Another thing...
>
> I'm with luminousone's suggestion for some manner of function attribute, to the tune of several metric tonnes of chimes. Wind chimes. I'm supporting this suggestion with at least a metric tonne of wind chimes.
>
> I'd prefer this (and some small number of helpers) rather than straight-up dumping a new keyword and block type into the language. I really don't think D *needs* to have this any lower level than a library based solution, because it already has the tools to make it ridiculously more convenient than C/C++ (not necessarily as much as CUDA's totally separate program nvcc does, but a huge amount).
>
> ex.
>
>
> @kernel auto myFun(BufferT)(BufferT glbmem)
> {
>   // brings in the kernel keywords and whatnot depending __FUNCTION__
>   // (because mixins eval where they're mixed in)
>   mixin KernelDefs;
>   // ^ and that's just about all the syntactic noise, the rest uses mixed-in
>   //   keywords and the glbmem object to define several expressions that
>   //   effectively record the operations to be performed into the return type
>
>   // assignment into global memory recovers the expression type in the glbmem.
>   glbmem[glbid] += 4;
>
>   // This assigns the *expression* glbmem[glbid] to val.
>   auto val = glbmem[glbid];
>
>   // Ignoring that this has a data race, this exemplifies recapturing the
>   // expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
>   glbmem[glbid+1] = val;
>
>   return glbmem; ///< I lied about the syntactic noise. This is the last bit.
> }
>
>
> Now if you want to, you can at runtime create an OpenCL-code string (for example) by passing a heavily metaprogrammed type in as BufferT. The call ends up looking like this:
>
>
> auto promisedFutureResult = Gpu.call!myFun(buffer);
>
>
> The kernel compilation (assuming OpenCL) is memoized, and the promisedFutureResult is some asynchronous object that implements concurrent programming's future (or something to that extent). For convenience, let's say that it blocks on any read other than some special poll/checking mechanism.
>
> The constraints imposed on the kernel functions is generalizable to even execute the code on the CPU, as the launching call ( Gpu.call!myFun(buffer) ) can, instead of using an expression-buffer, just pass a normal array in and have the proper result pop out given some interaction between the identifiers mixed in by KernelDefs and the launching caller (ex. using a loop).
>
> Alternatively to returning the captured expressions, the argument glbmem could have been passed ref, and the same sort of expression capturing could occur. Heck, more arguments could've been passed, too, this doesn't require there to be one single argument representing global memory.
>
> With CTFE, this method *I think* can also generate the code at compile time given the proper kind of expression-type-recording-BufferT.
>
> Again, though, all this requires a significant amount of metaprogramming, heavy abuse of auto, and... did I mention a significant amount of metaprogramming? It's roughly the same method I used to embed OpenCL code in a C++ project of mine without writing a single line of OpenCL code, however, so I *know* it's doable, likely even moreso, in D.

Often when first introducing programmers to gpu programming, they imagine gpu instructions as being part of the instruction stream the cpu receives, completely missing the point of what makes the entire scheme so useful.

The gpu might better be imagined as a wholly separate computer, that happens to be networked via the system bus. Every interaction between the cpu and the gpu has to travel across this expensive comparatively high latency divide, so the goal is design that makes it easy to avoid interaction between the two separate entities as much as possible while still getting the maximum control and performance from them.

Opencl may have picked the term __KERNEL based on the idea that the gpu program in fact represents the devices operating system for the duration of that function call.

Single Statement code operations on the GPU, in this vain represent a horridly bad idea. so ...

gpu{
   c[] = a[] * b[];
}

seems like very bad design to me.

In fact being able to have random gpu {} code blocks seems like a bad idea in this vain. each line in such a block very likely would end up being separate gpu __KERNEL functions creating excessive amounts of cpu/gpu interaction, as each line may have different ranges.

The foreach loop type is actually fits the model of microthreading very nicely. It has a clearly defined range, their is no dependency related to the order in which the code functions on any arrays used in the loop, you have an implicit index that is unique for each value in the range, you can't change the size of the range mid execution(at least I haven't seen anyone do it so far).

August 17, 2013
On 8/13/2013 12:27 PM, Russel Winder wrote:
> The era of GPGPUs for Bitcoin mining are now over, they moved to ASICs. The new market for GPGPUs is likely the banks, and other "Big Data" folk. True many of the banks are already doing some GPGPU usage, but it is not big as yet. But it is coming.
> 
> Most of the banks are either reinforcing their JVM commitment, via Scala, or are re-architecting to C++ and Python. True there is some C#/F# but it is all for terminals not for strategic computing, and it is diminishing (despite what you might hear from .NET oriented training companies).
> 
> Currently GPGPU tooling means C.

There is some interesting work in that regards in .NET:

http://research.microsoft.com/en-us/projects/Accelerator/

Obviously, it uses DirectX, but what's nice about it that it is normal C# that doesn't look like it is written by aliens.
August 17, 2013
On Friday, 16 August 2013 at 19:53:44 UTC, Atash wrote:
> I'm iffy on the assumption that the future holds unified memory for heterogeneous devices. Even relatively recent products such as the Intel Xeon Phi have totally separate memory. I'm not aware of any general-computation-oriented products that don't have separate memory.
>

Many laptop and mobile devices (they have it in the hardware, but the API don't permit to make use of it most of the time), next gen consoles (PS4, Xbox one).

nVidia is pushing against it, AMD/ATI is pushing for it, and they are right on that one.

> I'm also of the opinion that as long as people want to have devices that can scale in size, there will be modular devices. Because they're modular, there's some sort of a spacing between them and the machine, ex. PCIe (and, somewhat importantly, a physical distance between the added device and the CPU-stuff). Because of that, they're likely to have their own memory. Therefore, I'm personally not willing to bank on anything short of targeting the least common denominator here (non-uniform access memory) specifically because it looks like a necessity for scaling a physical collection of heterogeneous devices up in size, which in turn I *think* is a necessity for people trying to deal with growing data sets in the real world.
>

You can have 2 sockets on the motherboard.

> Annnnnnnnnnnndddddd because heterogeneous compute devices aren't *just* GPUs (ex. Intel Xeon Phi), I'd strongly suggest picking a more general name, like 'accelerators' or 'apu' (except AMD totally ran away with that acronym in marketing and I sort of hate them for it) or '<something-I-can't-think-of-because-words-are-hard>'.
>
> That said, I'm no expert, so go ahead and rip 'mah opinions apart. :-D

Unified memory have too much benefice for it to be ignored. The open questions are about cache coherency, direct communication between chips, identical performances through the address space, and so on. But the unified memory will remains. Even when memory is physically separated, you'll see an unified memory model emerge, with disparate performances depending on address space.
August 17, 2013
On Saturday, 17 August 2013 at 15:37:58 UTC, deadalnix wrote:
> Unified memory have too much benefice for it to be ignored. The open questions are about cache coherency, direct communication between chips, identical performances through the address space, and so on. But the unified memory will remains. Even when memory is physically separated, you'll see an unified memory model emerge, with disparate performances depending on address space.

I'm not saying 'ignore it', I'm saying that it's not the least common denominator among popular devices, and that in all likelihood it won't be the least common denominator among compute devices ever. AMD/ATi being 'right' doesn't mean that they'll dominate the whole market. Having two slots on your mobo is more limiting than having the ability to just chuck more computers in a line hidden behind some thin wrapper around some code built to deal with non-uniform memory access.

Additionally, in another post, I tried to demonstrate a way for it to target the least common denominator, and in my (obviously biased) opinion, it didn't look half bad.

Unlike uniform memory access, non-uniform memory access will *always* be a thing. Uniform memory access is cool n'all, but it isn't popular enough to be here now, and it isn't like non-uniform memory access which has a long history of being here and looks like it has a permanent stay in computing.

Pragmatism dictates to me here that any tool we want to be 'awesome', eliciting 'wowzers' from all the folk of the land, should target the widest variety of devices while still being pleasant to work with. *That* tells me that it is paramount to *not* brush off non-uniform access, and that because non-uniform access is the least common denominator, that should be what is targeted.

On the other hand, if we want to start up some sort of thing where one lib handles the paradigm of uniform memory access in as convenient a way as possible, and another lib handles non-uniform memory access, that's fine too. Except that the first lib really would just be a specialization of the second alongside some more 'convenience'-functions.
August 17, 2013
On Saturday, 17 August 2013 at 20:17:17 UTC, Atash wrote:
> On Saturday, 17 August 2013 at 15:37:58 UTC, deadalnix wrote:
>> Unified memory have too much benefice for it to be ignored. The open questions are about cache coherency, direct communication between chips, identical performances through the address space, and so on. But the unified memory will remains. Even when memory is physically separated, you'll see an unified memory model emerge, with disparate performances depending on address space.
>
> I'm not saying 'ignore it', I'm saying that it's not the least common denominator among popular devices, and that in all likelihood it won't be the least common denominator among compute devices ever. AMD/ATi being 'right' doesn't mean that they'll dominate the whole market. Having two slots on your mobo is more limiting than having the ability to just chuck more computers in a line hidden behind some thin wrapper around some code built to deal with non-uniform memory access.
>
> Additionally, in another post, I tried to demonstrate a way for it to target the least common denominator, and in my (obviously biased) opinion, it didn't look half bad.
>
> Unlike uniform memory access, non-uniform memory access will *always* be a thing. Uniform memory access is cool n'all, but it isn't popular enough to be here now, and it isn't like non-uniform memory access which has a long history of being here and looks like it has a permanent stay in computing.
>
> Pragmatism dictates to me here that any tool we want to be 'awesome', eliciting 'wowzers' from all the folk of the land, should target the widest variety of devices while still being pleasant to work with. *That* tells me that it is paramount to *not* brush off non-uniform access, and that because non-uniform access is the least common denominator, that should be what is targeted.
>
> On the other hand, if we want to start up some sort of thing where one lib handles the paradigm of uniform memory access in as convenient a way as possible, and another lib handles non-uniform memory access, that's fine too. Except that the first lib really would just be a specialization of the second alongside some more 'convenience'-functions.

There are two major things, unified memory, and unified virtual address space.

Unified virtual address will be universal within a few years, and the user space application will no longer manage the cpu/gpu copies anymore, this instead will be handled by the gpu system library, hMMU, and operating system.

Even non-uniform memory will be uniform from the perspective of the application writer.
August 18, 2013
We basically have to follow these rules,

1. The range must be none prior to execution of a gpu code block
2. The range can not be changed during execution of a gpu code block
3. Code blocks can only receive a single range, it can however be multidimensional
4. index keys used in a code block are immutable
5. Code blocks can only use a single key(the gpu executes many instances in parallel each with their own unique key)
6. index's are always an unsigned integer type
7. openCL,CUDA have no access to global state
8. gpu code blocks can not allocate memory
9. gpu code blocks can not call cpu functions
10. atomics tho available on the gpu are many times slower then on the cpu
11. separate running instances of the same code block on the gpu can not have any interdependency on each other.

Now if we are talking about HSA, or other similar setup, then a few of those rules don't apply or become fuzzy.

HSA, does have limited access to global state, HSA can call cpu functions that are pure, and of course because in HSA the cpu and gpu share the same virtual address space most of memory is open for access.

HSA also manages memory, via the hMMU, and their is no need for gpu memory management functions, as that is managed by the operating system and video card drivers.

Basically, D would either need to opt out of legacy api's such as openCL, CUDA, etc, these are mostly tied to c/c++ anyway, and generally have ugly as sin syntax; or D would have go the route of a full and safe gpu subset of features.

I don't think such a setup can be implemented as simply a library, as the GPU needs compiled source.

If D where to implement gpgpu features, I would actually suggest starting by simply adding a microthreading function syntax, for example...

void example( aggregate in float a[] ; key , in float b[], out float c[]) {
	c[key] = a[key] + b[key];
}

By adding an aggregate keyword to the function, we can assume the range simply using the length of a[] without adding an extra set of brackets or something similar.

This would make access to the gpu more generic, and more importantly, because llvm will support HSA, removes the needs for writing more complex support into dmd as openCL and CUDA would require, a few hints for the llvm backend would be enough to generate the dual bytecode ELF executables.
August 18, 2013
Unified virtual address-space I can accept, fine. Ignoring that it is, in fact, in a totally different address-space where memory latency is *entirely different*, I'm far *far* more iffy about.

> We basically have to follow these rules,
>
> 1. The range must be none prior to execution of a gpu code block
> 2. The range can not be changed during execution of a gpu code block
> 3. Code blocks can only receive a single range, it can however be multidimensional
> 4. index keys used in a code block are immutable
> 5. Code blocks can only use a single key(the gpu executes many instances in parallel each with their own unique key)
> 6. index's are always an unsigned integer type
> 7. openCL,CUDA have no access to global state
> 8. gpu code blocks can not allocate memory
> 9. gpu code blocks can not call cpu functions
> 10. atomics tho available on the gpu are many times slower then on the cpu
> 11. separate running instances of the same code block on the gpu can not have any interdependency on each other.

Please explain point 1 (specifically the use of the word 'none'), and why you added in point 3?

Additionally, point 11 doesn't make any sense to me. There is research out there showing how to use cooperative warp-scans, for example, to have multiple work-items cooperate over some local block of memory and perform sorting in blocks. There are even tutorials out there for OpenCL and CUDA that shows how to do this, specifically to create better performing code. This statement is in direct contradiction with what exists.

> Now if we are talking about HSA, or other similar setup, then a few of those rules don't apply or become fuzzy.
>
> HSA, does have limited access to global state, HSA can call cpu functions that are pure, and of course because in HSA the cpu and gpu share the same virtual address space most of memory is open for access.
>
> HSA also manages memory, via the hMMU, and their is no need for gpu memory management functions, as that is managed by the operating system and video card drivers.

Good for HSA. Now why are we latching onto this particular construction that, as far as I can tell, is missing the support of at least two highly relevant giants (Intel and NVidia)?

> Basically, D would either need to opt out of legacy api's such as openCL, CUDA, etc, these are mostly tied to c/c++ anyway, and generally have ugly as sin syntax; or D would have go the route of a full and safe gpu subset of features.

Wrappers do a lot to change the appearance of a program. Raw OpenCL may look ugly, but so do BLAS and LAPACK routines. The use of wrappers and expression templates does a lot to clean up code (ex. look at the way Eigen 3 or any other linear algebra library does expression templates in C++; something D can do even better).

> I don't think such a setup can be implemented as simply a library, as the GPU needs compiled source.

This doesn't make sense. Your claim is contingent on opting out of OpenCL or any other mechanism that provides for the application to carry abstract instructions which are then compiled on the fly. If you're okay with creating kernel code on the fly, this can be implemented as a library, beyond any reasonable doubt.

> If D where to implement gpgpu features, I would actually suggest starting by simply adding a microthreading function syntax, for example...
>
> void example( aggregate in float a[] ; key , in float b[], out float c[]) {
> 	c[key] = a[key] + b[key];
> }
>
> By adding an aggregate keyword to the function, we can assume the range simply using the length of a[] without adding an extra set of brackets or something similar.
>
> This would make access to the gpu more generic, and more importantly, because llvm will support HSA, removes the needs for writing more complex support into dmd as openCL and CUDA would require, a few hints for the llvm backend would be enough to generate the dual bytecode ELF executables.

1) If you wanted to have that 'key' nonsense in there, I'm thinking you'd need to add several additional parameters: global size, group size, group count, and maybe group-local memory access (requires allowing multiple aggregates?). I mean, I get the gist of what you're saying, this isn't me pointing out a problem, just trying to get a clarification on it (maybe give 'key' some additional structure, or something).

2) ... I kind of like this idea. I disagree with how you led up to it, but I like the idea.

3) How do you envision *calling* microthreaded code? Just the usual syntax?

4) How would this handle working on subranges?

ex. Let's say I'm coding up a radix sort using something like this:

https://sites.google.com/site/duanemerrill/PplGpuSortingPreprint.pdf?attredirects=0

What's the high-level program organization with this syntax if we can only use one range at a time? How many work-items get fired off? What's the gpu-code launch procedure?
August 18, 2013
On Sunday, 18 August 2013 at 01:43:33 UTC, Atash wrote:
> Unified virtual address-space I can accept, fine. Ignoring that it is, in fact, in a totally different address-space where memory latency is *entirely different*, I'm far *far* more iffy about.
>
>> We basically have to follow these rules,
>>
>> 1. The range must be none prior to execution of a gpu code block
>> 2. The range can not be changed during execution of a gpu code block
>> 3. Code blocks can only receive a single range, it can however be multidimensional
>> 4. index keys used in a code block are immutable
>> 5. Code blocks can only use a single key(the gpu executes many instances in parallel each with their own unique key)
>> 6. index's are always an unsigned integer type
>> 7. openCL,CUDA have no access to global state
>> 8. gpu code blocks can not allocate memory
>> 9. gpu code blocks can not call cpu functions
>> 10. atomics tho available on the gpu are many times slower then on the cpu
>> 11. separate running instances of the same code block on the gpu can not have any interdependency on each other.
>
> Please explain point 1 (specifically the use of the word 'none'), and why you added in point 3?
>
> Additionally, point 11 doesn't make any sense to me. There is research out there showing how to use cooperative warp-scans, for example, to have multiple work-items cooperate over some local block of memory and perform sorting in blocks. There are even tutorials out there for OpenCL and CUDA that shows how to do this, specifically to create better performing code. This statement is in direct contradiction with what exists.
>
>> Now if we are talking about HSA, or other similar setup, then a few of those rules don't apply or become fuzzy.
>>
>> HSA, does have limited access to global state, HSA can call cpu functions that are pure, and of course because in HSA the cpu and gpu share the same virtual address space most of memory is open for access.
>>
>> HSA also manages memory, via the hMMU, and their is no need for gpu memory management functions, as that is managed by the operating system and video card drivers.
>
> Good for HSA. Now why are we latching onto this particular construction that, as far as I can tell, is missing the support of at least two highly relevant giants (Intel and NVidia)?
>
>> Basically, D would either need to opt out of legacy api's such as openCL, CUDA, etc, these are mostly tied to c/c++ anyway, and generally have ugly as sin syntax; or D would have go the route of a full and safe gpu subset of features.
>
> Wrappers do a lot to change the appearance of a program. Raw OpenCL may look ugly, but so do BLAS and LAPACK routines. The use of wrappers and expression templates does a lot to clean up code (ex. look at the way Eigen 3 or any other linear algebra library does expression templates in C++; something D can do even better).
>
>> I don't think such a setup can be implemented as simply a library, as the GPU needs compiled source.
>
> This doesn't make sense. Your claim is contingent on opting out of OpenCL or any other mechanism that provides for the application to carry abstract instructions which are then compiled on the fly. If you're okay with creating kernel code on the fly, this can be implemented as a library, beyond any reasonable doubt.
>
>> If D where to implement gpgpu features, I would actually suggest starting by simply adding a microthreading function syntax, for example...
>>
>> void example( aggregate in float a[] ; key , in float b[], out float c[]) {
>> 	c[key] = a[key] + b[key];
>> }
>>
>> By adding an aggregate keyword to the function, we can assume the range simply using the length of a[] without adding an extra set of brackets or something similar.
>>
>> This would make access to the gpu more generic, and more importantly, because llvm will support HSA, removes the needs for writing more complex support into dmd as openCL and CUDA would require, a few hints for the llvm backend would be enough to generate the dual bytecode ELF executables.
>
> 1) If you wanted to have that 'key' nonsense in there, I'm thinking you'd need to add several additional parameters: global size, group size, group count, and maybe group-local memory access (requires allowing multiple aggregates?). I mean, I get the gist of what you're saying, this isn't me pointing out a problem, just trying to get a clarification on it (maybe give 'key' some additional structure, or something).
>
> 2) ... I kind of like this idea. I disagree with how you led up to it, but I like the idea.
>
> 3) How do you envision *calling* microthreaded code? Just the usual syntax?
>
> 4) How would this handle working on subranges?
>
> ex. Let's say I'm coding up a radix sort using something like this:
>
> https://sites.google.com/site/duanemerrill/PplGpuSortingPreprint.pdf?attredirects=0
>
> What's the high-level program organization with this syntax if we can only use one range at a time? How many work-items get fired off? What's the gpu-code launch procedure?

sorry typo, meant known.