GPGPUs (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » GPGPUs (page 4)

August 18, 2013

Posted by luminousone
in reply to Atash

luminousone

Posted in reply to Atash

On Sunday, 18 August 2013 at 01:43:33 UTC, Atash wrote:
> Unified virtual address-space I can accept, fine. Ignoring that it is, in fact, in a totally different address-space where memory latency is *entirely different*, I'm far *far* more iffy about.
>
>> We basically have to follow these rules,
>>
>> 1. The range must be none prior to execution of a gpu code block
>> 2. The range can not be changed during execution of a gpu code block
>> 3. Code blocks can only receive a single range, it can however be multidimensional
>> 4. index keys used in a code block are immutable
>> 5. Code blocks can only use a single key(the gpu executes many instances in parallel each with their own unique key)
>> 6. index's are always an unsigned integer type
>> 7. openCL,CUDA have no access to global state
>> 8. gpu code blocks can not allocate memory
>> 9. gpu code blocks can not call cpu functions
>> 10. atomics tho available on the gpu are many times slower then on the cpu
>> 11. separate running instances of the same code block on the gpu can not have any interdependency on each other.
>
> Please explain point 1 (specifically the use of the word 'none'), and why you added in point 3?
>
> Additionally, point 11 doesn't make any sense to me. There is research out there showing how to use cooperative warp-scans, for example, to have multiple work-items cooperate over some local block of memory and perform sorting in blocks. There are even tutorials out there for OpenCL and CUDA that shows how to do this, specifically to create better performing code. This statement is in direct contradiction with what exists.

You do have limited Atomics, but you don't really have any sort of complex messages, or anything like that.

>> Now if we are talking about HSA, or other similar setup, then a few of those rules don't apply or become fuzzy.
>>
>> HSA, does have limited access to global state, HSA can call cpu functions that are pure, and of course because in HSA the cpu and gpu share the same virtual address space most of memory is open for access.
>>
>> HSA also manages memory, via the hMMU, and their is no need for gpu memory management functions, as that is managed by the operating system and video card drivers.
>
> Good for HSA. Now why are we latching onto this particular construction that, as far as I can tell, is missing the support of at least two highly relevant giants (Intel and NVidia)?

Intel doesn't have a dog in this race, so their is no way to know what they plan on doing if anything at all.

The reason to point out HSA, is because it is really easy add support for, it is not a giant task like opencl would be. A few changes to the front end compiler is all that is needed, LLVM's backend does the rest.

>> Basically, D would either need to opt out of legacy api's such as openCL, CUDA, etc, these are mostly tied to c/c++ anyway, and generally have ugly as sin syntax; or D would have go the route of a full and safe gpu subset of features.
>
> Wrappers do a lot to change the appearance of a program. Raw OpenCL may look ugly, but so do BLAS and LAPACK routines. The use of wrappers and expression templates does a lot to clean up code (ex. look at the way Eigen 3 or any other linear algebra library does expression templates in C++; something D can do even better).
>
>> I don't think such a setup can be implemented as simply a library, as the GPU needs compiled source.
>
> This doesn't make sense. Your claim is contingent on opting out of OpenCL or any other mechanism that provides for the application to carry abstract instructions which are then compiled on the fly. If you're okay with creating kernel code on the fly, this can be implemented as a library, beyond any reasonable doubt.

OpenCL isn't just a library, it is a language extension, that is ran through a preprocessor that compiles the embedded __KERNEL and __DEVICE functions, into usable code, and then outputs .c/.cpp files for the c compiler to deal with.

>> If D where to implement gpgpu features, I would actually suggest starting by simply adding a microthreading function syntax, for example...
>>
>> void example( aggregate in float a[] ; key , in float b[], out float c[]) {
>> 	c[key] = a[key] + b[key];
>> }
>>
>> By adding an aggregate keyword to the function, we can assume the range simply using the length of a[] without adding an extra set of brackets or something similar.
>>
>> This would make access to the gpu more generic, and more importantly, because llvm will support HSA, removes the needs for writing more complex support into dmd as openCL and CUDA would require, a few hints for the llvm backend would be enough to generate the dual bytecode ELF executables.
>
> 1) If you wanted to have that 'key' nonsense in there, I'm thinking you'd need to add several additional parameters: global size, group size, group count, and maybe group-local memory access (requires allowing multiple aggregates?). I mean, I get the gist of what you're saying, this isn't me pointing out a problem, just trying to get a clarification on it (maybe give 'key' some additional structure, or something).

Those are all platform specific, they change based on the whim and fancy of NVIDIA and AMD with each and every new chip released, The size and configuration of CUDA clusters, or compute clusters, or EU's, or whatever the hell x chip maker feels like using at the moment.

Long term this will all be managed by the underlying support software in the video drivers, and operating system kernel. Putting any effort into this is a waste of time.

> 2) ... I kind of like this idea. I disagree with how you led up to it, but I like the idea.
>
> 3) How do you envision *calling* microthreaded code? Just the usual syntax?

void example( aggregate in float a[] ; key , in float b[], out
   float c[]) {
	c[key] = a[key] + b[key];
}

example(a,b,c);

in the function declaration you can think of the aggregate basically having the reserve order of the items in a foreach statement.

int a[100] = [ ... ];
int b[100];
foreach( v, k ; a ) { b = a[k]; }

int a[100] = [ ... ];
int b[100];

void example2( aggregate in float A[] ; k, out float B[] ) { B[k] = A[k]; }

example2(a,b);

> 4) How would this handle working on subranges?
>
> ex. Let's say I'm coding up a radix sort using something like this:
>
> https://sites.google.com/site/duanemerrill/PplGpuSortingPreprint.pdf?attredirects=0
>
> What's the high-level program organization with this syntax if we can only use one range at a time? How many work-items get fired off? What's the gpu-code launch procedure?

I am pretty sure they are simply multiplying the index value by the unit size they desire to work on

int a[100] = [ ... ];
int b[100];
void example3( aggregate in range r ; k, in float a[], float b[]){
   b[k]   = a[k];
   b[k+1] = a[k+1];
}

example3( 0 .. 50 , a,b);

Then likely they are simply executing multiple __KERNEL functions in sequence, would be my guess.

August 18, 2013

Posted by Atash
in reply to luminousone

Atash

Posted in reply to luminousone

On Sunday, 18 August 2013 at 03:55:58 UTC, luminousone wrote:
> You do have limited Atomics, but you don't really have any sort of complex messages, or anything like that.

I said 'point 11', not 'point 10'. You also dodged points 1 and 3...

> Intel doesn't have a dog in this race, so their is no way to know what they plan on doing if anything at all.

http://software.intel.com/en-us/vcsource/tools/opencl-sdk
http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html

Just based on those, I'm pretty certain they 'have a dog in this race'. The dog happens to be running with MPI and OpenCL across a bridge made of PCIe.

> The reason to point out HSA, is because it is really easy add support for, it is not a giant task like opencl would be. A few changes to the front end compiler is all that is needed, LLVM's backend does the rest.

H'okay. I can accept that.

> OpenCL isn't just a library, it is a language extension, that is ran through a preprocessor that compiles the embedded __KERNEL and __DEVICE functions, into usable code, and then outputs .c/.cpp files for the c compiler to deal with.

But all those extra bits are part of the computing *environment*. Is there something wrong with requiring the proper environment for an executable?

A more objective question: which devices are you trying to target here?

> Those are all platform specific, they change based on the whim and fancy of NVIDIA and AMD with each and every new chip released, The size and configuration of CUDA clusters, or compute clusters, or EU's, or whatever the hell x chip maker feels like using at the moment.
>
> Long term this will all be managed by the underlying support software in the video drivers, and operating system kernel. Putting any effort into this is a waste of time.

Yes. And the only way to optimize around them is to *know them*, otherwise you're pinning the developer down the same way OpenMP does. Actually, even worse than the way OpenMP does - at least OpenMP lets you set some hints about how many threads you want.

> void example( aggregate in float a[] ; key , in float b[], out
>    float c[]) {
> 	c[key] = a[key] + b[key];
> }
>
> example(a,b,c);
>
> in the function declaration you can think of the aggregate basically having the reserve order of the items in a foreach statement.
>
> int a[100] = [ ... ];
> int b[100];
> foreach( v, k ; a ) { b = a[k]; }
>
> int a[100] = [ ... ];
> int b[100];
>
> void example2( aggregate in float A[] ; k, out float B[] ) { B[k] = A[k]; }
>
> example2(a,b);

Contextually solid. Read my response to the next bit.

> I am pretty sure they are simply multiplying the index value by the unit size they desire to work on
>
> int a[100] = [ ... ];
> int b[100];
> void example3( aggregate in range r ; k, in float a[], float b[]){
>    b[k]   = a[k];
>    b[k+1] = a[k+1];
> }
>
> example3( 0 .. 50 , a,b);
>
> Then likely they are simply executing multiple __KERNEL functions in sequence, would be my guess.

I've implemented this algorithm before in OpenCL already, and what you're saying so far doesn't rhyme with what's needed.

There are at least two ranges, one keeping track of partial summations, the other holding the partial sorts. Three separate kernels are ran in cycles to reduce over and scatter the data. The way early exit is implemented isn't mentioned as part of the implementation details, but my implementation of the strategy requires a third range to act as a flag array to be reduced over and read in between kernel invocations.

It isn't just unit size multiplication - there's communication between work-items and *exquisitely* arranged local-group reductions and scans (so-called 'warpscans') that take advantage of the widely accepted concept of a local group of work-items (a parameter you explicitly disregarded) and their shared memory pool. The entire point of the paper is that it's possible to come up with a general algorithm that can be parameterized to fit individual GPU configurations if desired. This kind of algorithm provides opportunities for tuning... which seem to be lost, unnecessarily, in what I've read so far in your descriptions.

My point being, I don't like where this is going by treating coprocessors, which have so far been very *very* different from one another, as the same batch of *whatever*. I also don't like that it's ignoring NVidia, and ignoring Intel's push for general-purpose accelerators such as their Xeon Phi.

But, meh, if HSA is so easy, then it's low-hanging fruit, so whatever, go ahead and push for it.

=== REMINDER OF RELEVANT STUFF FURTHER UP IN THE POST:

"A more objective question: which devices are you trying to target here?"

=== AND SOMETHING ELSE:

I feel like we're just on different wavelengths. At what level do you imagine having this support, in terms of support for doing low-level things? Is this something like OpenMP, where threading and such are done at a really (really really really...) high level, or what?

August 18, 2013

Posted by luminousone
in reply to Atash

luminousone

Posted in reply to Atash

On Sunday, 18 August 2013 at 05:05:48 UTC, Atash wrote:
> On Sunday, 18 August 2013 at 03:55:58 UTC, luminousone wrote:
>> You do have limited Atomics, but you don't really have any sort of complex messages, or anything like that.
>
> I said 'point 11', not 'point 10'. You also dodged points 1 and 3...
>
>> Intel doesn't have a dog in this race, so their is no way to know what they plan on doing if anything at all.
>
> http://software.intel.com/en-us/vcsource/tools/opencl-sdk
> http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html
>
> Just based on those, I'm pretty certain they 'have a dog in this race'. The dog happens to be running with MPI and OpenCL across a bridge made of PCIe.

The Xeon Phi is interesting in so far as taking generic programming to a more parallel environment. However it has some serious limitations that will heavily damage its potential performance.

AVX2 is completely the wrong path to go about improving performance in parallel computing, The SIMD nature of this instruction set means that scalar operations, or even just not being able to fill the giant 256/512bit register wastes huge chunks of this things peak theoretical performance, and if any rules apply to instruction pairing on this multi issue pipeline you have yet more potential for wasted cycles.

I haven't seen anything about intels, micro thread scheduler, or how these chips handle mass context switching natural of micro threaded environments, These two items make a huge difference in performance, comparing radeon VLIW5/4 to radeon GCN is a good example, most of the performance benefit of GCN is from easy of scheduling scalar pipelines over more complex pipes with instruction pairing rules etc.

Frankly Intel, has some cool stuff, but they have been caught with their pants down, they have depended on their large fab advantage to carry them over and got lazy.

We likely are watching AMD64 all over again.

>> The reason to point out HSA, is because it is really easy add support for, it is not a giant task like opencl would be. A few changes to the front end compiler is all that is needed, LLVM's backend does the rest.
>
> H'okay. I can accept that.
>
>> OpenCL isn't just a library, it is a language extension, that is ran through a preprocessor that compiles the embedded __KERNEL and __DEVICE functions, into usable code, and then outputs .c/.cpp files for the c compiler to deal with.
>
> But all those extra bits are part of the computing *environment*. Is there something wrong with requiring the proper environment for an executable?
>
> A more objective question: which devices are you trying to target here?

A first, simply a different way of approaching std.parallel like functionality, with an eye gpgpu in the future when easy integration solutions popup(such as HSA).

>> Those are all platform specific, they change based on the whim and fancy of NVIDIA and AMD with each and every new chip released, The size and configuration of CUDA clusters, or compute clusters, or EU's, or whatever the hell x chip maker feels like using at the moment.
>>
>> Long term this will all be managed by the underlying support software in the video drivers, and operating system kernel. Putting any effort into this is a waste of time.
>
> Yes. And the only way to optimize around them is to *know them*, otherwise you're pinning the developer down the same way OpenMP does. Actually, even worse than the way OpenMP does - at least OpenMP lets you set some hints about how many threads you want.

It would be best to wait for a more generic software platform, to find out how this is handled by the next generation of micro threading tools.

The way openCL/CUDA work reminds me to much of someone setting up tomcat to have java code generate php that runs on their apache server, just because they can. I would rather tighter integration with the core language, then having a language in language.

>> void example( aggregate in float a[] ; key , in float b[], out
>>   float c[]) {
>> 	c[key] = a[key] + b[key];
>> }
>>
>> example(a,b,c);
>>
>> in the function declaration you can think of the aggregate basically having the reserve order of the items in a foreach statement.
>>
>> int a[100] = [ ... ];
>> int b[100];
>> foreach( v, k ; a ) { b = a[k]; }
>>
>> int a[100] = [ ... ];
>> int b[100];
>>
>> void example2( aggregate in float A[] ; k, out float B[] ) { B[k] = A[k]; }
>>
>> example2(a,b);
>
> Contextually solid. Read my response to the next bit.
>
>> I am pretty sure they are simply multiplying the index value by the unit size they desire to work on
>>
>> int a[100] = [ ... ];
>> int b[100];
>> void example3( aggregate in range r ; k, in float a[], float b[]){
>>   b[k]   = a[k];
>>   b[k+1] = a[k+1];
>> }
>>
>> example3( 0 .. 50 , a,b);
>>
>> Then likely they are simply executing multiple __KERNEL functions in sequence, would be my guess.
>
> I've implemented this algorithm before in OpenCL already, and what you're saying so far doesn't rhyme with what's needed.
>
> There are at least two ranges, one keeping track of partial summations, the other holding the partial sorts. Three separate kernels are ran in cycles to reduce over and scatter the data. The way early exit is implemented isn't mentioned as part of the implementation details, but my implementation of the strategy requires a third range to act as a flag array to be reduced over and read in between kernel invocations.
>
> It isn't just unit size multiplication - there's communication between work-items and *exquisitely* arranged local-group reductions and scans (so-called 'warpscans') that take advantage of the widely accepted concept of a local group of work-items (a parameter you explicitly disregarded) and their shared memory pool. The entire point of the paper is that it's possible to come up with a general algorithm that can be parameterized to fit individual GPU configurations if desired. This kind of algorithm provides opportunities for tuning... which seem to be lost, unnecessarily, in what I've read so far in your descriptions.
>
> My point being, I don't like where this is going by treating coprocessors, which have so far been very *very* different from one another, as the same batch of *whatever*. I also don't like that it's ignoring NVidia, and ignoring Intel's push for general-purpose accelerators such as their Xeon Phi.
>
> But, meh, if HSA is so easy, then it's low-hanging fruit, so whatever, go ahead and push for it.
>
> === REMINDER OF RELEVANT STUFF FURTHER UP IN THE POST:
>
> "A more objective question: which devices are you trying to target here?"
>
> === AND SOMETHING ELSE:
>
> I feel like we're just on different wavelengths. At what level do you imagine having this support, in terms of support for doing low-level things? Is this something like OpenMP, where threading and such are done at a really (really really really...) high level, or what?

Low level optimization is a wonderful thing, But I almost wonder if this will always be something where in order todo the low level optimization you will be using the vendors provided platform for doing it, as no generic tool will be able to match the custom one.

Most of my interaction with the gpu is via shader programs for Opengl, I have only lightly used CUDA for some image processing software, So I am certainly not the one to give in depth detail to optimization strategies.

sorry on point 1, that was a typo, I meant

1. The range must be known prior to execution of a gpu code block.

as for

3. Code blocks can only receive a single range, it can however be multidimensional

int a[100] = [ ... ];
int b[100];
void example3( aggregate in range r ; k, in float a[], float
  b[]){
  b[k]   = a[k];
}
example3( 0 .. 100 , a,b);

This function would be executed 100 times.

int a[10_000] = [ ... ];
int b[10_000];
void example3( aggregate in range r ; kx,aggregate in range r2 ; ky, in float a[], float
  b[]){
  b[kx+(ky*100)]   = a[kx+(ky*100)];
}
example3( 0 .. 100 , 0 .. 100 , a,b);

this function would be executed 10,000 times. the two aggregate ranges being treated as a single 2 dimensional range.

Maybe a better description of the rule would be that multiple ranges are multiplicative, and functionally operate as a single range.

August 18, 2013

Posted by Atash
in reply to luminousone

Atash

Posted in reply to luminousone

On Sunday, 18 August 2013 at 06:22:30 UTC, luminousone wrote:
> The Xeon Phi is interesting in so far as taking generic programming to a more parallel environment. However it has some serious limitations that will heavily damage its potential performance.
>
> AVX2 is completely the wrong path to go about improving performance in parallel computing, The SIMD nature of this instruction set means that scalar operations, or even just not being able to fill the giant 256/512bit register wastes huge chunks of this things peak theoretical performance, and if any rules apply to instruction pairing on this multi issue pipeline you have yet more potential for wasted cycles.
>
> I haven't seen anything about intels, micro thread scheduler, or how these chips handle mass context switching natural of micro threaded environments, These two items make a huge difference in performance, comparing radeon VLIW5/4 to radeon GCN is a good example, most of the performance benefit of GCN is from easy of scheduling scalar pipelines over more complex pipes with instruction pairing rules etc.
>
> Frankly Intel, has some cool stuff, but they have been caught with their pants down, they have depended on their large fab advantage to carry them over and got lazy.
>
> We likely are watching AMD64 all over again.

Well, I can't argue that one.

> A first, simply a different way of approaching std.parallel like functionality, with an eye gpgpu in the future when easy integration solutions popup(such as HSA).

I can't argue with that either.

> It would be best to wait for a more generic software platform, to find out how this is handled by the next generation of micro threading tools.
>
> The way openCL/CUDA work reminds me to much of someone setting up tomcat to have java code generate php that runs on their apache server, just because they can. I would rather tighter integration with the core language, then having a language in language.

Fair point. I have my own share of idyllic wants, so I can't argue with those.

> Low level optimization is a wonderful thing, But I almost wonder if this will always be something where in order todo the low level optimization you will be using the vendors provided platform for doing it, as no generic tool will be able to match the custom one.

But OpenCL is by no means a 'custom tool'. CUDA, maybe, but OpenCL just doesn't fit the bill in my opinion. I can see it being possible in the future that it'd be considered 'low-level', but it's a fairly generic solution. A little hackneyed under your earlier metaphors, but still a generic, standard solution.

> Most of my interaction with the gpu is via shader programs for Opengl, I have only lightly used CUDA for some image processing software, So I am certainly not the one to give in depth detail to optimization strategies.

There was a *lot* of stuff that opened up when vendors dumped GPGPU out of Pandora's box. If you want to get a feel for some optimization strategies and what they require, check this site out: http://www.bealto.com/gpu-sorting_intro.html (and I hope I'm not insulting your intelligence here, if I am, I truly apologize).

> sorry on point 1, that was a typo, I meant
>
> 1. The range must be known prior to execution of a gpu code block.
>
> as for
>
> 3. Code blocks can only receive a single range, it can however be multidimensional
>
> int a[100] = [ ... ];
> int b[100];
> void example3( aggregate in range r ; k, in float a[], float
>   b[]){
>   b[k]   = a[k];
> }
> example3( 0 .. 100 , a,b);
>
> This function would be executed 100 times.
>
> int a[10_000] = [ ... ];
> int b[10_000];
> void example3( aggregate in range r ; kx,aggregate in range r2 ; ky, in float a[], float
>   b[]){
>   b[kx+(ky*100)]   = a[kx+(ky*100)];
> }
> example3( 0 .. 100 , 0 .. 100 , a,b);
>
> this function would be executed 10,000 times. the two aggregate ranges being treated as a single 2 dimensional range.
>
> Maybe a better description of the rule would be that multiple ranges are multiplicative, and functionally operate as a single range.

OH.

I think I was totally misunderstanding you earlier. The 'aggregate' is the range over the *problem space*, not the values being punched into the problem. Is this true or false?

(if true I'm about to feel incredibly sheepish)

August 18, 2013

Posted by luminousone
in reply to Atash

luminousone

Posted in reply to Atash

On Sunday, 18 August 2013 at 07:28:02 UTC, Atash wrote:
> On Sunday, 18 August 2013 at 06:22:30 UTC, luminousone wrote:
>> The Xeon Phi is interesting in so far as taking generic programming to a more parallel environment. However it has some serious limitations that will heavily damage its potential performance.
>>
>> AVX2 is completely the wrong path to go about improving performance in parallel computing, The SIMD nature of this instruction set means that scalar operations, or even just not being able to fill the giant 256/512bit register wastes huge chunks of this things peak theoretical performance, and if any rules apply to instruction pairing on this multi issue pipeline you have yet more potential for wasted cycles.
>>
>> I haven't seen anything about intels, micro thread scheduler, or how these chips handle mass context switching natural of micro threaded environments, These two items make a huge difference in performance, comparing radeon VLIW5/4 to radeon GCN is a good example, most of the performance benefit of GCN is from easy of scheduling scalar pipelines over more complex pipes with instruction pairing rules etc.
>>
>> Frankly Intel, has some cool stuff, but they have been caught with their pants down, they have depended on their large fab advantage to carry them over and got lazy.
>>
>> We likely are watching AMD64 all over again.
>
> Well, I can't argue that one.
>
>> A first, simply a different way of approaching std.parallel like functionality, with an eye gpgpu in the future when easy integration solutions popup(such as HSA).
>
> I can't argue with that either.
>
>> It would be best to wait for a more generic software platform, to find out how this is handled by the next generation of micro threading tools.
>>
>> The way openCL/CUDA work reminds me to much of someone setting up tomcat to have java code generate php that runs on their apache server, just because they can. I would rather tighter integration with the core language, then having a language in language.
>
> Fair point. I have my own share of idyllic wants, so I can't argue with those.
>
>> Low level optimization is a wonderful thing, But I almost wonder if this will always be something where in order todo the low level optimization you will be using the vendors provided platform for doing it, as no generic tool will be able to match the custom one.
>
> But OpenCL is by no means a 'custom tool'. CUDA, maybe, but OpenCL just doesn't fit the bill in my opinion. I can see it being possible in the future that it'd be considered 'low-level', but it's a fairly generic solution. A little hackneyed under your earlier metaphors, but still a generic, standard solution.

I can agree with that.

>> Most of my interaction with the gpu is via shader programs for Opengl, I have only lightly used CUDA for some image processing software, So I am certainly not the one to give in depth detail to optimization strategies.
>
> There was a *lot* of stuff that opened up when vendors dumped GPGPU out of Pandora's box. If you want to get a feel for some optimization strategies and what they require, check this site out: http://www.bealto.com/gpu-sorting_intro.html (and I hope I'm not insulting your intelligence here, if I am, I truly apologize).

I am still learning and additional links to go over never hurt!, I am of the opinion that a good programmers has never finished learning new stuff.

>> sorry on point 1, that was a typo, I meant
>>
>> 1. The range must be known prior to execution of a gpu code block.
>>
>> as for
>>
>> 3. Code blocks can only receive a single range, it can however be multidimensional
>>
>> int a[100] = [ ... ];
>> int b[100];
>> void example3( aggregate in range r ; k, in float a[], float
>>  b[]){
>>  b[k]   = a[k];
>> }
>> example3( 0 .. 100 , a,b);
>>
>> This function would be executed 100 times.
>>
>> int a[10_000] = [ ... ];
>> int b[10_000];
>> void example3( aggregate in range r ; kx,aggregate in range r2 ; ky, in float a[], float
>>  b[]){
>>  b[kx+(ky*100)]   = a[kx+(ky*100)];
>> }
>> example3( 0 .. 100 , 0 .. 100 , a,b);
>>
>> this function would be executed 10,000 times. the two aggregate ranges being treated as a single 2 dimensional range.
>>
>> Maybe a better description of the rule would be that multiple ranges are multiplicative, and functionally operate as a single range.
>
> OH.
>
> I think I was totally misunderstanding you earlier. The 'aggregate' is the range over the *problem space*, not the values being punched into the problem. Is this true or false?
>
> (if true I'm about to feel incredibly sheepish)

Is "problem space" the correct industry term, I am self taught on much of this, so I on occasion I miss out on what the correct terminology for something is.

But yes that is what i meant.

August 18, 2013

Posted by luminousone
in reply to luminousone

luminousone

Posted in reply to luminousone

I chose the term aggregate, because it is the term used in the description of the foreach syntax.

foreach( value, key ; aggregate )

aggregate being an array or range, it seems to fit as even when the aggregate is an array, as you still implicitly have a range being "0 .. array.length", and will have a key or index position created by the foreach in addition to the value.

A wrapped function could very easily be similar to the intended initial outcome

void example( ref float a[], float b[], float c[] ) {

   foreach( v, k ; a ) {
      a[k] = b[k] + c[k];
   }
}

is functionally the same as

void example( aggregate ref float a[] ; k, float b[], float c[] ) {
   a[k] = b[k] + c[k];
}

maybe : would make more sense then ; but I am not sure as to the best way to represent that index value.

August 18, 2013

Posted by Dejan Lekic
in reply to eles

Dejan Lekic

Posted in reply to eles

On Tuesday, 13 August 2013 at 18:21:12 UTC, eles wrote:
> On Tuesday, 13 August 2013 at 16:27:46 UTC, Russel Winder wrote:
>> The entry point would be if D had a way of creating GPGPU kernels that
>> is better than the current C/C++ + tooling.
>
> You mean an alternative to OpenCL language?
>
> Because, I imagine, a library (libopencl) would be easy enough to write/bind.
>
> Who'll gonna standardize this language?

Tra55er did it long ago - look at the cl4d wrapper. I think it is on GitHub.

August 18, 2013

Posted by John Colvin
in reply to Dejan Lekic

John Colvin

Posted in reply to Dejan Lekic

On Sunday, 18 August 2013 at 18:19:06 UTC, Dejan Lekic wrote:
> On Tuesday, 13 August 2013 at 18:21:12 UTC, eles wrote:
>> On Tuesday, 13 August 2013 at 16:27:46 UTC, Russel Winder wrote:
>>> The entry point would be if D had a way of creating GPGPU kernels that
>>> is better than the current C/C++ + tooling.
>>
>> You mean an alternative to OpenCL language?
>>
>> Because, I imagine, a library (libopencl) would be easy enough to write/bind.
>>
>> Who'll gonna standardize this language?
>
> Tra55er did it long ago - look at the cl4d wrapper. I think it is on GitHub.

I had no idea that existed. Thanks :) https://github.com/Trass3r/cl4d

August 18, 2013

Posted by Russel Winder
in reply to John Colvin

Russel Winder

Posted in reply to John Colvin

Attachments:

signature.asc (This is a digitally signed message part)

On Sun, 2013-08-18 at 20:27 +0200, John Colvin wrote:
> On Sunday, 18 August 2013 at 18:19:06 UTC, Dejan Lekic wrote:
> > On Tuesday, 13 August 2013 at 18:21:12 UTC, eles wrote:
> >> On Tuesday, 13 August 2013 at 16:27:46 UTC, Russel Winder wrote:
> >>> The entry point would be if D had a way of creating GPGPU
> >>> kernels that
> >>> is better than the current C/C++ + tooling.
> >>
> >> You mean an alternative to OpenCL language?
> >>
> >> Because, I imagine, a library (libopencl) would be easy enough to write/bind.
> >>
> >> Who'll gonna standardize this language?
> >
> > Tra55er did it long ago - look at the cl4d wrapper. I think it is on GitHub.

Thanks for pointing this out, I had completely missed it.

> I had no idea that existed. Thanks :) https://github.com/Trass3r/cl4d

I had missed that as well. Bad Google and GitHub skills on my part clearly.

I think the path is now obvious, ask if the owner will turn this repository over to a group so that it can become the focus of future work via the repositories wiki and issue tracker.

I will fork this repository as is and begin to analyse the status quo wrt the discussion recently on the email list.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

August 18, 2013

Posted by Atash
in reply to luminousone

Atash

Posted in reply to luminousone

I'm not sure if 'problem space' is the industry standard term (in fact I doubt it), but it's certainly a term I've used over the years by taking a leaf out of math books and whatever my professors had touted. :-D I wish I knew what the standard term was, but for now I'm latching onto that because it seems to describe at a high implementation-agnostic level what's up, and in my personal experience most people seem to 'get it' when I use the term - it has empirically had an accurate connotation.

That all said, I'd like to know what the actual term is, too. -.-'

On Sunday, 18 August 2013 at 08:21:18 UTC, luminousone wrote:
> I chose the term aggregate, because it is the term used in the description of the foreach syntax.
>
> foreach( value, key ; aggregate )
>
> aggregate being an array or range, it seems to fit as even when the aggregate is an array, as you still implicitly have a range being "0 .. array.length", and will have a key or index position created by the foreach in addition to the value.
>
> A wrapped function could very easily be similar to the intended initial outcome
>
> void example( ref float a[], float b[], float c[] ) {
>
>    foreach( v, k ; a ) {
>       a[k] = b[k] + c[k];
>    }
> }
>
> is functionally the same as
>
> void example( aggregate ref float a[] ; k, float b[], float c[] ) {
>    a[k] = b[k] + c[k];
> }
>
> maybe : would make more sense then ; but I am not sure as to the best way to represent that index value.

Aye, that makes awesome sense, but I'm left wishing that there was something in that syntax to support access to local/shared memory between work-items. Or, better yet, some way of hinting at desired amounts of memory in the various levels of the non-global memory hierarchy and a way of accessing those requested allocations.

I mean, I haven't *seen* anyone do anything device-wise with more hierarchical levels than just global-shared-private, but it's always bothered me that in OpenCL we could only specify memory allocations on those three levels. What if someone dropped in another hierarchical level? Suddenly it'd open another door to optimization of code, and there'd be no way for OpenCL to access it. Or what if someone scrapped local memory altogether, for whatever reason? The industry may never add/remove such memory levels, but, still, it just feels... kinda wrong that OpenCL doesn't have an immediate way of saying, "A'ight, it's cool that you have this, Mr. XYZ-compute-device, I can deal with it," before proceeding to put on sunglasses and driving away in a Ferrari. Or something like that.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation