GPGPUs (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » GPGPUs (page 2)

August 16, 2013

Posted by Russel Winder
in reply to Paul Jurczak

Russel Winder

Posted in reply to Paul Jurczak

Attachments:

signature.asc (This is a digitally signed message part)

On Fri, 2013-08-16 at 12:41 +0200, Paul Jurczak wrote: […]
> It seems to me that you are describing something similar to C++ AMP, which is a high level, language specific solution to GPGPU problem.

C++ AMP may be an open specification but it only targets DirectX. But the ideas behind it are very sensible, use closures and internal iteration with library support to drive the compiler to construct the required kernels.

Today you have to download the kernel to the attached GPGPU over the bus. In the near future the GPGPU will exist in a single memory address space shared with all the CPUs. At this point separately downloadable kernels become a thing of the past, it becomes a compiler/loader issue to get things right.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

August 16, 2013

Posted by Joseph Rushton Wakeling

Joseph Rushton Wakeling

On 08/16/2013 12:04 PM, Russel Winder wrote:
> I guess my question is whether people are interested in std.gpgpu (or
> some more sane name).

Yes, I'd be interested, particularly if it's possible to produce a GPGPU solution that is much more user-friendly than the current C/C++ options.

I think this could have a good deal of importance for scientific simulation and other similarly demanding computational tasks.

August 16, 2013

Posted by Atash
in reply to Russel Winder

Atash

Posted in reply to Russel Winder

On Friday, 16 August 2013 at 12:18:49 UTC, Russel Winder wrote:
> On Fri, 2013-08-16 at 12:41 +0200, Paul Jurczak wrote:
> […]
> Today you have to download the kernel to the attached GPGPU over the
> bus. In the near future the GPGPU will exist in a single memory address
> space shared with all the CPUs. At this point separately downloadable
> kernels become a thing of the past, it becomes a compiler/loader issue
> to get things right.

I'm iffy on the assumption that the future holds unified memory for heterogeneous devices. Even relatively recent products such as the Intel Xeon Phi have totally separate memory. I'm not aware of any general-computation-oriented products that don't have separate memory.

I'm also of the opinion that as long as people want to have devices that can scale in size, there will be modular devices. Because they're modular, there's some sort of a spacing between them and the machine, ex. PCIe (and, somewhat importantly, a physical distance between the added device and the CPU-stuff). Because of that, they're likely to have their own memory. Therefore, I'm personally not willing to bank on anything short of targeting the least common denominator here (non-uniform access memory) specifically because it looks like a necessity for scaling a physical collection of heterogeneous devices up in size, which in turn I *think* is a necessity for people trying to deal with growing data sets in the real world.

Annnnnnnnnnnndddddd because heterogeneous compute devices aren't *just* GPUs (ex. Intel Xeon Phi), I'd strongly suggest picking a more general name, like 'accelerators' or 'apu' (except AMD totally ran away with that acronym in marketing and I sort of hate them for it) or '<something-I-can't-think-of-because-words-are-hard>'.

That said, I'm no expert, so go ahead and rip 'mah opinions apart. :-D

August 16, 2013

Posted by luminousone
in reply to Russel Winder

luminousone

Posted in reply to Russel Winder

> The core (!) point here is that processor chips are rapidly becoming a
> collection of heterogeneous cores. Any programming language that assumes
> a single CPU or a collection of homogeneous CPUs has built-in
> obsolescence.
>
> So the question I am interested in is whether D is the language that can
> allow me to express in a single codebase a program in which parts will
> be executed on one or more GPGPUs and parts on multiple CPUs. D has
> support for the latter, std.parallelism and std.concurrency.
>
> I guess my question is whether people are interested in std.gpgpu (or
> some more sane name).

CUDA, works as a preprocessor pass that generates c files from .cu extension files.

In effect, to create a sensible environment for microthreaded programming, they extend the language.

a basic CUDA function looking something like...

__global__ void add( float * a, float * b, float * c) {
   int i = threadIdx.x;
   c[i] = a[i] + b[i];
}

add<<< 1, 10 >>>( ptrA, ptrB, ptrC );

Their is the buildin variables to handle the index location threadIdx.x in the above example, this is something generated by the thread scheduler in the video card/apu device.

Generally calls to this setup has a very high latency, so using this for a small handful of items as in the above example makes no sense. In the above example that would end up using a single execution cluster, and leave you prey to the latency of the pcie bus, execution time, and latency costs of the video memory.

it doesn't get effective until you are working with large data sets, that can take advantage of a massive number of threads where the latency problems would be secondary to the sheer calculations done.

as far as D goes, we really only have one build in microthreading capable language construct, foreach.

However I don't think a library extension similar to std.parallelism would work gpu based microthreading.

foreach would need to have something to tell the compiler to generate gpu bytecode for the code block it uses, and would need instructions on when to use said code block based on dataset size.

while it is completely possible to have very little change with function just add new property @microthreaded and the build in variables for the index position/s, the calling syntax would need changes to support a work range or multidimensional range of some sort.

perhaps looking something like....

add$(1 .. 10)(ptrA,ptrB,ptrC);

a templated function looking similar

add!(float)$(1 .. 10)(ptrA,ptrB,ptrC);

August 16, 2013

Posted by John Colvin
in reply to luminousone

John Colvin

Posted in reply to luminousone

On Friday, 16 August 2013 at 19:55:56 UTC, luminousone wrote:
>> The core (!) point here is that processor chips are rapidly becoming a
>> collection of heterogeneous cores. Any programming language that assumes
>> a single CPU or a collection of homogeneous CPUs has built-in
>> obsolescence.
>>
>> So the question I am interested in is whether D is the language that can
>> allow me to express in a single codebase a program in which parts will
>> be executed on one or more GPGPUs and parts on multiple CPUs. D has
>> support for the latter, std.parallelism and std.concurrency.
>>
>> I guess my question is whether people are interested in std.gpgpu (or
>> some more sane name).
>
> CUDA, works as a preprocessor pass that generates c files from .cu extension files.
>
> In effect, to create a sensible environment for microthreaded programming, they extend the language.
>
> a basic CUDA function looking something like...
>
> __global__ void add( float * a, float * b, float * c) {
>    int i = threadIdx.x;
>    c[i] = a[i] + b[i];
> }
>
> add<<< 1, 10 >>>( ptrA, ptrB, ptrC );
>
> Their is the buildin variables to handle the index location threadIdx.x in the above example, this is something generated by the thread scheduler in the video card/apu device.
>
> Generally calls to this setup has a very high latency, so using this for a small handful of items as in the above example makes no sense. In the above example that would end up using a single execution cluster, and leave you prey to the latency of the pcie bus, execution time, and latency costs of the video memory.
>
> it doesn't get effective until you are working with large data sets, that can take advantage of a massive number of threads where the latency problems would be secondary to the sheer calculations done.
>
> as far as D goes, we really only have one build in microthreading capable language construct, foreach.
>
> However I don't think a library extension similar to std.parallelism would work gpu based microthreading.
>
> foreach would need to have something to tell the compiler to generate gpu bytecode for the code block it uses, and would need instructions on when to use said code block based on dataset size.
>
> while it is completely possible to have very little change with function just add new property @microthreaded and the build in variables for the index position/s, the calling syntax would need changes to support a work range or multidimensional range of some sort.
>
> perhaps looking something like....
>
> add$(1 .. 10)(ptrA,ptrB,ptrC);
>
> a templated function looking similar
>
> add!(float)$(1 .. 10)(ptrA,ptrB,ptrC);

We have a[] = b[] * c[] - 5; etc. which could work very neatly perhaps?

August 16, 2013

Posted by Atash
in reply to luminousone

Atash

Posted in reply to luminousone

On Friday, 16 August 2013 at 19:55:56 UTC, luminousone wrote:
>> The core (!) point here is that processor chips are rapidly becoming a
>> collection of heterogeneous cores. Any programming language that assumes
>> a single CPU or a collection of homogeneous CPUs has built-in
>> obsolescence.
>>
>> So the question I am interested in is whether D is the language that can
>> allow me to express in a single codebase a program in which parts will
>> be executed on one or more GPGPUs and parts on multiple CPUs. D has
>> support for the latter, std.parallelism and std.concurrency.
>>
>> I guess my question is whether people are interested in std.gpgpu (or
>> some more sane name).
>
> CUDA, works as a preprocessor pass that generates c files from .cu extension files.
>
> In effect, to create a sensible environment for microthreaded programming, they extend the language.
>
> a basic CUDA function looking something like...
>
> __global__ void add( float * a, float * b, float * c) {
>    int i = threadIdx.x;
>    c[i] = a[i] + b[i];
> }
>
> add<<< 1, 10 >>>( ptrA, ptrB, ptrC );
>
> Their is the buildin variables to handle the index location threadIdx.x in the above example, this is something generated by the thread scheduler in the video card/apu device.
>
> Generally calls to this setup has a very high latency, so using this for a small handful of items as in the above example makes no sense. In the above example that would end up using a single execution cluster, and leave you prey to the latency of the pcie bus, execution time, and latency costs of the video memory.
>
> it doesn't get effective until you are working with large data sets, that can take advantage of a massive number of threads where the latency problems would be secondary to the sheer calculations done.
>
> as far as D goes, we really only have one build in microthreading capable language construct, foreach.
>
> However I don't think a library extension similar to std.parallelism would work gpu based microthreading.
>
> foreach would need to have something to tell the compiler to generate gpu bytecode for the code block it uses, and would need instructions on when to use said code block based on dataset size.
>
> while it is completely possible to have very little change with function just add new property @microthreaded and the build in variables for the index position/s, the calling syntax would need changes to support a work range or multidimensional range of some sort.
>
> perhaps looking something like....
>
> add$(1 .. 10)(ptrA,ptrB,ptrC);
>
> a templated function looking similar
>
> add!(float)$(1 .. 10)(ptrA,ptrB,ptrC);

Regarding functionality, @microthreaded is sounding a lot like the __kernel or __global__ keywords in OpenCL and CUDA. Is this intentional?

The more metaphors that can be drawn between extant tools and whatever is come up with the better, methinks.

August 16, 2013

Posted by luminousone
in reply to John Colvin

luminousone

Posted in reply to John Colvin

On Friday, 16 August 2013 at 20:07:32 UTC, John Colvin wrote:
> We have a[] = b[] * c[] - 5; etc. which could work very neatly perhaps?

While this in fact could work, given the nature of GPGPU it would
not be very effective.

In a non shared memory and non cache coherent setup, the entirety
of all 3 arrays have to be copied into GPU memory, had that
statement ran as gpu bytecode, and then copied back to complete
the operation.

GPGPU doesn't make sense on small code blocks, both in
instruction count, and by how memory bound a particular statement
would be.

The compiler needs to either be explicitly told what can/should
be ran as a GPU function, or have some intelligence about what to
or not to run as a GPU function.

This will get better in the future, APU's using the full HSA
implementation will drastically reduce the "buyin" latency/cycle
cost of using a GPGPU function, and make them more practical for
smaller(in instruction count/memory boundess) operations.

August 16, 2013

Posted by luminousone
in reply to Atash

luminousone

Posted in reply to Atash

On Friday, 16 August 2013 at 21:14:12 UTC, Atash wrote:

> Regarding functionality, @microthreaded is sounding a lot like the __kernel or __global__ keywords in OpenCL and CUDA. Is this intentional?
>
> The more metaphors that can be drawn between extant tools and whatever is come up with the better, methinks.

Yes, And that is just a word I pull out the air, if another term makes more sense then I am not against it.

August 16, 2013

Posted by John Colvin
in reply to luminousone

John Colvin

Posted in reply to luminousone

On Friday, 16 August 2013 at 22:11:41 UTC, luminousone wrote:
> On Friday, 16 August 2013 at 20:07:32 UTC, John Colvin wrote:
>> We have a[] = b[] * c[] - 5; etc. which could work very neatly perhaps?
>
> While this in fact could work, given the nature of GPGPU it would
> not be very effective.
>
> In a non shared memory and non cache coherent setup, the entirety
> of all 3 arrays have to be copied into GPU memory, had that
> statement ran as gpu bytecode, and then copied back to complete
> the operation.
>
> GPGPU doesn't make sense on small code blocks, both in
> instruction count, and by how memory bound a particular statement
> would be.
>
> The compiler needs to either be explicitly told what can/should
> be ran as a GPU function, or have some intelligence about what to
> or not to run as a GPU function.
>
> This will get better in the future, APU's using the full HSA
> implementation will drastically reduce the "buyin" latency/cycle
> cost of using a GPGPU function, and make them more practical for
> smaller(in instruction count/memory boundess) operations.

I didn't literally mean automatically inserting GPU code.

I was more imagining this:

void foo(T)(T[] arr)
{
    useArray(arr);
}

auto a = someLongArray;
auto b = someOtherLongArray;

gpu
{
    auto aGPU = toGPUMem(a);
    auto bGPU = toGPUMem(b);

    auto c = GPUArr(a.length);

    c[] = a[] * b[];

    auto cCPU = toCPUMem(c);
    c.foo();

    dot(c, iota(c.length).array().toGPUMem())
        .foo();
}

gpu T dot(T)(T[] a, T[] b)
{
    //gpu dot product
}


with cpu arrays and gpu arrays identified separately in the type system. Automatic conversions could be possible, but of course that would allow carelessness.

Obviously there is some cpu code mixed in with the gpu code there, which should be executed asynchronously if possible. You could also have
onlyGPU
{
    //only code that can all be executed on the GPU.
}



Just ideas off the top of my head. Definitely full of holes and I haven't really considered the detail :)

August 17, 2013

Posted by luminousone
in reply to John Colvin

luminousone

Posted in reply to John Colvin

On Friday, 16 August 2013 at 23:30:12 UTC, John Colvin wrote:
> On Friday, 16 August 2013 at 22:11:41 UTC, luminousone wrote:
>> On Friday, 16 August 2013 at 20:07:32 UTC, John Colvin wrote:
>>> We have a[] = b[] * c[] - 5; etc. which could work very neatly perhaps?
>>
>> While this in fact could work, given the nature of GPGPU it would
>> not be very effective.
>>
>> In a non shared memory and non cache coherent setup, the entirety
>> of all 3 arrays have to be copied into GPU memory, had that
>> statement ran as gpu bytecode, and then copied back to complete
>> the operation.
>>
>> GPGPU doesn't make sense on small code blocks, both in
>> instruction count, and by how memory bound a particular statement
>> would be.
>>
>> The compiler needs to either be explicitly told what can/should
>> be ran as a GPU function, or have some intelligence about what to
>> or not to run as a GPU function.
>>
>> This will get better in the future, APU's using the full HSA
>> implementation will drastically reduce the "buyin" latency/cycle
>> cost of using a GPGPU function, and make them more practical for
>> smaller(in instruction count/memory boundess) operations.
>
> I didn't literally mean automatically inserting GPU code.
>
> I was more imagining this:
>
> void foo(T)(T[] arr)
> {
>     useArray(arr);
> }
>
> auto a = someLongArray;
> auto b = someOtherLongArray;
>
> gpu
> {
>     auto aGPU = toGPUMem(a);
>     auto bGPU = toGPUMem(b);
>
>     auto c = GPUArr(a.length);
>
>     c[] = a[] * b[];
>
>     auto cCPU = toCPUMem(c);
>     c.foo();
>
>     dot(c, iota(c.length).array().toGPUMem())
>         .foo();
> }
>
> gpu T dot(T)(T[] a, T[] b)
> {
>     //gpu dot product
> }
>
>
> with cpu arrays and gpu arrays identified separately in the type system. Automatic conversions could be possible, but of course that would allow carelessness.
>
> Obviously there is some cpu code mixed in with the gpu code there, which should be executed asynchronously if possible. You could also have
> onlyGPU
> {
>     //only code that can all be executed on the GPU.
> }
>
>
>
> Just ideas off the top of my head. Definitely full of holes and I haven't really considered the detail :)


You can't mix cpu and gpu code, they must be separate.

auto a = someLongArray;
auto b = someOtherLongArray;

auto aGPU = toGPUMem(a);
auto bGPU = toGPUMem(b);

auto c = GPUArr(a.length);

gpu
{
    // this block is one gpu shader program
    c[] = a[] * b[];
}

auto cCPU = toCPUMem(c);
cCPU.foo();
auto cGPU = toGPUMem(cCPU);
auto dGPU = iota(c.length).array().toGPUMem();

gpu{
    // this block is another wholly separate shader program
    auto resultGPU = dot(cGPU, dGPU);
}

auto resultCPU = toCPUMem(resultGPU);
resultCPU.foo();

gpu T dot(T)(T[] a, T[] b)
{
    //gpu dot product
}


Your example rewritten to fit the gpu.

However this still has problems of the cpu having to generate CPU code from the contents of gpu{} code blocks, as the GPU is unable to allocate memory, so for example ,

gpu{
    auto resultGPU = dot(c, cGPU);
}

likely either won't work, or generates an array allocation in cpu code before the gpu block is otherwise ran.

Also how does that dot product function know the correct index range to run on?, are we assuming it knows based on the length of a?, while the syntax,

c[] = a[] * b[];

is safe for this sort of call, a function is less safe todo this with, with function calls the range needs to be told to the function, and you would call this function without the gpu{} block as the function itself is marked.

auto resultGPU = dot$(0 .. returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);



Remember with gpu's you don't send instructions, you send whole programs, and the whole program must finish before you can move onto the next cpu instruction.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation