January 06, 2012
> I see. While you design, you need to think about the other features of D :-) Is it possible to mix CPU SIMD with D vector ops?
> 
> __float4[10] a, b, c;
> c[] = a[] + b[];

And generally, if the D compiler receives just D vector ops, what's a good way for the compiler to map them efficiently (even if less efficiently than true SIMD operations written manually) to SIMD ops? Generally you can't ask all D programmers to use __float4, some of them will want to use just D vector ops, despite they are a less efficient, because they are simpler to use. So the duty of a good D compiler is to implement them too efficiently enough.

Bye,
bearophile
January 06, 2012
On 6 January 2012 16:06, bearophile <bearophileHUGS@lycos.com> wrote:

> Manu:
>
> > To make it perform float4 math, or double2 match, you either write the pseudo assembly you want directly, but more realistically, you use the __float4 type supplied in the standard library, which will already associate all the float4 related functionality, and try and map it across various architectures as efficiently as possible.
>
> I see. While you design, you need to think about the other features of D :-) Is it possible to mix CPU SIMD with D vector ops?
>
> __float4[10] a, b, c;
> c[] = a[] + b[];
>

I don't see any issue with this. An array of vectors makes perfect sense,
and I see no reason why arrays/slices/etc of hardware vectors should be any
sort of problem.
This particular expression should be just as efficient as if it were an
array of flat floats, especially if the compiler unrolls it.

D's array/slice syntax is something I'm very excited about actually in conjunction with hardware vectors. I could do some really elegant geometry processing with slices from vertex streams.


January 06, 2012
On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
[...]
> I don't see any issue with this. An array of vectors makes perfect sense,
> and I see no reason why arrays/slices/etc of hardware vectors should be any
> sort of problem.
> This particular expression should be just as efficient as if it were an
> array of flat floats, especially if the compiler unrolls it.
> 
> D's array/slice syntax is something I'm very excited about actually in conjunction with hardware vectors. I could do some really elegant geometry processing with slices from vertex streams.

Excuse me for jumping in part way through, apologies if I have the "wrong end of the stick".

As I understand it currently the debate to date has effectively revolved around how to have first class support in D for the SSE (vectorizing) capability of the x86 architecture.  This immediately raises the questions in my mind:

1.  Should x86 specific things be reified in the D language.  Given that ARM and other architectures are increasingly more important than x86, D should not tie itself to x86.

2.  Is there a way of doing something in D so that GPGPU can be described?

Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA
addicts) or OpenCL (for Apple addicts and others).  It would be good if
D could just take over this market by being able to manage GPU kernels
easily.  The risk is that PyCUDA and PyOpenCL beat D to market
leadership.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@russel.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder


January 06, 2012
From what I see in HPC conferences papers and webcasts, I think it might  be already too late for D
in those scenarios.

"Russel Winder"  wrote in message news:mailman.107.1325862128.16222.digitalmars-d@puremagic.com...
On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
[...]

Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA
addicts) or OpenCL (for Apple addicts and others).  It would be good if
D could just take over this market by being able to manage GPU kernels
easily.  The risk is that PyCUDA and PyOpenCL beat D to market
leadership.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder 

January 06, 2012
On 6 January 2012 16:12, bearophile <bearophileHUGS@lycos.com> wrote:

> > I see. While you design, you need to think about the other features of D
> :-) Is it possible to mix CPU SIMD with D vector ops?
> >
> > __float4[10] a, b, c;
> > c[] = a[] + b[];
>
> And generally, if the D compiler receives just D vector ops, what's a good way for the compiler to map them efficiently (even if less efficiently than true SIMD operations written manually) to SIMD ops? Generally you can't ask all D programmers to use __float4, some of them will want to use just D vector ops, despite they are a less efficient, because they are simpler to use. So the duty of a good D compiler is to implement them too efficiently enough.
>

 I'm not clear what you mean, are you talking about D vectors of hardware
vectors, as in your example above? (no problem, see my last post)

Or are you talking about programmers who will prefer to use float[4]
instead of __float4? (this is what I think you're getting at?)...
Users who prefer to use float[4] are welcome to do so, but I think you are
mistaken when you assume this will be 'simpler to use'.. The rules for what
they can/can't do efficiently with a float[4] are extremely restrictive,
and it's also very unclear if/when they are violating said rules.
It will almost always be faster to let the float unit do all the work in
this case... Perhaps the compiler COULD apply some SIMD optimisations in
very specific cases, but this would require some
serious sophistication from the compiler to detect.

Some likely problems:
  * float[4] is not aligned, performing unaligned load/stores will require
a long sequence of carefully pipelines vector code to offset/break even on
that cost. If the sequence of ops is short, it will be faster to keep it in
the FPU.
  * float[4] allows component-wise access. This produces transfer of data
between the FPU and the SIMD unit. This may again negate the advantage of
using SIMD opcodes over the FPU directly.
  * loading a vectorised float[4] with floats calculated/stored on the FPU
produces the same hazards as above. SIMD regs should not be loaded with
data taken from the FPU if possible.
  * how do you express logic and comparisons? chances are people will write
arbitrary component-wise comparisons. This requires flushing the values out
from the SIMD regs back to the FPU for comparisons, again, negating any
advantages of SIMD calculation.

The hazard I refer to almost universally is that of swapping data between
register types. This is a low process, and breaks any possibility for
efficient pipelining.
FPU pipelines nicely:
  float[4] x; x += 1.0; // This will result in 4 sequential adds to
different registers, there are no data dependencies, this will pipeline
beautifully, one cycle after another. This is probably only 3 cycles longer
than a simd add, plus a small cost for the extra opcodes in the instruction
stream

Any time you need to swap register type, the pipeline is broken, imagine something seemingly harmless, and totally logical like this:

float[4] hardwareVec; // compiler allows use of a hardware vector for
float[4]
float[1] = groundHeight; // we want to set Y explicitly, seems reasonable,
perhaps we're snapping a position to a ground plane or something...

This may be achieved in some way that looks something like this:
 * groundHeight must be stored to the stack
 * flush pipeline (wait for the data to arrive) (potentially long time)
 * UNALIGNED load from stack into a vector register (this may require an
additional operation to rotate the vector into the proper position after
loading on some architectures)
 * flush pipeline (wait for data to arrive)
 * loaded float needs to be merged with the existing vector, this can be
done in a variety of ways
   - use a permute operation [only some architectures support arbitrary
permute, VMX is best] (one opcode, but requires pre-loading of a separate
permute control register to describe the appropriate merge, this load may
be expensive, and the data must be available)
   - use a series of shifts (requires 2 shifts for X or W, 3 shifts for Y
or Z), doesn't require any additional loads from memory, but each of the
shifts are dependant operations, and must flush the pipeline between them
   - use a mask and OR the 2 vectors together (since applying masks to both
the source and target vectors can be pipelined in parallel, and only the
final OR requires flushing the pipeline...)
   - [ note: none of these options is ideal, and each may be preferable
based on context in different situations]
 * done

Congratulations, you've now set the Y component. At the cost of a LHS
though memory, potentially other loads from memory, and 5-10 flushes of the
pipeline summing hundreds, maybe thousands of wasted cpu cycles..
In this same amount of wasted time, you could have done a LOT of work with
the FPU directly.

Process of same operation using just the FPU:
  * FPU stores groundHeight (already in an FPU reg) to &float[1]
  * done

And if the value is an intermediate and never needs to be stored on the stack, there's a chance the operation will be eliminated entirely, since the value is already in a float reg, ready for use in the next operation :)

I think the take-away I'm trying to illustrate here is:
SIMD work and scalar word do NOT mix... any syntax that allows it is a
mistake. Users won't understand all the details and implications of the
seemingly trivial operations they perform, and shouldn't need to.
Auto-vectorisation of float[4] will be some amazingly sophisticated code,
and very temperamental. If the compiler detects it can make some
optimisation, great, but it will not be reliable from a user point of view,
and it won't be clear what to change to make the compiler do a better job.
It also still implies policy problems, ie, should float[4] be special cased
to be aligned(16) when no other array requires this? What about all the
different types? How to cast between then, what are the expected results?

I think it's best to forget about float[4] as a candidate for reliable
auto-vectorisation. Perhaps there's an opportunity for some nice little
compiler bonuses, but it should not be the languages window into efficient
use of the hardware.
Anyone using float[4] should accept that they are working with the FPU, and
they probably won't suffer much for it. If they want/need aggressive SIMD
optimisation, then they need to use the appropriate API, and understand, at
least a little bit, how the hardware works... Ideally the well-defined SIMD
API will make it easiest to do the right thing, and they won't need to know
all these hardware details to make good use of it.


January 06, 2012
On Fri, 2012-01-06 at 16:09 +0100, Paulo Pinto wrote:
> From what I see in HPC conferences papers and webcasts, I think it might  be
> already too late for D
> in those scenarios.

Indeed, for core HPC that is true:  if you aren't using Fortran, C, C++, and Python you are not in the game.  The point is that HPC is really about using computers that cost a significant proportion of the USA national debt.  My thinking is that with Intel especially, looking to use the Moore's Law transistor count mountain to put heterogeneous many core systems on chip, i.e. arrays of CPUs connected to GPGPUs on chip, the programming languages used by the majority of programmers not just those playing with multi-billion dollar kit, will have to be able to deal with heterogeneous models of computation.   The current model of separate compilation and loading of CPU code and GPGPU kernel is a hack to get things working in a world where tool chains are still about building 1970s single threaded code.  This represents an opportunity for non C and C++ languages.  Python is beginning to take a stab at trying to deal with all this.  D would be another good candidate.  Java cannot be in this game without some serious updating of the JVM semantics -- an issue we debated a bit on this list a short time ago so non need to rehearse all the points.

It just strikes me as an opportunity to get D front and centre by having it provide a better development experience for these heterogeneous systems that are coming.

Sadly Santa failed to bring me a GPGPU card for Christmas so as to do experiments using C++, Python, OpenCL (and probably CUDA, though OpenCL is the industry standard now).  I will though be buying one for myself in the next couple of weeks.

> "Russel Winder"  wrote in message
> news:mailman.107.1325862128.16222.digitalmars-d@puremagic.com...
> On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
> [...]
> 
> Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA addicts) or OpenCL (for Apple addicts and others).  It would be good if D could just take over this market by being able to manage GPU kernels easily.  The risk is that PyCUDA and PyOpenCL beat D to market leadership.
> 

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@russel.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder


January 06, 2012
On 6 January 2012 17:01, Russel Winder <russel@russel.org.uk> wrote:

> On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
> [...]
> > I don't see any issue with this. An array of vectors makes perfect sense, and I see no reason why arrays/slices/etc of hardware vectors should be
> any
> > sort of problem.
> > This particular expression should be just as efficient as if it were an
> > array of flat floats, especially if the compiler unrolls it.
> >
> > D's array/slice syntax is something I'm very excited about actually in conjunction with hardware vectors. I could do some really elegant
> geometry
> > processing with slices from vertex streams.
>
> Excuse me for jumping in part way through, apologies if I have the "wrong end of the stick".
>
> As I understand it currently the debate to date has effectively revolved around how to have first class support in D for the SSE (vectorizing) capability of the x86 architecture.


No, I'm talking specifically about NOT making the type x86/SSE specific. Hence all my ramblings about a 'generalised'/typeless v128 type which can be used to express 128bit SIMD hardware of any architecture.


> This immediately raises the
> questions in my mind:
>
> 1.  Should x86 specific things be reified in the D language.  Given that ARM and other architectures are increasingly more important than x86, D should not tie itself to x86.
>

The opcodes intrinsics you use to interact with the generalised type will
be architecture specific, but this isn't the end point of my proposal. The
next step is to produce libraries which will use version() heavily behind
the API to collate different architectures into nice user-friendly vector
types.
Sadly vector units across architectures are too different to expose useful
vector types cleanly in the language directly, so libraries will do this,
making use of compiler defined architecture intrinsics behind lots of
version() statements.


> 2.  Is there a way of doing something in D so that GPGPU can be described?
>

I think this will map neatly to GPGPU. The vector types proposed will apply
to that hardware just fine.
This is a much bigger question though, the real problems are:
  * How to you compile/produce code that will run on the GPU? (do we have a
D->Cg compiler?)
  * How do you express the threading/concurrency aspects of GPGPU usage?
(this is way outside the scope of vector arithmetic)
  * How do you express the data sources available to GPU's? Constant files,
etc... (seems D actually had reasonably good language expressions for this
sort of thing)

Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA
> addicts) or OpenCL (for Apple addicts and others).  It would be good if D could just take over this market by being able to manage GPU kernels easily.  The risk is that PyCUDA and PyOpenCL beat D to market leadership.
>

As said, I think these questions are way outside the scope of SIMD vector
libraries ;)
Although this is a fundamental piece of the puzzle, since GPGPU is no use
without SIMD type expression... but I think everything we've discussed here
so far will map perfectly to GPGPU.


January 06, 2012
Please don't start a flame war on this, I am just expressing an opinion.

I think that for heterougenous computing we are better of with a language
that supports functional programming concepts.

From what I have seen in papers, many imperative languages have the issue
that they are too tied to the old homogenous computing model we had on the
desktop. That is the main reason why C and C++ start to look like frankenstein
languages with all the extensions companies are adding to them to support the
new models.

Funcional languages have the advantage that their hardware model is more abstract
and as such can be easier mapped to heterougenous hardware. This is also an area
where VM based languages might have some kind of advantage, but I am not sure.

Now, D actually has quite a few tools to explore functional concepts, so I guess it could
take off in this area if enough HPC people got some interest on it.

Regarding CUDA, you will surely now this better than me. I read somewhere that in most
research institutes people only care about CUDA, not OpenCL, because of it being older
than OpenCL, the C++ support, available tools, and NVidia card's performance when compared
with ATI in this area. But I don't have any experience here, so I don't know how much of this is
true.

--
Paulo


"Russel Winder"  wrote in message news:mailman.109.1325864213.16222.digitalmars-d@puremagic.com...
On Fri, 2012-01-06 at 16:09 +0100, Paulo Pinto wrote:
> From what I see in HPC conferences papers and webcasts, I think it might be
> already too late for D
> in those scenarios.

Indeed, for core HPC that is true:  if you aren't using Fortran, C, C++,
and Python you are not in the game.  The point is that HPC is really
about using computers that cost a significant proportion of the USA
national debt.  My thinking is that with Intel especially, looking to
use the Moore's Law transistor count mountain to put heterogeneous many
core systems on chip, i.e. arrays of CPUs connected to GPGPUs on chip,
the programming languages used by the majority of programmers not just
those playing with multi-billion dollar kit, will have to be able to
deal with heterogeneous models of computation.   The current model of
separate compilation and loading of CPU code and GPGPU kernel is a hack
to get things working in a world where tool chains are still about
building 1970s single threaded code.  This represents an opportunity for
non C and C++ languages.  Python is beginning to take a stab at trying
to deal with all this.  D would be another good candidate.  Java cannot
be in this game without some serious updating of the JVM semantics -- an
issue we debated a bit on this list a short time ago so non need to
rehearse all the points.

It just strikes me as an opportunity to get D front and centre by having
it provide a better development experience for these heterogeneous
systems that are coming.

Sadly Santa failed to bring me a GPGPU card for Christmas so as to do
experiments using C++, Python, OpenCL (and probably CUDA, though OpenCL
is the industry standard now).  I will though be buying one for myself
in the next couple of weeks.

> "Russel Winder"  wrote in message
> news:mailman.107.1325862128.16222.digitalmars-d@puremagic.com...
> On Fri, 2012-01-06 at 16:35 +0200, Manu wrote:
> [...]
>
> Currently GPGPU is dominated by C and C++ using CUDA (for NVIDIA
> addicts) or OpenCL (for Apple addicts and others).  It would be good if
> D could just take over this market by being able to manage GPU kernels
> easily.  The risk is that PyCUDA and PyOpenCL beat D to market
> leadership.
>

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder 

January 06, 2012
 That Cuda is used more is probably true, OpenCL is fugly C and no fun.

Microsoft's upcoming C++ AMP looks interesting as it lets you write GPU and CPU code in C++.  The spec is open so hopefully it becomes common to implement it in other C++ compilers.

SSE intrinsics in C++ are pretty essential for getting great performance, so I do think D needs something like this.  A problem with intrinsics in C++ has been poor support from compilers, often performing little or no optimization and just blindly issuing instructions as you listed them, causing all kinds of extra loads and stores.

 Visual Studio is actually one of the worst C++ compilers for intrinsics, ICC is likely the best.

So even if D does add these new intrinsic functions it would need to actual optimize around them to produce reasonably fast code.

 I agree that the v128 type should be typeless, it is typeless on hardware, and this makes it easier to mix and match instructions.

January 06, 2012
On Fri, 06 Jan 2012 14:44:53 +0100, Manu <turkeyman@gmail.com> wrote:

> On 6 January 2012 14:56, Martin Nowak <dawg@dawgfoto.de> wrote:
>
>> On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright <
>> newshound2@digitalmars.com> wrote:
>>
>>> One caveat is it is typeless; a __v128 could be used as 4 packed ints or
>>> 2 packed doubles. One problem with making it typed is it'll add 10 more
>>> types to the base compiler, instead of one. Maybe we should just bite the
>>> bullet and do the types:
>>>
>>>     __vdouble2
>>>     __vfloat4
>>>     __vlong2
>>>     __vulong2
>>>     __vint4
>>>     __vuint4
>>>     __vshort8
>>>     __vushort8
>>>     __vbyte16
>>>     __vubyte16
>>>
>>
>> Those could be typedefs, i.e. alias this wrapper.
>> Still simdop would not be typesafe.
>>
>
> I think they should by well defined structs with lots of type safety and
> sensible methods. Not just a typedef of the typeless primitive.
>
>
>> As much as this proposal presents a viable solution,
>> why not spending the time to extend inline asm.
>>
>
> I think there are too many risky problems with the inline assembler (as
> raised in my discussion about supporting pseudo registers in inline asm
> blocks).
>   * No way to allow the compiler to assign registers (pseudo registers)
That's what I propose he should do. IMHO it's a huge improvement when
register variables could be used directly in asm.

int a, b;
__vec128 c;

asm (a, b, c)
{
    mov EAX, a;
    add b, EAX;
    movps XMM1, c;
    mulps c, XMM1;
}

The compiler has enough knowledge to do this, and it's the common basic block spilling
scheme that is used here.

There is another benefit.
Consider the following:

__vec128 addps(__vec128 a, __vec128 b) pure
{
    __vec128 res = a;

    if (__ctfe)
    {
        foreach(i; 0 .. 4)
           res[i] += b[i];
    }
    else
    {
        asm (b, res)
        {
            addps res, b;
        }
    }
    return res;
}

>   * Assembly blocks present problems for the optimiser, it's not reliable
> that it can optimise around an inline asm blocks. How bad will it be when
> trying to optimise around 100 small inlined functions each containing its
> own inline asm blocks?
What do you mean by optimizing around? I don't see any apparent reason why that
should perform worse than using intrinsics.

The only implementation issue could be that lots of inlined asm snippets
make plenty basic blocks which could slow down certain compiler algorithms.

>   * D's inline assembly syntax has to be carefully translated to GCC's
> inline asm format when using GCC, and this needs to be done
> PER-ARCHITECTURE, which Iain should not be expected to do for all the
> obscure architectures GCC supports.
>
???
This would be needed for opcodes as well. You initial goal was to directly influence
code gen up to instruction level, how should that be achieved without platform specific
extension. Quite contrary with ops and asm he will need two hack paths into gcc's codegen.

What I see here is that we can do much good things to the inline
assembler while achieving the same goal.
With intrinsics on the other hand we're adding a very specialized
maintenance burden.
>
>> What would be needed?
>>  - Implement the asm allocation logic.
>>  - Functions containing asm statements should participate in inlining.
>>  - Determining inline cost of asm statements.
>>
>
> I raised these points in my other thread, these are all far more
> complicated problems I think than exposing opcode intrinsics would be.
> Opcode intrinsics are almost certainly the way to go.
>
> When being used with typedefs for __vubyte16 et.al. this would
>> allow a really clean and simple library implementation of intrinsics.
>>
>
> The type safety you're imagining here might actually be annoying when
> working with the raw type and opcodes..
> Consider this common situation and the code that will be built around it:
> __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //
Such is really not a good idea if the bit pattern of packedColour is a denormal.
How can you even execute a single useful command on the floats here?

Also mixing integer and FP instructions on the same register may
cause performance degradation. The registers are indeed typed CPU internally.

> pack
> some other useful data in W
> If vec were strongly typed, I would now need to start casting all over the
> place to use various float and uint opcodes on this value?
> I think it's correct when using SIMD at the raw level to express the type
> as it is, typeless... SIMD regs are infact typeless regs, they only gain
> concept of type the moment you perform an opcode on it, and only for the
> duration of that opcode.
>
> You will get your strong type safety when you make use of the float4 types
> which will be created in the libs.