View mode: basic / threaded / horizontal-split · Log in · Help
January 06, 2012
SIMD support...
So I've been hassling about this for a while now, and Walter asked me to
pitch an email detailing a minimal implementation with some initial
thoughts.

The first thing I'd like to say is that a lot of people seem to have this
idea that float[4] should be specialised as a candidate for simd
optimisations somehow. It's obviously been discussed, and this general
opinion seems to be shared by a good few people here.
I've had a whole bunch of rants why I think this is wrong in other threads,
so I won't repeat them here... and that said, I'll attempt to detail an
approach based on explicit vector types.

So, what do we need...? A language defined primitive vector type... that's
all.


-- What shall we call it? --

Doesn't really matter... open to suggestions.
VisualC calls it __m128, XBox360 calls it __vector4, GCC calls it 'vector
float' (a name I particularly hate, not specifying any size, and trying to
associate it with a specific type)

I like v128, or something like that. I'll use that for the sake of this
document. I think it is preferable to float4 for a few reasons:
* v128 says what the register intends to be, a general purpose 128bit
register that may be used for a variety of simd operations that aren't
necessarily type bound.
* float4 implies it is a specific 4 component float type, which is not
what the raw type should be.
* If we use names like float4, it stands to reason that (u)int4,
(u)short8, etc should also exist, and it also stands to reason that one
might expect math operators and such to be defined...

I suggest initial language definition and implementation of something like
v128, and then types like float4, (u)int4, etc, may be implemented in the
std library with complex behaviour like casting mechanics, and basic math
operators...


-- Alignment --

This type needs to be 16byte aligned. Unaligned loads/stores are very
expensive, and also tend to produce extremely costly LHS hazards on most
architectures when accessing vectors in arrays. If they are not aligned,
they are useless... honestly.

** Does this cause problems with class allocation? Are/can classes be
allocated to an alignment as inherited from an aligned member? ... If not,
this might be the bulk of the work.

There is one other problem I know of that is only of concern on x86.
In the C ABI, passing 16byte ALIGNED vectors by value is a problem,
since x86 ALWAYS uses the stack to pass arguments, and has no way to align
the stack.
I wonder if D can get creative with its ABI here, passing vectors in
registers, even though that's not conventional on x86... the C ABI was
invented long before these hardware features.
In lieu of that, x86 would (sadly) need to silently pass by const ref...
and also do this in the case of register overflow.

Every other architecture (including x64) is fine, since all other
architectures pass in regs, and can align the stack as needed when
overflowing the regs (since stack management is manual and not performed
with special opcodes).


-- What does this type do? --

The primitive v128 type DOES nothing... it is a type that facilitates the
compiler allocating SIMD registers, managing assignments, loads, and
stores, and allow passing to/from functions BY VALUE in registers.
Ie, the only valid operations would be:
 v128 myVec = someStruct.vecMember; // and vice versa...
 v128 result = someFunc(myVec); // and calling functions, passing by value.

Nice bonus: This alone is enough to allow implementation of fast memcpy
functions that copy 16 bytes at a time... ;)


-- So, it does nothing... so what good is it? --

Initially you could use this type in conjunction with inline asm, or
architecture intrinsics to do useful stuff. This would be using the
hardware totally raw, which is an important feature to have, but I imagine
most of the good stuff would come from libraries built on top of this.


-- Literal assignment --

This is a hairy one. Endian issues appear in 2 layers here...
Firstly, if you consider the vector to be 4 int's, the ints themselves may
be little or big endian, but in addition, the outer layer (ie. the order of
x,y,z,w) may also be in reverse order on some architectures... This makes a
single 128bit hex literal hard to apply.
I'll have a dig and try and confirm this, but I have a suspicion that VMX
defines its components reverse to other architectures... (Note: not usually
a problem in C, because vector code is sooo non-standard in C that this is
ALWAYS ifdef-ed for each platform anyway, and the literal syntax and order
can suit)

For the primitive v128 type, I generally like the idea of using a huge
128bit hex literal.
 v128 vec = 0x01234567_01234567_01234567_01234567; // yeah!! ;)

Since the primitive v128 type is effectively typeless, it makes no sense to
use syntax like this:
 v128 myVec = { 1.0f, 2.0f, 3.0f, 4.0f }; // syntax like this should be
reserved for use with a float4 type defined in a library somewhere.

... The problem is, this may not be linearly applicable to all hardware. If
the order of the components match the endian, then it is fine...
I suspect VMX orders the components reverse to match the fact the values
are big endian, which would be good, but I need to check. And if not...
then literals may need to get a lot more complicated :)

Assignment of literals to the primitive type IS actually important, it's
common to generate bit masks in these registers which are type-independent.
I also guess libraries still need to leverage this primitive assignment
functionality to assign their more complex literal expressions.


-- Libraries --

With this type, we can write some useful standard libraries. For a start,
we can consider adding float4, int4, etc, and make them more intelligent...
they would have basic maths operators defined, and probably implement type
conversion when casting between types.

 int4 intVec = floatVec; // perform a type conversion from float to int..
or vice versa... (perhaps we make this require an explicit cast?)

 v128 vec = floatVec; // implicit cast to the raw type always possible,
and does no type casting, just a reinterpret
 int4 intVec = vec; // conversely, the primitive type would implicitly
assign to other types.
 int4  intVec = (v128)floatVec; // piping through the primitive v128
allows to easily perform a reinterpret between vector types, rather than
the usual type conversion.

There are also a truckload of other operations that would be fleshed out.
For instance, strongly typed literal assignment, vector comparisons that
can be used with if() (usually these allow you to test if ALL components,
or if ANY components meet a given condition). Conventional logic operators
can't be neatly applied to vectors. You need to do something like this:
 if(std.simd.allGreater(v1, v2) && std.simd.anyLessOrEqual(v1, v3)) ...

We can discuss the libraries at a later date, but it's possible that you
might also want to make some advanced functions in the library that are
only supported on particular architectures, std.simd.sse...,
std.simd.vmx..., etc. which may be version()-ed.


-- Exceptions, flags, and error conditions --

SIMD units usually have their own control register for controlling various
behaviours, most importantly NaN policy and exception semantics...
I'm open to input here... what should be default behaviour?
I'll bet the D community opt for strict NaNs, and throw by default... but
it is actually VERY common to disable hardware exceptions when working with
SIMD code:
 * often precision is less important than speed when using SIMD, and some
SIMD units perform faster when these features are disabled.
 * most SIMD algorithms (at least in performance oriented code) are
designed to tolerate '0,0,0,0' as the result of a divide by zero, or some
other error condition.
 * realtime physics tends to suffer error creep and freaky random
explosions, and you can't have those crashing the program :) .. they're not
really 'errors', they're expected behaviour, often producing 0,0,0,0 as a
result, so they're easy to deal with.

I presume it'll end up being NaNs and throw by default, but we do need some
mechanism to change the SIMD unit flags for realtime use... A runtime
function? Perhaps a compiler switch (C does this sort of thing a lot)?

It's also worth noting that there are numerous SIMD units out there that
DON'T follow strict ieee float rules, and don't support NaNs or hardware
exceptions at all... others may simply set a divide-by-zero flag, but not
actually trigger a hardware exception, requiring you to explicitly check
the flag if you're interested.
Will it be okay that the languages default behaviour of NaN's and throws is
unsupported on such platforms? What are the implications of this?


-- Future --

AVX now exists, this is a 256 bit SIMD architecture. We simply add a v256
type, everything else is precisely the same.
I think this is perfectly reasonable... AVX is to SSE exactly as long is to
int, or double is to float. They are different types with different
register allocation and addressing semantics, and deserve a discreet type.
As with v128, libraries may then be created to allow the types to interact.

I know of 2 architectures that support 512bit (4x4 matrix) registers...
same story; implement a primitive type, then using intrinsics, we can build
interesting types in libraries.

We may also consider a v64 type, which would map to older MMX registers on
x86... there are also other architectures with 64bit 'vector' registers
(nintendo wii for one), supporting a pair of floats, or 4 shorts, etc...
Same general concept, but only 64 bits wide.


-- Conclusion --

I think that's about it for a start. I don't think it's particularly a lot
of work, the potential trouble points are 16byte alignment, and literal
expression. Potential issues relating to language guarantees of
exception/error conditions...
Go on, tear it apart!

Discuss...
January 06, 2012
Re: SIMD support...
On 1/5/2012 5:42 PM, Manu wrote:
> The first thing I'd like to say is that a lot of people seem to have this idea
> that float[4] should be specialised as a candidate for simd optimisations
> somehow. It's obviously been discussed, and this general opinion seems to be
> shared by a good few people here.
> I've had a whole bunch of rants why I think this is wrong in other threads, so I
> won't repeat them here...

If you could cut&paste them here, I would find it most helpful. I have some 
ideas on making that work, but I need to know everything wrong with it first.
January 06, 2012
Re: SIMD support...
On 1/5/2012 5:42 PM, Manu wrote:
> So I've been hassling about this for a while now, and Walter asked me to pitch
> an email detailing a minimal implementation with some initial thoughts.

Another question:

Is this worth doing for 32 bit code? Or is anyone doing this doing it for 64 bit 
only?

The reason I ask is because 64 bit is 16 byte aligned, but aligning the stack in 
32 bit code is inefficient for everything else.
January 06, 2012
Re: SIMD support...
On 1/5/2012 5:42 PM, Manu wrote:
> -- Alignment --
>
> This type needs to be 16byte aligned. Unaligned loads/stores are very expensive,
> and also tend to produce extremely costly LHS hazards on most architectures when
> accessing vectors in arrays. If they are not aligned, they are useless... honestly.
>
> ** Does this cause problems with class allocation? Are/can classes be allocated
> to an alignment as inherited from an aligned member? ... If not, this might be
> the bulk of the work.

The only real issue with alignment is getting the stack aligned to 16 bytes. 
This is already true of 64 bit code gen, and 32 bit code gen for OS X.
January 06, 2012
Re: SIMD support...
On 6 January 2012 04:12, Walter Bright <newshound2@digitalmars.com> wrote:

> If you could cut&paste them here, I would find it most helpful. I have
> some ideas on making that work, but I need to know everything wrong with it
> first.
>

On 5 January 2012 11:02, Manu <turkeyman@gmail.com> wrote:

> On 5 January 2012 02:42, bearophile <bearophileHUGS@lycos.com> wrote:
>
>> Think about future CPU evolution with SIMD registers 128, then 256, then
>> 512, then 1024 bits long. In theory a good compiler is able to use them
>> with no changes in the D code that uses vector operations.
>>
>
> These are all fundamentally different types, like int and long.. float and
> double... and I certainly want a keyword to identify each of them. Even if
> the compiler is trying to make auto vector optimisations, you can't deny
> programmers explicit control to the hardware when they want/need it.
> Look at x86 compilers, been TRYING to perform automatic SSE optimisations
> for 10 years, with basically no success... do you really think you can do
> better then all that work by microsoft and GCC?
> In my experience, I've even run into a lot of VC's auto-SSE-ed code that
> is SLOWER than the original float code.
> Let's not even mention architectures that receive much less love than x86,
> and are arguably more important (ARM; slower, simpler processors with more
> demand to perform well, and not waste power)
>

...

Vector ops and SIMD ops are different things. float[4] (or more
> realistically, float[3]) should NOT be a candidate for automatic SIMD
> implementation, likewise, simd_type should not have its components
> individually accessible. These are operations the hardware can not actually
> perform. So no syntax to worry about, just a type.
>
>
>> I think the good Hara will be able to implement those syntax fixes in a
>> matter of just one day or very few days if a consensus is reached about
>> what actually is to be fixed in D vector ops syntax.
>>
>
>
>> Instead of discussing about *adding* something (register intrinsics) I
>> suggest to discuss about what to fix about the *already present* vector op
>> syntax. This is not a request to just you Manu, but to this whole newsgroup.
>>
>
> And I think this is exactly the wrong approach. A vector is NOT an array
> of 4 (actually, usually 3) floats. It should not appear as one. This is
> overly complicated and ultimately wrong way to engage this hardware.
> Imagine the complexity in the compiler to try and force float[4]
> operations into vector arithmetic vs adding a 'v128' type which actually
> does what people want anyway... What about when when you actually WANT a
> float[4] array, and NOT a vector?
>
> SIMD units are not float units, they should not appear like an aggregation
> of float units. They have:
>  * Different error semantics, exception handling rules, sometimes
> different precision...
>  * Special alignment rules.
>  * Special literal expression/assignment.
>  * You can NOT access individual components at will.
>  * May be reinterpreted at any time as float[1] float[4] double[2]
> short[8] char[16], etc... (up to the architecture intrinsics)
>  * Can not be involved in conventional comparison logic (array of floats
> would make you think they could)
>  *** Can NOT interact with the regular 'float' unit... Vectors as an array
> of floats certainly suggests that you can interact with scalar floats...
>
> I will use architecture intrinsics to operate on these regs, and put that
> nice and neatly behind a hardware vector type with version()'s for each
> architecture, and an API with a whole lot of sugar to make them nice and
> friendly to use.
>
> My argument is that even IF the compiler some day attempts to make vector
> optimisations to float[4] arrays, the raw hardware should be exposed first,
> and allow programmers to use it directly. This starts with a language
> defined (platform independant) v128 type.
>

...

Other rants have been on IRC.
January 06, 2012
Re: SIMD support...
On 6 January 2012 04:16, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/5/2012 5:42 PM, Manu wrote:
>
>> -- Alignment --
>>
>> This type needs to be 16byte aligned. Unaligned loads/stores are very
>> expensive,
>> and also tend to produce extremely costly LHS hazards on most
>> architectures when
>> accessing vectors in arrays. If they are not aligned, they are useless...
>> honestly.
>>
>> ** Does this cause problems with class allocation? Are/can classes be
>> allocated
>> to an alignment as inherited from an aligned member? ... If not, this
>> might be
>> the bulk of the work.
>>
>
> The only real issue with alignment is getting the stack aligned to 16
> bytes. This is already true of 64 bit code gen, and 32 bit code gen for OS
> X.
>

It's important for all implementations of simd units, x32, x64, and others.
As said, if aligning the x32 stack is too much trouble, I suggest silently
passing by const ref on x86.

Are you talking about for parameter passing, or for local variable
assignment on the stack?
For parameter passing, I understand the x32 problems with aligning the
arguments (I think it's possible to work around though), but there should
be no problem with aligning the stack for allocating local variables.
January 06, 2012
Re: SIMD support...
>
> The reason I ask is because 64 bit is 16 byte aligned, but aligning the
> stack in 32 bit code is inefficient for everything else.
>

Note: you only need to align the stack when a vector is actually stored on
it by value. Probably very rare, more rare than you think.
January 06, 2012
Re: SIMD support...
On 1/5/2012 6:25 PM, Manu wrote:
> Are you talking about for parameter passing, or for local variable assignment on
> the stack?
> For parameter passing, I understand the x32 problems with aligning the arguments
> (I think it's possible to work around though), but there should be no problem
> with aligning the stack for allocating local variables.

Aligning the stack. Before I say anything, I want to hear your suggestion for 
how to do it efficiently.
January 06, 2012
Re: SIMD support...
On 6 January 2012 05:22, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/5/2012 6:25 PM, Manu wrote:
>
>> Are you talking about for parameter passing, or for local variable
>> assignment on
>> the stack?
>> For parameter passing, I understand the x32 problems with aligning the
>> arguments
>> (I think it's possible to work around though), but there should be no
>> problem
>> with aligning the stack for allocating local variables.
>>
>
> Aligning the stack. Before I say anything, I want to hear your suggestion
> for how to do it efficiently.
>

Perhaps I misunderstand, I can't see the problem?
In the function preamble, you just align it... something like:
 mov reg, esp ; take a backup of the stack pointer
 and esp, -16 ; align it

... function

 mov esp, reg ; restore the stack pointer
 ret 0
January 06, 2012
Re: SIMD support...
On 6 January 2012 05:42, Manu <turkeyman@gmail.com> wrote:

> On 6 January 2012 05:22, Walter Bright <newshound2@digitalmars.com> wrote:
>
>> On 1/5/2012 6:25 PM, Manu wrote:
>>
>>> Are you talking about for parameter passing, or for local variable
>>> assignment on
>>> the stack?
>>> For parameter passing, I understand the x32 problems with aligning the
>>> arguments
>>> (I think it's possible to work around though), but there should be no
>>> problem
>>> with aligning the stack for allocating local variables.
>>>
>>
>> Aligning the stack. Before I say anything, I want to hear your suggestion
>> for how to do it efficiently.
>>
>
> Perhaps I misunderstand, I can't see the problem?
> In the function preamble, you just align it... something like:
>   mov reg, esp ; take a backup of the stack pointer
>   and esp, -16 ; align it
>
> ... function
>
>   mov esp, reg ; restore the stack pointer
>   ret 0
>

That said, most of the time values used in smaller functions will only ever
exist in regs, and won't ever be written to the stack... in this case,
there's no problem.
« First   ‹ Prev
1 2 3 4 5
Top | Discussion index | About this forum | D home