So I've been hassling about this for a while now, and Walter asked me to pitch an email detailing a minimal implementation with some initial thoughts.

The first thing I'd like to say is that a lot of people seem to have this idea that float[4] should be specialised as a candidate for simd optimisations somehow. It's obviously been discussed, and this general opinion seems to be shared by a good few people here.
I've had a whole bunch of rants why I think this is wrong in other threads, so I won't repeat them here... and that said, I'll attempt to detail an approach based on explicit vector types.

So, what do we need...? A language defined primitive vector type... that's all.


-- What shall we call it? --

Doesn't really matter... open to suggestions.
VisualC calls it __m128, XBox360 calls it __vector4, GCC calls it 'vector float' (a name I particularly hate, not specifying any size, and trying to associate it with a specific type)

I like v128, or something like that. I'll use that for the sake of this document. I think it is preferable to float4 for a few reasons:
 * v128 says what the register intends to be, a general purpose 128bit register that may be used for a variety of simd operations that aren't necessarily type bound.
 * float4 implies it is a specific 4 component float type, which is not what the raw type should be.
 * If we use names like float4, it stands to reason that (u)int4, (u)short8, etc should also exist, and it also stands to reason that one might expect math operators and such to be defined...

I suggest initial language definition and implementation of something like v128, and then types like float4, (u)int4, etc, may be implemented in the std library with complex behaviour like casting mechanics, and basic math operators...


-- Alignment --

This type needs to be 16byte aligned. Unaligned loads/stores are very expensive, and also tend to produce extremely costly LHS hazards on most architectures when accessing vectors in arrays. If they are not aligned, they are useless... honestly.

** Does this cause problems with class allocation? Are/can classes be allocated to an alignment as inherited from an aligned member? ... If not, this might be the bulk of the work.

There is one other problem I know of that is only of concern on x86.
In the C ABI, passing 16byte ALIGNED vectors by value is a problem, since x86 ALWAYS uses the stack to pass arguments, and has no way to align the stack.
I wonder if D can get creative with its ABI here, passing vectors in registers, even though that's not conventional on x86... the C ABI was invented long before these hardware features.
In lieu of that, x86 would (sadly) need to silently pass by const ref... and also do this in the case of register overflow.

Every other architecture (including x64) is fine, since all other architectures pass in regs, and can align the stack as needed when overflowing the regs (since stack management is manual and not performed with special opcodes).


-- What does this type do? --

The primitive v128 type DOES nothing... it is a type that facilitates the compiler allocating SIMD registers, managing assignments, loads, and stores, and allow passing to/from functions BY VALUE in registers.
Ie, the only valid operations would be:
  v128 myVec = someStruct.vecMember; // and vice versa...
  v128 result = someFunc(myVec); // and calling functions, passing by value.

Nice bonus: This alone is enough to allow implementation of fast memcpy functions that copy 16 bytes at a time... ;)


-- So, it does nothing... so what good is it? --

Initially you could use this type in conjunction with inline asm, or architecture intrinsics to do useful stuff. This would be using the hardware totally raw, which is an important feature to have, but I imagine most of the good stuff would come from libraries built on top of this.


-- Literal assignment --

This is a hairy one. Endian issues appear in 2 layers here...
Firstly, if you consider the vector to be 4 int's, the ints themselves may be little or big endian, but in addition, the outer layer (ie. the order of x,y,z,w) may also be in reverse order on some architectures... This makes a single 128bit hex literal hard to apply.
I'll have a dig and try and confirm this, but I have a suspicion that VMX defines its components reverse to other architectures... (Note: not usually a problem in C, because vector code is sooo non-standard in C that this is ALWAYS ifdef-ed for each platform anyway, and the literal syntax and order can suit)

For the primitive v128 type, I generally like the idea of using a huge 128bit hex literal.
  v128 vec = 0x01234567_01234567_01234567_01234567; // yeah!! ;)

Since the primitive v128 type is effectively typeless, it makes no sense to use syntax like this:
  v128 myVec = { 1.0f, 2.0f, 3.0f, 4.0f }; // syntax like this should be reserved for use with a float4 type defined in a library somewhere.

... The problem is, this may not be linearly applicable to all hardware. If the order of the components match the endian, then it is fine...
I suspect VMX orders the components reverse to match the fact the values are big endian, which would be good, but I need to check. And if not... then literals may need to get a lot more complicated :)

Assignment of literals to the primitive type IS actually important, it's common to generate bit masks in these registers which are type-independent. I also guess libraries still need to leverage this primitive assignment functionality to assign their more complex literal expressions.


-- Libraries --

With this type, we can write some useful standard libraries. For a start, we can consider adding float4, int4, etc, and make them more intelligent... they would have basic maths operators defined, and probably implement type conversion when casting between types.

  int4 intVec = floatVec; // perform a type conversion from float to int.. or vice versa... (perhaps we make this require an explicit cast?)

  v128 vec = floatVec; // implicit cast to the raw type always possible, and does no type casting, just a reinterpret
  int4 intVec = vec; // conversely, the primitive type would implicitly assign to other types.
  int4  intVec = (v128)floatVec; // piping through the primitive v128 allows to easily perform a reinterpret between vector types, rather than the usual type conversion.

There are also a truckload of other operations that would be fleshed out. For instance, strongly typed literal assignment, vector comparisons that can be used with if() (usually these allow you to test if ALL components, or if ANY components meet a given condition). Conventional logic operators can't be neatly applied to vectors. You need to do something like this:
  if(std.simd.allGreater(v1, v2) && std.simd.anyLessOrEqual(v1, v3)) ...

We can discuss the libraries at a later date, but it's possible that you might also want to make some advanced functions in the library that are only supported on particular architectures, std.simd.sse..., std.simd.vmx..., etc. which may be version()-ed.


-- Exceptions, flags, and error conditions --

SIMD units usually have their own control register for controlling various behaviours, most importantly NaN policy and exception semantics...
I'm open to input here... what should be default behaviour?
I'll bet the D community opt for strict NaNs, and throw by default... but it is actually VERY common to disable hardware exceptions when working with SIMD code:
  * often precision is less important than speed when using SIMD, and some SIMD units perform faster when these features are disabled.
  * most SIMD algorithms (at least in performance oriented code) are designed to tolerate '0,0,0,0' as the result of a divide by zero, or some other error condition.
  * realtime physics tends to suffer error creep and freaky random explosions, and you can't have those crashing the program :) .. they're not really 'errors', they're expected behaviour, often producing 0,0,0,0 as a result, so they're easy to deal with.

I presume it'll end up being NaNs and throw by default, but we do need some mechanism to change the SIMD unit flags for realtime use... A runtime function? Perhaps a compiler switch (C does this sort of thing a lot)?

It's also worth noting that there are numerous SIMD units out there that DON'T follow strict ieee float rules, and don't support NaNs or hardware exceptions at all... others may simply set a divide-by-zero flag, but not actually trigger a hardware exception, requiring you to explicitly check the flag if you're interested.
Will it be okay that the languages default behaviour of NaN's and throws is unsupported on such platforms? What are the implications of this?


-- Future --

AVX now exists, this is a 256 bit SIMD architecture. We simply add a v256 type, everything else is precisely the same.
I think this is perfectly reasonable... AVX is to SSE exactly as long is to int, or double is to float. They are different types with different register allocation and addressing semantics, and deserve a discreet type.
As with v128, libraries may then be created to allow the types to interact.

I know of 2 architectures that support 512bit (4x4 matrix) registers... same story; implement a primitive type, then using intrinsics, we can build interesting types in libraries.

We may also consider a v64 type, which would map to older MMX registers on x86... there are also other architectures with 64bit 'vector' registers (nintendo wii for one), supporting a pair of floats, or 4 shorts, etc...
Same general concept, but only 64 bits wide.


-- Conclusion --

I think that's about it for a start. I don't think it's particularly a lot of work, the potential trouble points are 16byte alignment, and literal expression. Potential issues relating to language guarantees of exception/error conditions...
Go on, tear it apart!

Discuss...