Jump to page: 1 216  
Page
Thread overview
SIMD support...
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 06, 2012
Martin Nowak
Jan 06, 2012
Manu
Jan 06, 2012
bearophile
Jan 06, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 06, 2012
Martin Nowak
Jan 07, 2012
FeepingCreature
Jan 06, 2012
Trass3r
Jan 11, 2012
Piotr Szturmaj
Jan 11, 2012
Danni Coy
Jan 11, 2012
Piotr Szturmaj
Jan 06, 2012
Artur Skawina
Jan 06, 2012
Iain Buclaw
Jan 06, 2012
Manu
Jan 06, 2012
Iain Buclaw
Jan 06, 2012
Manu
Jan 06, 2012
Manu
Jan 06, 2012
Martin Nowak
Jan 06, 2012
Walter Bright
Jan 06, 2012
Andrew Wiley
Jan 06, 2012
a
Jan 06, 2012
a
Jan 06, 2012
a
Jan 06, 2012
Manu
Jan 06, 2012
Paulo Pinto
Jan 06, 2012
Manu
Jan 06, 2012
Manu
Jan 06, 2012
bearophile
Jan 06, 2012
Manu
Jan 06, 2012
bearophile
Jan 06, 2012
bearophile
Jan 06, 2012
Manu
Jan 06, 2012
Manu
Jan 06, 2012
Russel Winder
Jan 06, 2012
Paulo Pinto
Jan 06, 2012
Russel Winder
Jan 06, 2012
Paulo Pinto
Jan 06, 2012
Froglegs
Jan 06, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 15, 2012
Sean Cavanaugh
Jan 06, 2012
Martin Nowak
Jan 06, 2012
Martin Nowak
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 07, 2012
Manu
Jan 07, 2012
Walter Bright
Jan 07, 2012
Manu
Jan 07, 2012
Don
Jan 07, 2012
Adam D. Ruppe
Jan 07, 2012
Adam D. Ruppe
Jan 07, 2012
Artur Skawina
Jan 07, 2012
Walter Bright
Jan 08, 2012
Walter Bright
Jan 08, 2012
bearophile
Jan 08, 2012
Peter Alexander
Jan 08, 2012
Peter Alexander
Jan 08, 2012
Manu
Jan 08, 2012
Peter Alexander
Jan 08, 2012
Walter Bright
Jan 08, 2012
Walter Bright
Jan 08, 2012
Manu
Jan 08, 2012
Peter Alexander
Jan 08, 2012
Walter Bright
Jan 08, 2012
a
Jan 08, 2012
Manu
Jan 15, 2012
Sean Cavanaugh
Jan 15, 2012
Walter Bright
Jan 15, 2012
Sean Cavanaugh
Jan 15, 2012
Manu
Jan 16, 2012
Marco Leise
Jan 16, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 07, 2012
Walter Bright
Jan 07, 2012
bearophile
Jan 07, 2012
Manu
Jan 07, 2012
Iain Buclaw
Jan 06, 2012
Martin Nowak
Jan 06, 2012
Brad Roberts
Jan 06, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 06, 2012
Walter Bright
Jan 06, 2012
Manu
Jan 06, 2012
Manu
Jan 06, 2012
Manu
Jan 06, 2012
Brad Roberts
Jan 06, 2012
Walter Bright
Jan 06, 2012
Brad Roberts
Jan 07, 2012
Vladimir Panteleev
Jan 07, 2012
Manu
Jan 15, 2012
Sean Cavanaugh
Jan 06, 2012
Martin Nowak
Jan 06, 2012
Manu
Jan 07, 2012
Walter Bright
Jan 07, 2012
Manu
Jan 07, 2012
Walter Bright
Jan 07, 2012
Martin Nowak
Jan 07, 2012
Artur Skawina
Jan 08, 2012
Martin Nowak
Jan 08, 2012
Peter Alexander
Jan 08, 2012
Manu
Jan 08, 2012
Martin Nowak
Jan 12, 2012
Norbert Nemec
Jan 12, 2012
Walter Bright
Jan 12, 2012
Peter Alexander
Jan 13, 2012
Norbert Nemec
Jan 13, 2012
Manu
Jan 15, 2012
Sean Cavanaugh
Jan 15, 2012
Manu
Jan 15, 2012
Walter Bright
Jan 16, 2012
JoeCoder
Jan 16, 2012
Walter Bright
Jan 16, 2012
suicide
Jan 16, 2012
F i L
Jan 16, 2012
F i L
Jan 16, 2012
F i L
Jan 16, 2012
David
Jan 16, 2012
Walter Bright
Jan 16, 2012
Simen Kjærås
Jan 17, 2012
Danni Coy
Jan 17, 2012
Sean Cavanaugh
Jan 17, 2012
David
Jan 17, 2012
Simen Kjærås
Jan 18, 2012
Danni Coy
Jan 16, 2012
JoeCoder
Jan 16, 2012
Iain Buclaw
Jan 17, 2012
Kiith-Sa
Jan 17, 2012
David
Jan 17, 2012
Manu
Jan 14, 2012
Mehrdad
Jan 15, 2012
Walter Bright
January 06, 2012
So I've been hassling about this for a while now, and Walter asked me to pitch an email detailing a minimal implementation with some initial thoughts.

The first thing I'd like to say is that a lot of people seem to have this
idea that float[4] should be specialised as a candidate for simd
optimisations somehow. It's obviously been discussed, and this general
opinion seems to be shared by a good few people here.
I've had a whole bunch of rants why I think this is wrong in other threads,
so I won't repeat them here... and that said, I'll attempt to detail an
approach based on explicit vector types.

So, what do we need...? A language defined primitive vector type... that's all.


-- What shall we call it? --

Doesn't really matter... open to suggestions.
VisualC calls it __m128, XBox360 calls it __vector4, GCC calls it 'vector
float' (a name I particularly hate, not specifying any size, and trying to
associate it with a specific type)

I like v128, or something like that. I'll use that for the sake of this
document. I think it is preferable to float4 for a few reasons:
 * v128 says what the register intends to be, a general purpose 128bit
register that may be used for a variety of simd operations that aren't
necessarily type bound.
 * float4 implies it is a specific 4 component float type, which is not
what the raw type should be.
 * If we use names like float4, it stands to reason that (u)int4,
(u)short8, etc should also exist, and it also stands to reason that one
might expect math operators and such to be defined...

I suggest initial language definition and implementation of something like v128, and then types like float4, (u)int4, etc, may be implemented in the std library with complex behaviour like casting mechanics, and basic math operators...


-- Alignment --

This type needs to be 16byte aligned. Unaligned loads/stores are very expensive, and also tend to produce extremely costly LHS hazards on most architectures when accessing vectors in arrays. If they are not aligned, they are useless... honestly.

** Does this cause problems with class allocation? Are/can classes be allocated to an alignment as inherited from an aligned member? ... If not, this might be the bulk of the work.

There is one other problem I know of that is only of concern on x86.
In the C ABI, passing 16byte ALIGNED vectors by value is a problem,
since x86 ALWAYS uses the stack to pass arguments, and has no way to align
the stack.
I wonder if D can get creative with its ABI here, passing vectors in
registers, even though that's not conventional on x86... the C ABI was
invented long before these hardware features.
In lieu of that, x86 would (sadly) need to silently pass by const ref...
and also do this in the case of register overflow.

Every other architecture (including x64) is fine, since all other architectures pass in regs, and can align the stack as needed when overflowing the regs (since stack management is manual and not performed with special opcodes).


-- What does this type do? --

The primitive v128 type DOES nothing... it is a type that facilitates the
compiler allocating SIMD registers, managing assignments, loads, and
stores, and allow passing to/from functions BY VALUE in registers.
Ie, the only valid operations would be:
  v128 myVec = someStruct.vecMember; // and vice versa...
  v128 result = someFunc(myVec); // and calling functions, passing by value.

Nice bonus: This alone is enough to allow implementation of fast memcpy functions that copy 16 bytes at a time... ;)


-- So, it does nothing... so what good is it? --

Initially you could use this type in conjunction with inline asm, or architecture intrinsics to do useful stuff. This would be using the hardware totally raw, which is an important feature to have, but I imagine most of the good stuff would come from libraries built on top of this.


-- Literal assignment --

This is a hairy one. Endian issues appear in 2 layers here...
Firstly, if you consider the vector to be 4 int's, the ints themselves may
be little or big endian, but in addition, the outer layer (ie. the order of
x,y,z,w) may also be in reverse order on some architectures... This makes a
single 128bit hex literal hard to apply.
I'll have a dig and try and confirm this, but I have a suspicion that VMX
defines its components reverse to other architectures... (Note: not usually
a problem in C, because vector code is sooo non-standard in C that this is
ALWAYS ifdef-ed for each platform anyway, and the literal syntax and order
can suit)

For the primitive v128 type, I generally like the idea of using a huge
128bit hex literal.
  v128 vec = 0x01234567_01234567_01234567_01234567; // yeah!! ;)

Since the primitive v128 type is effectively typeless, it makes no sense to
use syntax like this:
  v128 myVec = { 1.0f, 2.0f, 3.0f, 4.0f }; // syntax like this should be
reserved for use with a float4 type defined in a library somewhere.

... The problem is, this may not be linearly applicable to all hardware. If
the order of the components match the endian, then it is fine...
I suspect VMX orders the components reverse to match the fact the values
are big endian, which would be good, but I need to check. And if not...
then literals may need to get a lot more complicated :)

Assignment of literals to the primitive type IS actually important, it's common to generate bit masks in these registers which are type-independent. I also guess libraries still need to leverage this primitive assignment functionality to assign their more complex literal expressions.


-- Libraries --

With this type, we can write some useful standard libraries. For a start, we can consider adding float4, int4, etc, and make them more intelligent... they would have basic maths operators defined, and probably implement type conversion when casting between types.

  int4 intVec = floatVec; // perform a type conversion from float to int..
or vice versa... (perhaps we make this require an explicit cast?)

  v128 vec = floatVec; // implicit cast to the raw type always possible,
and does no type casting, just a reinterpret
  int4 intVec = vec; // conversely, the primitive type would implicitly
assign to other types.
  int4  intVec = (v128)floatVec; // piping through the primitive v128
allows to easily perform a reinterpret between vector types, rather than
the usual type conversion.

There are also a truckload of other operations that would be fleshed out.
For instance, strongly typed literal assignment, vector comparisons that
can be used with if() (usually these allow you to test if ALL components,
or if ANY components meet a given condition). Conventional logic operators
can't be neatly applied to vectors. You need to do something like this:
  if(std.simd.allGreater(v1, v2) && std.simd.anyLessOrEqual(v1, v3)) ...

We can discuss the libraries at a later date, but it's possible that you might also want to make some advanced functions in the library that are only supported on particular architectures, std.simd.sse..., std.simd.vmx..., etc. which may be version()-ed.


-- Exceptions, flags, and error conditions --

SIMD units usually have their own control register for controlling various
behaviours, most importantly NaN policy and exception semantics...
I'm open to input here... what should be default behaviour?
I'll bet the D community opt for strict NaNs, and throw by default... but
it is actually VERY common to disable hardware exceptions when working with
SIMD code:
  * often precision is less important than speed when using SIMD, and some
SIMD units perform faster when these features are disabled.
  * most SIMD algorithms (at least in performance oriented code) are
designed to tolerate '0,0,0,0' as the result of a divide by zero, or some
other error condition.
  * realtime physics tends to suffer error creep and freaky random
explosions, and you can't have those crashing the program :) .. they're not
really 'errors', they're expected behaviour, often producing 0,0,0,0 as a
result, so they're easy to deal with.

I presume it'll end up being NaNs and throw by default, but we do need some mechanism to change the SIMD unit flags for realtime use... A runtime function? Perhaps a compiler switch (C does this sort of thing a lot)?

It's also worth noting that there are numerous SIMD units out there that
DON'T follow strict ieee float rules, and don't support NaNs or hardware
exceptions at all... others may simply set a divide-by-zero flag, but not
actually trigger a hardware exception, requiring you to explicitly check
the flag if you're interested.
Will it be okay that the languages default behaviour of NaN's and throws is
unsupported on such platforms? What are the implications of this?


-- Future --

AVX now exists, this is a 256 bit SIMD architecture. We simply add a v256
type, everything else is precisely the same.
I think this is perfectly reasonable... AVX is to SSE exactly as long is to
int, or double is to float. They are different types with different
register allocation and addressing semantics, and deserve a discreet type.
As with v128, libraries may then be created to allow the types to interact.

I know of 2 architectures that support 512bit (4x4 matrix) registers... same story; implement a primitive type, then using intrinsics, we can build interesting types in libraries.

We may also consider a v64 type, which would map to older MMX registers on x86... there are also other architectures with 64bit 'vector' registers (nintendo wii for one), supporting a pair of floats, or 4 shorts, etc... Same general concept, but only 64 bits wide.


-- Conclusion --

I think that's about it for a start. I don't think it's particularly a lot
of work, the potential trouble points are 16byte alignment, and literal
expression. Potential issues relating to language guarantees of
exception/error conditions...
Go on, tear it apart!

Discuss...


January 06, 2012
On 1/5/2012 5:42 PM, Manu wrote:
> The first thing I'd like to say is that a lot of people seem to have this idea
> that float[4] should be specialised as a candidate for simd optimisations
> somehow. It's obviously been discussed, and this general opinion seems to be
> shared by a good few people here.
> I've had a whole bunch of rants why I think this is wrong in other threads, so I
> won't repeat them here...

If you could cut&paste them here, I would find it most helpful. I have some ideas on making that work, but I need to know everything wrong with it first.
January 06, 2012
On 1/5/2012 5:42 PM, Manu wrote:
> So I've been hassling about this for a while now, and Walter asked me to pitch
> an email detailing a minimal implementation with some initial thoughts.

Another question:

Is this worth doing for 32 bit code? Or is anyone doing this doing it for 64 bit only?

The reason I ask is because 64 bit is 16 byte aligned, but aligning the stack in 32 bit code is inefficient for everything else.
January 06, 2012
On 1/5/2012 5:42 PM, Manu wrote:
> -- Alignment --
>
> This type needs to be 16byte aligned. Unaligned loads/stores are very expensive,
> and also tend to produce extremely costly LHS hazards on most architectures when
> accessing vectors in arrays. If they are not aligned, they are useless... honestly.
>
> ** Does this cause problems with class allocation? Are/can classes be allocated
> to an alignment as inherited from an aligned member? ... If not, this might be
> the bulk of the work.

The only real issue with alignment is getting the stack aligned to 16 bytes. This is already true of 64 bit code gen, and 32 bit code gen for OS X.
January 06, 2012
On 6 January 2012 04:12, Walter Bright <newshound2@digitalmars.com> wrote:

> If you could cut&paste them here, I would find it most helpful. I have some ideas on making that work, but I need to know everything wrong with it first.
>

On 5 January 2012 11:02, Manu <turkeyman@gmail.com> wrote:

> On 5 January 2012 02:42, bearophile <bearophileHUGS@lycos.com> wrote:
>
>> Think about future CPU evolution with SIMD registers 128, then 256, then 512, then 1024 bits long. In theory a good compiler is able to use them with no changes in the D code that uses vector operations.
>>
>
> These are all fundamentally different types, like int and long.. float and
> double... and I certainly want a keyword to identify each of them. Even if
> the compiler is trying to make auto vector optimisations, you can't deny
> programmers explicit control to the hardware when they want/need it.
> Look at x86 compilers, been TRYING to perform automatic SSE optimisations
> for 10 years, with basically no success... do you really think you can do
> better then all that work by microsoft and GCC?
> In my experience, I've even run into a lot of VC's auto-SSE-ed code that
> is SLOWER than the original float code.
> Let's not even mention architectures that receive much less love than x86,
> and are arguably more important (ARM; slower, simpler processors with more
> demand to perform well, and not waste power)
>

...

Vector ops and SIMD ops are different things. float[4] (or more
> realistically, float[3]) should NOT be a candidate for automatic SIMD implementation, likewise, simd_type should not have its components individually accessible. These are operations the hardware can not actually perform. So no syntax to worry about, just a type.
>
>
>> I think the good Hara will be able to implement those syntax fixes in a matter of just one day or very few days if a consensus is reached about what actually is to be fixed in D vector ops syntax.
>>
>
>
>> Instead of discussing about *adding* something (register intrinsics) I suggest to discuss about what to fix about the *already present* vector op syntax. This is not a request to just you Manu, but to this whole newsgroup.
>>
>
> And I think this is exactly the wrong approach. A vector is NOT an array of 4 (actually, usually 3) floats. It should not appear as one. This is overly complicated and ultimately wrong way to engage this hardware. Imagine the complexity in the compiler to try and force float[4] operations into vector arithmetic vs adding a 'v128' type which actually does what people want anyway... What about when when you actually WANT a float[4] array, and NOT a vector?
>
> SIMD units are not float units, they should not appear like an aggregation
> of float units. They have:
>  * Different error semantics, exception handling rules, sometimes
> different precision...
>  * Special alignment rules.
>  * Special literal expression/assignment.
>  * You can NOT access individual components at will.
>  * May be reinterpreted at any time as float[1] float[4] double[2]
> short[8] char[16], etc... (up to the architecture intrinsics)
>  * Can not be involved in conventional comparison logic (array of floats
> would make you think they could)
>  *** Can NOT interact with the regular 'float' unit... Vectors as an array
> of floats certainly suggests that you can interact with scalar floats...
>
> I will use architecture intrinsics to operate on these regs, and put that nice and neatly behind a hardware vector type with version()'s for each architecture, and an API with a whole lot of sugar to make them nice and friendly to use.
>
> My argument is that even IF the compiler some day attempts to make vector optimisations to float[4] arrays, the raw hardware should be exposed first, and allow programmers to use it directly. This starts with a language defined (platform independant) v128 type.
>

...

 Other rants have been on IRC.


January 06, 2012
On 6 January 2012 04:16, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/5/2012 5:42 PM, Manu wrote:
>
>> -- Alignment --
>>
>> This type needs to be 16byte aligned. Unaligned loads/stores are very
>> expensive,
>> and also tend to produce extremely costly LHS hazards on most
>> architectures when
>> accessing vectors in arrays. If they are not aligned, they are useless...
>> honestly.
>>
>> ** Does this cause problems with class allocation? Are/can classes be
>> allocated
>> to an alignment as inherited from an aligned member? ... If not, this
>> might be
>> the bulk of the work.
>>
>
> The only real issue with alignment is getting the stack aligned to 16 bytes. This is already true of 64 bit code gen, and 32 bit code gen for OS X.
>

It's important for all implementations of simd units, x32, x64, and others. As said, if aligning the x32 stack is too much trouble, I suggest silently passing by const ref on x86.

Are you talking about for parameter passing, or for local variable
assignment on the stack?
For parameter passing, I understand the x32 problems with aligning the
arguments (I think it's possible to work around though), but there should
be no problem with aligning the stack for allocating local variables.


January 06, 2012
>
> The reason I ask is because 64 bit is 16 byte aligned, but aligning the stack in 32 bit code is inefficient for everything else.
>

Note: you only need to align the stack when a vector is actually stored on it by value. Probably very rare, more rare than you think.


January 06, 2012
On 1/5/2012 6:25 PM, Manu wrote:
> Are you talking about for parameter passing, or for local variable assignment on
> the stack?
> For parameter passing, I understand the x32 problems with aligning the arguments
> (I think it's possible to work around though), but there should be no problem
> with aligning the stack for allocating local variables.

Aligning the stack. Before I say anything, I want to hear your suggestion for how to do it efficiently.
January 06, 2012
On 6 January 2012 05:22, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/5/2012 6:25 PM, Manu wrote:
>
>> Are you talking about for parameter passing, or for local variable
>> assignment on
>> the stack?
>> For parameter passing, I understand the x32 problems with aligning the
>> arguments
>> (I think it's possible to work around though), but there should be no
>> problem
>> with aligning the stack for allocating local variables.
>>
>
> Aligning the stack. Before I say anything, I want to hear your suggestion for how to do it efficiently.
>

Perhaps I misunderstand, I can't see the problem?
In the function preamble, you just align it... something like:
  mov reg, esp ; take a backup of the stack pointer
  and esp, -16 ; align it

... function

  mov esp, reg ; restore the stack pointer
  ret 0


January 06, 2012
On 6 January 2012 05:42, Manu <turkeyman@gmail.com> wrote:

> On 6 January 2012 05:22, Walter Bright <newshound2@digitalmars.com> wrote:
>
>> On 1/5/2012 6:25 PM, Manu wrote:
>>
>>> Are you talking about for parameter passing, or for local variable
>>> assignment on
>>> the stack?
>>> For parameter passing, I understand the x32 problems with aligning the
>>> arguments
>>> (I think it's possible to work around though), but there should be no
>>> problem
>>> with aligning the stack for allocating local variables.
>>>
>>
>> Aligning the stack. Before I say anything, I want to hear your suggestion for how to do it efficiently.
>>
>
> Perhaps I misunderstand, I can't see the problem?
> In the function preamble, you just align it... something like:
>   mov reg, esp ; take a backup of the stack pointer
>   and esp, -16 ; align it
>
> ... function
>
>   mov esp, reg ; restore the stack pointer
>   ret 0
>

That said, most of the time values used in smaller functions will only ever exist in regs, and won't ever be written to the stack... in this case, there's no problem.


« First   ‹ Prev
1 2 3 4 5 6 7 8 9 10 11