January 08, 2012
On 8 January 2012 02:54, Peter Alexander <peter.alexander.au@gmail.com>wrote:

> I agree with Manu that we should just have a single type like __m128 in MSVC. The other types and their conversions should be solvable in a library with something like strong typedefs.
>

Walter put in a reasonable effort to sway me to his side of the fence last night. I'm still not entirely sold that implementation inside the language is necessary to achieve these details, but I don't have enough background into to argue, and I'm not the one that has to maintain the code :)

Here are some points we discussed... how do we do these (efficiently) in a
library?

** Literal syntax.. and constant folding:

Constants and literals also need to be aligned. If we use array syntax to express literals, this will be a problem.

 int4 v = [ 1,2,3,4 ] + [ 5,6,7,8 ];

Any constant expressions need to be simplified at compile time: int4 vec =
[ 6,8,10,12 ];
Perhaps this is possible with CTFE? Or will it be automatic if you express
literals as if they were arrays?

** Expression interpretation/simplification:

 float4 v = -b + a;

Obviously, this should be simplified to 'a - b'.

 float4 v = a*b + c;

This should use a multiply-accumulate opcode on most architectures: FMADDPS v, a, b, c

** Typed debug info

In a debugger it's nice to inspect variables in their supposed type.
Can probably use unions to do this... probably wouldn't be as nice though.

** God knows what other optimisations

float4 v = [ 0,0,0,0 ]; // XOR v
etc...


I don't know what amount of this is achievable with libraries, but Walter seems to think this will all work much better in the language... I'm inclined to trust his judgement.


January 08, 2012
On 1/7/2012 4:54 PM, Peter Alexander wrote:
> I think it simply requires a lot of work in the compiler.

Not that much work. Most of it segues nicely into the previous work I did supporting the XMM floating point code gen.
January 08, 2012
On 8 January 2012 03:44, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/7/2012 4:54 PM, Peter Alexander wrote:
>
>> I think it simply requires a lot of work in the compiler.
>>
>
> Not that much work. Most of it segues nicely into the previous work I did supporting the XMM floating point code gen.
>

What is this previous work you speak of? Is there already XMM stuff in there somewhere?


January 08, 2012
On 8/01/12 1:32 AM, Manu wrote:
> On 8 January 2012 02:54, Peter Alexander <peter.alexander.au@gmail.com
> <mailto:peter.alexander.au@gmail.com>> wrote:
>
>     I agree with Manu that we should just have a single type like __m128
>     in MSVC. The other types and their conversions should be solvable in
>     a library with something like strong typedefs.
>
>
> Walter put in a reasonable effort to sway me to his side of the fence
> last night. I'm still not entirely sold that implementation inside the
> language is necessary to achieve these details, but I don't have enough
> background into to argue, and I'm not the one that has to maintain the
> code :)
>
> Here are some points we discussed... how do we do these (efficiently) in
> a library?

Just to be clear, it was only the types and conversions that I thought would be suitable for a library. Operations, along with their optimisations are best for compiler.


> ** Literal syntax.. and constant folding:
>
> Constants and literals also need to be aligned. If we use array syntax
> to express literals, this will be a problem.
>
>   int4 v = [ 1,2,3,4 ] + [ 5,6,7,8 ];
>
> Any constant expressions need to be simplified at compile time: int4 vec
> = [ 6,8,10,12 ];
> Perhaps this is possible with CTFE? Or will it be automatic if you
> express literals as if they were arrays?

You could use array syntax for vector literals, as long as they are stored directly into vector variables. e.g.

immutable int4 a = [1, 2, 3, 4];
immutable int4 b = [5, 6, 7, 8];
int4 v = a + b;

Constant folding can be done by compiler, although I don't think this is a priority.


> ** Expression interpretation/simplification:
>
>   float4 v = -b + a;
>
> Obviously, this should be simplified to 'a - b'.
>
>   float4 v = a*b + c;
>
> This should use a multiply-accumulate opcode on most architectures:
> FMADDPS v, a, b, c

Compiler should make these decisions, just like it does with int/float etc.  In some cases these kinds of simplifications can effect the result due to numeric issues.

You can use expression templates for this sort of thing as well, but they are a horrible mess, so I don't think I'd like to see them.


> ** Typed debug info
>
> In a debugger it's nice to inspect variables in their supposed type.
> Can probably use unions to do this... probably wouldn't be as nice though.

Good point. I'm not an expert on this, but I suspect that a union would be good enough?


> ** God knows what other optimisations
>
> float4 v = [ 0,0,0,0 ]; // XOR v
> etc...

Again, I think you could use expression templates for this, but it's so much simpler to leave this optimisation to the compiler.

Even if the compiler doesn't do it, it's not difficult to do it manually when you really need it:

float4 v = void;
asm { pxor v, v; }


Honestly, I'm not too bothered with these types of optimisations. As long as the compiler does the register allocation and instruction scheduling for me, I would be 99% happy because those things are the most tedious when trying to write structured code. I can easily enough change (-b + a) to (b - a) if that's faster, or insert specific instructions for generating vector constants, or do constant folding manually.

Of course, it would be nice if the compiler did them, but that's just icing on the cake. The meat of the problem is register allocation.


> I don't know what amount of this is achievable with libraries, but
> Walter seems to think this will all work much better in the language...
> I'm inclined to trust his judgement.

I agree.
January 08, 2012
On 8/01/12 1:48 AM, Manu wrote:
> On 8 January 2012 03:44, Walter Bright <newshound2@digitalmars.com
> <mailto:newshound2@digitalmars.com>> wrote:
>
>     On 1/7/2012 4:54 PM, Peter Alexander wrote:
>
>         I think it simply requires a lot of work in the compiler.
>
>
>     Not that much work. Most of it segues nicely into the previous work
>     I did supporting the XMM floating point code gen.
>
>
> What is this previous work you speak of? Is there already XMM stuff in
> there somewhere?

On 64-bit, floats are stored in XMM registers (just as single scalars). I don't think it does any vectorization yet though. It does mean that the register allocation of those registers is already complete though.
January 08, 2012
On 1/7/2012 5:32 PM, Manu wrote:
> Here are some points we discussed... how do we do these (efficiently) in a library?

Another issue - matching the name mangling and parameter passing/return conventions of how other C/C++ compilers deal with vector types. That is currently not doable with a library type.
January 08, 2012
On 1/7/2012 6:32 PM, Peter Alexander wrote:
> On 64-bit, floats are stored in XMM registers (just as single scalars).

Yes.

> I don't think it does any vectorization yet though.

Right. It doesn't do that.

> It does mean that the register
> allocation of those registers is already complete though.

Yup. Does a nice job of it, too :-)
January 08, 2012
On Sunday, 8 January 2012 at 01:48:34 UTC, Manu wrote:
> On 8 January 2012 03:44, Walter Bright <newshound2@digitalmars.com> wrote:
>
>> On 1/7/2012 4:54 PM, Peter Alexander wrote:
>>
>>> I think it simply requires a lot of work in the compiler.
>>>
>>
>> Not that much work. Most of it segues nicely into the previous work I did
>> supporting the XMM floating point code gen.
>>
>
> What is this previous work you speak of? Is there already XMM stuff in
> there somewhere?

DMD (at least 64 bit on linux, I'm not sure about 32 bit) now uses XMM registers and instructions that work on them (addss, addsd, mulsd...) for scalar floating point operations.
January 08, 2012
On 8 January 2012 11:56, a <a@a.com> wrote:

> On Sunday, 8 January 2012 at 01:48:34 UTC, Manu wrote:
>
>> On 8 January 2012 03:44, Walter Bright <newshound2@digitalmars.com> wrote:
>>
>>  On 1/7/2012 4:54 PM, Peter Alexander wrote:
>>>
>>>  I think it simply requires a lot of work in the compiler.
>>>>
>>>>
>>> Not that much work. Most of it segues nicely into the previous work I did supporting the XMM floating point code gen.
>>>
>>>
>> What is this previous work you speak of? Is there already XMM stuff in there somewhere?
>>
>
> DMD (at least 64 bit on linux, I'm not sure about 32 bit) now uses XMM
> registers and instructions that work on them (addss, addsd, mulsd...) for
> scalar floating point operations.
>

Yeah of course! >_<
I forgot that they did that in x64 (I never work with x64), but I recall
thinking that was the single most awesome change to the architecture! :)


January 08, 2012
simdop will need more overloads, e.g. some
instructions need immediate bytes.
z = simdop(SHUFPS, x, y, 0);

How about this:
__v128 simdop(T...)(SIMD op, T args);