January 06, 2012
On 1/6/2012 12:43 AM, Walter Bright wrote:
> Declare one new basic type:
> 
>     __v128
> 
> which represents the 16 byte aligned 128 bit vector type. The only operations defined to work on it would be construction and assignment. The __ prefix signals that it is non-portable.
> 
> Then, have:
> 
>    import core.simd;
> 
> which provides two functions:
> 
>    __v128 simdop(operator, __v128 op1);
>    __v128 simdop(operator, __v128 op1, __v128 op2);

How is making __v128 a builtin type better than defining it as:

align(16) struct __v128
{
    ubyte[16] data;
}

January 06, 2012
On 1/6/2012 6:05 AM, Manu wrote:
> On 6 January 2012 08:22, Walter Bright <newshound2@digitalmars.com
> <mailto:newshound2@digitalmars.com>> wrote:
>
>     On 1/5/2012 7:42 PM, Manu wrote:
>
>         Perhaps I misunderstand, I can't see the problem?
>         In the function preamble, you just align it... something like:
>            mov reg, esp ; take a backup of the stack pointer
>            and esp, -16 ; align it
>
>         ... function
>
>            mov esp, reg ; restore the stack pointer
>            ret 0
>
>
>     And now you cannot access the function's parameters anymore, because the
>     stack offset for them is now variable rather than fixed.
>
>
> Hehe, true, but not insurmountable. Scheduling of parameter pops before you
> perform the alignment may solve that straight up, or else don't align esp its
> self; store the vector to the stack through some other aligned reg copied from
> esp...
>
> I just wrote some test functions using __m128 in VisualC, it seems to do
> something in between the simplicity of my initial suggestion, and my refined
> ideas one above :)
> If you have VisualC, check out what it does, it's very simple, looks pretty
> good, and I'm sure it's optimal (MS have enough R&D money to assure this)
>
> I can paste some disassemblies if you don't have VC...

I don't have VC. I had thought of using an extra level of indirection for all the aligned stuff, essentially rewrite:

    v128 v;
    v = x;

with:

    v128 v; // goes in aligned stack
    v128 *pv = &v;  // pv is in regular stack
    *pv = x;

but there are still complexities with it, like spilling aligned temps to the stack.
January 06, 2012
On 6 January 2012 20:17, Martin Nowak <dawg@dawgfoto.de> wrote:

> There is another benefit.
> Consider the following:
>
> __vec128 addps(__vec128 a, __vec128 b) pure
> {
>    __vec128 res = a;
>
>    if (__ctfe)
>    {
>        foreach(i; 0 .. 4)
>           res[i] += b[i];
>    }
>    else
>    {
>        asm (b, res)
>        {
>            addps res, b;
>        }
>    }
>    return res;
>
> }
>

You don't need to use inline ASM to be able to do this, it will work the
same with intrinsics.
I've detailed numerous problems with using inline asm, and complications
with extending the inline assembler to support this.

 * Assembly blocks present problems for the optimiser, it's not reliable
>> that it can optimise around an inline asm blocks. How bad will it be when trying to optimise around 100 small inlined functions each containing its own inline asm blocks?
>>
> What do you mean by optimizing around? I don't see any apparent reason why
> that
> should perform worse than using intrinsics.
>

Most compilers can't reschedule code around inline asm blocks. There are a
lot of reasons for this, google can help you.
The main reason is that a COMPILER doesn't attempt to understand the
assembly it's being asked to insert inline. The information that it may use
for optimisation is never present, so it can't do it's job.


> The only implementation issue could be that lots of inlined asm snippets make plenty basic blocks which could slow down certain compiler algorithms.


Same problem as above. The compiler would need to understand enough about assembly to perform optimisation on the assembly its self to clean this up. Using intrinsics, all the register allocation, load/store code, etc, is all in the regular realm of compiling the language, and the code generation and optimisation will all work as usual.

 * D's inline assembly syntax has to be carefully translated to GCC's
>> inline asm format when using GCC, and this needs to be done PER-ARCHITECTURE, which Iain should not be expected to do for all the obscure architectures GCC supports.
>>
>>  ???
> This would be needed for opcodes as well. You initial goal was to directly
> influence
> code gen up to instruction level, how should that be achieved without
> platform specific
> extension. Quite contrary with ops and asm he will need two hack paths
> into gcc's codegen.


> What I see here is that we can do much good things to the inline
> assembler while achieving the same goal.
> With intrinsics on the other hand we're adding a very specialized
> maintenance burden.


You need to understand how the inline assembler works in GCC to understand
the problems with this.
GCC basically receives a string containing assembly code. It does not
attempt to understand it, it just pastes it in the .s file verbatim.
This means, you can support any architecture without any additional work...
you just type the appropriate architectures asm in your program and it's
fine... but now if we want to perform pseudo-register assignment, or
parameter substitution, we need a front end that parses the D asm
expressions, and generated a valid asm string for GCC.. It can't generate
that string without detailed knowledge of the architecture its targeting,
and it's not feasible to implement that support for all the architectures
GCC supports.

Even after all that, It's still not ideal.. Inline asm reduces the ability of the compiler to perform many optimisations.

Consider this common situation and the code that will be built around it:
>>
>> __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //
>>
> Such is really not a good idea if the bit pattern of packedColour is a
> denormal.
> How can you even execute a single useful command on the floats here?
>
> Also mixing integer and FP instructions on the same register may cause performance degradation. The registers are indeed typed CPU internally.


It's a very good idea, I am saving memory and, and also saving memory accesses.

This leads back to the point in my OP where I said that most games
programmers turn NaN, Den, and FP exceptions off.
As I've also raised before, most vectors are actually float[3]'s, W is
usually ignored and contains rubbish.
It's conventional to stash some 32bit value in the W to fill the otherwise
wasted space, and also get the load for free alongside the position.

The typical program flow, in this case:
  * the colour will be copied out into a separate register where it will be
reinterpreted as a uint, and have an unpack process applied to it.
  * XYZ will then be used to perform maths, ignoring W, which will continue
to accumulate rubbish values... it doesn't matter, all FP exceptions and
such are disabled.


January 06, 2012
On 1/6/2012 7:36 AM, Russel Winder wrote:
> It just strikes me as an opportunity to get D front and centre by having
> it provide a better development experience for these heterogeneous
> systems that are coming.

At the moment, I have no idea what such support might look like :-(
January 06, 2012
On 6 January 2012 20:25, Brad Roberts <braddr@puremagic.com> wrote:

> How is making __v128 a builtin type better than defining it as:
>
> align(16) struct __v128
> {
>    ubyte[16] data;
> }
>

Where in that code is the compiler informed that your structure should occupy a SIMD registers, and apply SIMD ABI conventions?


January 06, 2012
On 6 January 2012 20:53, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/6/2012 6:05 AM, Manu wrote:
>
>> On 6 January 2012 08:22, Walter Bright <newshound2@digitalmars.com <mailto:newshound2@**digitalmars.com <newshound2@digitalmars.com>>> wrote:
>>
>>    On 1/5/2012 7:42 PM, Manu wrote:
>>
>>        Perhaps I misunderstand, I can't see the problem?
>>        In the function preamble, you just align it... something like:
>>           mov reg, esp ; take a backup of the stack pointer
>>           and esp, -16 ; align it
>>
>>        ... function
>>
>>           mov esp, reg ; restore the stack pointer
>>           ret 0
>>
>>
>>    And now you cannot access the function's parameters anymore, because
>> the
>>    stack offset for them is now variable rather than fixed.
>>
>>
>> Hehe, true, but not insurmountable. Scheduling of parameter pops before
>> you
>> perform the alignment may solve that straight up, or else don't align esp
>> its
>> self; store the vector to the stack through some other aligned reg copied
>> from
>> esp...
>>
>> I just wrote some test functions using __m128 in VisualC, it seems to do
>> something in between the simplicity of my initial suggestion, and my
>> refined
>> ideas one above :)
>> If you have VisualC, check out what it does, it's very simple, looks
>> pretty
>> good, and I'm sure it's optimal (MS have enough R&D money to assure this)
>>
>> I can paste some disassemblies if you don't have VC...
>>
>
> I don't have VC. I had thought of using an extra level of indirection for all the aligned stuff, essentially rewrite:
>
>    v128 v;
>    v = x;
>
> with:
>
>    v128 v; // goes in aligned stack
>    v128 *pv = &v;  // pv is in regular stack
>    *pv = x;
>
> but there are still complexities with it, like spilling aligned temps to the stack.
>

I think we should take this conversation to IRC, or a separate thread?
I'll generate some examples from VC for you in various situations. If you
can write me a short list of trouble cases as you see them, I'll make sure
to address them specifically...
Have you tested the code that GCC produces? I'm sure it'll be identical to
VC...

That said, how do you currently support ANY aligned type? I thought align(n) was a defined keyword in D?


January 06, 2012
On 1/6/2012 10:25 AM, Brad Roberts wrote:
> How is making __v128 a builtin type better than defining it as:
>
> align(16) struct __v128
> {
>      ubyte[16] data;
> }

Then the back end knows it should be mapped onto the XMM registers rather than the usual arithmetic set.

January 06, 2012
On 1/6/2012 11:06 AM, Manu wrote:
> On 6 January 2012 20:25, Brad Roberts <braddr@puremagic.com <mailto:braddr@puremagic.com>> wrote:
> 
>     How is making __v128 a builtin type better than defining it as:
> 
>     align(16) struct __v128
>     {
>        ubyte[16] data;
>     }
> 
> 
> Where in that code is the compiler informed that your structure should occupy a SIMD registers, and apply SIMD ABI conventions?

Good point, those rules would need to be added.  I'd argue that it's not unreasonable to allow any properly aligned and sized types to occupy those registers.  Though that's likely not optimal for cases that won't actually use the operations that modify them.  However, a counter example, it'd be a lot easier to write a memcpy routine that uses them without having to resort to asm code under this theoretical model.

January 06, 2012
On 1/6/2012 5:44 AM, Manu wrote:
> The type safety you're imagining here might actually be annoying when working
> with the raw type and opcodes..
> Consider this common situation and the code that will be built around it:
> __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); // pack some
> other useful data in W
> If vec were strongly typed, I would now need to start casting all over the place
> to use various float and uint opcodes on this value?
> I think it's correct when using SIMD at the raw level to express the type as it
> is, typeless... SIMD regs are infact typeless regs, they only gain concept of
> type the moment you perform an opcode on it, and only for the duration of that
> opcode.
>
> You will get your strong type safety when you make use of the float4 types which
> will be created in the libs.

Consider an analogy with the EAX register. It's untyped. But we don't find it convenient to make it untyped in a high level language, we paint the fiction of a type onto it, and that works very well.

To me, the advantage of making the SIMD types typed are:

1. the language does typechecking, for example, trying to add a vector of 4 floats to 16 bytes would be (and should be) an error.

2. Some of the SIMD operations do map nicely onto the operators, so one could write:

   a = b + c + -d;

and the correct SIMD opcodes will be generated based on the types. I think that would be one hell of a lot nicer than using function syntax. Of course, this will only be for those SIMD ops that do map, for the rest you're stuck with the functions.

3. A lot of the SIMD opcodes have 10 variants, one for each of the 10 types. The user would only need to remember the operation, not the variants, and let the usual overloading rules apply.


And, of course, casting would be allowed and would be zero cost.

I've been thinking about this a lot since last night, and I think that since the back end already supports XMM registers, most of the hard work is done, that doing it this way would fit in well. (At least for 64 bit code, where the alignment issue is solved, but that's an orthogonal issue.)
January 06, 2012
On 1/6/2012 11:08 AM, Manu wrote:
> I think we should take this conversation to IRC, or a separate thread?
> I'll generate some examples from VC for you in various situations. If you can
> write me a short list of trouble cases as you see them, I'll make sure to
> address them specifically...
> Have you tested the code that GCC produces? I'm sure it'll be identical to VC...

What I'm going to do is make the SIMD stuff work on 64 bits for now. The alignment problem is solved for it, and is an orthogonal issue.


> That said, how do you currently support ANY aligned type? I thought align(n) was
> a defined keyword in D?

Yes, but the alignment is only as good as the alignment underlying it. For example, anything in segments can be aligned to 16 bytes or less, because the segments are aligned to 16 bytes. Anything allocated with new can be aligned to 16 bytes or less.

The stack, however, is aligned to 4, so trying to align things on the stack by 8 or 16 will not work.