January 06, 2012
On 6 January 2012 21:21, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/6/2012 5:44 AM, Manu wrote:
>
>> The type safety you're imagining here might actually be annoying when
>> working
>> with the raw type and opcodes..
>> Consider this common situation and the code that will be built around it:
>> __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //
>> pack some
>
>
> Consider an analogy with the EAX register. It's untyped. But we don't find it convenient to make it untyped in a high level language, we paint the fiction of a type onto it, and that works very well.
>

Damn it, I though we already reached agreement, why are you having second
thoughts? :)
Your analogy to EAX is not really valid. EAX may hold some data that is not
an int, but it is incapable of performing a float operation on that data.
SIMD registers are capable of performing operations of any type at any time
to any register, I think this is the key distinction that justifies them as
inherently 'typeless' registers.


> To me, the advantage of making the SIMD types typed are:
>
> 1. the language does typechecking, for example, trying to add a vector of 4 floats to 16 bytes would be (and should be) an error.


The language WILL do that checking as soon as we create the strongly typed libraries. And people will use those libraries, they'll never touch the primitive type.

The primitive type however must not inhibit the programmer from being able
to perform any operation that the hardware is technically capable of.
The primitive type will be used behind the scenes for building said
libraries... nobody will use it in front-end code. It's not really a useful
type, it doesn't do anything. It just allows the ABI and register semantics
to be expressed in the language.


> 2. Some of the SIMD operations do map nicely onto the operators, so one could write:
>
>   a = b + c + -d;
>

This is not even true, as you said yourself in a previous post.
SIMD int ops may wrap, or saturate... which is it?
Don't try and express this at the language level. Let the libraries do it,
and if they fail, or are revealed to be poorly defined, they can be
updated/changed.

3. A lot of the SIMD opcodes have 10 variants, one for each of the 10
> types. The user would only need to remember the operation, not the variants, and let the usual overloading rules apply.
>

Correct, and they will be hidden behind the api of their strongly typed library counterparts. The user will never need to be aware of the opcodes, or their variants.


> And, of course, casting would be allowed and would be zero cost.
>

Zero cost? You're suggesting all casts would be reinterprets? Surely:
float4 fVec = (float4)intVec; should perform a type conversion?
Again, this is detail that can/should be discussed when implementing the
standard library, leave this sort of problem out of the language.


Your earlier email detailing your simple API with an enum of opcodes sounded fine... whatever's easiest really. The hard part will be implementing the alignment, and the literal syntax.


January 06, 2012
On 6 January 2012 21:34, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/6/2012 11:08 AM, Manu wrote:
>
>> I think we should take this conversation to IRC, or a separate thread?
>> I'll generate some examples from VC for you in various situations. If you
>> can
>> write me a short list of trouble cases as you see them, I'll make sure to
>> address them specifically...
>> Have you tested the code that GCC produces? I'm sure it'll be identical
>> to VC...
>>
>
> What I'm going to do is make the SIMD stuff work on 64 bits for now. The alignment problem is solved for it, and is an orthogonal issue.


...I'm using DMD on windows... x32. So this isn't ideal ;)
Although with this change, Iain should be able to expose the vector types
in GDC, and I can work from there, and hopefully even build an ARM/PPC
toolchain to experiment with the library in a cross platform environment.

That said, how do you currently support ANY aligned type? I thought
>> align(n) was
>> a defined keyword in D?
>>
>
> Yes, but the alignment is only as good as the alignment underlying it. For example, anything in segments can be aligned to 16 bytes or less, because the segments are aligned to 16 bytes. Anything allocated with new can be aligned to 16 bytes or less.
>
> The stack, however, is aligned to 4, so trying to align things on the stack by 8 or 16 will not work.
>

... this sounds bad. Shall I start another thread? ;)
So you're saying it's impossible to align a stack based buffer to, say, 128
bytes... ?
This is another fairly important daily requirement of mine (that I assumed
was currently supported). Aligning buffers to cache lines is common, and is
required for many optimisations.

Hopefully the work you do to support 16byte alignment on x86 will also
support arbitrary alignment of any buffer...
Will arbitrary alignment be supported on x64?
What about GCC? Will/does it support arbitrary alignment?


January 06, 2012
On 1/6/2012 11:53 AM, Manu wrote:
> ... this sounds bad. Shall I start another thread? ;)
> So you're saying it's impossible to align a stack based buffer to, say, 128
> bytes... ?

No, it's not impossible. Here's what you can do now:

char[128+127] buf;
char* pbuf = cast(char*)(((size_t)buf.ptr + 127) & ~127);

and now pbuf points to 128 bytes, aligned, on the stack.


> Hopefully the work you do to support 16byte alignment on x86 will also support
> arbitrary alignment of any buffer...
> Will arbitrary alignment be supported on x64?

Aligning to non-powers of 2 will never work. As for other alignments, they only will work if the underlying storage is aligned to that or greater. Otherwise, you'll have to resort to the method outlined above.


> What about GCC? Will/does it support arbitrary alignment?

Don't know about gcc.
January 06, 2012
On 1/6/2012 11:16 AM, Brad Roberts wrote:
> However, a counter example, it'd be a lot easier to write a memcpy routine that uses them
> without having to resort to asm code under this theoretical model.

I would seriously argue that individuals not attempt to write their own memcpy.

Why? Because the C one has had probably thousands of programmers looking at it for the last 30 years. You're not going to spend 5 minutes, or even 5 days, and make it faster.

January 06, 2012
On 1/6/2012 11:41 AM, Manu wrote:
> On 6 January 2012 21:21, Walter Bright <newshound2@digitalmars.com
> <mailto:newshound2@digitalmars.com>> wrote:
>
>     On 1/6/2012 5:44 AM, Manu wrote:
>
>         The type safety you're imagining here might actually be annoying when
>         working
>         with the raw type and opcodes..
>         Consider this common situation and the code that will be built around it:
>         __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //
>         pack some
>
>
>     Consider an analogy with the EAX register. It's untyped. But we don't find
>     it convenient to make it untyped in a high level language, we paint the
>     fiction of a type onto it, and that works very well.
>
>
> Damn it, I though we already reached agreement, why are you having second
> thoughts? :)
> Your analogy to EAX is not really valid. EAX may hold some data that is not an
> int, but it is incapable of performing a float operation on that data.
> SIMD registers are capable of performing operations of any type at any time to
> any register, I think this is the key distinction that justifies them as
> inherently 'typeless' registers.

I strongly disagree with this. EAX can (and is) at various times used as byte, ubyte, short, ushort, int, uint, pointer, and yes, even floats! Anything that fits in it, actually. It is typeless. The types used on them are a fiction perpetrated by the language, but a very very useful fiction.


>     To me, the advantage of making the SIMD types typed are:
>
>     1. the language does typechecking, for example, trying to add a vector of 4
>     floats to 16 bytes would be (and should be) an error.
>
>
> The language WILL do that checking as soon as we create the strongly typed
> libraries. And people will use those libraries, they'll never touch the
> primitive type.

I'm not so sure this will work out satisfactorily.


> The primitive type however must not inhibit the programmer from being able to
> perform any operation that the hardware is technically capable of.
> The primitive type will be used behind the scenes for building said libraries...
> nobody will use it in front-end code. It's not really a useful type, it doesn't
> do anything. It just allows the ABI and register semantics to be expressed in
> the language.
>
>     2. Some of the SIMD operations do map nicely onto the operators, so one
>     could write:
>
>        a = b + c + -d;
>
>
> This is not even true, as you said yourself in a previous post.
> SIMD int ops may wrap, or saturate... which is it?

It would only be for those ops that actually do map onto the D operators. (This is already done by the library implementation of the array arithmetic operations.) The saturated int ops would not be usable this way.


> Don't try and express this at the language level. Let the libraries do it, and
> if they fail, or are revealed to be poorly defined, they can be updated/changed.

Doing it as a library type pretty much prevents certain optimizations, for example, the fused operations, from being expressed using infix operators.


>     3. A lot of the SIMD opcodes have 10 variants, one for each of the 10 types.
>     The user would only need to remember the operation, not the variants, and
>     let the usual overloading rules apply.
>
>
> Correct, and they will be hidden behind the api of their strongly typed library
> counterparts. The user will never need to be aware of the opcodes, or their
> variants.
>
>     And, of course, casting would be allowed and would be zero cost.
>
>
> Zero cost? You're suggesting all casts would be reinterprets? Surely: float4
> fVec = (float4)intVec; should perform a type conversion?
> Again, this is detail that can/should be discussed when implementing the
> standard library, leave this sort of problem out of the language.

Painting a new type (i.e. reinterpret casts) do have zero runtime cost to them. I don't think it's a real problem - we do it all the time when, for example, we want to retype an int as a uint:

   int i;
   uint u = cast(uint)i;


> Your earlier email detailing your simple API with an enum of opcodes sounded
> fine... whatever's easiest really. The hard part will be implementing the
> alignment, and the literal syntax.

January 06, 2012
On Fri, 6 Jan 2012, Walter Bright wrote:

> On 1/6/2012 11:16 AM, Brad Roberts wrote:
> > However, a counter example, it'd be a lot easier to write a memcpy routine
> > that uses them
> > without having to resort to asm code under this theoretical model.
> 
> I would seriously argue that individuals not attempt to write their own memcpy.
> 
> Why? Because the C one has had probably thousands of programmers looking at it for the last 30 years. You're not going to spend 5 minutes, or even 5 days, and make it faster.

Oh, I completely agree.  Intel has people that work on that as their primary job.  There's a constant trickle of changes going into glibc's mem{cpy,cmp} type routines to specialize for each of the ever evolving set of platforms out there.  No way should that effort be duplicated.  All I was pondering was how much cleaner much of that could be if it was expressed in higher level representations.  But you'd still wind up playing serious tweaking and validation games that would largely if not completely invalidate the utility of being expressed in higher level forms.  Probably.

Later,
Brad
January 06, 2012
On Fri, 06 Jan 2012 20:00:15 +0100, Manu <turkeyman@gmail.com> wrote:

> On 6 January 2012 20:17, Martin Nowak <dawg@dawgfoto.de> wrote:
>
>> There is another benefit.
>> Consider the following:
>>
>> __vec128 addps(__vec128 a, __vec128 b) pure
>> {
>>    __vec128 res = a;
>>
>>    if (__ctfe)
>>    {
>>        foreach(i; 0 .. 4)
>>           res[i] += b[i];
>>    }
>>    else
>>    {
>>        asm (res, b)
>>        {
>>            addps res, b;
>>        }
>>    }
>>    return res;
>>
>> }
>>
>
> You don't need to use inline ASM to be able to do this, it will work the
> same with intrinsics.
> I've detailed numerous problems with using inline asm, and complications
> with extending the inline assembler to support this.
>
Don't get me wrong here. The idea is to find out if intrinsics
can be build with the help of inlineable asm functions.
The ctfe support is one good reason to go with a library solution.

>  * Assembly blocks present problems for the optimiser, it's not reliable
>>> that it can optimise around an inline asm blocks. How bad will it be when
>>> trying to optimise around 100 small inlined functions each containing its
>>> own inline asm blocks?
>>>
>> What do you mean by optimizing around? I don't see any apparent reason why
>> that
>> should perform worse than using intrinsics.
>>
>
> Most compilers can't reschedule code around inline asm blocks. There are a
> lot of reasons for this, google can help you.
> The main reason is that a COMPILER doesn't attempt to understand the
> assembly it's being asked to insert inline. The information that it may use
It doesn't have to understand the assembly.
Wrapping these in functions creates an IR expression with inputs and outputs.
Declaring them as pure gives the compiler free hands to apply whatever
optimizations he does normally on an IR tree.
Common subexpressions elimination, removing dead expressions...

> for optimisation is never present, so it can't do it's job.
>
>
>> The only implementation issue could be that lots of inlined asm snippets
>> make plenty basic blocks which could slow down certain compiler algorithms.
>
>
> Same problem as above. The compiler would need to understand enough about
> assembly to perform optimisation on the assembly its self to clean this up.
> Using intrinsics, all the register allocation, load/store code, etc, is all
> in the regular realm of compiling the language, and the code generation and
> optimisation will all work as usual.
>
There is no informational difference between the intrinsic

__m128 _mm_add_ps(__m128 a, __m128 b);

and an inline assembler version

__m128 _mm_add_ps(__m128 a, __m128 b)
{
    asm
    {
         addps a, b;
    }
}

>  * D's inline assembly syntax has to be carefully translated to GCC's
>>> inline asm format when using GCC, and this needs to be done
>>> PER-ARCHITECTURE, which Iain should not be expected to do for all the
>>> obscure architectures GCC supports.
>>>
>>>  ???
>> This would be needed for opcodes as well. You initial goal was to directly
>> influence
>> code gen up to instruction level, how should that be achieved without
>> platform specific
>> extension. Quite contrary with ops and asm he will need two hack paths
>> into gcc's codegen.
>
>
>> What I see here is that we can do much good things to the inline
>> assembler while achieving the same goal.
>> With intrinsics on the other hand we're adding a very specialized
>> maintenance burden.
>
>
> You need to understand how the inline assembler works in GCC to understand
> the problems with this.
> GCC basically receives a string containing assembly code. It does not
> attempt to understand it, it just pastes it in the .s file verbatim.
> This means, you can support any architecture without any additional work...
> you just type the appropriate architectures asm in your program and it's
> fine... but now if we want to perform pseudo-register assignment, or
> parameter substitution, we need a front end that parses the D asm
> expressions, and generated a valid asm string for GCC.. It can't generate
> that string without detailed knowledge of the architecture its targeting,
> and it's not feasible to implement that support for all the architectures
> GCC supports.
>
So the argument here is that intrinsics in D can easier be
mapped to existing intrinsics in GCC?
I do understand that this will be pretty difficult for GDC
to implement.
Reminds me that Walter has stated several times how much
better an internal assembler can integrate with the language.

> Even after all that, It's still not ideal.. Inline asm reduces the ability
> of the compiler to perform many optimisations.
>
> Consider this common situation and the code that will be built around it:
>>>
>>> __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //
>>>
>> Such is really not a good idea if the bit pattern of packedColour is a
>> denormal.
>> How can you even execute a single useful command on the floats here?
>>
>> Also mixing integer and FP instructions on the same register may
>> cause performance degradation. The registers are indeed typed CPU
>> internally.
>
>
> It's a very good idea, I am saving memory and, and also saving memory
> accesses.
>
> This leads back to the point in my OP where I said that most games
> programmers turn NaN, Den, and FP exceptions off.
> As I've also raised before, most vectors are actually float[3]'s, W is
> usually ignored and contains rubbish.
> It's conventional to stash some 32bit value in the W to fill the otherwise
> wasted space, and also get the load for free alongside the position.
>
> The typical program flow, in this case:
>   * the colour will be copied out into a separate register where it will be
> reinterpreted as a uint, and have an unpack process applied to it.
>   * XYZ will then be used to perform maths, ignoring W, which will continue
> to accumulate rubbish values... it doesn't matter, all FP exceptions and
> such are disabled.

Putting the uint to the front slot would make your life simpler then,
only MOVD, no unpacking :).
January 06, 2012
On 6 January 2012 21:21, Walter Bright <newshound2@digitalmars.com> wrote:

> 1. the language does typechecking, for example, trying to add a vector of 4 floats to 16 bytes would be (and should be) an error.
>

I want to sell you on the 'primitive SIMD regs are truly typeless' point.
(which I thought you had already agreed with) :)

Here are some examples of tight interacting between int/float, and
interacting ON floats with int operations...
Naturally the examples I present will be wrapped as useful functions in
libraries, but the primitive type shouldn't try and make this more annoying
by trying to enforce pointless type safety errors like you seem to be
suggesting.

In computer graphics it's common to work with float16's, a type not
supported by simd units. Pack/Unpack code involved detailed float/int
interaction.
You might take a register of floats, then mask the exponent and then
perform integer arithmetic on the exponent to shift it into the float16
exponent range... then you will mask the bottom of the mantissa and shift
them into place.
Unpacking is same process in reverse.

Other tricks with the float sign bits, making everything negative, by or-ing in 1's into the top bits. or you can gather the signs using various techniques.. useful for identifying the cell in a quad-tree for instance. Integer manipulation of floats is surprisingly common.


January 06, 2012
On Fri, 06 Jan 2012 21:16:40 +0100, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/6/2012 11:53 AM, Manu wrote:
>> ... this sounds bad. Shall I start another thread? ;)
>> So you're saying it's impossible to align a stack based buffer to, say, 128
>> bytes... ?
>
> No, it's not impossible. Here's what you can do now:
>
> char[128+127] buf;
> char* pbuf = cast(char*)(((size_t)buf.ptr + 127) & ~127);
>
> and now pbuf points to 128 bytes, aligned, on the stack.
>
>
>> Hopefully the work you do to support 16byte alignment on x86 will also support
>> arbitrary alignment of any buffer...
>> Will arbitrary alignment be supported on x64?
>
> Aligning to non-powers of 2 will never work. As for other alignments, they only will work if the underlying storage is aligned to that or greater. Otherwise, you'll have to resort to the method outlined above.
>
>
>> What about GCC? Will/does it support arbitrary alignment?
>
> Don't know about gcc.

Only recently (4.6 I think).
January 06, 2012
On 01/06/12 20:53, Manu wrote:
> What about GCC? Will/does it support arbitrary alignment?

For sane "arbitrary" values (ie powers of two) it looks like this:

--------------------------------
import std.stdio;
struct S { align(65536) ubyte[64] bs; alias bs this; }

pragma(attribute, noinline) void f(ref S s) { s[2] = 42; }

void main(string[] argv) {
  S s = void;
  f(s);
  writeln(s.ptr);
}
---------------------------------

turns into:

---------------------------------
 804ae40:       55                      push   %ebp
 804ae41:       89 e5                   mov    %esp,%ebp
 804ae43:       66 bc 00 00             mov    $0x0,%sp
 804ae47:       81 ec 00 00 01 00       sub    $0x10000,%esp
 804ae4d:       89 e0                   mov    %esp,%eax
 804ae4f:       e8 2c 0e 00 00          call   804bc80 <void align.f(ref align.S)>
 804ae54:       89 e0                   mov    %esp,%eax
 804ae56:       e8 c5 ff ff ff          call   804ae20 <void std.stdio.writeln!(ubyte*).writeln(ubyte*).2084>
 804ae5b:       31 c0                   xor    %eax,%eax
 804ae5d:       c9                      leave
 804ae5e:       c3                      ret
 804ae5f:       90                      nop
---------------------------------

specifying a more sane alignment of 64 gives:

---------------------------------
0804ae40 <_Dmain>:
 804ae40:       55                      push   %ebp
 804ae41:       89 e5                   mov    %esp,%ebp
 804ae43:       83 e4 c0                and    $0xffffffc0,%esp
 804ae46:       83 ec 40                sub    $0x40,%esp
 804ae49:       89 e0                   mov    %esp,%eax
 804ae4b:       e8 30 0e 00 00          call   804bc80 <void align.f(ref align.S)>
 804ae50:       89 e0                   mov    %esp,%eax
 804ae52:       e8 c9 ff ff ff          call   804ae20 <void std.stdio.writeln!(ubyte*).writeln(ubyte*).2084>
 804ae57:       31 c0                   xor    %eax,%eax
 804ae59:       c9                      leave
 804ae5a:       c3                      ret
---------------------------------