January 08, 2012
On 8/01/12 5:02 PM, Martin Nowak wrote:
> simdop will need more overloads, e.g. some
> instructions need immediate bytes.
> z = simdop(SHUFPS, x, y, 0);
>
> How about this:
> __v128 simdop(T...)(SIMD op, T args);

These don't make a lot of sense to return as value, e.g.

__v128 a, b;
a = simdop(movhlps, b); // ???

movhlps moves the top 64-bits of b into the bottom 64-bits of a. Can't be done as an expression like this.

Would make more sense to just write the instructions like they appear in asm:

simdop(movhlps, a, b);
simdop(addps, a, b);
etc.

The difference between this and inline asm would be:

1. Registers are automatically allocated.
2. Loads/stores are inserted when we spill to stack.
3. Instructions can be scheduled and optimised by the compiler.

We could then extend this with user-defined types:

struct float4
{
  union
  {
     __v128 v;
     float[4] for_debugging;
  }

  float4 opBinary(string op:"+")(float4 rhs) @forceinline
  {
    __v128 result = v;
    simdop(addps, result, rhs);
    return float4(result);
  }
}

We'd need a strong guarantee of inlining and removal of redundant load/stores though for this to work well. We'd also need a guarantee that float4's would get the same treatment as __v128 (as it is the only element).
January 08, 2012
On 8 January 2012 19:56, Peter Alexander <peter.alexander.au@gmail.com>wrote:

> These don't make a lot of sense to return as value, e.g.
>
> __v128 a, b;
> a = simdop(movhlps, b); // ???
>
> movhlps moves the top 64-bits of b into the bottom 64-bits of a. Can't be done as an expression like this.
>

The conventional way is to write it like this:
  r = simdop(movhlps, a, b);

This allows you to chain the functions together, ie. passing the result as an arg..


January 08, 2012
On Sun, 08 Jan 2012 18:56:04 +0100, Peter Alexander <peter.alexander.au@gmail.com> wrote:

> On 8/01/12 5:02 PM, Martin Nowak wrote:
>> simdop will need more overloads, e.g. some
>> instructions need immediate bytes.
>> z = simdop(SHUFPS, x, y, 0);
>>
>> How about this:
>> __v128 simdop(T...)(SIMD op, T args);
>
> These don't make a lot of sense to return as value, e.g.
>
> __v128 a, b;
> a = simdop(movhlps, b); // ???
>
> movhlps moves the top 64-bits of b into the bottom 64-bits of a. Can't be done as an expression like this.
>
> Would make more sense to just write the instructions like they appear in asm:
>
> simdop(movhlps, a, b);
> simdop(addps, a, b);
> etc.
>
Yeah, also thought of this. Having a copy as default would
require to eliminate them again.

> The difference between this and inline asm would be:
>

> 1. Registers are automatically allocated.
See asm pseudo-registers.

> 2. Loads/stores are inserted when we spill to stack.
There are sequencing point before and after asm blocks.

> 3. Instructions can be scheduled and optimised by the compiler.
Optimization can be done on IR level.
Scheduling is done after all code is emitted.
January 11, 2012
Trass3r wrote:
> On Friday, 6 January 2012 at 19:53:52 UTC, Manu wrote:
>> Iain should be able to expose the vector types in GDC,
>> and I can work from there, and hopefully even build an ARM/PPC
>> toolchain to experiment with the library in a cross platform environment.
>
> On Windoze? You're a masochist ^^

Windows 8 will support ARM. I hope that D will too.
January 11, 2012
I was rather under that only the new html5 api would be available under windows 8 arm - that they were doing a iOS walled garden type thing with it - if true this could make things difficult...

On Wed, Jan 11, 2012 at 9:42 PM, Piotr Szturmaj <bncrbme@jadamspam.pl>wrote:

> Trass3r wrote:
>
>> On Friday, 6 January 2012 at 19:53:52 UTC, Manu wrote:
>>
>>> Iain should be able to expose the vector types in GDC,
>>> and I can work from there, and hopefully even build an ARM/PPC
>>> toolchain to experiment with the library in a cross platform environment.
>>>
>>
>> On Windoze? You're a masochist ^^
>>
>
> Windows 8 will support ARM. I hope that D will too.
>


January 11, 2012
Danni Coy wrote:
> I was rather under that only the new html5 api would be available under
> windows 8 arm - that they were doing a iOS walled garden type thing with
> it - if true this could make things difficult...

http://www.microsoft.com/presspass/exec/ssinofsky/2011/09-13BUILD.mspx?rss_fdn=Custom

"[...] And you have your choice of world-class development tools and languages. JavaScript, C#, VB, C++, C, HTML, CSS, XAML, all for X86-64 and ARM.

This is an extremely important point: If you go and build your Metro style app in JavaScript and HTML, in C# or in XAML, that app will just run when there's ARM hardware available. So, you don’t have to worry about that. Just write your application in HTML5, JavaScript and C# and XAML and your application runs across all the hardware that Windows 8 supports. (Applause.)

And if you want to write native code, we're going to help you do that as well and make it so that you can cross-compile into the other platforms as well. So, full platform support with these Metro style applications."

It means Win8 ARM will be limited to Metro apps only, but you will be able to choose HTML/CSS/JS, .NET or native code.
January 11, 2012
On 11-01-2012 13:23, Piotr Szturmaj wrote:
> Danni Coy wrote:
>> I was rather under that only the new html5 api would be available under
>> windows 8 arm - that they were doing a iOS walled garden type thing with
>> it - if true this could make things difficult...
>
> http://www.microsoft.com/presspass/exec/ssinofsky/2011/09-13BUILD.mspx?rss_fdn=Custom
>
>
> "[...] And you have your choice of world-class development tools and
> languages. JavaScript, C#, VB, C++, C, HTML, CSS, XAML, all for X86-64
> and ARM.
>
> This is an extremely important point: If you go and build your Metro
> style app in JavaScript and HTML, in C# or in XAML, that app will just
> run when there's ARM hardware available. So, you don’t have to worry
> about that. Just write your application in HTML5, JavaScript and C# and
> XAML and your application runs across all the hardware that Windows 8
> supports. (Applause.)
>
> And if you want to write native code, we're going to help you do that as
> well and make it so that you can cross-compile into the other platforms
> as well. So, full platform support with these Metro style applications."
>
> It means Win8 ARM will be limited to Metro apps only, but you will be
> able to choose HTML/CSS/JS, .NET or native code.

If they have ported the Common Language Runtime to ARM, I doubt they would put some arbitrary limitation on what apps can run on that hardware. All things considered, AArch32/64 are coming soon.

Besides, Windows running on ARM is not a new thing; see Windows Mobile and Windows Phone 7. By now, their ARM support should be as good as their x86 support.

- Alex
January 12, 2012
On 06.01.2012 02:42, Manu wrote:
> I like v128, or something like that. I'll use that for the sake of this
> document. I think it is preferable to float4 for a few reasons...

I do not agree at all. That way, the type looses all semantic information. This is not only breaking with C/C++/D philosophy but actually *hides* an essential hardware detail on Intel SSE:

An SSE register is 128 bit, but the processor actually cares about the semantics of the content:

There are different commands for loading two doubles, four singles or integers to a register. They all load the same 128 bits from memory into the same register. Anyhow, the specs warn about a performance penalty when loading a register as one type and then using it as another. I do not know the internals of the processor, but my understanding is that the CPU splits the floats into mantissa, exponent and sign already at the moment of loading and has to drop that information when you reinterpret the bit pattern stored in the register.

A type v128 would not provide the necessary information for the compiler to produce the correct mov statements.

There definitely must be a float4 and a double2 type to express these semantics. For integers, I am not quite sure. I believe that integer SSE commands can be mixed more so a single 128bit type would be sufficient.

Considering these hardware details of the SSE architecture alone, I fear that portable low-level support for SIMD is very hard to achieve. If you want to offer access to the raw power of each architecture, it might be simpler to have machine-specific language extensions for SIMD and leave the portability for a wrapper library with a common front-end and various back-ends for the different architectures.
January 12, 2012
On 1/12/2012 12:13 PM, Norbert Nemec wrote:
> A type v128 would not provide the necessary information for the compiler to
> produce the correct mov statements.
>
> There definitely must be a float4 and a double2 type to express these semantics.
> For integers, I am not quite sure. I believe that integer SSE commands can be
> mixed more so a single 128bit type would be sufficient.
>
> Considering these hardware details of the SSE architecture alone, I fear that
> portable low-level support for SIMD is very hard to achieve. If you want to
> offer access to the raw power of each architecture, it might be simpler to have
> machine-specific language extensions for SIMD and leave the portability for a
> wrapper library with a common front-end and various back-ends for the different
> architectures.

That's what we're doing for D's SIMD support.

Although the syntax will support any vector type, the semantics will constrain it to what works for the target hardware. Manu has convinced me that to emulate vector types that don't have hardware support is a very bad idea, because then naive users will assume they'll be getting hardware performance, but in reality will have truly execrable performance.

Note that gcc does do the emulation for unsupported ops (like some of the multiplies). Take a gander at the code generated - instead of one instruction, it's a page of them. I think this will be an unwelcome surprise to the performance minded vector programmer.

Note that explicit emulation will be possible, using D's general purpose vector syntax:

    a[] = b[] + c[];
January 12, 2012
On 12/01/12 8:13 PM, Norbert Nemec wrote:
> Considering these hardware details of the SSE architecture alone, I fear
> that portable low-level support for SIMD is very hard to achieve. If you
> want to offer access to the raw power of each architecture, it might be
> simpler to have machine-specific language extensions for SIMD and leave
> the portability for a wrapper library with a common front-end and
> various back-ends for the different architectures.

You are right, but don't forget that the same is true for instructions already in the language. For example, (1 << x) is a very slow operation on PPUs (it's micro-coded).

It's simply not possible to be portable and achieve maximum performance for any language features, not just vectors. Algorithms must be tuned for specific architectures in version statements. However, you can get a decent baseline by providing the lowest common denominator in functionality. This v128 type (or whatever it will be called) does that.