January 16, 2012
On 01/16/2012 05:59 PM, Manu wrote:
> On 16 January 2012 18:48, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org <mailto:SeeWebsiteForEmail@erdani.org>>
> wrote:
>
>     On 1/16/12 10:46 AM, Manu wrote:
>
>         A function using float arrays and a function using hardware vectors
>         should certainly not be the same speed.
>
>
>     My point was that the version using float arrays should
>     opportunistically use hardware ops whenever possible.
>
>
> I think this is a mistake, because such a piece of code never exists
> outside of some context. If the context it exists within is all FPU code
> (and it is, it's a float array), then swapping between FPU and SIMD
> execution units will probably result in the function being slower than
> the original (also the float array is unaligned). The SIMD version
> however must exist within a SIMD context, since the API can't implicitly
> interact with floats, this guarantees that the context of each function
> matches that within which it lives.
> This is fundamental to fast vector performance. Using SIMD is an all or
> nothing decision, you can't just mix it in here and there.
> You don't go casting back and fourth between floats and ints on every
> other line... obviously it's imprecise, but it's also a major
> performance hazard. There is no difference here, except the performance
> hazard is much worse.

I think DMD now uses XMM registers for scalar floating point arithmetic on x86_64.
January 16, 2012
On 16 January 2012 19:01, Timon Gehr <timon.gehr@gmx.ch> wrote:

> On 01/16/2012 05:59 PM, Manu wrote:
>
>> On 16 January 2012 18:48, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org <mailto:SeeWebsiteForEmail@**erdani.org<SeeWebsiteForEmail@erdani.org>
>> >>
>>
>> wrote:
>>
>>    On 1/16/12 10:46 AM, Manu wrote:
>>
>>        A function using float arrays and a function using hardware vectors
>>        should certainly not be the same speed.
>>
>>
>>    My point was that the version using float arrays should
>>    opportunistically use hardware ops whenever possible.
>>
>>
>> I think this is a mistake, because such a piece of code never exists
>> outside of some context. If the context it exists within is all FPU code
>> (and it is, it's a float array), then swapping between FPU and SIMD
>> execution units will probably result in the function being slower than
>> the original (also the float array is unaligned). The SIMD version
>> however must exist within a SIMD context, since the API can't implicitly
>> interact with floats, this guarantees that the context of each function
>> matches that within which it lives.
>> This is fundamental to fast vector performance. Using SIMD is an all or
>> nothing decision, you can't just mix it in here and there.
>> You don't go casting back and fourth between floats and ints on every
>> other line... obviously it's imprecise, but it's also a major
>> performance hazard. There is no difference here, except the performance
>> hazard is much worse.
>>
>
> I think DMD now uses XMM registers for scalar floating point arithmetic on x86_64.
>

x64 can do the swapping too with no penalty, but that is the only architecture that can. So it might be a viable x64 optimisation, but only for x64 codegen, which means any tech to detect and apply the optimisation should live in the back end, not in the front end as a higher level semantic.


January 16, 2012
On 2012-01-16 16:59:44 +0000, Manu <turkeyman@gmail.com> said:

> 
> On 16 January 2012 18:48, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org
>> wrote:
> 
>> On 1/16/12 10:46 AM, Manu wrote:
>> 
>>> A function using float arrays and a function using hardware vectors
>>> should certainly not be the same speed.
>> 
>> My point was that the version using float arrays should opportunistically
>> use hardware ops whenever possible.
> 
> I think this is a mistake, because such a piece of code never exists
> outside of some context. If the context it exists within is all FPU code
> (and it is, it's a float array), then swapping between FPU and SIMD
> execution units will probably result in the function being slower than the
> original (also the float array is unaligned). The SIMD version however must
> exist within a SIMD context, since the API can't implicitly interact with
> floats, this guarantees that the context of each function matches that
> within which it lives.
> This is fundamental to fast vector performance. Using SIMD is an all or
> nothing decision, you can't just mix it in here and there.
> You don't go casting back and fourth between floats and ints on every other
> line... obviously it's imprecise, but it's also a major performance hazard.
> There is no difference here, except the performance hazard is much worse.

Andrei's idea could be valid as an optimization when the compiler can see that all the operations can be performed with SIMD ops. In this particular case: if test1a(a) is inlined. But it can't work if the float[4] value crosses a function's boundary.

Or instead the optimization could be performed at the semantic level, like this: try to change the type of a variable float[4] to a float4, and if it can compile, use it instead. So if you have the same function working with a float[4] and a float4, and if all the functions you call on a given variable supports float4, it'll go for float4. But doing that at the semantic level would be rather messy, not counting the combinatorial explosion when multiple variables are at play.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 16, 2012
On 1/16/12 11:32 AM, Michel Fortin wrote:
> Andrei's idea could be valid as an optimization when the compiler can
> see that all the operations can be performed with SIMD ops. In this
> particular case: if test1a(a) is inlined. But it can't work if the
> float[4] value crosses a function's boundary.

In this case it's the exact contrary: the float[4] and the operation are both local to the function. So it all depends on the inlining of the dummy functions that follows. No?


Andrei
January 16, 2012
On 1/16/2012 8:48 AM, Andrei Alexandrescu wrote:
> My point was that the version using float arrays should opportunistically use
> hardware ops whenever possible.

Yes, you're right. The compiler can opportunistically convert a number of vector operations on static arrays to the SIMD instructions.

Now that the basics are there, there are many, many opportunities to improve the code generation. Even for things like:

  int i,j;
  i *= 3;
  foo();
  j *= 3;

the two multiplies can be combined. Also, if operations on a particular integer variable are a subset that is supported by SIMD, that variable could be enregistered in an XMM register, instead of a GP register.

But don't worry, I'm not planning on working on that at the moment :-)
January 16, 2012
On 1/16/2012 8:59 AM, Manu wrote:
> (also the float array is
> unaligned).

Currently, it is 4 byte aligned. But the compiler could align freestanding static arrays on 16 bytes without breaking anything. It just cannot align:

   struct S
   {
        int a;
        float[4] b;
   }

b on a 16 byte boundary, as that would break the ABI. Even worse,

   struct S
   {
       int a;
       char[16] s;
   }

can't be aligned on 16 bytes as that is a common "small string optimization".
January 16, 2012
On 1/16/2012 9:21 AM, Manu wrote:
> x64 can do the swapping too with no penalty, but that is the only architecture
> that can.

Ah, that is a crucial bit of information.
January 16, 2012
On 16 January 2012 18:59, Walter Bright <newshound2@digitalmars.com> wrote:
> On 1/16/2012 8:48 AM, Andrei Alexandrescu wrote:
>>
>> My point was that the version using float arrays should opportunistically
>> use
>> hardware ops whenever possible.
>
>
> Yes, you're right. The compiler can opportunistically convert a number of vector operations on static arrays to the SIMD instructions.
>
> Now that the basics are there, there are many, many opportunities to improve the code generation. Even for things like:
>
>  int i,j;
>  i *= 3;
>  foo();
>  j *= 3;
>
> the two multiplies can be combined. Also, if operations on a particular integer variable are a subset that is supported by SIMD, that variable could be enregistered in an XMM register, instead of a GP register.
>
> But don't worry, I'm not planning on working on that at the moment :-)

Leave that sort of optimisation for the backend to handle please. ;-)


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
January 16, 2012
On 1/16/2012 11:16 AM, Iain Buclaw wrote:
>> But don't worry, I'm not planning on working on that at the moment :-)
> Leave that sort of optimisation for the backend to handle please. ;-)

Of course.

I suspect Intel's compiler does that one, does gcc?

January 16, 2012
On 2012-01-16 17:57:14 +0000, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> On 1/16/12 11:32 AM, Michel Fortin wrote:
>> Andrei's idea could be valid as an optimization when the compiler can
>> see that all the operations can be performed with SIMD ops. In this
>> particular case: if test1a(a) is inlined. But it can't work if the
>> float[4] value crosses a function's boundary.
> 
> In this case it's the exact contrary: the float[4] and the operation are both local to the function. So it all depends on the inlining of the dummy functions that follows. No?

That's exactly what I meant, if everything is local to the function you might be able to optimize. In this particular case, if test1a(a) is inlined, everything is local.

But the current example has too much isolation for it to be meaningful. If you returned the result as a float[4] the the optimization doesn't work. If you took an argument as a float[4] it probably wouldn't work either (depending on what you do with the argument). So I don't think its an optimization you should count on very much.

In fact, the optimization I'd expect the compiler to do in this case is just wipe out all the code, as it does nothing other than putting a value in a local variable which is never reused later.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/