On 16 January 2012 19:01, Timon Gehr <timon.gehr@gmx.ch> wrote:
On 01/16/2012 05:59 PM, Manu wrote:
On 16 January 2012 18:48, Andrei Alexandrescu
<SeeWebsiteForEmail@erdani.org <mailto:SeeWebsiteForEmail@erdani.org>>

wrote:

   On 1/16/12 10:46 AM, Manu wrote:

       A function using float arrays and a function using hardware vectors
       should certainly not be the same speed.


   My point was that the version using float arrays should
   opportunistically use hardware ops whenever possible.


I think this is a mistake, because such a piece of code never exists
outside of some context. If the context it exists within is all FPU code
(and it is, it's a float array), then swapping between FPU and SIMD
execution units will probably result in the function being slower than
the original (also the float array is unaligned). The SIMD version
however must exist within a SIMD context, since the API can't implicitly
interact with floats, this guarantees that the context of each function
matches that within which it lives.
This is fundamental to fast vector performance. Using SIMD is an all or
nothing decision, you can't just mix it in here and there.
You don't go casting back and fourth between floats and ints on every
other line... obviously it's imprecise, but it's also a major
performance hazard. There is no difference here, except the performance
hazard is much worse.

I think DMD now uses XMM registers for scalar floating point arithmetic on x86_64.

x64 can do the swapping too with no penalty, but that is the only architecture that can. So it might be a viable x64 optimisation, but only for x64 codegen, which means any tech to detect and apply the optimisation should live in the back end, not in the front end as a higher level semantic.