May 31, 2013
On Friday, 31 May 2013 at 16:17:28 UTC, Ali Çehreli wrote:
> On 05/31/2013 04:28 AM, Shriramana Sharma wrote:
>
> > On Fri, May 31, 2013 at 4:31 PM, Timon Gehr
> <timon.gehr@gmx.ch> wrote:
> >>
> >> If double uses xmm registers and real uses the fpu registers
> (as is standard
> >> on x64), then double multiplication has twice the throughput
> of real
> >> multiplication on recent intel microarchitectures.
> >
> > Hi can you clarify that? I'm interested because I'm running a
> 64 bit
> > system. What does twice the throughput mean? double is faster?
>
> I am interested in the answer too.

int, long, float, double etc. are all of size 2^n. That means they can be neatly packed in to SIMD registers (e.g. XMM0-XMM15 on x86_64) and can be operated on in parallel, providing very large speedups in some cases. real, on the other hand - on x86/64 - is 80bits and as such doesn't fit in to the same scheme.

There are also differences to do with calling conventions that can make different types slower/faster, but this is very dependant on the exact usage and normally only significantly affects small functions that can't be inlined IME.
May 31, 2013
On 05/31/2013 01:28 PM, Shriramana Sharma wrote:
> On Fri, May 31, 2013 at 4:31 PM, Timon Gehr <timon.gehr@gmx.ch> wrote:
>>
>> If double uses xmm registers and real uses the fpu registers (as is standard
>> on x64), then double multiplication has twice the throughput of real
>> multiplication on recent intel microarchitectures.
>
> Hi can you clarify that? I'm interested because I'm running a 64 bit
> system. What does twice the throughput mean? double is faster?
>

Depends. Two useful numbers to classify performance characteristics of machine instructions are latency and reciprocal throughput.

Modern out-of-order processors are pipelined. I.e. instructions may take multiple cycles to complete, and multiple instructions may run through different stages of the pipeline at the same time.

Latency: The time taken from the point the point where all inputs are available to the point where all outputs are available.

Reciprocal throughput: The minimum delay between the start of two instructions of the same kind.

Multiplying doubles in an xmm register has latency 5 and reciprocal throughput 1 (on recent intel microarchitectures). Multiplying 'reals' in an fpu register has latency 5 and reciprocal throughput 2.


Therefore, doubles allow more instruction level parallelism (ILP). However, if you have eg. a computation like this one:

b = a*b*c*d;

Then there will not be a difference in runtime, as all instructions depend on a previous result.

On the other hand, if you reassociate the expression as follows:


b = (a*b)*(c*d);

Then double will be one cycle faster, since the second mult can be started one cycle earlier, and hence the third one can also start one cycle earlier.


If you are interested, more information is available here:

http://agner.org/optimize/#manuals


May 31, 2013
On 05/31/2013 09:08 PM, John Colvin wrote:
> On Friday, 31 May 2013 at 16:17:28 UTC, Ali Çehreli wrote:
>> On 05/31/2013 04:28 AM, Shriramana Sharma wrote:
>>
>> > On Fri, May 31, 2013 at 4:31 PM, Timon Gehr
>> <timon.gehr@gmx.ch> wrote:
>> >>
>> >> If double uses xmm registers and real uses the fpu registers
>> (as is standard
>> >> on x64), then double multiplication has twice the throughput
>> of real
>> >> multiplication on recent intel microarchitectures.
>> >
>> > Hi can you clarify that? I'm interested because I'm running a
>> 64 bit
>> > system. What does twice the throughput mean? double is faster?
>>
>> I am interested in the answer too.
>
> int, long, float, double etc. are all of size 2^n. That means they can
> be neatly packed in to SIMD registers (e.g. XMM0-XMM15 on x86_64) and
> can be operated on in parallel, providing very large speedups in some
> cases. real, on the other hand - on x86/64 - is 80bits and as such
> doesn't fit in to the same scheme.
>

What I was talking about also holds for scalar code without usage of any packed instructions.

> There are also differences to do with calling conventions that can make
> different types slower/faster, but this is very dependant on the exact
> usage and normally only significantly affects small functions that can't
> be inlined IME.

1 2
Next ›   Last »