September 23, 2008
bearophile Wrote:

> Denis Koroskin:
> > New hardware - new results (-O -release -inline):
> > length=1
> > _add Time elapsed: 91.212/0 ticks OK
> > _sub Time elapsed: 89.8235/0 ticks OK
> > add Time elapsed: 82.6135/0 ticks OK
> > sub Time elapsed: 84.2032/0 ticks OK
> > mul Time elapsed: 82.8325/0 ticks OK
> > div Time elapsed: 99.2924/0 ticks OK
> ...
> 
> I don't understand how to read the timings, are the timings before the slash of the current implementation, and the timings after the slash of your new code?
> 
> Regarding your code, lines as:
> foreach (i; 0 .. 10_000_000) {
> Probably need to be rewritten as:
> for (int i; i < 10_000_000; i++) {
> 
> Because array operations are present in D1 too, and it may be better to write code that works on both D1/D2 if possible (so Walter has less code differences to manage).
> 
> Bye,
> bearophile


timings shown per implementation for aligned/unaligned data. name of function with underscore it's default/current implementation without new optimized version

regarding rewriting loops, may be You right, I didn't checked if it even compiled by D 1.0, however major code differcence will be in benchmark function wich is not intended for inclusion in librariy



September 23, 2008
Eugene Pelekhay wrote:
> I'm finished optimized version of array operations, using SSE2 instructions.

Good work. Note that your code will actually work better for floats (using SSE) than with SSE2.

As far as I can tell, X87 code may actually be faster for the unaligned case.

Comparing x87 code

fld mem
fadd mem
fstp mem

with SSE code

movapd/movupd  reg, mem
addpd reg, mem
movapd/movupd mem, reg

On all CPUs below the x87 code takes 3uops, so it is 6 uops for two doubles, 12 for four floats. The number of SSE uops depends on whether aligned or unaligned loads are used. Importantly, the extra uops are mostly for the load & store ports, so this is going to translate reasonably well to clock cycles:

CPU        aligned  unaligned
PentiumM   6         14
Core2      3         14
AMD K8     6         11
AMD K10    4         5

(AMD K7 is the same as K8, except doesn't have SSE2).

Practical conclusion: Probably better to use x87 for the unaligned double case, on everything except K10. For unaligned floats, it's marginal, again only a clear win on the K10. If the _destination_ is aligned, even if the _source_ is not, SSE floats will be better on any of the processors.

Theoretical conclusion: Don't assume SSE is always faster!

The balance changes for more complex operations (for simple ones like add, you're limited by memory bandwidth, so SSE doesn't help very much).
September 23, 2008
Eugene Pelekhay:
> however major code differcence will be in benchmark function wich is not intended for inclusion in librariy

In my libs I keep the tidy benchmark code too (and few timing tables too), because it's part of the performance benchmarking of the code, so it completes the unittests, spotting performance problems, etc.

Bye,
bearophile
September 23, 2008
Eugene Pelekhay schrieb:
> Hoenir Wrote:
> 
>> Eugene Pelekhay schrieb:
>>> I'm finished optimized version of array operations, using SSE2 instructions.
>> Shouldn't this at line 446
>>    else
>>    {
>>      NoSEE2();
>>    }
>>
>> be NoSSE2()?
> 
> No, If we have no inline asm we have noting to do

I meant the spelling. you wrote NoSEE2
September 23, 2008
Hoenir Wrote:

> Eugene Pelekhay schrieb:
> > Hoenir Wrote:
> > 
> >> Eugene Pelekhay schrieb:
> >>> I'm finished optimized version of array operations, using SSE2 instructions.
> >> Shouldn't this at line 446
> >>    else
> >>    {
> >>      NoSEE2();
> >>    }
> >>
> >> be NoSSE2()?
> > 
> > No, If we have no inline asm we have noting to do
> 
> I meant the spelling. you wrote NoSEE2

Sorry, yes you right
September 24, 2008
Don Wrote:

> Eugene Pelekhay wrote:
> > I'm finished optimized version of array operations, using SSE2 instructions.
> 
> Good work. Note that your code will actually work better for floats (using SSE) than with SSE2.
> 
> As far as I can tell, X87 code may actually be faster for the unaligned case.
> 
> Comparing x87 code
> 
> fld mem
> fadd mem
> fstp mem
> 
> with SSE code
> 
> movapd/movupd  reg, mem
> addpd reg, mem
> movapd/movupd mem, reg
> 
> On all CPUs below the x87 code takes 3uops, so it is 6 uops for two doubles, 12 for four floats. The number of SSE uops depends on whether aligned or unaligned loads are used. Importantly, the extra uops are mostly for the load & store ports, so this is going to translate reasonably well to clock cycles:
> 
> CPU        aligned  unaligned
> PentiumM   6         14
> Core2      3         14
> AMD K8     6         11
> AMD K10    4         5
> 
> (AMD K7 is the same as K8, except doesn't have SSE2).
> 
> Practical conclusion: Probably better to use x87 for the unaligned double case, on everything except K10. For unaligned floats, it's marginal, again only a clear win on the K10. If the _destination_ is aligned, even if the _source_ is not, SSE floats will be better on any of the processors.
> 
> Theoretical conclusion: Don't assume SSE is always faster!
> 
> The balance changes for more complex operations (for simple ones like add, you're limited by memory bandwidth, so SSE doesn't help very much).

Thanks for advise, I'll try to improve it. Actualy I not used assembler 7 years and my knowledge is a bit outdated.
September 25, 2008
"Eugene Pelekhay" <pelekhay@gmail.com> wrote in message news:gbeaf0$g28$1@digitalmars.com...
> Don Wrote:
>
>> Eugene Pelekhay wrote:
>> > I'm finished optimized version of array operations, using SSE2 instructions.
>>
>> Good work. Note that your code will actually work better for floats
>> (using SSE) than with SSE2.
>>
>> As far as I can tell, X87 code may actually be faster for the unaligned case.
>>
>> Comparing x87 code
>>
>> fld mem
>> fadd mem
>> fstp mem
>>
>> with SSE code
>>
>> movapd/movupd  reg, mem
>> addpd reg, mem
>> movapd/movupd mem, reg
>>
>> On all CPUs below the x87 code takes 3uops, so it is 6 uops for two doubles, 12 for four floats. The number of SSE uops depends on whether aligned or unaligned loads are used. Importantly, the extra uops are mostly for the load & store ports, so this is going to translate reasonably well to clock cycles:
>>
>> CPU        aligned  unaligned
>> PentiumM   6         14
>> Core2      3         14
>> AMD K8     6         11
>> AMD K10    4         5
>>
>> (AMD K7 is the same as K8, except doesn't have SSE2).
>>
>> Practical conclusion: Probably better to use x87 for the unaligned double case, on everything except K10. For unaligned floats, it's marginal, again only a clear win on the K10. If the _destination_ is aligned, even if the _source_ is not, SSE floats will be better on any of the processors.
>>
>> Theoretical conclusion: Don't assume SSE is always faster!
>>
>> The balance changes for more complex operations (for simple ones like add, you're limited by memory bandwidth, so SSE doesn't help very much).
>
> Thanks for advise, I'll try to improve it. Actualy I not used assembler 7 years and my knowledge is a bit outdated.

If you are doing unaligned memory acesses it's actualy faster to do this..

MOVLPS    XMM0,[address]
MOVHPS   XMM0,[address+8]

Than it is to do

MOVUPS  XMM0,[address]

The reason being that (on almost all but a very latest chips) SSE ops are actualy split into 2 64 bit ops. So the former code actualy works out a lot faster.

Also, unaligned loads are a whole lot quicker than unaligned stores. 2 or 3 times faster IIRC. So the best method is bend over backwards to get your writes aligned.





September 26, 2008
Jb Wrote:

> If you are doing unaligned memory acesses it's actualy faster to do this..
> 
> MOVLPS    XMM0,[address]
> MOVHPS   XMM0,[address+8]
> 
> Than it is to do
> 
> MOVUPS  XMM0,[address]
> 
> The reason being that (on almost all but a very latest chips) SSE ops are actualy split into 2 64 bit ops. So the former code actualy works out a lot faster.
> 
> Also, unaligned loads are a whole lot quicker than unaligned stores. 2 or 3 times faster IIRC. So the best method is bend over backwards to get your writes aligned.
> 

Thanks, I'll check this way too.
Meanwile can anybody test new version on other systems, I implemented operations for unaligned case by x87 instructions and my benchamrc show that it works much slower then SSE2 version. This means that Don's theory wrong or I having unusual Pentium-M or I have bad x87 code.


September 26, 2008
Eugene Pelekhay wrote:
> Jb Wrote:
> 
>> If you are doing unaligned memory acesses it's actualy faster to do this..
>>
>> MOVLPS    XMM0,[address]
>> MOVHPS   XMM0,[address+8]
>>
>> Than it is to do
>>
>> MOVUPS  XMM0,[address]
>>
>> The reason being that (on almost all but a very latest chips) SSE ops are actualy split into 2 64 bit ops. So the former code actualy works out a lot faster.
>>
>> Also, unaligned loads are a whole lot quicker than unaligned stores. 2 or 3 times faster IIRC. So the best method is bend over backwards to get your writes aligned.
>>
> 
> Thanks, I'll check this way too.
> Meanwile can anybody test new version on other systems, I implemented operations for unaligned case by x87 instructions and my benchamrc show that it works much slower then SSE2 version. This means that Don's theory wrong or I having unusual Pentium-M or I have bad x87 code.
You have way too many indexing operations. Also, unrolling by 8 makes the code so big that you probably get limited by instruction decoding.
The whole loop can be reduced to something like (not tested):

// EAX=length.
// count UP from -length
 lea EDX, [EDX + 8*EAX];
 lea EDI, [EDI + 8*EAX];
 lea ESI, [ESI + 8*EAX];
 neg EAX;
start:
 fld dword ptr [EDX+8*EAX];
 fadd dword ptr [ESI+8*EAX];
 fstp dword ptr [EDI+8*EAX];
 add EAX, 1;
 jnz start;

There are 5 fused uops in the loop. Every instruction is 1 uop, so decoding is not a bottleneck.
There are two memory loads per loop (execution unit p2), one store (p3), add EAX uses p0 or p1, jnz uses p1, fadd uses p0 or p1. Since Pentium M can do 3uops per clock as long as they're in different units, the best case would be two clocks per loop.
Loop unrolling _might_ be necessary to get it to schedule the instructions correctly, but otherwise it's unhelpful.

On PentiumM there's a bug which means it keeps trying to do two fadds at once, even though it only has one FADD execution unit. So one keeps getting stalled, so it probably won't be as fast as it should be. Sometimes you can fix that by moving the add EAX above the store, or above the fadd. On Core2 you should get 2 clocks per iteration without any trouble.
1 2
Next ›   Last »