optimized array operations (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Announce » optimized array operations (page 2)

September 23, 2008

Re: optimized array operations

Posted by Eugene Pelekhay
in reply to bearophile

Eugene Pelekhay

Posted in reply to bearophile

bearophile Wrote:

> Denis Koroskin:
> > New hardware - new results (-O -release -inline):
> > length=1
> > _add Time elapsed: 91.212/0 ticks OK
> > _sub Time elapsed: 89.8235/0 ticks OK
> > add Time elapsed: 82.6135/0 ticks OK
> > sub Time elapsed: 84.2032/0 ticks OK
> > mul Time elapsed: 82.8325/0 ticks OK
> > div Time elapsed: 99.2924/0 ticks OK
> ...
> 
> I don't understand how to read the timings, are the timings before the slash of the current implementation, and the timings after the slash of your new code?
> 
> Regarding your code, lines as:
> foreach (i; 0 .. 10_000_000) {
> Probably need to be rewritten as:
> for (int i; i < 10_000_000; i++) {
> 
> Because array operations are present in D1 too, and it may be better to write code that works on both D1/D2 if possible (so Walter has less code differences to manage).
> 
> Bye,
> bearophile


timings shown per implementation for aligned/unaligned data. name of function with underscore it's default/current implementation without new optimized version

regarding rewriting loops, may be You right, I didn't checked if it even compiled by D 1.0, however major code differcence will be in benchmark function wich is not intended for inclusion in librariy

September 23, 2008

Re: optimized array operations

Posted by Don
in reply to Eugene Pelekhay

Don

Posted in reply to Eugene Pelekhay

Eugene Pelekhay wrote:
> I'm finished optimized version of array operations, using SSE2 instructions.

Good work. Note that your code will actually work better for floats (using SSE) than with SSE2.

As far as I can tell, X87 code may actually be faster for the unaligned case.

Comparing x87 code

fld mem
fadd mem
fstp mem

with SSE code

movapd/movupd  reg, mem
addpd reg, mem
movapd/movupd mem, reg

On all CPUs below the x87 code takes 3uops, so it is 6 uops for two doubles, 12 for four floats. The number of SSE uops depends on whether aligned or unaligned loads are used. Importantly, the extra uops are mostly for the load & store ports, so this is going to translate reasonably well to clock cycles:

CPU        aligned  unaligned
PentiumM   6         14
Core2      3         14
AMD K8     6         11
AMD K10    4         5

(AMD K7 is the same as K8, except doesn't have SSE2).

Practical conclusion: Probably better to use x87 for the unaligned double case, on everything except K10. For unaligned floats, it's marginal, again only a clear win on the K10. If the _destination_ is aligned, even if the _source_ is not, SSE floats will be better on any of the processors.

Theoretical conclusion: Don't assume SSE is always faster!

The balance changes for more complex operations (for simple ones like add, you're limited by memory bandwidth, so SSE doesn't help very much).

September 23, 2008

Re: optimized array operations

Posted by bearophile
in reply to Eugene Pelekhay

bearophile

Posted in reply to Eugene Pelekhay

Eugene Pelekhay:
> however major code differcence will be in benchmark function wich is not intended for inclusion in librariy

In my libs I keep the tidy benchmark code too (and few timing tables too), because it's part of the performance benchmarking of the code, so it completes the unittests, spotting performance problems, etc.

Bye,
bearophile

September 23, 2008

Re: optimized array operations

Posted by Hoenir
in reply to Eugene Pelekhay

Hoenir

Posted in reply to Eugene Pelekhay

Eugene Pelekhay schrieb:
> Hoenir Wrote:
> 
>> Eugene Pelekhay schrieb:
>>> I'm finished optimized version of array operations, using SSE2 instructions.
>> Shouldn't this at line 446
>>    else
>>    {
>>      NoSEE2();
>>    }
>>
>> be NoSSE2()?
> 
> No, If we have no inline asm we have noting to do

I meant the spelling. you wrote NoSEE2

September 23, 2008

Re: optimized array operations

Posted by Eugene Pelekhay
in reply to Hoenir

Eugene Pelekhay

Posted in reply to Hoenir

Hoenir Wrote:

> Eugene Pelekhay schrieb:
> > Hoenir Wrote:
> > 
> >> Eugene Pelekhay schrieb:
> >>> I'm finished optimized version of array operations, using SSE2 instructions.
> >> Shouldn't this at line 446
> >>    else
> >>    {
> >>      NoSEE2();
> >>    }
> >>
> >> be NoSSE2()?
> > 
> > No, If we have no inline asm we have noting to do
> 
> I meant the spelling. you wrote NoSEE2

Sorry, yes you right

September 24, 2008

Re: optimized array operations

Posted by Eugene Pelekhay
in reply to Don

Eugene Pelekhay

Posted in reply to Don

Don Wrote:

> Eugene Pelekhay wrote:
> > I'm finished optimized version of array operations, using SSE2 instructions.
> 
> Good work. Note that your code will actually work better for floats (using SSE) than with SSE2.
> 
> As far as I can tell, X87 code may actually be faster for the unaligned case.
> 
> Comparing x87 code
> 
> fld mem
> fadd mem
> fstp mem
> 
> with SSE code
> 
> movapd/movupd  reg, mem
> addpd reg, mem
> movapd/movupd mem, reg
> 
> On all CPUs below the x87 code takes 3uops, so it is 6 uops for two doubles, 12 for four floats. The number of SSE uops depends on whether aligned or unaligned loads are used. Importantly, the extra uops are mostly for the load & store ports, so this is going to translate reasonably well to clock cycles:
> 
> CPU        aligned  unaligned
> PentiumM   6         14
> Core2      3         14
> AMD K8     6         11
> AMD K10    4         5
> 
> (AMD K7 is the same as K8, except doesn't have SSE2).
> 
> Practical conclusion: Probably better to use x87 for the unaligned double case, on everything except K10. For unaligned floats, it's marginal, again only a clear win on the K10. If the _destination_ is aligned, even if the _source_ is not, SSE floats will be better on any of the processors.
> 
> Theoretical conclusion: Don't assume SSE is always faster!
> 
> The balance changes for more complex operations (for simple ones like add, you're limited by memory bandwidth, so SSE doesn't help very much).

Thanks for advise, I'll try to improve it. Actualy I not used assembler 7 years and my knowledge is a bit outdated.

September 25, 2008

Re: optimized array operations

Posted by Jb
in reply to Eugene Pelekhay

Jb

Posted in reply to Eugene Pelekhay

"Eugene Pelekhay" <pelekhay@gmail.com> wrote in message news:gbeaf0$g28$1@digitalmars.com...
> Don Wrote:
>
>> Eugene Pelekhay wrote:
>> > I'm finished optimized version of array operations, using SSE2 instructions.
>>
>> Good work. Note that your code will actually work better for floats
>> (using SSE) than with SSE2.
>>
>> As far as I can tell, X87 code may actually be faster for the unaligned case.
>>
>> Comparing x87 code
>>
>> fld mem
>> fadd mem
>> fstp mem
>>
>> with SSE code
>>
>> movapd/movupd  reg, mem
>> addpd reg, mem
>> movapd/movupd mem, reg
>>
>> On all CPUs below the x87 code takes 3uops, so it is 6 uops for two doubles, 12 for four floats. The number of SSE uops depends on whether aligned or unaligned loads are used. Importantly, the extra uops are mostly for the load & store ports, so this is going to translate reasonably well to clock cycles:
>>
>> CPU        aligned  unaligned
>> PentiumM   6         14
>> Core2      3         14
>> AMD K8     6         11
>> AMD K10    4         5
>>
>> (AMD K7 is the same as K8, except doesn't have SSE2).
>>
>> Practical conclusion: Probably better to use x87 for the unaligned double case, on everything except K10. For unaligned floats, it's marginal, again only a clear win on the K10. If the _destination_ is aligned, even if the _source_ is not, SSE floats will be better on any of the processors.
>>
>> Theoretical conclusion: Don't assume SSE is always faster!
>>
>> The balance changes for more complex operations (for simple ones like add, you're limited by memory bandwidth, so SSE doesn't help very much).
>
> Thanks for advise, I'll try to improve it. Actualy I not used assembler 7 years and my knowledge is a bit outdated.

If you are doing unaligned memory acesses it's actualy faster to do this..

MOVLPS    XMM0,[address]
MOVHPS   XMM0,[address+8]

Than it is to do

MOVUPS  XMM0,[address]

The reason being that (on almost all but a very latest chips) SSE ops are actualy split into 2 64 bit ops. So the former code actualy works out a lot faster.

Also, unaligned loads are a whole lot quicker than unaligned stores. 2 or 3 times faster IIRC. So the best method is bend over backwards to get your writes aligned.

September 26, 2008

Re: optimized array operations

Posted by Eugene Pelekhay
in reply to Jb

Eugene Pelekhay

Posted in reply to Jb

Attachments:

arrayOfDouble.d

Jb Wrote:

> If you are doing unaligned memory acesses it's actualy faster to do this..
> 
> MOVLPS    XMM0,[address]
> MOVHPS   XMM0,[address+8]
> 
> Than it is to do
> 
> MOVUPS  XMM0,[address]
> 
> The reason being that (on almost all but a very latest chips) SSE ops are actualy split into 2 64 bit ops. So the former code actualy works out a lot faster.
> 
> Also, unaligned loads are a whole lot quicker than unaligned stores. 2 or 3 times faster IIRC. So the best method is bend over backwards to get your writes aligned.
> 

Thanks, I'll check this way too.
Meanwile can anybody test new version on other systems, I implemented operations for unaligned case by x87 instructions and my benchamrc show that it works much slower then SSE2 version. This means that Don's theory wrong or I having unusual Pentium-M or I have bad x87 code.

September 26, 2008

Re: optimized array operations

Posted by Don
in reply to Eugene Pelekhay

Don

Posted in reply to Eugene Pelekhay

Eugene Pelekhay wrote:
> Jb Wrote:
> 
>> If you are doing unaligned memory acesses it's actualy faster to do this..
>>
>> MOVLPS    XMM0,[address]
>> MOVHPS   XMM0,[address+8]
>>
>> Than it is to do
>>
>> MOVUPS  XMM0,[address]
>>
>> The reason being that (on almost all but a very latest chips) SSE ops are actualy split into 2 64 bit ops. So the former code actualy works out a lot faster.
>>
>> Also, unaligned loads are a whole lot quicker than unaligned stores. 2 or 3 times faster IIRC. So the best method is bend over backwards to get your writes aligned.
>>
> 
> Thanks, I'll check this way too.
> Meanwile can anybody test new version on other systems, I implemented operations for unaligned case by x87 instructions and my benchamrc show that it works much slower then SSE2 version. This means that Don's theory wrong or I having unusual Pentium-M or I have bad x87 code.
You have way too many indexing operations. Also, unrolling by 8 makes the code so big that you probably get limited by instruction decoding.
The whole loop can be reduced to something like (not tested):

// EAX=length.
// count UP from -length
 lea EDX, [EDX + 8*EAX];
 lea EDI, [EDI + 8*EAX];
 lea ESI, [ESI + 8*EAX];
 neg EAX;
start:
 fld dword ptr [EDX+8*EAX];
 fadd dword ptr [ESI+8*EAX];
 fstp dword ptr [EDI+8*EAX];
 add EAX, 1;
 jnz start;

There are 5 fused uops in the loop. Every instruction is 1 uop, so decoding is not a bottleneck.
There are two memory loads per loop (execution unit p2), one store (p3), add EAX uses p0 or p1, jnz uses p1, fadd uses p0 or p1. Since Pentium M can do 3uops per clock as long as they're in different units, the best case would be two clocks per loop.
Loop unrolling _might_ be necessary to get it to schedule the instructions correctly, but otherwise it's unhelpful.

On PentiumM there's a bug which means it keeps trying to do two fadds at once, even though it only has one FADD execution unit. So one keeps getting stalled, so it probably won't be as fast as it should be. Sometimes you can fix that by moving the add EAX above the store, or above the fadd. On Core2 you should get 2 clocks per iteration without any trouble.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation