April 19, 2007 Re: summing large arrays - or - performance of tight loops | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Dave | > I think the "trust the compiler" philosophy is for stability (20 years
> old), not optimizations.
Nope. Read the documentation on foreach. It says specifically that foreach is preferred over other looping mechanisms because the compiler will optimize it.
-Craig
| |||
April 19, 2007 Re: summing large arrays - or - performance of tight loops | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Craig Black | Craig Black Wrote:
> > I think the "trust the compiler" philosophy is for stability (20 years
> > old), not optimizations.
>
> Nope. Read the documentation on foreach. It says specifically that foreach is preferred over other looping mechanisms because the compiler will optimize it.
>
> -Craig
I'm not going to change any foreach code over. I'm going to wait until someone in Walter's group fixes it. : )
| |||
April 19, 2007 Re: summing large arrays - or - performance of tight loops | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Craig Black | Craig Black wrote:
>> I think the "trust the compiler" philosophy is for stability (20 years old), not optimizations.
>
> Nope. Read the documentation on foreach. It says specifically that foreach is preferred over other looping mechanisms because the compiler will optimize it.
>
> -Craig
>
>
Sorry - I misunderstood you, and hadn't seen that documentation.
Thanks,
- Dave
| |||
April 19, 2007 Re: summing large arrays - or - performance of tight loops | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Dan | Dan wrote:
> Craig Black Wrote:
>
>>> I think the "trust the compiler" philosophy is for stability (20 years old), not optimizations.
>> Nope. Read the documentation on foreach. It says specifically that foreach is preferred over other looping mechanisms because the compiler will optimize it.
>>
>> -Craig
>
> I'm not going to change any foreach code over. I'm going to wait until someone in Walter's group fixes it. : )
>
You might want to try it out first before making any judgments anyhow. I tried a couple of the benchmark tests and 'foreach' performed the same as 'array' for example. Mine were ran on a P4 chip and an AMD64 chip (original was Turion).
| |||
April 19, 2007 Re: summing large arrays - or - performance of tight loops | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Dave | Dave wrote
> Mine were ran on a P4 chip and an AMD64 chip (original was
> Turion).
It is disgusting to see, that three random benchmarkers all get different results. Mine were run on an AMD64 X2 under WinXP32 and compiled with DMD1.012.
Quotients (foreach/other):
Thomas: 200%
Dave: 100%
me: 85%
What causes those huge differences?
Here is my main:
import std.stdio, std.perf;
void main(){
const int f= 256;
int[ ] a= new int[ 1024*1024*f];
auto c=new PerformanceCounter;
auto c2=new PerformanceCounter;
c.start;
for(int i= 1; i<=5120/f; i++)
volatile sum_foreach( a);
c.stop;
volatile auto t= c.microseconds;
c2.start;
for(int i= 1; i<=5120/f; i++)
volatile sum_array( a);
c2.stop;
writefln( "%s %s %s"
, cast(real)t/c2.microseconds
, t
, c2.microseconds
);
}
-manfred
| |||
April 19, 2007 Re: summing large arrays - or - performance of tight loops | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Manfred Nowak | Manfred Nowak wrote: > Dave wrote >> Mine were ran on a P4 chip and an AMD64 chip (original was >> Turion). > > It is disgusting to see, that three random benchmarkers all get different results. Mine were run on an AMD64 X2 under WinXP32 and compiled with DMD1.012. > > Quotients (foreach/other): > Thomas: 200% > Dave: 100% > me: 85% > > What causes those huge differences? > > Here is my main: > > import std.stdio, std.perf; > void main(){ > const int f= 256; > int[ ] a= new int[ 1024*1024*f]; > auto c=new PerformanceCounter; > auto c2=new PerformanceCounter; > c.start; > for(int i= 1; i<=5120/f; i++) > volatile sum_foreach( a); > c.stop; > volatile auto t= c.microseconds; > c2.start; > for(int i= 1; i<=5120/f; i++) > volatile sum_array( a); > c2.stop; > writefln( "%s %s %s" > , cast(real)t/c2.microseconds > , t > , c2.microseconds > ); > } > > -manfred > I was using a simple 1024 x 1024 size array before. Using a 1024 x 1024 x 64 array, I got: P4: 97% (linux32 FC5) AMD64: 92% (WinXP32) So, the array size seems to make some difference, at least on AMD machines. Here's all of the code I tested with: import std.date, std.perf, std.stdio; void main() { int[] arr = new int[1024*1024*64]; int result1 = 0, result2 = 0; // initialize with values that won't cause an integer overflow foreach(i, ref e; arr) e = i % 5 + 1; version(perftime) { auto t1 = new PerformanceCounter; t1.start; for(auto i = 0; i < 10; i++) result1 = sum_foreach(arr); t1.stop; auto t2 = new PerformanceCounter; t2.start; for(auto i = 0; i < 10; i++) result2 = sum_array(arr); t2.stop; writefln("%s %s %s %s %s" , cast(real)t1.microseconds/t2.microseconds , t1.microseconds , t2.microseconds , result1 , result2 ); } else { d_time s1 = getUTCtime; for(auto i = 0; i < 10; i++) result1 = sum_foreach(arr); d_time e1 = getUTCtime; d_time s2 = getUTCtime; for(auto i = 0; i < 10; i++) result2 = sum_array(arr); d_time e2 = getUTCtime; writefln("%s %s %s %s %s" , cast(real)(e1-s1)/(e2-s2) , e1-s1 , e2-s2 , result1 , result2 ); } } T sum_foreach(T)(T[] data) { T result = 0; foreach(element; data) { result += element; } return result; } T sum_array(T)(T[] data) { T result = 0; size_t index = 0; while(index < data.length) { result += data[index]; index++; } return result; } | |||
April 23, 2007 Re: summing large arrays - or - performance of tight loops | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Dan Attachments: | Dan schrieb am 2007-04-18:
>
> Indeed, this is clearly an optimization problem. I agree that we should put our faith in the compiler because it's bad to undermine it to try to write optimal code when the compiler may still be improved to solve the problem *properly*
>
> That said, DMDScript undermines associative arrays by completely re-implementing them with a tweak - instead of using opIndex and opIndexAssign. : p
>
> But yes, let's work on getting the compiler to Do The Right Thing(tm) so we don't have to kludge our code, mm?
Originally I needed to find the fastest way to iterate repeatedly over large arrays and apply trivial operations to each element. I've put the results only so that others can have a look to, not to prioritise compiler tweaks over bug fixing.
Thomas
| |||
April 23, 2007 Re: summing large arrays - or - performance of tight loops | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Dave Attachments: | Dave schrieb am 2007-04-19: > Manfred Nowak wrote: >> Dave wrote >>> Mine were ran on a P4 chip and an AMD64 chip (original was >>> Turion). >> >> It is disgusting to see, that three random benchmarkers all get different results. Mine were run on an AMD64 X2 under WinXP32 and compiled with DMD1.012. >> >> Quotients (foreach/other): >> Thomas: 200% >> Dave: 100% >> me: 85% >> >> What causes those huge differences? >> >> Here is my main: <snip> > I was using a simple 1024 x 1024 size array before. > > Using a 1024 x 1024 x 64 array, I got: > > P4: 97% (linux32 FC5) > AMD64: 92% (WinXP32) > > So, the array size seems to make some difference, at least on AMD machines. The results strongly depend on the memory architecture and to a lesser extend on the element values. I've put an updated version online that contains results for byte, short, int, long, float and double. The benchmarking was done as follows: dmd sum.html -O -release -inline -version=benchmark ./sum 2> /dev/null Thomas | |||
April 23, 2007 Re: summing large arrays - or - performance of tight loops | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Thomas Kuehne | > > Using a 1024 x 1024 x 64 array, I got:
> >
> > P4: 97% (linux32 FC5)
> > AMD64: 92% (WinXP32)
> >
> > So, the array size seems to make some difference, at least on AMD machines.
>
> The results strongly depend on the memory architecture and to a lesser extend on the element values. I've put an updated version online that contains results for byte, short, int, long, float and double.
Actually, the size of the data type doesn't matter at all for a properly implemented algorithm - as a general rule, you implement a duff's device to align and then use the largest sized instruction you can fit. Right now the SSE2 instruction "movaps" is quite effective for copying memory.
Also, each operating system implements Page tracking differently. Some do it by inserting some metadata onto the Page itself. For that reason, implementing 4kb arrays can perform really well on some OS's, and very very poorly on others (they take 2 pages when you think they're taking 1, so cache and page misses screw you up)
For that reason, it can (very rarely but occassionally) actually improve performance to *not* use a power of two array, but something just short of it.
Sincerely,
Dan
| |||
April 23, 2007 Re: summing large arrays - or - performance of tight loops | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Thomas Kuehne | Thomas Kuehne wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Dave schrieb am 2007-04-19: >> Manfred Nowak wrote: >>> Dave wrote >>>> Mine were ran on a P4 chip and an AMD64 chip (original was >>>> Turion). >>> It is disgusting to see, that three random benchmarkers all get different results. Mine were run on an AMD64 X2 under WinXP32 and compiled with DMD1.012. >>> >>> Quotients (foreach/other): >>> Thomas: 200% >>> Dave: 100% >>> me: 85% >>> >>> What causes those huge differences? >>> >>> Here is my main: > > <snip> > >> I was using a simple 1024 x 1024 size array before. >> >> Using a 1024 x 1024 x 64 array, I got: >> >> P4: 97% (linux32 FC5) >> AMD64: 92% (WinXP32) >> >> So, the array size seems to make some difference, at least on AMD machines. > > The results strongly depend on the memory architecture and to a lesser Yea, with "...the array size seems to make some difference...", I was referring to slight differences between tests on my machines. > extend on the element values. I've put an updated version online that > contains results for byte, short, int, long, float and double. > I agree, I think the big (relative foreach/other) difference between machines is probably because of different cache sizes, or it could be alignment issues I guess. All this seems to indicate that the code generated for foreach is more sensitive to mem. arch. than the other loops. That said, three of the four architectures spoken of in this thread had foreach perform as well as or better than the common while-loop anyway. > The benchmarking was done as follows: > > dmd sum.html -O -release -inline -version=benchmark > ./sum 2> /dev/null > > Thomas > > > -----BEGIN PGP SIGNATURE----- > > iD8DBQFGKhZsLK5blCcjpWoRAha5AJ40lu7apkLufJgfi8wFYiFfIuhWTACdGUJg > 3MiQxN7N15GlC+J6OEilwuM= > =dXko > -----END PGP SIGNATURE----- | |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply