Thread overview | |||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
May 14, 2010 Loop optimization | ||||
---|---|---|---|---|
| ||||
Hello, I was evaluating using D for some numerical stuff. However I was surprised to find that looping & array indexing was not very speedy compared to alternatives (gcc et al). I was using the DMD2 compiler on mac and windows, with -O -release. Here is a boiled down test case: void main (string[] args) { double [] foo = new double [cast(int)1e6]; for (int i=0;i<1e3;i++) { for (int j=0;j<1e6-1;j++) { foo[j]=foo[j]+foo[j+1]; } } } Any ideas? Am I somehow not hitting a vital compiler optimization? Thanks for your help. |
May 14, 2010 Re: Loop optimization | ||||
---|---|---|---|---|
| ||||
Posted in reply to kai | On Fri, 14 May 2010 02:38:40 +0000, kai wrote: > Hello, > > I was evaluating using D for some numerical stuff. However I was surprised to find that looping & array indexing was not very speedy compared to alternatives (gcc et al). I was using the DMD2 compiler on mac and windows, with -O -release. Here is a boiled down test case: > > void main (string[] args) > { > double [] foo = new double [cast(int)1e6]; for (int i=0;i<1e3;i++) > { > for (int j=0;j<1e6-1;j++) > { > foo[j]=foo[j]+foo[j+1]; > } > } > } > > Any ideas? Am I somehow not hitting a vital compiler optimization? Thanks for your help. Two suggestions: 1. Have you tried the -noboundscheck compiler switch? Unlike C, D checks that you do not try to read/write beyond the end of an array, but you can turn those checks off with said switch. 2. Can you use vector operations? If the example you gave is representative of your specific problem, then you can't because you are adding overlapping parts of the array. But if you are doing operations on separate arrays, then array operations will be *much* faster. http://www.digitalmars.com/d/2.0/arrays.html#array-operations As an example, compare the run time of the following code with the example you gave: void main () { double[] foo = new double [cast(int)1e6]; double[] slice1 = foo[0 .. 999_998]; double[] slice2 = foo[1 .. 999_999]; for (int i=0;i<1e3;i++) { // BAD, BAD, BAD. DON'T DO THIS even though // it's pretty awesome: slice1[] += slice2[]; } } Note that this is very bad code, since slice1 and slice2 are overlapping arrays, and there is no guarantee as to which order the array elements are computed -- it may even occur in parallel. It was just an example of the speed gains you may expect from designing your code with array operations in mind. -Lars |
May 14, 2010 Re: Loop optimization | ||||
---|---|---|---|---|
| ||||
Posted in reply to kai | On Fri, 14 May 2010 02:38:40 +0000, kai wrote: > Hello, > > I was evaluating using D for some numerical stuff. However I was surprised to find that looping & array indexing was not very speedy compared to alternatives (gcc et al). I was using the DMD2 compiler on mac and windows, with -O -release. Here is a boiled down test case: > > void main (string[] args) > { > double [] foo = new double [cast(int)1e6]; for (int i=0;i<1e3;i++) > { > for (int j=0;j<1e6-1;j++) > { > foo[j]=foo[j]+foo[j+1]; > } > } > } > > Any ideas? Am I somehow not hitting a vital compiler optimization? Thanks for your help. Two suggestions: 1. Have you tried the -noboundscheck compiler switch? Unlike C, D checks that you do not try to read/write beyond the end of an array, but you can turn those checks off with said switch. 2. Can you use vector operations? If the example you gave is representative of your specific problem, then you can't because you are adding overlapping parts of the array. But if you are doing operations on separate arrays, then array operations will be *much* faster. http://www.digitalmars.com/d/2.0/arrays.html#array-operations As an example, compare the run time of the following code with the example you gave: void main () { double[] foo = new double [cast(int)1e6]; double[] slice1 = foo[0 .. 999_998]; double[] slice2 = foo[1 .. 999_999]; for (int i=0;i<1e3;i++) { // BAD, BAD, BAD. DON'T DO THIS even though // it's pretty awesome: slice1[] += slice2[]; } } Note that this is very bad code, since slice1 and slice2 are overlapping arrays, and there is no guarantee as to which order the array elements are computed -- it may even occur in parallel. It was just an example of the speed gains you may expect from designing your code with array operations in mind. -Lars |
May 14, 2010 Re: Loop optimization | ||||
---|---|---|---|---|
| ||||
Posted in reply to Lars T. Kyllingstad | On Fri, 14 May 2010 06:31:29 +0000, Lars T. Kyllingstad wrote:
> void main ()
> {
> double[] foo = new double [cast(int)1e6]; double[] slice1 =
> foo[0 .. 999_998];
> double[] slice2 = foo[1 .. 999_999];
>
> for (int i=0;i<1e3;i++)
> {
> // BAD, BAD, BAD. DON'T DO THIS even though // it's pretty
> awesome:
> slice1[] += slice2[];
> }
> }
Hmm.. something very strange is going on with the line breaking here.
-Lars
|
May 14, 2010 Re: Loop optimization | ||||
---|---|---|---|---|
| ||||
Posted in reply to kai | kai: > I was evaluating using D for some numerical stuff. For that evaluation you probably have to use the LDC compiler, that is able to optimize better. > void main (string[] args) > { > double [] foo = new double [cast(int)1e6]; > for (int i=0;i<1e3;i++) > { > for (int j=0;j<1e6-1;j++) > { > foo[j]=foo[j]+foo[j+1]; > } > } > } Using floating point for indexes and lengths is not a good practice. In D large numbers are written like 1_000_000. Use -release too. > Any ideas? Am I somehow not hitting a vital compiler optimization? DMD compiler doesn't perform many optimizations, especially on floating point computations. But the bigger problem in your code is that you are performing operations on NaNs (that's the default initalization of FP values in D), and operations on NaNs are usually quite slower. Your code in C: #include "stdio.h" #include "stdlib.h" #define N 1000000 int main() { double *foo = calloc(N, sizeof(double)); // malloc suffices here int i, j; for (j = 0; j < N; j++) foo[j] = 1.0; for (i = 0; i < 1000; i++) for (j = 0; j < N-1; j++) foo[j] = foo[j] + foo[j + 1]; printf("%f", foo[N-1]); return 0; } /* gcc -O3 -s -Wall test.c -o test Timings, outer loop=1_000 times: 7.72 ------------------ gcc -Wall -O3 -fomit-frame-pointer -msse3 -march=native test.c -o test (Running on a VirtualBox) Timings, outer loop=1_000 times: 7.69 s Just the inner loop: .L7: fldl 8(%edx) fadd %st, %st(1) fxch %st(1) fstpl (%edx) addl $8, %edx cmpl %ecx, %edx jne .L7 */ -------------------- Your code in D1: version (Tango) import tango.stdc.stdio: printf; else import std.c.stdio: printf; void main() { const int N = 1_000_000; double[] foo = new double[N]; foo[] = 1.0; for (int i = 0; i < 1_000; i++) for (int j = 0; j < N-1; j++) foo[j] = foo[j] + foo[j + 1]; printf("%f", foo[N-1]); } /* dmd -O -release -inline test.d (Not running on a VirtualBox) Timings, outer loop=1_000 times: 9.35 s Just the inner loop: L34: fld qword ptr 8[EDX*8][ECX] fadd qword ptr [EDX*8][ECX] fstp qword ptr [EDX*8][ECX] inc EDX cmp EDX,0F423Fh jb L34 ----------------------- ldc -O3 -release -inline test.d (Running on a VirtualBox) Timings, outer loop=1_000 times: 7.87 s Just the inner loop: .LBB1_2: movsd (%eax,%ecx,8), %xmm0 addsd 8(%eax,%ecx,8), %xmm0 movsd %xmm0, (%eax,%ecx,8) incl %ecx cmpl $999999, %ecx jne .LBB1_2 ----------------------- ldc -unroll-allow-partial -O3 -release -inline test.d (Running on a VirtualBox) Timings, outer loop=1_000 times: 7.75 s Just the inner loop: .LBB1_2: movsd (%eax,%ecx,8), %xmm0 addsd 8(%eax,%ecx,8), %xmm0 movsd %xmm0, (%eax,%ecx,8) movsd 8(%eax,%ecx,8), %xmm0 addsd 16(%eax,%ecx,8), %xmm0 movsd %xmm0, 8(%eax,%ecx,8) movsd 16(%eax,%ecx,8), %xmm0 addsd 24(%eax,%ecx,8), %xmm0 movsd %xmm0, 16(%eax,%ecx,8) movsd 24(%eax,%ecx,8), %xmm0 addsd 32(%eax,%ecx,8), %xmm0 movsd %xmm0, 24(%eax,%ecx,8) movsd 32(%eax,%ecx,8), %xmm0 addsd 40(%eax,%ecx,8), %xmm0 movsd %xmm0, 32(%eax,%ecx,8) movsd 40(%eax,%ecx,8), %xmm0 addsd 48(%eax,%ecx,8), %xmm0 movsd %xmm0, 40(%eax,%ecx,8) movsd 48(%eax,%ecx,8), %xmm0 addsd 56(%eax,%ecx,8), %xmm0 movsd %xmm0, 48(%eax,%ecx,8) movsd 56(%eax,%ecx,8), %xmm0 addsd 64(%eax,%ecx,8), %xmm0 movsd %xmm0, 56(%eax,%ecx,8) movsd 64(%eax,%ecx,8), %xmm0 addsd 72(%eax,%ecx,8), %xmm0 movsd %xmm0, 64(%eax,%ecx,8) addl $9, %ecx cmpl $999999, %ecx jne .LBB1_2 */ As you see the code generated by ldc is about as good as the one generated by gcc. There are of course other ways to optimize this code... Bye, bearophile |
May 14, 2010 Re: Loop optimization | ||||
---|---|---|---|---|
| ||||
Posted in reply to Lars T. Kyllingstad | On Fri, 14 May 2010 02:31:29 -0400, Lars T. Kyllingstad <public@kyllingen.nospamnet> wrote: > On Fri, 14 May 2010 02:38:40 +0000, kai wrote: >> I was using the DMD2 compiler on >> mac and windows, with -O -release. > > 1. Have you tried the -noboundscheck compiler switch? Unlike C, D checks > that you do not try to read/write beyond the end of an array, but you can > turn those checks off with said switch. -release implies -noboundscheck (in fact, I did not know there was a noboundscheck flag, I thought you had to use -release). -Steve |
May 14, 2010 Re: Loop optimization | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Fri, 14 May 2010 07:32:54 -0400, Steven Schveighoffer wrote:
> On Fri, 14 May 2010 02:31:29 -0400, Lars T. Kyllingstad <public@kyllingen.nospamnet> wrote:
>
>> On Fri, 14 May 2010 02:38:40 +0000, kai wrote:
>
>
>>> I was using the DMD2 compiler on
>>> mac and windows, with -O -release.
>>
>> 1. Have you tried the -noboundscheck compiler switch? Unlike C, D checks that you do not try to read/write beyond the end of an array, but you can turn those checks off with said switch.
>
> -release implies -noboundscheck (in fact, I did not know there was a noboundscheck flag, I thought you had to use -release).
>
> -Steve
You are right, just checked it now. But it's strange, I thought the whole point of the -noboundscheck switch was that it would be independent of -release. But perhaps I remember wrongly (or perhaps Walter just hasn't gotten around to it yet).
Anyway, sorry for the misinformation.
-Lars
|
May 14, 2010 Re: Loop optimization | ||||
---|---|---|---|---|
| ||||
Posted in reply to kai | On Thu, 13 May 2010 22:38:40 -0400, kai <kai@nospam.zzz> wrote:
> Hello,
>
> I was evaluating using D for some numerical stuff. However I was surprised to
> find that looping & array indexing was not very speedy compared to
> alternatives (gcc et al). I was using the DMD2 compiler on mac and windows,
> with -O -release. Here is a boiled down test case:
>
> void main (string[] args)
> {
> double [] foo = new double [cast(int)1e6];
> for (int i=0;i<1e3;i++)
> {
> for (int j=0;j<1e6-1;j++)
> {
> foo[j]=foo[j]+foo[j+1];
> }
> }
> }
>
> Any ideas? Am I somehow not hitting a vital compiler optimization? Thanks for
> your help.
I figured it out.
in D, the default value for doubles is nan, so you are adding countless scores of nan's which is costly for some reason (not a big floating point guy, so I'm not sure about this).
In C/C++, the default value for doubles is 0.
BTW, without any initialization of the array, what are you expecting the code to do? In the C++ version, I suspect you are simply adding a bunch of 0s together.
Equivalent D code which first initializes the array to 0s:
void main (string[] args)
{
double [] foo = new double [cast(int)1e6];
foo[] = 0; // probably want to change this to something more meaningful
for (int i=0;i<cast(int)1e3;i++)
{
for (int j=0;j<cast(int)1e6-1;j++)
{
foo[j]+=foo[j+1];
}
}
}
On my PC, it runs almost exactly at the same speed as the C++ version.
-Steve
|
May 14, 2010 Re: Loop optimization | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | Thanks for the help all! > 2. Can you use vector operations? If the example you gave is representative of your specific problem, then you can't because you are adding overlapping parts of the array. But if you are doing operations on separate arrays, then array operations will be *much* faster. Unfortunately, I don't think I will be able to. The actual code is computing norms of a sequence of points and then updating their values as needed (MLE smoothing/prediction). > For that evaluation you probably have to use the LDC compiler, that is able to optimize better. I was scared off by the warning that D 2.0 support is experimental. I realize D 2 itself is still non-production, but for academic interests industrial-strength isnt all that important if it usually works :). > Using floating point for indexes and lengths is not a good practice. In D large numbers are written like 1_000_000. Use -release too. Good to know, thanks (thats actually a great feature for scientists!). > DMD compiler doesn't perform many optimizations, especially on floating point computations. But the bigger problem in your code is that you are performing operations on NaNs (that's the default initalization of FP values in D), and operations on NaNs are usually quite slower. > in D, the default value for doubles is nan, so you are adding countless scores of nan's which is costly for some reason (not a big floating point guy, so I'm not sure about this). Ah ha, that was it-- serves me right for trying to boil down a test case and failing miserably. I'll head back to my code now and try to find the real problem :-) At some point I removed the initialization data obviously. |
May 14, 2010 Re: Loop optimization | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | == Quote from bearophile (bearophileHUGS@lycos.com)'s article
> But the bigger problem in your code is that you are performing operations on
NaNs (that's the default initalization of FP values in D), and operations on NaNs are usually quite slower.
I didn't know that. Is it the same for inf?
I used it as a null for structs.
|
Copyright © 1999-2021 by the D Language Foundation