Jump to page: 1 24  
Page
Thread overview
Loop optimization
May 14, 2010
kai
May 14, 2010
bearophile
May 14, 2010
strtr
May 15, 2010
Don
May 15, 2010
strtr
May 16, 2010
Don
May 15, 2010
Don
May 16, 2010
Walter Bright
May 17, 2010
bearophile
May 17, 2010
Walter Bright
May 17, 2010
bearophile
May 18, 2010
Walter Bright
May 17, 2010
Don
May 17, 2010
BCS
May 16, 2010
Walter Bright
May 17, 2010
bearophile
May 17, 2010
Brad Roberts
May 19, 2010
Joseph Wakeling
May 14, 2010
kai
May 14, 2010
bearophile
May 14, 2010
Jérôme M. Berger
May 15, 2010
div0
May 15, 2010
Jérôme M. Berger
May 16, 2010
div0
May 16, 2010
Jouko Koski
May 16, 2010
Jérôme M. Berger
May 16, 2010
Ali Çehreli
May 16, 2010
Simen kjaeraas
May 16, 2010
Ali Çehreli
May 16, 2010
bearophile
May 22, 2010
Walter Bright
May 22, 2010
bearophile
May 14, 2010
Hello,

I was evaluating using D for some numerical stuff. However I was surprised to find that looping & array indexing was not very speedy compared to alternatives (gcc et al). I was using the DMD2 compiler on mac and windows, with -O -release. Here is a boiled down test case:

	void main (string[] args)
	{
		double [] foo = new double [cast(int)1e6];
		for (int i=0;i<1e3;i++)
		{
			for (int j=0;j<1e6-1;j++)
			{
				foo[j]=foo[j]+foo[j+1];
			}
		}
	}

Any ideas? Am I somehow not hitting a vital compiler optimization? Thanks for your help.
May 14, 2010
On Fri, 14 May 2010 02:38:40 +0000, kai wrote:

> Hello,
> 
> I was evaluating using D for some numerical stuff. However I was surprised to find that looping & array indexing was not very speedy compared to alternatives (gcc et al). I was using the DMD2 compiler on mac and windows, with -O -release. Here is a boiled down test case:
> 
> 	void main (string[] args)
> 	{
> 		double [] foo = new double [cast(int)1e6]; for (int
i=0;i<1e3;i++)
> 		{
> 			for (int j=0;j<1e6-1;j++)
> 			{
> 				foo[j]=foo[j]+foo[j+1];
> 			}
> 		}
> 	}
> 
> Any ideas? Am I somehow not hitting a vital compiler optimization? Thanks for your help.


Two suggestions:


1. Have you tried the -noboundscheck compiler switch?  Unlike C, D checks that you do not try to read/write beyond the end of an array, but you can turn those checks off with said switch.


2. Can you use vector operations?  If the example you gave is representative of your specific problem, then you can't because you are adding overlapping parts of the array.  But if you are doing operations on separate arrays, then array operations will be *much* faster.

    http://www.digitalmars.com/d/2.0/arrays.html#array-operations

As an example, compare the run time of the following code with the example you gave:

    void main ()
    {
        double[] foo = new double [cast(int)1e6];
        double[] slice1 = foo[0 .. 999_998];
        double[] slice2 = foo[1 .. 999_999];

        for (int i=0;i<1e3;i++)
        {
            // BAD, BAD, BAD.  DON'T DO THIS even though
            // it's pretty awesome:
            slice1[] += slice2[];
        }
    }

Note that this is very bad code, since slice1 and slice2 are overlapping arrays, and there is no guarantee as to which order the array elements are computed -- it may even occur in parallel.  It was just an example of the speed gains you may expect from designing your code with array operations in mind.

-Lars
May 14, 2010
On Fri, 14 May 2010 02:38:40 +0000, kai wrote:

> Hello,
> 
> I was evaluating using D for some numerical stuff. However I was surprised to find that looping & array indexing was not very speedy compared to alternatives (gcc et al). I was using the DMD2 compiler on mac and windows, with -O -release. Here is a boiled down test case:
> 
> 	void main (string[] args)
> 	{
> 		double [] foo = new double [cast(int)1e6]; for (int
i=0;i<1e3;i++)
> 		{
> 			for (int j=0;j<1e6-1;j++)
> 			{
> 				foo[j]=foo[j]+foo[j+1];
> 			}
> 		}
> 	}
> 
> Any ideas? Am I somehow not hitting a vital compiler optimization? Thanks for your help.

Two suggestions:


1. Have you tried the -noboundscheck compiler switch?  Unlike C, D checks that you do not try to read/write beyond the end of an array, but you can turn those checks off with said switch.


2. Can you use vector operations?  If the example you gave is representative of your specific problem, then you can't because you are adding overlapping parts of the array.  But if you are doing operations on separate arrays, then array operations will be *much* faster.

    http://www.digitalmars.com/d/2.0/arrays.html#array-operations

As an example, compare the run time of the following code with the example you gave:

    void main ()
    {
        double[] foo = new double [cast(int)1e6];
        double[] slice1 = foo[0 .. 999_998];
        double[] slice2 = foo[1 .. 999_999];

        for (int i=0;i<1e3;i++)
        {
            // BAD, BAD, BAD.  DON'T DO THIS even though
            // it's pretty awesome:
            slice1[] += slice2[];
        }
    }

Note that this is very bad code, since slice1 and slice2 are overlapping arrays, and there is no guarantee as to which order the array elements are computed -- it may even occur in parallel.  It was just an example of the speed gains you may expect from designing your code with array operations in mind.

-Lars
May 14, 2010
On Fri, 14 May 2010 06:31:29 +0000, Lars T. Kyllingstad wrote:
>     void main ()
>     {
>         double[] foo = new double [cast(int)1e6]; double[] slice1 =
>         foo[0 .. 999_998];
>         double[] slice2 = foo[1 .. 999_999];
> 
>         for (int i=0;i<1e3;i++)
>         {
>             // BAD, BAD, BAD.  DON'T DO THIS even though // it's pretty
>             awesome:
>             slice1[] += slice2[];
>         }
>     }

Hmm.. something very strange is going on with the line breaking here.

-Lars
May 14, 2010
kai:

> I was evaluating using D for some numerical stuff.

For that evaluation you probably have to use the LDC compiler, that is able to optimize better.


> 	void main (string[] args)
> 	{
> 		double [] foo = new double [cast(int)1e6];
> 		for (int i=0;i<1e3;i++)
> 		{
> 			for (int j=0;j<1e6-1;j++)
> 			{
> 				foo[j]=foo[j]+foo[j+1];
> 			}
> 		}
> 	}

Using floating point for indexes and lengths is not a good practice. In D large numbers are written like 1_000_000. Use -release too.


> Any ideas? Am I somehow not hitting a vital compiler optimization?

DMD compiler doesn't perform many optimizations, especially on floating point computations.
But the bigger problem in your code is that you are performing operations on NaNs (that's the default initalization of FP values in D), and operations on NaNs are usually quite slower.


Your code in C:

#include "stdio.h"
#include "stdlib.h"
#define N 1000000

int main() {
    double *foo = calloc(N, sizeof(double)); // malloc suffices here
    int i, j;
    for (j = 0; j < N; j++)
        foo[j] = 1.0;

    for (i = 0; i < 1000; i++)
        for (j = 0; j < N-1; j++)
            foo[j] = foo[j] + foo[j + 1];

    printf("%f", foo[N-1]);
    return 0;
}

/*
gcc -O3 -s -Wall test.c -o test
Timings, outer loop=1_000 times: 7.72

------------------

gcc -Wall -O3 -fomit-frame-pointer -msse3 -march=native test.c -o test
(Running on a VirtualBox)
Timings, outer loop=1_000 times: 7.69 s
Just the inner loop:
.L7:
    fldl    8(%edx)
    fadd    %st, %st(1)
    fxch    %st(1)
    fstpl   (%edx)
    addl    $8, %edx
    cmpl    %ecx, %edx
    jne .L7
*/

--------------------

Your code in D1:

version (Tango)
    import tango.stdc.stdio: printf;
else
    import std.c.stdio: printf;

void main() {
    const int N = 1_000_000;
    double[] foo = new double[N];
    foo[] = 1.0;

    for (int i = 0; i < 1_000; i++)
        for (int j = 0; j < N-1; j++)
            foo[j] = foo[j] + foo[j + 1];

    printf("%f", foo[N-1]);
}


/*
dmd -O -release -inline test.d
(Not running on a VirtualBox)
Timings, outer loop=1_000 times: 9.35 s
Just the inner loop:
L34:    fld qword ptr 8[EDX*8][ECX]
        fadd    qword ptr [EDX*8][ECX]
        fstp    qword ptr [EDX*8][ECX]
        inc EDX
        cmp EDX,0F423Fh
        jb  L34

-----------------------

ldc -O3 -release -inline test.d
(Running on a VirtualBox)
Timings, outer loop=1_000 times: 7.87 s
Just the inner loop:
.LBB1_2:
    movsd   (%eax,%ecx,8), %xmm0
    addsd   8(%eax,%ecx,8), %xmm0
    movsd   %xmm0, (%eax,%ecx,8)
    incl    %ecx
    cmpl    $999999, %ecx
    jne .LBB1_2

-----------------------

ldc -unroll-allow-partial -O3 -release -inline test.d
(Running on a VirtualBox)
Timings, outer loop=1_000 times: 7.75 s
Just the inner loop:
.LBB1_2:
    movsd   (%eax,%ecx,8), %xmm0
    addsd   8(%eax,%ecx,8), %xmm0
    movsd   %xmm0, (%eax,%ecx,8)
    movsd   8(%eax,%ecx,8), %xmm0
    addsd   16(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 8(%eax,%ecx,8)
    movsd   16(%eax,%ecx,8), %xmm0
    addsd   24(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 16(%eax,%ecx,8)
    movsd   24(%eax,%ecx,8), %xmm0
    addsd   32(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 24(%eax,%ecx,8)
    movsd   32(%eax,%ecx,8), %xmm0
    addsd   40(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 32(%eax,%ecx,8)
    movsd   40(%eax,%ecx,8), %xmm0
    addsd   48(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 40(%eax,%ecx,8)
    movsd   48(%eax,%ecx,8), %xmm0
    addsd   56(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 48(%eax,%ecx,8)
    movsd   56(%eax,%ecx,8), %xmm0
    addsd   64(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 56(%eax,%ecx,8)
    movsd   64(%eax,%ecx,8), %xmm0
    addsd   72(%eax,%ecx,8), %xmm0
    movsd   %xmm0, 64(%eax,%ecx,8)
    addl    $9, %ecx
    cmpl    $999999, %ecx
    jne .LBB1_2
*/

As you see the code generated by ldc is about as good as the one generated by gcc. There are of course other ways to optimize this code...

Bye,
bearophile
May 14, 2010
On Fri, 14 May 2010 02:31:29 -0400, Lars T. Kyllingstad <public@kyllingen.nospamnet> wrote:

> On Fri, 14 May 2010 02:38:40 +0000, kai wrote:


>> I was using the DMD2 compiler on
>> mac and windows, with -O -release.
>
> 1. Have you tried the -noboundscheck compiler switch?  Unlike C, D checks
> that you do not try to read/write beyond the end of an array, but you can
> turn those checks off with said switch.

-release implies -noboundscheck (in fact, I did not know there was a noboundscheck flag, I thought you had to use -release).

-Steve
May 14, 2010
On Fri, 14 May 2010 07:32:54 -0400, Steven Schveighoffer wrote:

> On Fri, 14 May 2010 02:31:29 -0400, Lars T. Kyllingstad <public@kyllingen.nospamnet> wrote:
> 
>> On Fri, 14 May 2010 02:38:40 +0000, kai wrote:
> 
> 
>>> I was using the DMD2 compiler on
>>> mac and windows, with -O -release.
>>
>> 1. Have you tried the -noboundscheck compiler switch?  Unlike C, D checks that you do not try to read/write beyond the end of an array, but you can turn those checks off with said switch.
> 
> -release implies -noboundscheck (in fact, I did not know there was a noboundscheck flag, I thought you had to use -release).
> 
> -Steve


You are right, just checked it now.  But it's strange, I thought the whole point of the -noboundscheck switch was that it would be independent of -release.  But perhaps I remember wrongly (or perhaps Walter just hasn't gotten around to it yet).

Anyway, sorry for the misinformation.

-Lars
May 14, 2010
On Thu, 13 May 2010 22:38:40 -0400, kai <kai@nospam.zzz> wrote:

> Hello,
>
> I was evaluating using D for some numerical stuff. However I was surprised to
> find that looping & array indexing was not very speedy compared to
> alternatives (gcc et al). I was using the DMD2 compiler on mac and windows,
> with -O -release. Here is a boiled down test case:
>
> 	void main (string[] args)
> 	{
> 		double [] foo = new double [cast(int)1e6];
> 		for (int i=0;i<1e3;i++)
> 		{
> 			for (int j=0;j<1e6-1;j++)
> 			{
> 				foo[j]=foo[j]+foo[j+1];
> 			}
> 		}
> 	}
>
> Any ideas? Am I somehow not hitting a vital compiler optimization? Thanks for
> your help.

I figured it out.

in D, the default value for doubles is nan, so you are adding countless scores of nan's which is costly for some reason (not a big floating point guy, so I'm not sure about this).

In C/C++, the default value for doubles is 0.

BTW, without any initialization of the array, what are you expecting the code to do?  In the C++ version, I suspect you are simply adding a bunch of 0s together.

Equivalent D code which first initializes the array to 0s:

void main (string[] args)
{
    double [] foo = new double [cast(int)1e6];
    foo[] = 0; // probably want to change this to something more meaningful
    for (int i=0;i<cast(int)1e3;i++)
    {
        for (int j=0;j<cast(int)1e6-1;j++)
        {
            foo[j]+=foo[j+1];
        }
    }
}

On my PC, it runs almost exactly at the same speed as the C++ version.

-Steve
May 14, 2010
Thanks for the help all!

> 2. Can you use vector operations?  If the example you gave is representative of your specific problem, then you can't because you are adding overlapping parts of the array.  But if you are doing operations on separate arrays, then array operations will be *much* faster.

Unfortunately, I don't think I will be able to. The actual code is computing norms of a sequence of points and then updating their values as needed (MLE smoothing/prediction).

> For that evaluation you probably have to use the LDC compiler, that is able to optimize better.

I was scared off by the warning that D 2.0 support is experimental. I realize D 2 itself is still non-production, but for academic interests industrial-strength isnt all that important if it usually works :).

> Using floating point for indexes and lengths is not a good practice. In D large numbers are written like 1_000_000. Use -release too.

Good to know, thanks (thats actually a great feature for scientists!).

> DMD compiler doesn't perform many optimizations, especially on floating point computations. But the bigger problem in your code is that you are performing operations on NaNs (that's the default initalization of FP values in D), and operations on NaNs are usually quite slower.

> in D, the default value for doubles is nan, so you are adding countless scores of nan's which is costly for some reason (not a big floating point guy, so I'm not sure about this).

Ah ha, that was it-- serves me right for trying to boil down a test case and failing miserably. I'll head back to my code now and try to find the real problem :-) At some point I removed the initialization data obviously.
May 14, 2010
== Quote from bearophile (bearophileHUGS@lycos.com)'s article
> But the bigger problem in your code is that you are performing operations on
NaNs (that's the default initalization of FP values in D), and operations on NaNs are usually quite slower.

I didn't know that. Is it the same for inf?
I used it as a null for structs.

« First   ‹ Prev
1 2 3 4