Jump to page: 1 2 3
Thread overview
Small part of a program : d and c versions performances diff.
Jul 09, 2014
Larry
Jul 09, 2014
NCrashed
Jul 09, 2014
bearophile
Jul 09, 2014
Larry
Jul 09, 2014
bearophile
Jul 09, 2014
John Colvin
Jul 09, 2014
Larry
Jul 09, 2014
Larry
Jul 09, 2014
bearophile
Jul 09, 2014
Larry
Jul 09, 2014
bearophile
Jul 09, 2014
John Colvin
Jul 09, 2014
Larry
Jul 09, 2014
Chris
Jul 09, 2014
John Colvin
Jul 09, 2014
Larry
Jul 09, 2014
Larry
Jul 10, 2014
Kapps
Jul 10, 2014
Larry
Jul 09, 2014
Ali Çehreli
Jul 09, 2014
Larry
Jul 09, 2014
Ali Çehreli
Jul 10, 2014
Larry
Jul 10, 2014
Kapps
July 09, 2014
Hello,

I extracted a part of my code written in c.
it is deliberately useless here but I would understand the different technics to optimize such kind of code with gdc compiler.

it currently runs under a microsecond.

Constraint : the way the code is expressed cannot be changed much we need that double loop because there are other operations involved in the first loop scope.

main.c :
[code]
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "jol.h"
#include <time.h>
#include <sys/time.h>
int main(void)
{

    struct timeval s,e;
    gettimeofday(&s,NULL);

    int pol = 5;
    tes(&pol);


    int arr[] = {9,16,458,2,68,5452,98,32,4,565,78,985,3215};
    int len = 13-1;
    int g = 0;

    for (int x = 36; x >= 0 ; --x ){
        // some code here erased for the test
        for(int y = len ; y >= 0; --y){
            //some other code here
            ++g;
            arr[y] +=1;

        }

    }
    gettimeofday(&e,NULL);

    printf("so ? %d %lu %d %d %d",g,e.tv_usec - s.tv_usec, arr[4],arr[9],pol);
    return 0;
}
[/code]

jol.c
[code]
void tes(int * restrict a){

    *a = 9;

}
[/code]

and jol.h

#ifndef JOL_H
#define JOL_H
void tes(int * restrict a);
#endif // JOL_H


Now, the D counterpart:

module main;

import std.stdio;
import std.datetime;
import jol;
int main(string[] args)
{


    auto currentTime = Clock.currTime();

    int pol = 5;
    tes(pol);
    pol = 8;

    int arr[] = [9,16,458,2,68,5452,98,32,4,565,78,985,3215];
    int len = 13-1;
    int g = 0;

    for (int x = 31; x >= 0 ; --x ){

        for(int y = len ; y >= 0; --y){

            ++g;
            arr[y] +=1;

        }

    }
    auto currentTime2 = Clock.currTime();
    writefln("Hello World %d %s %d %d\n",g, (currentTime2 - currentTime),arr[4],arr[9]);

    return 0;
}

and

module jol;
final void tes(ref int a){

    a = 9;

}


Ok, the compilation options :
gdc hello.d jol.d -O3 -frelease -ftree-loop-optimize

gcc -march=native -std=c11 -O2 main.c jol.c

Now the performance :
D : 12 µs
C : < 1µs

Where does the diff comes from ? Is there a way to optimize the d version ?

Again, I am absolutely new to D and those are my very first line of code with it.

Thanks
July 09, 2014
On Wednesday, 9 July 2014 at 10:57:33 UTC, Larry wrote:
> Hello,
>
> I extracted a part of my code written in c.
> it is deliberately useless here but I would understand the different technics to optimize such kind of code with gdc compiler.
>
> it currently runs under a microsecond.
>
> Constraint : the way the code is expressed cannot be changed much we need that double loop because there are other operations involved in the first loop scope.
>
> main.c :
> [code]
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> #include "jol.h"
> #include <time.h>
> #include <sys/time.h>
> int main(void)
> {
>
>     struct timeval s,e;
>     gettimeofday(&s,NULL);
>
>     int pol = 5;
>     tes(&pol);
>
>
>     int arr[] = {9,16,458,2,68,5452,98,32,4,565,78,985,3215};
>     int len = 13-1;
>     int g = 0;
>
>     for (int x = 36; x >= 0 ; --x ){
>         // some code here erased for the test
>         for(int y = len ; y >= 0; --y){
>             //some other code here
>             ++g;
>             arr[y] +=1;
>
>         }
>
>     }
>     gettimeofday(&e,NULL);
>
>     printf("so ? %d %lu %d %d %d",g,e.tv_usec - s.tv_usec, arr[4],arr[9],pol);
>     return 0;
> }
> [/code]
>
> jol.c
> [code]
> void tes(int * restrict a){
>
>     *a = 9;
>
> }
> [/code]
>
> and jol.h
>
> #ifndef JOL_H
> #define JOL_H
> void tes(int * restrict a);
> #endif // JOL_H
>
>
> Now, the D counterpart:
>
> module main;
>
> import std.stdio;
> import std.datetime;
> import jol;
> int main(string[] args)
> {
>
>
>     auto currentTime = Clock.currTime();
>
>     int pol = 5;
>     tes(pol);
>     pol = 8;
>
>     int arr[] = [9,16,458,2,68,5452,98,32,4,565,78,985,3215];
>     int len = 13-1;
>     int g = 0;
>
>     for (int x = 31; x >= 0 ; --x ){
>
>         for(int y = len ; y >= 0; --y){
>
>             ++g;
>             arr[y] +=1;
>
>         }
>
>     }
>     auto currentTime2 = Clock.currTime();
>     writefln("Hello World %d %s %d %d\n",g, (currentTime2 - currentTime),arr[4],arr[9]);
>
>     return 0;
> }
>
> and
>
> module jol;
> final void tes(ref int a){
>
>     a = 9;
>
> }
>
>
> Ok, the compilation options :
> gdc hello.d jol.d -O3 -frelease -ftree-loop-optimize
>
> gcc -march=native -std=c11 -O2 main.c jol.c
>
> Now the performance :
> D : 12 µs
> C : < 1µs
>
> Where does the diff comes from ? Is there a way to optimize the d version ?
>
> Again, I am absolutely new to D and those are my very first line of code with it.
>
> Thanks

Clock isn't an accurate benchmark instrument. Try std.datetime.benchmark:
```
module main;

import std.stdio;
import std.datetime;

void tes(ref int a)
{
    a = 9;
}

int[] arr = [9,16,458,2,68,5452,98,32,4,565,78,985,3215];

void foo()
{
    int pol = 5;
    tes(pol);
    pol = 8;
    int g = 0;

    foreach_reverse(x; 0..31)
    {
        foreach_reverse(ref a; arr)
        {
            ++g;
            a += 1;
        }
    }
}

void main()
{
    auto res = benchmark!foo(1000); // take mean of 1000 launches
    writeln(res[0].msecs, " ", arr[4], " ", arr[9]);
}
```

Dmd time: 1 us
Gcc time: <= 1 us
July 09, 2014
Larry:

> Now the performance :
> D : 12 µs
> C : < 1µs
>
> Where does the diff comes from ? Is there a way to optimize the d version ?
>
> Again, I am absolutely new to D and those are my very first line of code with it.

Your C code is not equivalent to the D code, there are small differences, even the output is different. So I've cleaned up your C and D code:

------------------------

// C code.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include "jol.h"

int main() {
    struct timeval s, e;
    gettimeofday(&s, NULL);

    int pol = 5;
    tes(&pol);

    int arr[] = {9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215};
    int len = 13 - 1;
    int g = 0;

    for (int x = 36; x >= 0; --x) {
        for (int y = len; y >= 0; --y) {
            ++g;
            arr[y]++;
        }
    }

    gettimeofday(&e, NULL);
    printf("C: %d %lu %d %d %d\n",
           g, e.tv_usec - s.tv_usec, arr[4], arr[9], pol);

    return 0;
}

------------------------

D code ("final" functions have not much meaning, but the D compiler is very sloppy and doesn't complain):


module jol;

void tes(ref int a) {
    a = 9;
}


---------

module maind;

void main() {
    import std.stdio;
    import std.datetime;
    import jol;

    StopWatch sw;
    sw.start;

    int pol = 5;
    tes(pol);

    int[] arr = [9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215];
    int len = 13 - 1;
    int g = 0;

    for (int x = 36; x >= 0; --x) {
        // Some code here erased for the test.
        for (int y = len; y >= 0; --y) {
            // Some other code here.
            ++g;
            arr[y]++;
        }
    }

    sw.stop;
    writefln("D: %d %d %d %d %d",
             g, sw.peek.nsecs, arr[4], arr[9], pol);
}

----------------

That D code is not fully idiomatic, this is closer to idiomatic D code:


module jol2;

void test(ref int x) pure nothrow @safe {
    x = 9;
}



module maind;

void main() {
    import std.stdio, std.datetime;
    import jol2;

    StopWatch sw;
    sw.start;

    int pol = 5;
    test(pol);

    int[13] arr = [9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215];
    uint count = 0;

    foreach_reverse (immutable _; 0 .. 37) {
        foreach_reverse (ref ai; arr) {
            count++;
            ai++;
        }
    }

    sw.stop;
    writefln("D: %d %d %d %d %d",
             count, sw.peek.nsecs, arr[4], arr[9], pol);
}

----------------

In my benchmarks I don't have used the more idiomatic D code, I have used the C-like code. But the run-time is essentially the same.

I compile the C and D code with (on a 32 bit Windows):

gcc -march=native -std=c11 -O2 main.c jol.c -o main

ldmd2 -wi -O -release -inline -noboundscheck maind.d jol.d
strip maind.exe

For the D code I've used the latest ldc2 compiler (V. 0.13.0, based on DMD v2.064 and LLVM 3.4.2), GCC is V.4.8.0 (rubenvb-4.8.0).

----------------

The C code gives as ouput:

C: 481 0 105 602 9


The D code gives as output:

D: 481 6076 105 602 9

----------------------

If I slow down the CPU at half speed the C code runs in about 0.05 seconds, the D code runs in about 0.07 seconds.

Such run times are too much small to perform a sufficiently meaningful comparison. You need a run-time of about 2 seconds to get meaningful timings.

The difference between 0.05 and 0.07 is caused by initializing the D rutime (like the D GC), it takes about 0.015 seconds on my systems at full speed CPU to initialize the D runtime, and it's a constant time.

Bye,
bearophile
July 09, 2014
On Wednesday, 9 July 2014 at 12:25:40 UTC, bearophile wrote:
> Larry:
>
>> Now the performance :
>> D : 12 µs
>> C : < 1µs
>>
>> Where does the diff comes from ? Is there a way to optimize the d version ?
>>
>> Again, I am absolutely new to D and those are my very first line of code with it.
>
> Your C code is not equivalent to the D code, there are small differences, even the output is different. So I've cleaned up your C and D code:
>
> ------------------------
>
> // C code.
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> #include <time.h>
> #include <sys/time.h>
> #include "jol.h"
>
> int main() {
>     struct timeval s, e;
>     gettimeofday(&s, NULL);
>
>     int pol = 5;
>     tes(&pol);
>
>     int arr[] = {9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215};
>     int len = 13 - 1;
>     int g = 0;
>
>     for (int x = 36; x >= 0; --x) {
>         for (int y = len; y >= 0; --y) {
>             ++g;
>             arr[y]++;
>         }
>     }
>
>     gettimeofday(&e, NULL);
>     printf("C: %d %lu %d %d %d\n",
>            g, e.tv_usec - s.tv_usec, arr[4], arr[9], pol);
>
>     return 0;
> }
>
> ------------------------
>
> D code ("final" functions have not much meaning, but the D compiler is very sloppy and doesn't complain):
>
>
> module jol;
>
> void tes(ref int a) {
>     a = 9;
> }
>
>
> ---------
>
> module maind;
>
> void main() {
>     import std.stdio;
>     import std.datetime;
>     import jol;
>
>     StopWatch sw;
>     sw.start;
>
>     int pol = 5;
>     tes(pol);
>
>     int[] arr = [9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215];
>     int len = 13 - 1;
>     int g = 0;
>
>     for (int x = 36; x >= 0; --x) {
>         // Some code here erased for the test.
>         for (int y = len; y >= 0; --y) {
>             // Some other code here.
>             ++g;
>             arr[y]++;
>         }
>     }
>
>     sw.stop;
>     writefln("D: %d %d %d %d %d",
>              g, sw.peek.nsecs, arr[4], arr[9], pol);
> }
>
> ----------------
>
> That D code is not fully idiomatic, this is closer to idiomatic D code:
>
>
> module jol2;
>
> void test(ref int x) pure nothrow @safe {
>     x = 9;
> }
>
>
>
> module maind;
>
> void main() {
>     import std.stdio, std.datetime;
>     import jol2;
>
>     StopWatch sw;
>     sw.start;
>
>     int pol = 5;
>     test(pol);
>
>     int[13] arr = [9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215];
>     uint count = 0;
>
>     foreach_reverse (immutable _; 0 .. 37) {
>         foreach_reverse (ref ai; arr) {
>             count++;
>             ai++;
>         }
>     }
>
>     sw.stop;
>     writefln("D: %d %d %d %d %d",
>              count, sw.peek.nsecs, arr[4], arr[9], pol);
> }
>
> ----------------
>
> In my benchmarks I don't have used the more idiomatic D code, I have used the C-like code. But the run-time is essentially the same.
>
> I compile the C and D code with (on a 32 bit Windows):
>
> gcc -march=native -std=c11 -O2 main.c jol.c -o main
>
> ldmd2 -wi -O -release -inline -noboundscheck maind.d jol.d
> strip maind.exe
>
> For the D code I've used the latest ldc2 compiler (V. 0.13.0, based on DMD v2.064 and LLVM 3.4.2), GCC is V.4.8.0 (rubenvb-4.8.0).
>
> ----------------
>
> The C code gives as ouput:
>
> C: 481 0 105 602 9
>
>
> The D code gives as output:
>
> D: 481 6076 105 602 9
>
> ----------------------
>
> If I slow down the CPU at half speed the C code runs in about 0.05 seconds, the D code runs in about 0.07 seconds.
>
> Such run times are too much small to perform a sufficiently meaningful comparison. You need a run-time of about 2 seconds to get meaningful timings.
>
> The difference between 0.05 and 0.07 is caused by initializing the D rutime (like the D GC), it takes about 0.015 seconds on my systems at full speed CPU to initialize the D runtime, and it's a constant time.
>
> Bye,
> bearophile

You are definitely right, I did mess up while translating !

I run the corrected codes (the ones I was meant to provide :S) and on a slow macbook I end up with :
C : 2
D : 15994

Of course when run on very high end machines, this diff is almost non existent but we want to run on very low powered hardware.

Ok, even with a longer code, there will always be a launch penalty for d. So I cannot use it for very high performance loops.

Shame for us..
:)

Thanks and bye

July 09, 2014
Larry:

> Of course when run on very high end machines, this diff is almost non existent but we want to run on very low powered hardware.
>
> Ok, even with a longer code, there will always be a launch penalty for d. So I cannot use it for very high performance loops.

If you run it on very low powered hardware then you may not need the GC. So if you disable the run-time (stubbing out the GC) the start-up time of the D code will be smaller.

I think people here like you are really too quick at dismissing D :-)

Bye,
bearophile
July 09, 2014
On Wednesday, 9 July 2014 at 13:18:00 UTC, Larry wrote:
> On Wednesday, 9 July 2014 at 12:25:40 UTC, bearophile wrote:
>> Larry:
>>
>>> Now the performance :
>>> D : 12 µs
>>> C : < 1µs
>>>
>>> Where does the diff comes from ? Is there a way to optimize the d version ?
>>>
>>> Again, I am absolutely new to D and those are my very first line of code with it.
>>
>> Your C code is not equivalent to the D code, there are small differences, even the output is different. So I've cleaned up your C and D code:
>>
>> ------------------------
>>
>> // C code.
>> #include <stdio.h>
>> #include <string.h>
>> #include <stdlib.h>
>> #include <time.h>
>> #include <sys/time.h>
>> #include "jol.h"
>>
>> int main() {
>>    struct timeval s, e;
>>    gettimeofday(&s, NULL);
>>
>>    int pol = 5;
>>    tes(&pol);
>>
>>    int arr[] = {9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215};
>>    int len = 13 - 1;
>>    int g = 0;
>>
>>    for (int x = 36; x >= 0; --x) {
>>        for (int y = len; y >= 0; --y) {
>>            ++g;
>>            arr[y]++;
>>        }
>>    }
>>
>>    gettimeofday(&e, NULL);
>>    printf("C: %d %lu %d %d %d\n",
>>           g, e.tv_usec - s.tv_usec, arr[4], arr[9], pol);
>>
>>    return 0;
>> }
>>
>> ------------------------
>>
>> D code ("final" functions have not much meaning, but the D compiler is very sloppy and doesn't complain):
>>
>>
>> module jol;
>>
>> void tes(ref int a) {
>>    a = 9;
>> }
>>
>>
>> ---------
>>
>> module maind;
>>
>> void main() {
>>    import std.stdio;
>>    import std.datetime;
>>    import jol;
>>
>>    StopWatch sw;
>>    sw.start;
>>
>>    int pol = 5;
>>    tes(pol);
>>
>>    int[] arr = [9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215];
>>    int len = 13 - 1;
>>    int g = 0;
>>
>>    for (int x = 36; x >= 0; --x) {
>>        // Some code here erased for the test.
>>        for (int y = len; y >= 0; --y) {
>>            // Some other code here.
>>            ++g;
>>            arr[y]++;
>>        }
>>    }
>>
>>    sw.stop;
>>    writefln("D: %d %d %d %d %d",
>>             g, sw.peek.nsecs, arr[4], arr[9], pol);
>> }
>>
>> ----------------
>>
>> That D code is not fully idiomatic, this is closer to idiomatic D code:
>>
>>
>> module jol2;
>>
>> void test(ref int x) pure nothrow @safe {
>>    x = 9;
>> }
>>
>>
>>
>> module maind;
>>
>> void main() {
>>    import std.stdio, std.datetime;
>>    import jol2;
>>
>>    StopWatch sw;
>>    sw.start;
>>
>>    int pol = 5;
>>    test(pol);
>>
>>    int[13] arr = [9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215];
>>    uint count = 0;
>>
>>    foreach_reverse (immutable _; 0 .. 37) {
>>        foreach_reverse (ref ai; arr) {
>>            count++;
>>            ai++;
>>        }
>>    }
>>
>>    sw.stop;
>>    writefln("D: %d %d %d %d %d",
>>             count, sw.peek.nsecs, arr[4], arr[9], pol);
>> }
>>
>> ----------------
>>
>> In my benchmarks I don't have used the more idiomatic D code, I have used the C-like code. But the run-time is essentially the same.
>>
>> I compile the C and D code with (on a 32 bit Windows):
>>
>> gcc -march=native -std=c11 -O2 main.c jol.c -o main
>>
>> ldmd2 -wi -O -release -inline -noboundscheck maind.d jol.d
>> strip maind.exe
>>
>> For the D code I've used the latest ldc2 compiler (V. 0.13.0, based on DMD v2.064 and LLVM 3.4.2), GCC is V.4.8.0 (rubenvb-4.8.0).
>>
>> ----------------
>>
>> The C code gives as ouput:
>>
>> C: 481 0 105 602 9
>>
>>
>> The D code gives as output:
>>
>> D: 481 6076 105 602 9
>>
>> ----------------------
>>
>> If I slow down the CPU at half speed the C code runs in about 0.05 seconds, the D code runs in about 0.07 seconds.
>>
>> Such run times are too much small to perform a sufficiently meaningful comparison. You need a run-time of about 2 seconds to get meaningful timings.
>>
>> The difference between 0.05 and 0.07 is caused by initializing the D rutime (like the D GC), it takes about 0.015 seconds on my systems at full speed CPU to initialize the D runtime, and it's a constant time.
>>
>> Bye,
>> bearophile
>
> You are definitely right, I did mess up while translating !
>
> I run the corrected codes (the ones I was meant to provide :S) and on a slow macbook I end up with :
> C : 2
> D : 15994
>
> Of course when run on very high end machines, this diff is almost non existent but we want to run on very low powered hardware.
>
> Ok, even with a longer code, there will always be a launch penalty for d. So I cannot use it for very high performance loops.
>
> Shame for us..
> :)
>
> Thanks and bye

Could you provide the exact code you are using for that benchmark? Once the program has started up you should be able to obtain performance parity between C and D. Situations where this isn't true are problems we would like to know about.

For the amount of work you are doing in the test program (almost nothing), the total runtime is probably dominated by the program load time etc. even when using C.
July 09, 2014
Yes you are perfectly right but our need is to run the fastest code on the lowest powered machines. Not servers but embedded systems.

That is why I just test the overall structures.

The rest of the code is numerical so it will not change by much the fact that d cannot get back the huge launching time. At the microsecond level(even nano) it counts because of electrical consumption, size of hardware, heat and so on.

It is definitely not something most care about and i cannot disclose the full code for license reasons (yeah I know I suck and generate some fuss for nothing but.. I just execute.)

But D may be of our use for non critical code to replace some Python there and there. It is definitely a good piece of engineering. And it will help save money.
July 09, 2014
On Wednesday, 9 July 2014 at 13:46:59 UTC, Larry wrote:
> Yes you are perfectly right but our need is to run the fastest code on the lowest powered machines. Not servers but embedded systems.
>
> That is why I just test the overall structures.
>
> The rest of the code is numerical so it will not change by much the fact that d cannot get back the huge launching time. At the microsecond level(even nano) it counts because of electrical consumption, size of hardware, heat and so on.
>
> It is definitely not something most care about and i cannot disclose the full code for license reasons (yeah I know I suck and generate some fuss for nothing but.. I just execute.)
>
> But D may be of our use for non critical code to replace some Python there and there. It is definitely a good piece of engineering. And it will help save money.

@John Colvin :
hem, you meant the sample code or the real code ? If the former, it is the one corrected by Bearophile.
My excuses
July 09, 2014
Larry:

> The rest of the code is numerical so it will not change by much the fact that d cannot get back the huge launching time. At the microsecond level(even nano) it counts because of electrical consumption, size of hardware, heat and so on.

Have you benchmarked the D code without starting the current d-runtime (without GC)?

Is a starting time of around 0.015 seconds on an old PC is a huge one? I think no one has worked a lot in decreasing this tiny time. If you care for such time, D being open source, you can take a look at the runtime starting code.

Bye,
bearophile
July 09, 2014
@Bearophile: just tried. No dramatic change.

import core.memory;

void main() {
GC.disable;
...
}
« First   ‹ Prev
1 2 3