Thread overview | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
July 09, 2014 Small part of a program : d and c versions performances diff. | ||||
---|---|---|---|---|
| ||||
Hello, I extracted a part of my code written in c. it is deliberately useless here but I would understand the different technics to optimize such kind of code with gdc compiler. it currently runs under a microsecond. Constraint : the way the code is expressed cannot be changed much we need that double loop because there are other operations involved in the first loop scope. main.c : [code] #include <stdio.h> #include <string.h> #include <stdlib.h> #include "jol.h" #include <time.h> #include <sys/time.h> int main(void) { struct timeval s,e; gettimeofday(&s,NULL); int pol = 5; tes(&pol); int arr[] = {9,16,458,2,68,5452,98,32,4,565,78,985,3215}; int len = 13-1; int g = 0; for (int x = 36; x >= 0 ; --x ){ // some code here erased for the test for(int y = len ; y >= 0; --y){ //some other code here ++g; arr[y] +=1; } } gettimeofday(&e,NULL); printf("so ? %d %lu %d %d %d",g,e.tv_usec - s.tv_usec, arr[4],arr[9],pol); return 0; } [/code] jol.c [code] void tes(int * restrict a){ *a = 9; } [/code] and jol.h #ifndef JOL_H #define JOL_H void tes(int * restrict a); #endif // JOL_H Now, the D counterpart: module main; import std.stdio; import std.datetime; import jol; int main(string[] args) { auto currentTime = Clock.currTime(); int pol = 5; tes(pol); pol = 8; int arr[] = [9,16,458,2,68,5452,98,32,4,565,78,985,3215]; int len = 13-1; int g = 0; for (int x = 31; x >= 0 ; --x ){ for(int y = len ; y >= 0; --y){ ++g; arr[y] +=1; } } auto currentTime2 = Clock.currTime(); writefln("Hello World %d %s %d %d\n",g, (currentTime2 - currentTime),arr[4],arr[9]); return 0; } and module jol; final void tes(ref int a){ a = 9; } Ok, the compilation options : gdc hello.d jol.d -O3 -frelease -ftree-loop-optimize gcc -march=native -std=c11 -O2 main.c jol.c Now the performance : D : 12 µs C : < 1µs Where does the diff comes from ? Is there a way to optimize the d version ? Again, I am absolutely new to D and those are my very first line of code with it. Thanks |
July 09, 2014 Re: Small part of a program : d and c versions performances diff. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Larry | On Wednesday, 9 July 2014 at 10:57:33 UTC, Larry wrote: > Hello, > > I extracted a part of my code written in c. > it is deliberately useless here but I would understand the different technics to optimize such kind of code with gdc compiler. > > it currently runs under a microsecond. > > Constraint : the way the code is expressed cannot be changed much we need that double loop because there are other operations involved in the first loop scope. > > main.c : > [code] > #include <stdio.h> > #include <string.h> > #include <stdlib.h> > #include "jol.h" > #include <time.h> > #include <sys/time.h> > int main(void) > { > > struct timeval s,e; > gettimeofday(&s,NULL); > > int pol = 5; > tes(&pol); > > > int arr[] = {9,16,458,2,68,5452,98,32,4,565,78,985,3215}; > int len = 13-1; > int g = 0; > > for (int x = 36; x >= 0 ; --x ){ > // some code here erased for the test > for(int y = len ; y >= 0; --y){ > //some other code here > ++g; > arr[y] +=1; > > } > > } > gettimeofday(&e,NULL); > > printf("so ? %d %lu %d %d %d",g,e.tv_usec - s.tv_usec, arr[4],arr[9],pol); > return 0; > } > [/code] > > jol.c > [code] > void tes(int * restrict a){ > > *a = 9; > > } > [/code] > > and jol.h > > #ifndef JOL_H > #define JOL_H > void tes(int * restrict a); > #endif // JOL_H > > > Now, the D counterpart: > > module main; > > import std.stdio; > import std.datetime; > import jol; > int main(string[] args) > { > > > auto currentTime = Clock.currTime(); > > int pol = 5; > tes(pol); > pol = 8; > > int arr[] = [9,16,458,2,68,5452,98,32,4,565,78,985,3215]; > int len = 13-1; > int g = 0; > > for (int x = 31; x >= 0 ; --x ){ > > for(int y = len ; y >= 0; --y){ > > ++g; > arr[y] +=1; > > } > > } > auto currentTime2 = Clock.currTime(); > writefln("Hello World %d %s %d %d\n",g, (currentTime2 - currentTime),arr[4],arr[9]); > > return 0; > } > > and > > module jol; > final void tes(ref int a){ > > a = 9; > > } > > > Ok, the compilation options : > gdc hello.d jol.d -O3 -frelease -ftree-loop-optimize > > gcc -march=native -std=c11 -O2 main.c jol.c > > Now the performance : > D : 12 µs > C : < 1µs > > Where does the diff comes from ? Is there a way to optimize the d version ? > > Again, I am absolutely new to D and those are my very first line of code with it. > > Thanks Clock isn't an accurate benchmark instrument. Try std.datetime.benchmark: ``` module main; import std.stdio; import std.datetime; void tes(ref int a) { a = 9; } int[] arr = [9,16,458,2,68,5452,98,32,4,565,78,985,3215]; void foo() { int pol = 5; tes(pol); pol = 8; int g = 0; foreach_reverse(x; 0..31) { foreach_reverse(ref a; arr) { ++g; a += 1; } } } void main() { auto res = benchmark!foo(1000); // take mean of 1000 launches writeln(res[0].msecs, " ", arr[4], " ", arr[9]); } ``` Dmd time: 1 us Gcc time: <= 1 us |
July 09, 2014 Re: Small part of a program : d and c versions performances diff. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Larry | Larry: > Now the performance : > D : 12 µs > C : < 1µs > > Where does the diff comes from ? Is there a way to optimize the d version ? > > Again, I am absolutely new to D and those are my very first line of code with it. Your C code is not equivalent to the D code, there are small differences, even the output is different. So I've cleaned up your C and D code: ------------------------ // C code. #include <stdio.h> #include <string.h> #include <stdlib.h> #include <time.h> #include <sys/time.h> #include "jol.h" int main() { struct timeval s, e; gettimeofday(&s, NULL); int pol = 5; tes(&pol); int arr[] = {9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215}; int len = 13 - 1; int g = 0; for (int x = 36; x >= 0; --x) { for (int y = len; y >= 0; --y) { ++g; arr[y]++; } } gettimeofday(&e, NULL); printf("C: %d %lu %d %d %d\n", g, e.tv_usec - s.tv_usec, arr[4], arr[9], pol); return 0; } ------------------------ D code ("final" functions have not much meaning, but the D compiler is very sloppy and doesn't complain): module jol; void tes(ref int a) { a = 9; } --------- module maind; void main() { import std.stdio; import std.datetime; import jol; StopWatch sw; sw.start; int pol = 5; tes(pol); int[] arr = [9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215]; int len = 13 - 1; int g = 0; for (int x = 36; x >= 0; --x) { // Some code here erased for the test. for (int y = len; y >= 0; --y) { // Some other code here. ++g; arr[y]++; } } sw.stop; writefln("D: %d %d %d %d %d", g, sw.peek.nsecs, arr[4], arr[9], pol); } ---------------- That D code is not fully idiomatic, this is closer to idiomatic D code: module jol2; void test(ref int x) pure nothrow @safe { x = 9; } module maind; void main() { import std.stdio, std.datetime; import jol2; StopWatch sw; sw.start; int pol = 5; test(pol); int[13] arr = [9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215]; uint count = 0; foreach_reverse (immutable _; 0 .. 37) { foreach_reverse (ref ai; arr) { count++; ai++; } } sw.stop; writefln("D: %d %d %d %d %d", count, sw.peek.nsecs, arr[4], arr[9], pol); } ---------------- In my benchmarks I don't have used the more idiomatic D code, I have used the C-like code. But the run-time is essentially the same. I compile the C and D code with (on a 32 bit Windows): gcc -march=native -std=c11 -O2 main.c jol.c -o main ldmd2 -wi -O -release -inline -noboundscheck maind.d jol.d strip maind.exe For the D code I've used the latest ldc2 compiler (V. 0.13.0, based on DMD v2.064 and LLVM 3.4.2), GCC is V.4.8.0 (rubenvb-4.8.0). ---------------- The C code gives as ouput: C: 481 0 105 602 9 The D code gives as output: D: 481 6076 105 602 9 ---------------------- If I slow down the CPU at half speed the C code runs in about 0.05 seconds, the D code runs in about 0.07 seconds. Such run times are too much small to perform a sufficiently meaningful comparison. You need a run-time of about 2 seconds to get meaningful timings. The difference between 0.05 and 0.07 is caused by initializing the D rutime (like the D GC), it takes about 0.015 seconds on my systems at full speed CPU to initialize the D runtime, and it's a constant time. Bye, bearophile |
July 09, 2014 Re: Small part of a program : d and c versions performances diff. | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | On Wednesday, 9 July 2014 at 12:25:40 UTC, bearophile wrote:
> Larry:
>
>> Now the performance :
>> D : 12 µs
>> C : < 1µs
>>
>> Where does the diff comes from ? Is there a way to optimize the d version ?
>>
>> Again, I am absolutely new to D and those are my very first line of code with it.
>
> Your C code is not equivalent to the D code, there are small differences, even the output is different. So I've cleaned up your C and D code:
>
> ------------------------
>
> // C code.
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> #include <time.h>
> #include <sys/time.h>
> #include "jol.h"
>
> int main() {
> struct timeval s, e;
> gettimeofday(&s, NULL);
>
> int pol = 5;
> tes(&pol);
>
> int arr[] = {9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215};
> int len = 13 - 1;
> int g = 0;
>
> for (int x = 36; x >= 0; --x) {
> for (int y = len; y >= 0; --y) {
> ++g;
> arr[y]++;
> }
> }
>
> gettimeofday(&e, NULL);
> printf("C: %d %lu %d %d %d\n",
> g, e.tv_usec - s.tv_usec, arr[4], arr[9], pol);
>
> return 0;
> }
>
> ------------------------
>
> D code ("final" functions have not much meaning, but the D compiler is very sloppy and doesn't complain):
>
>
> module jol;
>
> void tes(ref int a) {
> a = 9;
> }
>
>
> ---------
>
> module maind;
>
> void main() {
> import std.stdio;
> import std.datetime;
> import jol;
>
> StopWatch sw;
> sw.start;
>
> int pol = 5;
> tes(pol);
>
> int[] arr = [9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215];
> int len = 13 - 1;
> int g = 0;
>
> for (int x = 36; x >= 0; --x) {
> // Some code here erased for the test.
> for (int y = len; y >= 0; --y) {
> // Some other code here.
> ++g;
> arr[y]++;
> }
> }
>
> sw.stop;
> writefln("D: %d %d %d %d %d",
> g, sw.peek.nsecs, arr[4], arr[9], pol);
> }
>
> ----------------
>
> That D code is not fully idiomatic, this is closer to idiomatic D code:
>
>
> module jol2;
>
> void test(ref int x) pure nothrow @safe {
> x = 9;
> }
>
>
>
> module maind;
>
> void main() {
> import std.stdio, std.datetime;
> import jol2;
>
> StopWatch sw;
> sw.start;
>
> int pol = 5;
> test(pol);
>
> int[13] arr = [9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215];
> uint count = 0;
>
> foreach_reverse (immutable _; 0 .. 37) {
> foreach_reverse (ref ai; arr) {
> count++;
> ai++;
> }
> }
>
> sw.stop;
> writefln("D: %d %d %d %d %d",
> count, sw.peek.nsecs, arr[4], arr[9], pol);
> }
>
> ----------------
>
> In my benchmarks I don't have used the more idiomatic D code, I have used the C-like code. But the run-time is essentially the same.
>
> I compile the C and D code with (on a 32 bit Windows):
>
> gcc -march=native -std=c11 -O2 main.c jol.c -o main
>
> ldmd2 -wi -O -release -inline -noboundscheck maind.d jol.d
> strip maind.exe
>
> For the D code I've used the latest ldc2 compiler (V. 0.13.0, based on DMD v2.064 and LLVM 3.4.2), GCC is V.4.8.0 (rubenvb-4.8.0).
>
> ----------------
>
> The C code gives as ouput:
>
> C: 481 0 105 602 9
>
>
> The D code gives as output:
>
> D: 481 6076 105 602 9
>
> ----------------------
>
> If I slow down the CPU at half speed the C code runs in about 0.05 seconds, the D code runs in about 0.07 seconds.
>
> Such run times are too much small to perform a sufficiently meaningful comparison. You need a run-time of about 2 seconds to get meaningful timings.
>
> The difference between 0.05 and 0.07 is caused by initializing the D rutime (like the D GC), it takes about 0.015 seconds on my systems at full speed CPU to initialize the D runtime, and it's a constant time.
>
> Bye,
> bearophile
You are definitely right, I did mess up while translating !
I run the corrected codes (the ones I was meant to provide :S) and on a slow macbook I end up with :
C : 2
D : 15994
Of course when run on very high end machines, this diff is almost non existent but we want to run on very low powered hardware.
Ok, even with a longer code, there will always be a launch penalty for d. So I cannot use it for very high performance loops.
Shame for us..
:)
Thanks and bye
|
July 09, 2014 Re: Small part of a program : d and c versions performances diff. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Larry | Larry:
> Of course when run on very high end machines, this diff is almost non existent but we want to run on very low powered hardware.
>
> Ok, even with a longer code, there will always be a launch penalty for d. So I cannot use it for very high performance loops.
If you run it on very low powered hardware then you may not need the GC. So if you disable the run-time (stubbing out the GC) the start-up time of the D code will be smaller.
I think people here like you are really too quick at dismissing D :-)
Bye,
bearophile
|
July 09, 2014 Re: Small part of a program : d and c versions performances diff. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Larry | On Wednesday, 9 July 2014 at 13:18:00 UTC, Larry wrote:
> On Wednesday, 9 July 2014 at 12:25:40 UTC, bearophile wrote:
>> Larry:
>>
>>> Now the performance :
>>> D : 12 µs
>>> C : < 1µs
>>>
>>> Where does the diff comes from ? Is there a way to optimize the d version ?
>>>
>>> Again, I am absolutely new to D and those are my very first line of code with it.
>>
>> Your C code is not equivalent to the D code, there are small differences, even the output is different. So I've cleaned up your C and D code:
>>
>> ------------------------
>>
>> // C code.
>> #include <stdio.h>
>> #include <string.h>
>> #include <stdlib.h>
>> #include <time.h>
>> #include <sys/time.h>
>> #include "jol.h"
>>
>> int main() {
>> struct timeval s, e;
>> gettimeofday(&s, NULL);
>>
>> int pol = 5;
>> tes(&pol);
>>
>> int arr[] = {9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215};
>> int len = 13 - 1;
>> int g = 0;
>>
>> for (int x = 36; x >= 0; --x) {
>> for (int y = len; y >= 0; --y) {
>> ++g;
>> arr[y]++;
>> }
>> }
>>
>> gettimeofday(&e, NULL);
>> printf("C: %d %lu %d %d %d\n",
>> g, e.tv_usec - s.tv_usec, arr[4], arr[9], pol);
>>
>> return 0;
>> }
>>
>> ------------------------
>>
>> D code ("final" functions have not much meaning, but the D compiler is very sloppy and doesn't complain):
>>
>>
>> module jol;
>>
>> void tes(ref int a) {
>> a = 9;
>> }
>>
>>
>> ---------
>>
>> module maind;
>>
>> void main() {
>> import std.stdio;
>> import std.datetime;
>> import jol;
>>
>> StopWatch sw;
>> sw.start;
>>
>> int pol = 5;
>> tes(pol);
>>
>> int[] arr = [9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215];
>> int len = 13 - 1;
>> int g = 0;
>>
>> for (int x = 36; x >= 0; --x) {
>> // Some code here erased for the test.
>> for (int y = len; y >= 0; --y) {
>> // Some other code here.
>> ++g;
>> arr[y]++;
>> }
>> }
>>
>> sw.stop;
>> writefln("D: %d %d %d %d %d",
>> g, sw.peek.nsecs, arr[4], arr[9], pol);
>> }
>>
>> ----------------
>>
>> That D code is not fully idiomatic, this is closer to idiomatic D code:
>>
>>
>> module jol2;
>>
>> void test(ref int x) pure nothrow @safe {
>> x = 9;
>> }
>>
>>
>>
>> module maind;
>>
>> void main() {
>> import std.stdio, std.datetime;
>> import jol2;
>>
>> StopWatch sw;
>> sw.start;
>>
>> int pol = 5;
>> test(pol);
>>
>> int[13] arr = [9, 16, 458, 2, 68, 5452, 98, 32, 4, 565, 78, 985, 3215];
>> uint count = 0;
>>
>> foreach_reverse (immutable _; 0 .. 37) {
>> foreach_reverse (ref ai; arr) {
>> count++;
>> ai++;
>> }
>> }
>>
>> sw.stop;
>> writefln("D: %d %d %d %d %d",
>> count, sw.peek.nsecs, arr[4], arr[9], pol);
>> }
>>
>> ----------------
>>
>> In my benchmarks I don't have used the more idiomatic D code, I have used the C-like code. But the run-time is essentially the same.
>>
>> I compile the C and D code with (on a 32 bit Windows):
>>
>> gcc -march=native -std=c11 -O2 main.c jol.c -o main
>>
>> ldmd2 -wi -O -release -inline -noboundscheck maind.d jol.d
>> strip maind.exe
>>
>> For the D code I've used the latest ldc2 compiler (V. 0.13.0, based on DMD v2.064 and LLVM 3.4.2), GCC is V.4.8.0 (rubenvb-4.8.0).
>>
>> ----------------
>>
>> The C code gives as ouput:
>>
>> C: 481 0 105 602 9
>>
>>
>> The D code gives as output:
>>
>> D: 481 6076 105 602 9
>>
>> ----------------------
>>
>> If I slow down the CPU at half speed the C code runs in about 0.05 seconds, the D code runs in about 0.07 seconds.
>>
>> Such run times are too much small to perform a sufficiently meaningful comparison. You need a run-time of about 2 seconds to get meaningful timings.
>>
>> The difference between 0.05 and 0.07 is caused by initializing the D rutime (like the D GC), it takes about 0.015 seconds on my systems at full speed CPU to initialize the D runtime, and it's a constant time.
>>
>> Bye,
>> bearophile
>
> You are definitely right, I did mess up while translating !
>
> I run the corrected codes (the ones I was meant to provide :S) and on a slow macbook I end up with :
> C : 2
> D : 15994
>
> Of course when run on very high end machines, this diff is almost non existent but we want to run on very low powered hardware.
>
> Ok, even with a longer code, there will always be a launch penalty for d. So I cannot use it for very high performance loops.
>
> Shame for us..
> :)
>
> Thanks and bye
Could you provide the exact code you are using for that benchmark? Once the program has started up you should be able to obtain performance parity between C and D. Situations where this isn't true are problems we would like to know about.
For the amount of work you are doing in the test program (almost nothing), the total runtime is probably dominated by the program load time etc. even when using C.
|
July 09, 2014 Re: Small part of a program : d and c versions performances diff. | ||||
---|---|---|---|---|
| ||||
Posted in reply to John Colvin | Yes you are perfectly right but our need is to run the fastest code on the lowest powered machines. Not servers but embedded systems. That is why I just test the overall structures. The rest of the code is numerical so it will not change by much the fact that d cannot get back the huge launching time. At the microsecond level(even nano) it counts because of electrical consumption, size of hardware, heat and so on. It is definitely not something most care about and i cannot disclose the full code for license reasons (yeah I know I suck and generate some fuss for nothing but.. I just execute.) But D may be of our use for non critical code to replace some Python there and there. It is definitely a good piece of engineering. And it will help save money. |
July 09, 2014 Re: Small part of a program : d and c versions performances diff. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Larry | On Wednesday, 9 July 2014 at 13:46:59 UTC, Larry wrote:
> Yes you are perfectly right but our need is to run the fastest code on the lowest powered machines. Not servers but embedded systems.
>
> That is why I just test the overall structures.
>
> The rest of the code is numerical so it will not change by much the fact that d cannot get back the huge launching time. At the microsecond level(even nano) it counts because of electrical consumption, size of hardware, heat and so on.
>
> It is definitely not something most care about and i cannot disclose the full code for license reasons (yeah I know I suck and generate some fuss for nothing but.. I just execute.)
>
> But D may be of our use for non critical code to replace some Python there and there. It is definitely a good piece of engineering. And it will help save money.
@John Colvin :
hem, you meant the sample code or the real code ? If the former, it is the one corrected by Bearophile.
My excuses
|
July 09, 2014 Re: Small part of a program : d and c versions performances diff. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Larry | Larry:
> The rest of the code is numerical so it will not change by much the fact that d cannot get back the huge launching time. At the microsecond level(even nano) it counts because of electrical consumption, size of hardware, heat and so on.
Have you benchmarked the D code without starting the current d-runtime (without GC)?
Is a starting time of around 0.015 seconds on an old PC is a huge one? I think no one has worked a lot in decreasing this tiny time. If you care for such time, D being open source, you can take a look at the runtime starting code.
Bye,
bearophile
|
July 09, 2014 Re: Small part of a program : d and c versions performances diff. | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | @Bearophile: just tried. No dramatic change. import core.memory; void main() { GC.disable; ... } |
Copyright © 1999-2021 by the D Language Foundation