Thread overview | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
February 21, 2016 Simple performance question from a newcomer | ||||
---|---|---|---|---|
| ||||
I've been vaguely aware of D for many years, but the recent addition of std.experimental.ndslice finally inspired me to give it a try, since my main expertise lies in the domain of scientific computing and I primarily use Python/Julia/C++, where multidimensional arrays can be handled with a great deal of expressiveness and flexibility. Before writing anything serious, I wanted to get a sense for the kind of code I would have to write to get the best performance for numerical calculations, so I wrote a trivial summation benchmark. The following code gave me slightly surprising results: import std.stdio; import std.array : array; import std.algorithm; import std.datetime; import std.range; import std.experimental.ndslice; void main() { int N = 1000; int Q = 20; int times = 1_000; double[] res1 = uninitializedArray!(double[])(N); double[] res2 = uninitializedArray!(double[])(N); double[] res3 = uninitializedArray!(double[])(N); auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q); StopWatch sw; double t0, t1, t2; sw.start(); foreach (unused; 0..times) { for (int i=0; i<N; ++i) { res1[i] = sumtest1(f[i]); } } sw.stop(); t0 = sw.peek().msecs; sw.reset(); sw.start(); foreach (unused; 0..times) { for (int i=0; i<N; ++i) { res2[i] = sumtest2(f[i]); } } sw.stop(); t1 = sw.peek().msecs; sw.reset(); sw.start(); foreach (unused; 0..times) { sumtest3(f, res3, N, Q); } t2 = sw.peek().msecs; writeln(t0, " ms"); writeln(t1, " ms"); writeln(t2, " ms"); assert( res1 == res2 ); assert( res2 == res3 ); } auto sumtest1(Range)(Range range) @safe pure nothrow @nogc { return range.sum; } auto sumtest2(Range)(Range f) @safe pure nothrow @nogc { double retval = 0.0; foreach (double f_ ; f) { retval += f_; } return retval; } auto sumtest3(Range)(Range f, double[] retval, double N, double Q) @safe pure nothrow @nogc { for (int i=0; i<N; ++i) { for (int j=1; j<Q; ++j) { retval[i] += f[i,j]; } } } When I compiled it using dmd -release -inline -O -noboundscheck ../src/main.d, I got the following timings: 1268 ms 312 ms 271 ms I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. Nevertheless, the two explicit loop functions gave me an improvement by a factor of 4+. Furthermore, the difference between sumtest2 and sumtest3 seems to indicate that function calls have a significant overhead. I also tried using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, but that yielded even worse performance. I did not try the GDC/LDC compilers yet, since they don't seem to be up to date on the standard library and don't include the ndslice package last I checked. Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades? |
February 21, 2016 Re: Simple performance question from a newcomer | ||||
---|---|---|---|---|
| ||||
Posted in reply to dextorious | On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:
>
> Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?
I didn't look at your code that thoroughly, but it is generally recommended that if you're concerned about performance to compile with gdc or ldc. dmd has fast compile times, but does not produce as fast of code. You might want to check if the performance differential is still large with one of those.
|
February 21, 2016 Re: Simple performance question from a newcomer | ||||
---|---|---|---|---|
| ||||
Posted in reply to dextorious | You can use -profile to see what is causing it. Num Tree Func Per Calls Time Time Call 23000000 550799875 550243765 23 pure nothrow @nogc @safe double std.algorithm.iteration.sumPairwise!(double, std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, double, double).iota(double, double, double).Result).Slice).sumPairwise(std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, double, double).iota(double, double, double).Result).Slice) Dne 21.2.2016 v 15:32 dextorious via Digitalmars-d-learn napsal(a): > I've been vaguely aware of D for many years, but the recent addition of std.experimental.ndslice finally inspired me to give it a try, since my main expertise lies in the domain of scientific computing and I primarily use Python/Julia/C++, where multidimensional arrays can be handled with a great deal of expressiveness and flexibility. Before writing anything serious, I wanted to get a sense for the kind of code I would have to write to get the best performance for numerical calculations, so I wrote a trivial summation benchmark. The following code gave me slightly surprising results: > > import std.stdio; > import std.array : array; > import std.algorithm; > import std.datetime; > import std.range; > import std.experimental.ndslice; > > void main() { > int N = 1000; > int Q = 20; > int times = 1_000; > double[] res1 = uninitializedArray!(double[])(N); > double[] res2 = uninitializedArray!(double[])(N); > double[] res3 = uninitializedArray!(double[])(N); > auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q); > StopWatch sw; > double t0, t1, t2; > sw.start(); > foreach (unused; 0..times) { > for (int i=0; i<N; ++i) { > res1[i] = sumtest1(f[i]); > } > } > sw.stop(); > t0 = sw.peek().msecs; > sw.reset(); > sw.start(); > foreach (unused; 0..times) { > for (int i=0; i<N; ++i) { > res2[i] = sumtest2(f[i]); > } > } > sw.stop(); > t1 = sw.peek().msecs; > sw.reset(); > sw.start(); > foreach (unused; 0..times) { > sumtest3(f, res3, N, Q); > } > t2 = sw.peek().msecs; > writeln(t0, " ms"); > writeln(t1, " ms"); > writeln(t2, " ms"); > assert( res1 == res2 ); > assert( res2 == res3 ); > } > > auto sumtest1(Range)(Range range) @safe pure nothrow @nogc { > return range.sum; > } > > auto sumtest2(Range)(Range f) @safe pure nothrow @nogc { > double retval = 0.0; > foreach (double f_ ; f) { > retval += f_; > } > return retval; > } > > auto sumtest3(Range)(Range f, double[] retval, double N, double Q) @safe pure nothrow @nogc { > for (int i=0; i<N; ++i) { > for (int j=1; j<Q; ++j) { > retval[i] += f[i,j]; > } > } > } > > When I compiled it using dmd -release -inline -O -noboundscheck ../src/main.d, I got the following timings: > 1268 ms > 312 ms > 271 ms > > I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. Nevertheless, the two explicit loop functions gave me an improvement by a factor of 4+. Furthermore, the difference between sumtest2 and sumtest3 seems to indicate that function calls have a significant overhead. I also tried using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, but that yielded even worse performance. I did not try the GDC/LDC compilers yet, since they don't seem to be up to date on the standard library and don't include the ndslice package last I checked. > > Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades? |
February 21, 2016 Re: Simple performance question from a newcomer | ||||
---|---|---|---|---|
| ||||
Attachments:
| So I guess pairwise summation is one to blame here.
Dne 21.2.2016 v 16:56 Daniel Kozak napsal(a):
> You can use -profile to see what is causing it.
>
> Num Tree Func Per
> Calls Time Time Call
>
> 23000000 550799875 550243765 23 pure nothrow @nogc @safe double std.algorithm.iteration.sumPairwise!(double, std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, double, double).iota(double, double, double).Result).Slice).sumPairwise(std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, double, double).iota(double, double, double).Result).Slice)
>
> Dne 21.2.2016 v 15:32 dextorious via Digitalmars-d-learn napsal(a):
>> I've been vaguely aware of D for many years, but the recent addition of std.experimental.ndslice finally inspired me to give it a try, since my main expertise lies in the domain of scientific computing and I primarily use Python/Julia/C++, where multidimensional arrays can be handled with a great deal of expressiveness and flexibility. Before writing anything serious, I wanted to get a sense for the kind of code I would have to write to get the best performance for numerical calculations, so I wrote a trivial summation benchmark. The following code gave me slightly surprising results:
>>
>> import std.stdio;
>> import std.array : array;
>> import std.algorithm;
>> import std.datetime;
>> import std.range;
>> import std.experimental.ndslice;
>>
>> void main() {
>> int N = 1000;
>> int Q = 20;
>> int times = 1_000;
>> double[] res1 = uninitializedArray!(double[])(N);
>> double[] res2 = uninitializedArray!(double[])(N);
>> double[] res3 = uninitializedArray!(double[])(N);
>> auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
>> StopWatch sw;
>> double t0, t1, t2;
>> sw.start();
>> foreach (unused; 0..times) {
>> for (int i=0; i<N; ++i) {
>> res1[i] = sumtest1(f[i]);
>> }
>> }
>> sw.stop();
>> t0 = sw.peek().msecs;
>> sw.reset();
>> sw.start();
>> foreach (unused; 0..times) {
>> for (int i=0; i<N; ++i) {
>> res2[i] = sumtest2(f[i]);
>> }
>> }
>> sw.stop();
>> t1 = sw.peek().msecs;
>> sw.reset();
>> sw.start();
>> foreach (unused; 0..times) {
>> sumtest3(f, res3, N, Q);
>> }
>> t2 = sw.peek().msecs;
>> writeln(t0, " ms");
>> writeln(t1, " ms");
>> writeln(t2, " ms");
>> assert( res1 == res2 );
>> assert( res2 == res3 );
>> }
>>
>> auto sumtest1(Range)(Range range) @safe pure nothrow @nogc {
>> return range.sum;
>> }
>>
>> auto sumtest2(Range)(Range f) @safe pure nothrow @nogc {
>> double retval = 0.0;
>> foreach (double f_ ; f) {
>> retval += f_;
>> }
>> return retval;
>> }
>>
>> auto sumtest3(Range)(Range f, double[] retval, double N, double Q)
>> @safe pure nothrow @nogc {
>> for (int i=0; i<N; ++i) {
>> for (int j=1; j<Q; ++j) {
>> retval[i] += f[i,j];
>> }
>> }
>> }
>>
>> When I compiled it using dmd -release -inline -O -noboundscheck
>> ../src/main.d, I got the following timings:
>> 1268 ms
>> 312 ms
>> 271 ms
>>
>> I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. Nevertheless, the two explicit loop functions gave me an improvement by a factor of 4+. Furthermore, the difference between sumtest2 and sumtest3 seems to indicate that function calls have a significant overhead. I also tried using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, but that yielded even worse performance. I did not try the GDC/LDC compilers yet, since they don't seem to be up to date on the standard library and don't include the ndslice package last I checked.
>>
>> Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?
>
|
February 21, 2016 Re: Simple performance question from a newcomer | ||||
---|---|---|---|---|
| ||||
Posted in reply to dextorious | On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote: > Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades? First off, you should really be using GDC or LDC if you want speed. On how to do that, see my blog post about here: http://jackstouffer.com/blog/nd_slice.html. Specifically the section titled "Getting Hands On". Secondly, both of your other sum examples use naive element by element summation rather than the more accurate pairwise summation which sum uses with random access floating point ranges. So your not really comparing apples to apples here. Since Phobos' pairwise summation is recursive, it's very likely that DMD isn't doing all the optimizations that LDC or GDC can, such as inlining or tail call optimizations. I haven't compiled your code so I can't check myself. Also, templates are auto attributed, so there's no reason to include @safe nothrow, etc. on templated functions. |
February 21, 2016 Re: Simple performance question from a newcomer | ||||
---|---|---|---|---|
| ||||
Posted in reply to dextorious | On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote: > I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. First, a minor point, the D community is usually pretty careful not to frown on a particular coding style (unlike some communities) so if you are comfortable writing loops and it gives you the fastest code, you should do so. On the performance issue, you can see this related post about performance with reduce: http://forum.dlang.org/post/mailman.4829.1434623275.7663.digitalmars-d@puremagic.com This was Walter's response: http://forum.dlang.org/post/mlvb40$1tdf$1@digitalmars.com And this shows that LDC flat out does a better job of optimization in this case: http://forum.dlang.org/post/mailman.4899.1434779705.7663.digitalmars-d@puremagic.com |
February 21, 2016 Re: Simple performance question from a newcomer | ||||
---|---|---|---|---|
| ||||
Posted in reply to dextorious | On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote: > I've been vaguely aware of D for many years, but the recent addition of std.experimental.ndslice finally inspired me to give it a try, since my main expertise lies in the domain of scientific computing and I primarily use Python/Julia/C++, where multidimensional arrays can be handled with a great deal of expressiveness and flexibility. Before writing anything serious, I wanted to get a sense for the kind of code I would have to write to get the best performance for numerical calculations, so I wrote a trivial summation benchmark. The following code gave me slightly surprising results: > > import std.stdio; > import std.array : array; > import std.algorithm; > import std.datetime; > import std.range; > import std.experimental.ndslice; > > void main() { > int N = 1000; > int Q = 20; > int times = 1_000; > double[] res1 = uninitializedArray!(double[])(N); > double[] res2 = uninitializedArray!(double[])(N); > double[] res3 = uninitializedArray!(double[])(N); > auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q); > StopWatch sw; > double t0, t1, t2; > sw.start(); > foreach (unused; 0..times) { > for (int i=0; i<N; ++i) { > res1[i] = sumtest1(f[i]); > } > } > sw.stop(); > t0 = sw.peek().msecs; > sw.reset(); > sw.start(); > foreach (unused; 0..times) { > for (int i=0; i<N; ++i) { > res2[i] = sumtest2(f[i]); > } > } > sw.stop(); > t1 = sw.peek().msecs; > sw.reset(); > sw.start(); > foreach (unused; 0..times) { > sumtest3(f, res3, N, Q); > } > t2 = sw.peek().msecs; > writeln(t0, " ms"); > writeln(t1, " ms"); > writeln(t2, " ms"); > assert( res1 == res2 ); > assert( res2 == res3 ); > } > > auto sumtest1(Range)(Range range) @safe pure nothrow @nogc { > return range.sum; > } > > auto sumtest2(Range)(Range f) @safe pure nothrow @nogc { > double retval = 0.0; > foreach (double f_ ; f) { > retval += f_; > } > return retval; > } > > auto sumtest3(Range)(Range f, double[] retval, double N, double Q) @safe pure nothrow @nogc { > for (int i=0; i<N; ++i) { > for (int j=1; j<Q; ++j) { > retval[i] += f[i,j]; > } > } > } > > When I compiled it using dmd -release -inline -O -noboundscheck ../src/main.d, I got the following timings: > 1268 ms > 312 ms > 271 ms > > I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. Nevertheless, the two explicit loop functions gave me an improvement by a factor of 4+. Furthermore, the difference between sumtest2 and sumtest3 seems to indicate that function calls have a significant overhead. I also tried using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, but that yielded even worse performance. I did not try the GDC/LDC compilers yet, since they don't seem to be up to date on the standard library and don't include the ndslice package last I checked. > > Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades? The problem is not with ranges, but with the particualr algorithm used for summing. If you look at the docs (http://dlang.org/phobos-prerelease/std_algorithm_iteration.html#.sum) you'll see that if the range has random-access `sum` will use the pair-wise algorithm. About the second and third tests, the problem is with DMD which should not be used when measuring performance (but only for development, because it has fast compile-times). These are the results that I get with LDC: Pair-wise (sumtest1): 415 ms 21 ms 20 ms And if I use the Kahan algorithm: 106 ms 36 ms 31 ms The second two results are probably larger due to noise. And if I increase N to 100_000: Pair-wise (sumtest1): 29557 ms 2061 ms 1990 ms Kahan: 4566 ms 2067 ms 1990 ms According to `dub --verbose`, my command-line was roughly this: ldc2 -ofapp -release -O5 -singleobj -w source/app.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/internal.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/iteration.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/package.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/selection.d ../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/slice.d |
February 21, 2016 Re: Simple performance question from a newcomer | ||||
---|---|---|---|---|
| ||||
Posted in reply to ZombineDev | On Sunday, 21 February 2016 at 16:29:26 UTC, ZombineDev wrote:
> ...
> And if I use the Kahan algorithm:
> 106 ms
> 36 ms
> 31 ms
> The second two results are probably larger due to noise.
I did some more testing and clearly the larger times for N=1000 were just noise:
[LDC Kahan N=1000]
106 ms
36 ms
31 ms
59 ms
24 ms
22 ms
46 ms
21 ms
20 ms
45 ms
21 ms
20 ms
45 ms
21 ms
20 ms
59 ms
24 ms
21 ms
102 ms
35 ms
30 ms
104 ms
37 ms
29 ms
107 ms
36 ms
31 ms
46 ms
21 ms
20 ms
|
February 21, 2016 Re: Simple performance question from a newcomer | ||||
---|---|---|---|---|
| ||||
Posted in reply to ZombineDev | On Sunday, 21 February 2016 at 16:36:22 UTC, ZombineDev wrote:
> On Sunday, 21 February 2016 at 16:29:26 UTC, ZombineDev wrote:
>> ...
>> And if I use the Kahan algorithm:
>> 106 ms
>> 36 ms
>> 31 ms
>> The second two results are probably larger due to noise.
>
> I did some more testing and clearly the larger times for N=1000 were just noise:
> [LDC Kahan N=1000]
> 106 ms
> 36 ms
> 31 ms
>
> 59 ms
> 24 ms
> 22 ms
>
> ...
Just for the record, with `DMD -release -O -inline`, Kahan, N=1000 I get:
325 ms
217 ms
165 ms
231 ms
117 ms
58 ms
131 ms
109 ms
58 ms
131 ms
109 ms
57 ms
131 ms
112 ms
57 ms
125 ms
106 ms
55 ms
125 ms
104 ms
55 ms
125 ms
105 ms
55 ms
125 ms
104 ms
55 ms
230 ms
115 ms
58 ms
131 ms
112 ms
58 ms
131 ms
109 ms
57 ms
|
February 22, 2016 Re: Simple performance question from a newcomer | ||||
---|---|---|---|---|
| ||||
Posted in reply to dextorious | If you do want to test the differences between the range approach and the loop approach, something like: auto sumtest4(Range)(Range range) @safe pure { return range.reduce!((a, b) => a + b); } is a more fair comparison. I get results within 15% of sumtest2 with this using dmd. I think with ldc this would be identical, but the version in homebrew is too old to compile this. |
Copyright © 1999-2021 by the D Language Foundation