Simple performance question from a newcomer - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Simple performance question from a newcomer

Thread overview

Simple performance question from a newcomer
Feb 21, 2016 dextorious
Feb 21, 2016 jmh530
Feb 21, 2016 Daniel Kozak
Feb 21, 2016 Daniel Kozak
Feb 21, 2016 Jack Stouffer
Feb 21, 2016 bachmeier
Feb 22, 2016 dextorious
Feb 22, 2016 sigod
Feb 21, 2016 ZombineDev
Feb 21, 2016 ZombineDev
Feb 21, 2016 ZombineDev
Feb 22, 2016 dextorious
Feb 23, 2016 ixid
Feb 23, 2016 Marc Schütz
Feb 23, 2016 ixid
Feb 23, 2016 dextorious
Feb 25, 2016 Ali Çehreli
Feb 23, 2016 bachmeier
Feb 23, 2016 dextorious
Feb 23, 2016 jmh530
Feb 24, 2016 Mike Parker
Feb 24, 2016 dextorious
Feb 24, 2016 jmh530
Feb 22, 2016 Kapps
Feb 22, 2016 Kapps

February 21, 2016

Simple performance question from a newcomer

Posted by dextorious

dextorious

I've been vaguely aware of D for many years, but the recent addition of std.experimental.ndslice finally inspired me to give it a try, since my main expertise lies in the domain of scientific computing and I primarily use Python/Julia/C++, where multidimensional arrays can be handled with a great deal of expressiveness and flexibility. Before writing anything serious, I wanted to get a sense for the kind of code I would have to write to get the best performance for numerical calculations, so I wrote a trivial summation benchmark. The following code gave me slightly surprising results:

import std.stdio;
import std.array : array;
import std.algorithm;
import std.datetime;
import std.range;
import std.experimental.ndslice;

void main() {
	int N = 1000;
	int Q = 20;
	int times = 1_000;
	double[] res1 = uninitializedArray!(double[])(N);
	double[] res2 = uninitializedArray!(double[])(N);
	double[] res3 = uninitializedArray!(double[])(N);
	auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
	StopWatch sw;
	double t0, t1, t2;
	sw.start();
	foreach (unused; 0..times) {
		for (int i=0; i<N; ++i) {
			res1[i] = sumtest1(f[i]);
		}
	}
	sw.stop();
	t0 = sw.peek().msecs;
	sw.reset();
	sw.start();
	foreach (unused; 0..times) {
		for (int i=0; i<N; ++i) {
			res2[i] = sumtest2(f[i]);
		}
	}
	sw.stop();
	t1 = sw.peek().msecs;
	sw.reset();
	sw.start();
	foreach (unused; 0..times) {
		sumtest3(f, res3, N, Q);
	}
	t2 = sw.peek().msecs;
	writeln(t0, " ms");
	writeln(t1, " ms");
	writeln(t2, " ms");
	assert( res1 == res2 );
	assert( res2 == res3 );
}

auto sumtest1(Range)(Range range) @safe pure nothrow @nogc {
	return range.sum;
}

auto sumtest2(Range)(Range f) @safe pure nothrow @nogc {
	double retval = 0.0;
	foreach (double f_ ; f)	{
		retval += f_;
	}
	return retval;
}

auto sumtest3(Range)(Range f, double[] retval, double N, double Q) @safe pure nothrow @nogc {
	for (int i=0; i<N; ++i)	{
		for (int j=1; j<Q; ++j)	{
			retval[i] += f[i,j];
		}
	}
}

When I compiled it using dmd -release -inline -O -noboundscheck ../src/main.d, I got the following timings:
1268 ms
312 ms
271 ms

I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. Nevertheless, the two explicit loop functions gave me an improvement by a factor of 4+. Furthermore, the difference between sumtest2 and sumtest3 seems to indicate that function calls have a significant overhead. I also tried using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, but that yielded even worse performance. I did not try the GDC/LDC compilers yet, since they don't seem to be up to date on the standard library and don't include the ndslice package last I checked.

Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?

February 21, 2016

Re: Simple performance question from a newcomer

Posted by jmh530
in reply to dextorious

jmh530

Posted in reply to dextorious

On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:
>
> Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?

I didn't look at your code that thoroughly, but it is generally recommended that if you're concerned about performance to compile with gdc or ldc. dmd has fast compile times, but does not produce as fast of code. You might want to check if the performance differential is still large with one of those.

February 21, 2016

Re: Simple performance question from a newcomer

Posted by Daniel Kozak
in reply to dextorious

Daniel Kozak

Posted in reply to dextorious

You can use -profile to see what is causing it.

  Num          Tree        Func        Per
  Calls        Time        Time        Call

23000000   550799875   550243765          23     pure nothrow @nogc @safe double std.algorithm.iteration.sumPairwise!(double, std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, double, double).iota(double, double, double).Result).Slice).sumPairwise(std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, double, double).iota(double, double, double).Result).Slice)

Dne 21.2.2016 v 15:32 dextorious via Digitalmars-d-learn napsal(a):
> I've been vaguely aware of D for many years, but the recent addition of std.experimental.ndslice finally inspired me to give it a try, since my main expertise lies in the domain of scientific computing and I primarily use Python/Julia/C++, where multidimensional arrays can be handled with a great deal of expressiveness and flexibility. Before writing anything serious, I wanted to get a sense for the kind of code I would have to write to get the best performance for numerical calculations, so I wrote a trivial summation benchmark. The following code gave me slightly surprising results:
>
> import std.stdio;
> import std.array : array;
> import std.algorithm;
> import std.datetime;
> import std.range;
> import std.experimental.ndslice;
>
> void main() {
>     int N = 1000;
>     int Q = 20;
>     int times = 1_000;
>     double[] res1 = uninitializedArray!(double[])(N);
>     double[] res2 = uninitializedArray!(double[])(N);
>     double[] res3 = uninitializedArray!(double[])(N);
>     auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
>     StopWatch sw;
>     double t0, t1, t2;
>     sw.start();
>     foreach (unused; 0..times) {
>         for (int i=0; i<N; ++i) {
>             res1[i] = sumtest1(f[i]);
>         }
>     }
>     sw.stop();
>     t0 = sw.peek().msecs;
>     sw.reset();
>     sw.start();
>     foreach (unused; 0..times) {
>         for (int i=0; i<N; ++i) {
>             res2[i] = sumtest2(f[i]);
>         }
>     }
>     sw.stop();
>     t1 = sw.peek().msecs;
>     sw.reset();
>     sw.start();
>     foreach (unused; 0..times) {
>         sumtest3(f, res3, N, Q);
>     }
>     t2 = sw.peek().msecs;
>     writeln(t0, " ms");
>     writeln(t1, " ms");
>     writeln(t2, " ms");
>     assert( res1 == res2 );
>     assert( res2 == res3 );
> }
>
> auto sumtest1(Range)(Range range) @safe pure nothrow @nogc {
>     return range.sum;
> }
>
> auto sumtest2(Range)(Range f) @safe pure nothrow @nogc {
>     double retval = 0.0;
>     foreach (double f_ ; f)    {
>         retval += f_;
>     }
>     return retval;
> }
>
> auto sumtest3(Range)(Range f, double[] retval, double N, double Q) @safe pure nothrow @nogc {
>     for (int i=0; i<N; ++i)    {
>         for (int j=1; j<Q; ++j)    {
>             retval[i] += f[i,j];
>         }
>     }
> }
>
> When I compiled it using dmd -release -inline -O -noboundscheck ../src/main.d, I got the following timings:
> 1268 ms
> 312 ms
> 271 ms
>
> I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. Nevertheless, the two explicit loop functions gave me an improvement by a factor of 4+. Furthermore, the difference between sumtest2 and sumtest3 seems to indicate that function calls have a significant overhead. I also tried using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, but that yielded even worse performance. I did not try the GDC/LDC compilers yet, since they don't seem to be up to date on the standard library and don't include the ndslice package last I checked.
>
> Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?

February 21, 2016

Re: Simple performance question from a newcomer

Posted by Daniel Kozak

Daniel Kozak

Attachments:

text/html part

So I guess pairwise summation is one to blame here.

Dne 21.2.2016 v 16:56 Daniel Kozak napsal(a):
> You can use -profile to see what is causing it.
>
>   Num          Tree        Func        Per
>   Calls        Time        Time        Call
>
> 23000000   550799875   550243765          23     pure nothrow @nogc @safe double std.algorithm.iteration.sumPairwise!(double, std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, double, double).iota(double, double, double).Result).Slice).sumPairwise(std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, double, double).iota(double, double, double).Result).Slice)
>
> Dne 21.2.2016 v 15:32 dextorious via Digitalmars-d-learn napsal(a):
>> I've been vaguely aware of D for many years, but the recent addition of std.experimental.ndslice finally inspired me to give it a try, since my main expertise lies in the domain of scientific computing and I primarily use Python/Julia/C++, where multidimensional arrays can be handled with a great deal of expressiveness and flexibility. Before writing anything serious, I wanted to get a sense for the kind of code I would have to write to get the best performance for numerical calculations, so I wrote a trivial summation benchmark. The following code gave me slightly surprising results:
>>
>> import std.stdio;
>> import std.array : array;
>> import std.algorithm;
>> import std.datetime;
>> import std.range;
>> import std.experimental.ndslice;
>>
>> void main() {
>>     int N = 1000;
>>     int Q = 20;
>>     int times = 1_000;
>>     double[] res1 = uninitializedArray!(double[])(N);
>>     double[] res2 = uninitializedArray!(double[])(N);
>>     double[] res3 = uninitializedArray!(double[])(N);
>>     auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
>>     StopWatch sw;
>>     double t0, t1, t2;
>>     sw.start();
>>     foreach (unused; 0..times) {
>>         for (int i=0; i<N; ++i) {
>>             res1[i] = sumtest1(f[i]);
>>         }
>>     }
>>     sw.stop();
>>     t0 = sw.peek().msecs;
>>     sw.reset();
>>     sw.start();
>>     foreach (unused; 0..times) {
>>         for (int i=0; i<N; ++i) {
>>             res2[i] = sumtest2(f[i]);
>>         }
>>     }
>>     sw.stop();
>>     t1 = sw.peek().msecs;
>>     sw.reset();
>>     sw.start();
>>     foreach (unused; 0..times) {
>>         sumtest3(f, res3, N, Q);
>>     }
>>     t2 = sw.peek().msecs;
>>     writeln(t0, " ms");
>>     writeln(t1, " ms");
>>     writeln(t2, " ms");
>>     assert( res1 == res2 );
>>     assert( res2 == res3 );
>> }
>>
>> auto sumtest1(Range)(Range range) @safe pure nothrow @nogc {
>>     return range.sum;
>> }
>>
>> auto sumtest2(Range)(Range f) @safe pure nothrow @nogc {
>>     double retval = 0.0;
>>     foreach (double f_ ; f)    {
>>         retval += f_;
>>     }
>>     return retval;
>> }
>>
>> auto sumtest3(Range)(Range f, double[] retval, double N, double Q)
>> @safe pure nothrow @nogc {
>>     for (int i=0; i<N; ++i)    {
>>         for (int j=1; j<Q; ++j)    {
>>             retval[i] += f[i,j];
>>         }
>>     }
>> }
>>
>> When I compiled it using dmd -release -inline -O -noboundscheck
>> ../src/main.d, I got the following timings:
>> 1268 ms
>> 312 ms
>> 271 ms
>>
>> I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. Nevertheless, the two explicit loop functions gave me an improvement by a factor of 4+. Furthermore, the difference between sumtest2 and sumtest3 seems to indicate that function calls have a significant overhead. I also tried using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, but that yielded even worse performance. I did not try the GDC/LDC compilers yet, since they don't seem to be up to date on the standard library and don't include the ndslice package last I checked.
>>
>> Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?
>

February 21, 2016

Re: Simple performance question from a newcomer

Posted by Jack Stouffer
in reply to dextorious

Jack Stouffer

Posted in reply to dextorious

On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:
> Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?

First off, you should really be using GDC or LDC if you want speed. On how to do that, see my blog post about here: http://jackstouffer.com/blog/nd_slice.html. Specifically the section titled "Getting Hands On".

Secondly, both of your other sum examples use naive element by element summation rather than the more accurate pairwise summation which sum uses with random access floating point ranges. So your not really comparing apples to apples here.

Since Phobos' pairwise summation is recursive, it's very likely that DMD isn't doing all the optimizations that LDC or GDC can, such as inlining or tail call optimizations. I haven't compiled your code so I can't check myself.

Also, templates are auto attributed, so there's no reason to include @safe nothrow, etc. on templated functions.

February 21, 2016

Re: Simple performance question from a newcomer

Posted by bachmeier
in reply to dextorious

bachmeier

Posted in reply to dextorious

On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:
> I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons.

First, a minor point, the D community is usually pretty careful not to frown on a particular coding style (unlike some communities) so if you are comfortable writing loops and it gives you the fastest code, you should do so.

On the performance issue, you can see this related post about performance with reduce:
http://forum.dlang.org/post/mailman.4829.1434623275.7663.digitalmars-d@puremagic.com

This was Walter's response:
http://forum.dlang.org/post/mlvb40$1tdf$1@digitalmars.com

And this shows that LDC flat out does a better job of optimization in this case:
http://forum.dlang.org/post/mailman.4899.1434779705.7663.digitalmars-d@puremagic.com

February 21, 2016

Re: Simple performance question from a newcomer

Posted by ZombineDev
in reply to dextorious

ZombineDev

Posted in reply to dextorious

On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:
> I've been vaguely aware of D for many years, but the recent addition of std.experimental.ndslice finally inspired me to give it a try, since my main expertise lies in the domain of scientific computing and I primarily use Python/Julia/C++, where multidimensional arrays can be handled with a great deal of expressiveness and flexibility. Before writing anything serious, I wanted to get a sense for the kind of code I would have to write to get the best performance for numerical calculations, so I wrote a trivial summation benchmark. The following code gave me slightly surprising results:
>
> import std.stdio;
> import std.array : array;
> import std.algorithm;
> import std.datetime;
> import std.range;
> import std.experimental.ndslice;
>
> void main() {
> 	int N = 1000;
> 	int Q = 20;
> 	int times = 1_000;
> 	double[] res1 = uninitializedArray!(double[])(N);
> 	double[] res2 = uninitializedArray!(double[])(N);
> 	double[] res3 = uninitializedArray!(double[])(N);
> 	auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
> 	StopWatch sw;
> 	double t0, t1, t2;
> 	sw.start();
> 	foreach (unused; 0..times) {
> 		for (int i=0; i<N; ++i) {
> 			res1[i] = sumtest1(f[i]);
> 		}
> 	}
> 	sw.stop();
> 	t0 = sw.peek().msecs;
> 	sw.reset();
> 	sw.start();
> 	foreach (unused; 0..times) {
> 		for (int i=0; i<N; ++i) {
> 			res2[i] = sumtest2(f[i]);
> 		}
> 	}
> 	sw.stop();
> 	t1 = sw.peek().msecs;
> 	sw.reset();
> 	sw.start();
> 	foreach (unused; 0..times) {
> 		sumtest3(f, res3, N, Q);
> 	}
> 	t2 = sw.peek().msecs;
> 	writeln(t0, " ms");
> 	writeln(t1, " ms");
> 	writeln(t2, " ms");
> 	assert( res1 == res2 );
> 	assert( res2 == res3 );
> }
>
> auto sumtest1(Range)(Range range) @safe pure nothrow @nogc {
> 	return range.sum;
> }
>
> auto sumtest2(Range)(Range f) @safe pure nothrow @nogc {
> 	double retval = 0.0;
> 	foreach (double f_ ; f)	{
> 		retval += f_;
> 	}
> 	return retval;
> }
>
> auto sumtest3(Range)(Range f, double[] retval, double N, double Q) @safe pure nothrow @nogc {
> 	for (int i=0; i<N; ++i)	{
> 		for (int j=1; j<Q; ++j)	{
> 			retval[i] += f[i,j];
> 		}
> 	}
> }
>
> When I compiled it using dmd -release -inline -O -noboundscheck ../src/main.d, I got the following timings:
> 1268 ms
> 312 ms
> 271 ms
>
> I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. Nevertheless, the two explicit loop functions gave me an improvement by a factor of 4+. Furthermore, the difference between sumtest2 and sumtest3 seems to indicate that function calls have a significant overhead. I also tried using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, but that yielded even worse performance. I did not try the GDC/LDC compilers yet, since they don't seem to be up to date on the standard library and don't include the ndslice package last I checked.
>
> Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?

The problem is not with ranges, but with the particualr algorithm used for summing. If you look at the docs (http://dlang.org/phobos-prerelease/std_algorithm_iteration.html#.sum) you'll see that if the range has random-access `sum` will use the pair-wise algorithm. About the second and third tests, the problem is with DMD which should not be used when measuring performance (but only for development, because it has fast compile-times).

These are the results that I get with LDC:
Pair-wise (sumtest1):
415 ms
21 ms
20 ms

And if I use the Kahan algorithm:
106 ms
36 ms
31 ms
The second two results are probably larger due to noise.

And if I increase N to 100_000:
Pair-wise (sumtest1):
29557 ms
2061 ms
1990 ms

Kahan:
4566 ms
2067 ms
1990 ms

According to `dub --verbose`, my command-line was roughly this:
ldc2 -ofapp -release -O5 -singleobj -w source/app.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/internal.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/iteration.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/package.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/selection.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/slice.d

February 21, 2016

Re: Simple performance question from a newcomer

Posted by ZombineDev
in reply to ZombineDev

ZombineDev

Posted in reply to ZombineDev

On Sunday, 21 February 2016 at 16:29:26 UTC, ZombineDev wrote:
> ...
> And if I use the Kahan algorithm:
> 106 ms
> 36 ms
> 31 ms
> The second two results are probably larger due to noise.

I did some more testing and clearly the larger times for N=1000 were just noise:
[LDC Kahan N=1000]
106 ms
36 ms
31 ms

59 ms
24 ms
22 ms

46 ms
21 ms
20 ms

45 ms
21 ms
20 ms

45 ms
21 ms
20 ms

59 ms
24 ms
21 ms

102 ms
35 ms
30 ms

104 ms
37 ms
29 ms

107 ms
36 ms
31 ms

46 ms
21 ms
20 ms

February 21, 2016

Re: Simple performance question from a newcomer

Posted by ZombineDev
in reply to ZombineDev

ZombineDev

Posted in reply to ZombineDev

On Sunday, 21 February 2016 at 16:36:22 UTC, ZombineDev wrote:
> On Sunday, 21 February 2016 at 16:29:26 UTC, ZombineDev wrote:
>> ...
>> And if I use the Kahan algorithm:
>> 106 ms
>> 36 ms
>> 31 ms
>> The second two results are probably larger due to noise.
>
> I did some more testing and clearly the larger times for N=1000 were just noise:
> [LDC Kahan N=1000]
> 106 ms
> 36 ms
> 31 ms
>
> 59 ms
> 24 ms
> 22 ms
>
> ...

Just for the record, with `DMD -release -O -inline`, Kahan, N=1000 I get:
325 ms
217 ms
165 ms

231 ms
117 ms
58 ms

131 ms
109 ms
58 ms

131 ms
109 ms
57 ms

131 ms
112 ms
57 ms

125 ms
106 ms
55 ms

125 ms
104 ms
55 ms

125 ms
105 ms
55 ms

125 ms
104 ms
55 ms

230 ms
115 ms
58 ms

131 ms
112 ms
58 ms

131 ms
109 ms
57 ms

February 22, 2016

Re: Simple performance question from a newcomer

Posted by Kapps
in reply to dextorious

Kapps

Posted in reply to dextorious

If you do want to test the differences between the range approach and the loop approach, something like:
auto sumtest4(Range)(Range range) @safe pure {
	return range.reduce!((a, b) => a + b);
}
is a more fair comparison. I get results within 15% of sumtest2 with this using dmd. I think with ldc this would be identical, but the version in homebrew is too old to compile this.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation