I wonder how fast we'd do (page 2)

On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote: > https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html I'm confused by the accuracy of measurements in steps of 16/17ms: 17ms, 34ms, 75ms, 98ms, 128ms... sse(2) should get same numbers (most compiler base on llvm)

On Tuesday, 28 May 2019 at 05:54:07 UTC, Marco de Wild wrote: > On Tuesday, 28 May 2019 at 05:20:14 UTC, Uknown wrote: >> On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote: >>> https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html >> >> I tested 3 D variants : >> >> ---ver1.d >> double sum = 0.0; >> for (int i = 0; i < values.length; i++) >> { >> double v = values[i] * values[i]; >> sum += v; >> } >> >> >> ---ver2.d >> double sum = 0.0; >> foreach (v; values) >> sum += v * v; >> return sum; >> >> >> ---ver3.d >> import std.algorithm : sum; >> double[] squares; >> squares[] = values[] * values[]; >> return squares.sum; >> >> All 3 were the exact same with LDC. https://run.dlang.io/is/6pjEud > > When the blog post released I wrote a few benchmarks. Surprisingly, using > > values.map!(x => x*x).sum > > was the fastest (faster than v1). It got around to 20 us on my machine. Should have been 20 ms of course. https://run.dlang.io/is/Fpg8Iw 21 ms, 387 μs, and 7 hnsecs (map) 32 ms, 191 μs, and 1 hnsec (foreach) 32 ms, 183 μs, and 8 hnsecs (for) However, recompiling it with LDC (to reproduce the exact compile flags) gives exactly the opposite result *facepalm*, bumping the map to 40 ms: 41 ms, 792 μs, and 7 hnsecs 30 ms and 893 μs 31 ms, 76 μs, and 6 hnsecs

On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote: > https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html By using a D driver to call extern C implementations, I get 27ms for everything here: https://github.com/atilaneves/blog-obvious Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM.

May 28, 2019

Re: I wonder how fast we'd do

Posted by KnightMare
in reply to Andrei Alexandrescu

Permalink

KnightMare

Posted in reply to Andrei Alexandrescu

Permalink

On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote:
> https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html

totally for:
https://pastebin.com/j0T0MRmA small changes to Marco de Wild code
Windows Server 2019, i7-3615QM, DMD 2.086.0, LDC 1.16.0-b1

C:\content\downloadz\dlang>ldc2 -release -O3 -mattr=avx times2.d
C:\content\downloadz\dlang>times2.exe
t1=42 ms, 714 ╬╝s, and 9 hnsecs         r=10922666154674544967680
t2=42 ms and 614 ╬╝s                    r=10922666154674544967680
t3=0 hnsecs     r=0
t4=42 ms, 474 ╬╝s, and 8 hnsecs         r=10922666154674544967680

C:\content\downloadz\dlang>dmd -release -O -mcpu=avx times2.d
C:\content\downloadz\dlang>times2.exe
t1=141 ms, 263 ╬╝s, and 5 hnsecs        r=10922666154673907433000
t2=143 ms, 128 ╬╝s, and 9 hnsecs        r=10922666154673907433000
t3=1 hnsec      r=0
t4=491 ms, 829 ╬╝s, and 9 hnsecs        r=10922666154673907433000

1) different sums DMD and LDC (probably fast-math, dont know)
2) t3=0 for d_with_sum. lets see assembler for LDC (-output-s):
	.def	 _D6times210d_with_sumFNaNbNfAdZd;
	.scl	2;
	.type	32;
	.endef
	.section	.text,"xr",discard,_D6times210d_with_sumFNaNbNfAdZd
	.globl	_D6times210d_with_sumFNaNbNfAdZd
	.p2align	4, 0x90
_D6times210d_with_sumFNaNbNfAdZd:
	vxorps	%xmm0, %xmm0, %xmm0
	retq // this means "return 0"? cool optimization
3) for Windows better change "╬╝s" to "us" (when /SUBSYSTEM:CONSOLE)

> https://pastebin.com/j0T0MRmA > Windows Server 2019, i7-3615QM, DMD 2.086.0, LDC 1.16.0-b1 > 2) t3=0 for d_with_sum. lets see assembler for LDC (-output-s): people explained why code is being thrown away https://forum.dlang.org/thread/wxesgcjznvwpdwpnxnej@forum.dlang.org I have next results now: C:\content\downloadz\dlang>ldc2 -release -O3 -mattr=avx times2.d C:\content\downloadz\dlang>times2.exe t1=42 ms, 929 ╬╝s, and 1 hnsec r=109226661546_74544967680 t2=42 ms and 578 ╬╝s r=109226661546_74544967680 t3=333 ms, 539 ╬╝s, and 3 hnsecs r=109226661546_66672259072 t4=42 ms, 631 ╬╝s, and 9 hnsecs r=109226661546_74544967680 C:\content\downloadz\dlang>dmd -release -O -mcpu=avx times2.d C:\content\downloadz\dlang>times2.exe core.exception.OutOfMemoryError@src\core\exception.d(702): Memory allocation failed ---------------- I have 16GB RAM, 8GB free (by Task Manager) double[32M].sizeof=256MB remove attrs from d_with_sum & main - nothing changed strange a little bit.

> When the blog post released I wrote a few benchmarks. Surprisingly, using > > values.map!(x => x*x).sum > > was the fastest (faster than v1). It got around to 20 us on my machine. all code was skipped coz results was unused. code with using results https://pastebin.com/j0T0MRmA and d_with_sum was skipped coz https://forum.dlang.org/thread/wxesgcjznvwpdwpnxnej@forum.dlang.org after all results will be XXms

On Tuesday, 28 May 2019 at 09:49:26 UTC, Atila Neves wrote: > > Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM. This really isn't _that_ surprising. Once properly optimized, native code is the same speed for every input language. C, C++, D and Rust all have a "no room below" ethic in most cases, so you end up with the very same performance. Barring anomalies like bounds check or integer overflow checks. Comparisons of backends would be much more interesting, but drive less interest on Internet forums.

On Tuesday, 28 May 2019 at 14:20:30 UTC, Guillaume Piolat wrote: > On Tuesday, 28 May 2019 at 09:49:26 UTC, Atila Neves wrote: >> >> Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM. > > This really isn't _that_ surprising. > > Once properly optimized, native code is the same speed for every input language. > C, C++, D and Rust all have a "no room below" ethic in most cases, so you end up with the very same performance. Barring anomalies like bounds check or integer overflow checks. > > Comparisons of backends would be much more interesting, but drive less interest on Internet forums. Indeed, the only thing that usually has any effect is aliasing rules, and the occasional convincing the code generator to do non-temporal ops. The REAL power of the frontend language is to make it aware to the optimiser that the code is redundant, 'cause theres no faster code than no code at all. Thats why I'm really excited to see what we could use MLIR[1] for in LDC. [1]: https://github.com/tensorflow/mlir/

On 5/28/2019 2:49 AM, Atila Neves wrote: > Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM. I'm not surprised. First off, because of inlining, etc., all are transformed into a simple loop. Then, the auto-vectorizer does the rest.

On Tue, May 28, 2019 at 10:11:43AM -0700, Walter Bright via Digitalmars-d wrote: > On 5/28/2019 2:49 AM, Atila Neves wrote: > > Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM. > > I'm not surprised. First off, because of inlining, etc., all are transformed into a simple loop. Then, the auto-vectorizer does the rest. Does dmd unroll loops yet? That appears to be a major cause of suboptimal codegen in dmd, last time I checked. Would be nice to improve this. T -- Some ideas are so stupid that only intellectuals could believe them. -- George Orwell

Forums