May 28, 2019 Re: I wonder how fast we'd do | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote:
> https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html
I'm confused by the accuracy of measurements in steps of 16/17ms: 17ms, 34ms, 75ms, 98ms, 128ms...
sse(2) should get same numbers (most compiler base on llvm)
|
May 28, 2019 Re: I wonder how fast we'd do | ||||
---|---|---|---|---|
| ||||
Posted in reply to Marco de Wild | On Tuesday, 28 May 2019 at 05:54:07 UTC, Marco de Wild wrote: > On Tuesday, 28 May 2019 at 05:20:14 UTC, Uknown wrote: >> On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote: >>> https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html >> >> I tested 3 D variants : >> >> ---ver1.d >> double sum = 0.0; >> for (int i = 0; i < values.length; i++) >> { >> double v = values[i] * values[i]; >> sum += v; >> } >> >> >> ---ver2.d >> double sum = 0.0; >> foreach (v; values) >> sum += v * v; >> return sum; >> >> >> ---ver3.d >> import std.algorithm : sum; >> double[] squares; >> squares[] = values[] * values[]; >> return squares.sum; >> >> All 3 were the exact same with LDC. https://run.dlang.io/is/6pjEud > > When the blog post released I wrote a few benchmarks. Surprisingly, using > > values.map!(x => x*x).sum > > was the fastest (faster than v1). It got around to 20 us on my machine. Should have been 20 ms of course. https://run.dlang.io/is/Fpg8Iw 21 ms, 387 μs, and 7 hnsecs (map) 32 ms, 191 μs, and 1 hnsec (foreach) 32 ms, 183 μs, and 8 hnsecs (for) However, recompiling it with LDC (to reproduce the exact compile flags) gives exactly the opposite result *facepalm*, bumping the map to 40 ms: 41 ms, 792 μs, and 7 hnsecs 30 ms and 893 μs 31 ms, 76 μs, and 6 hnsecs |
May 28, 2019 Re: I wonder how fast we'd do | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote: > https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html By using a D driver to call extern C implementations, I get 27ms for everything here: https://github.com/atilaneves/blog-obvious Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM. |
May 28, 2019 Re: I wonder how fast we'd do | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote: > https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html totally for: https://pastebin.com/j0T0MRmA small changes to Marco de Wild code Windows Server 2019, i7-3615QM, DMD 2.086.0, LDC 1.16.0-b1 C:\content\downloadz\dlang>ldc2 -release -O3 -mattr=avx times2.d C:\content\downloadz\dlang>times2.exe t1=42 ms, 714 ╬╝s, and 9 hnsecs r=10922666154674544967680 t2=42 ms and 614 ╬╝s r=10922666154674544967680 t3=0 hnsecs r=0 t4=42 ms, 474 ╬╝s, and 8 hnsecs r=10922666154674544967680 C:\content\downloadz\dlang>dmd -release -O -mcpu=avx times2.d C:\content\downloadz\dlang>times2.exe t1=141 ms, 263 ╬╝s, and 5 hnsecs r=10922666154673907433000 t2=143 ms, 128 ╬╝s, and 9 hnsecs r=10922666154673907433000 t3=1 hnsec r=0 t4=491 ms, 829 ╬╝s, and 9 hnsecs r=10922666154673907433000 1) different sums DMD and LDC (probably fast-math, dont know) 2) t3=0 for d_with_sum. lets see assembler for LDC (-output-s): .def _D6times210d_with_sumFNaNbNfAdZd; .scl 2; .type 32; .endef .section .text,"xr",discard,_D6times210d_with_sumFNaNbNfAdZd .globl _D6times210d_with_sumFNaNbNfAdZd .p2align 4, 0x90 _D6times210d_with_sumFNaNbNfAdZd: vxorps %xmm0, %xmm0, %xmm0 retq // this means "return 0"? cool optimization 3) for Windows better change "╬╝s" to "us" (when /SUBSYSTEM:CONSOLE) |
May 28, 2019 Re: I wonder how fast we'd do | ||||
---|---|---|---|---|
| ||||
Posted in reply to KnightMare | > https://pastebin.com/j0T0MRmA > Windows Server 2019, i7-3615QM, DMD 2.086.0, LDC 1.16.0-b1 > 2) t3=0 for d_with_sum. lets see assembler for LDC (-output-s): people explained why code is being thrown away https://forum.dlang.org/thread/wxesgcjznvwpdwpnxnej@forum.dlang.org I have next results now: C:\content\downloadz\dlang>ldc2 -release -O3 -mattr=avx times2.d C:\content\downloadz\dlang>times2.exe t1=42 ms, 929 ╬╝s, and 1 hnsec r=109226661546_74544967680 t2=42 ms and 578 ╬╝s r=109226661546_74544967680 t3=333 ms, 539 ╬╝s, and 3 hnsecs r=109226661546_66672259072 t4=42 ms, 631 ╬╝s, and 9 hnsecs r=109226661546_74544967680 C:\content\downloadz\dlang>dmd -release -O -mcpu=avx times2.d C:\content\downloadz\dlang>times2.exe core.exception.OutOfMemoryError@src\core\exception.d(702): Memory allocation failed ---------------- I have 16GB RAM, 8GB free (by Task Manager) double[32M].sizeof=256MB remove attrs from d_with_sum & main - nothing changed strange a little bit. |
May 28, 2019 Re: I wonder how fast we'd do | ||||
---|---|---|---|---|
| ||||
Posted in reply to Marco de Wild | > When the blog post released I wrote a few benchmarks. Surprisingly, using > > values.map!(x => x*x).sum > > was the fastest (faster than v1). It got around to 20 us on my machine. all code was skipped coz results was unused. code with using results https://pastebin.com/j0T0MRmA and d_with_sum was skipped coz https://forum.dlang.org/thread/wxesgcjznvwpdwpnxnej@forum.dlang.org after all results will be XXms |
May 28, 2019 Re: I wonder how fast we'd do | ||||
---|---|---|---|---|
| ||||
Posted in reply to Atila Neves | On Tuesday, 28 May 2019 at 09:49:26 UTC, Atila Neves wrote:
>
> Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM.
This really isn't _that_ surprising.
Once properly optimized, native code is the same speed for every input language.
C, C++, D and Rust all have a "no room below" ethic in most cases, so you end up with the very same performance. Barring anomalies like bounds check or integer overflow checks.
Comparisons of backends would be much more interesting, but drive less interest on Internet forums.
|
May 28, 2019 Re: I wonder how fast we'd do | ||||
---|---|---|---|---|
| ||||
Posted in reply to Guillaume Piolat | On Tuesday, 28 May 2019 at 14:20:30 UTC, Guillaume Piolat wrote: > On Tuesday, 28 May 2019 at 09:49:26 UTC, Atila Neves wrote: >> >> Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM. > > This really isn't _that_ surprising. > > Once properly optimized, native code is the same speed for every input language. > C, C++, D and Rust all have a "no room below" ethic in most cases, so you end up with the very same performance. Barring anomalies like bounds check or integer overflow checks. > > Comparisons of backends would be much more interesting, but drive less interest on Internet forums. Indeed, the only thing that usually has any effect is aliasing rules, and the occasional convincing the code generator to do non-temporal ops. The REAL power of the frontend language is to make it aware to the optimiser that the code is redundant, 'cause theres no faster code than no code at all. Thats why I'm really excited to see what we could use MLIR[1] for in LDC. [1]: https://github.com/tensorflow/mlir/ |
May 28, 2019 Re: I wonder how fast we'd do | ||||
---|---|---|---|---|
| ||||
Posted in reply to Atila Neves | On 5/28/2019 2:49 AM, Atila Neves wrote:
> Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM.
I'm not surprised. First off, because of inlining, etc., all are transformed into a simple loop. Then, the auto-vectorizer does the rest.
|
May 28, 2019 Re: I wonder how fast we'd do | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Tue, May 28, 2019 at 10:11:43AM -0700, Walter Bright via Digitalmars-d wrote: > On 5/28/2019 2:49 AM, Atila Neves wrote: > > Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM. > > I'm not surprised. First off, because of inlining, etc., all are transformed into a simple loop. Then, the auto-vectorizer does the rest. Does dmd unroll loops yet? That appears to be a major cause of suboptimal codegen in dmd, last time I checked. Would be nice to improve this. T -- Some ideas are so stupid that only intellectuals could believe them. -- George Orwell |
Copyright © 1999-2021 by the D Language Foundation