Thread overview | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
August 24, 2014 Performance problem in reverse algorithm | ||||
---|---|---|---|---|
| ||||
Below there is a snippet of D code followed by a close C++ translation. clang++ seems to have some problems optimizing the C++ version if my_reverse is not attributed "noinline". Unfortunately, I was not able to convince ldc to produce a fast executable. g++ and gdc both produce good results; annotation is not required here. The same effect is visible in the D version if one replaces my_reverse with reverse from std.algorithm. Can someone figure out what the problem is? clang++ / inline: 5.7s ldc / inline: 5.7s clang++ / noinline: 1.4s ldc / noinline: 5.8s g++: 1.2s gdc: 1.2s Test environment: Arch Linux x86_64 clang version 3.4.2 LDC - the LLVM D compiler (0.14.0) g++ (GCC) 4.9.1 gdc (GCC) 4.9.1 clang++ -std=c++11 -march=native -O3 -DNOINLINE ldc2 -O3 -mcpu=native -release -disable-boundscheck g++ -std=c++11 -march=native -O3 gdc -O3 -march=native -frelease -fno-bounds-check Moreover, if I do not specify '-disable-boundscheck' ldc2 produces the following error message: /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4HTTP18_sharedStaticCtor1FZv': (.text._D3std3net4curl4HTTP18_sharedStaticCtor1FZv+0x10): undefined reference to `curl_version_info' /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4Curl18_sharedStaticDtor3FZv': (.text._D3std3net4curl4Curl18_sharedStaticDtor3FZv+0x1): undefined reference to `curl_global_cleanup' /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl13__shared_ctorZ': (.text._D3std3net4curl13__shared_ctorZ+0x10): undefined reference to `curl_global_init' collect2: error: ld returned 1 exit status Error: /usr/bin/gcc failed with status: 1 //// // reverse.d import std.algorithm, std.range; import ldc.attribute; @attribute("noinline") void my_reverse(int* b, int* e) { auto steps = (e - b) / 2; if (steps) { auto l = b; auto r = e - 1; do { swap(*l, *r); ++l; --r; } while (--steps); } } void main(string[] args) { immutable N = 2000; immutable K = 10000; auto a = iota(N).array; for (auto n = 0; n <= N; ++n) { for (auto k = 0; k <= K; ++k) { my_reverse(&a[0], &a[0] + n); } } } //// // reverse.cpp #include <numeric> #include <vector> #ifdef NOINLINE __inline__ __attribute__((noinline)) #endif void my_reverse(int* b, int* e) { auto steps = (e - b) / 2; if (steps) { auto l = b; auto r = e - 1; do { std::swap(*l, *r); ++l; --r; } while (--steps); } } int main() { const auto N = 2000; const auto K = 10000; std::vector<int> a(N); auto b = std::begin(a); auto e = std::end(a); std::iota(b, e, 0); for (auto n = 0; n <= N; ++n) { for (auto k = 0; k <= K; ++k) { my_reverse(&a[0], &a[0] + n); } } } |
August 25, 2014 Re: Performance problem in reverse algorithm | ||||
---|---|---|---|---|
| ||||
Posted in reply to Fool | Hi Fool! On Sunday, 24 August 2014 at 16:45:12 UTC, Fool wrote: > Below there is a snippet of D code followed by a close C++ translation. > > clang++ seems to have some problems optimizing the C++ version if my_reverse is not attributed "noinline". Unfortunately, I was not able to convince ldc to produce a fast executable. g++ and gdc both produce good results; annotation is not required here. The speed of binaries produced by ldc is roughly the same as of binaries from clang++. That is no wonder as both use the same LLVM machinery. You could try the LDC_never_inline pragma (http://wiki.dlang.org/LDC-specific_language_changes). Unfortunately, general attributes are not yet implemented in LDC (see issue #561, https://github.com/ldc-developers/ldc/issues/561). > The same effect is visible in the D version if one replaces my_reverse with reverse from std.algorithm. Can someone figure out what the problem is? > > clang++ / inline: 5.7s > ldc / inline: 5.7s > clang++ / noinline: 1.4s > ldc / noinline: 5.8s > g++: 1.2s > gdc: 1.2s > > Test environment: Arch Linux x86_64 > > clang version 3.4.2 > LDC - the LLVM D compiler (0.14.0) > g++ (GCC) 4.9.1 > gdc (GCC) 4.9.1 > > clang++ -std=c++11 -march=native -O3 -DNOINLINE > ldc2 -O3 -mcpu=native -release -disable-boundscheck > g++ -std=c++11 -march=native -O3 > gdc -O3 -march=native -frelease -fno-bounds-check > > Moreover, if I do not specify '-disable-boundscheck' ldc2 produces the following error message: That is an instance of issue #683, https://github.com/ldc-developers/ldc/issues/683. Sorry for that. > > /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4HTTP18_sharedStaticCtor1FZv': > (.text._D3std3net4curl4HTTP18_sharedStaticCtor1FZv+0x10): undefined reference to `curl_version_info' > /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4Curl18_sharedStaticDtor3FZv': > (.text._D3std3net4curl4Curl18_sharedStaticDtor3FZv+0x1): undefined reference to `curl_global_cleanup' > /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl13__shared_ctorZ': > (.text._D3std3net4curl13__shared_ctorZ+0x10): undefined reference to `curl_global_init' > collect2: error: ld returned 1 exit status > Error: /usr/bin/gcc failed with status: 1 > > > //// > // reverse.d > > import std.algorithm, std.range; > import ldc.attribute; > > @attribute("noinline") > void my_reverse(int* b, int* e) > { > auto steps = (e - b) / 2; > if (steps) { > auto l = b; > auto r = e - 1; > do { > swap(*l, *r); > ++l; > --r; > } while (--steps); > } > } > > void main(string[] args) > { > immutable N = 2000; > immutable K = 10000; > auto a = iota(N).array; > for (auto n = 0; n <= N; ++n) { > for (auto k = 0; k <= K; ++k) { > my_reverse(&a[0], &a[0] + n); > } > } > } > > > //// > // reverse.cpp > > #include <numeric> > #include <vector> > > #ifdef NOINLINE > __inline__ __attribute__((noinline)) > #endif > void my_reverse(int* b, int* e) > { > auto steps = (e - b) / 2; > if (steps) { > auto l = b; > auto r = e - 1; > do { > std::swap(*l, *r); > ++l; > --r; > } while (--steps); > } > } > > int main() > { > const auto N = 2000; > const auto K = 10000; > std::vector<int> a(N); > auto b = std::begin(a); > auto e = std::end(a); > std::iota(b, e, 0); > for (auto n = 0; n <= N; ++n) { > for (auto k = 0; k <= K; ++k) { > my_reverse(&a[0], &a[0] + n); > } > } > } I will try to check this but it will take some time as I am busy with some other D related tasks. Regards, Kai |
August 25, 2014 Re: Performance problem in reverse algorithm | ||||
---|---|---|---|---|
| ||||
Posted in reply to Kai Nacke | Hi Kai! On Monday, 25 August 2014 at 16:49:34 UTC, Kai Nacke wrote: > On Sunday, 24 August 2014 at 16:45:12 UTC, Fool wrote: > The speed of binaries produced by ldc is roughly the same as of binaries from clang++. That is no wonder as both use the same LLVM machinery. Yes, that's what I thought. Probably, there is something strange going on in the depths of LLVM. > You could try the LDC_never_inline pragma (http://wiki.dlang.org/LDC-specific_language_changes). > Unfortunately, general attributes are not yet implemented in LDC (see issue #561, https://github.com/ldc-developers/ldc/issues/561). Introducing pragma(LDC_never_inline) slightly improves execution time of the ldc result to 5.4s. Still, this is more than three times slower than the fast version produced by clang++. > That is an instance of issue #683, https://github.com/ldc-developers/ldc/issues/683. Sorry for that. No problem, thanks for the info! > I will try to check this but it will take some time as I am busy with some other D related tasks. Thanks, there is no time pressure. I compared ASM and LLVM-IR of the fast and slow versions using clang++. Unfortunately, my foo in this area is non-existent. The only thing I noticed is that the fast version has significantly longer ASM and LLVM-I representations. Kind regards, Fool |
August 31, 2014 Re: Performance problem in reverse algorithm | ||||
---|---|---|---|---|
| ||||
Posted in reply to Fool | On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
> Thanks, there is no time pressure. I compared ASM and LLVM-IR of the fast and slow versions using clang++. Unfortunately, my foo in this area is non-existent. The only thing I noticed is that the fast version has significantly longer ASM and LLVM-I representations.
>
> Kind regards,
> Fool
I tried changing 'immutable' to 'enum' and adding 'pure nothrow' annotations on the functions, got a decent speedup for ldc, I currently don't have gdc installed, so I can't check if these changes make gdc even faster or if it's part of the difference.
Regards,
Daniel N
|
August 31, 2014 Re: Performance problem in reverse algorithm | ||||
---|---|---|---|---|
| ||||
Posted in reply to Daniel N | On Sunday, 31 August 2014 at 03:47:04 UTC, Daniel N wrote:
> I tried changing 'immutable' to 'enum' and adding 'pure nothrow' annotations on the functions, got a decent speedup for ldc, I currently don't have gdc installed, so I can't check if these changes make gdc even faster or if it's part of the difference.
>
> Regards,
> Daniel N
Hmm, sorry the execution time varies too wildly on my system, measuring error.
|
October 11, 2014 Re: Performance problem in reverse algorithm | ||||
---|---|---|---|---|
| ||||
Posted in reply to Fool | On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
> Introducing pragma(LDC_never_inline) slightly improves execution time of the ldc result to 5.4s. Still, this is more than three times slower than the fast version produced by clang++.
I just tested this, and on the machine I ran this on the code LDC produces with inlining disabled is just as fast as the one produced by Clang.
---
$ ldc2 -version
LDC - the LLVM D compiler (c8b9f6):
based on DMD v2.066 and LLVM 3.5.0
Default target: x86_64-unknown-linux-gnu
Host CPU: corei7-avx
[…]
$ clang++ --version
clang version 3.5.0 (tags/RELEASE_350/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
---
David
|
October 11, 2014 Re: Performance problem in reverse algorithm | ||||
---|---|---|---|---|
| ||||
Posted in reply to Fool | On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
> Introducing pragma(LDC_never_inline) slightly improves execution time of the ldc result to 5.4s. Still, this is more than three times slower than the fast version produced by clang++.
I just tested this, and on the machine I ran this on the code LDC produces with inlining disabled is just as fast as the one produced by Clang.
---
$ ldc2 -version
LDC - the LLVM D compiler (c8b9f6):
based on DMD v2.066 and LLVM 3.5.0
Default target: x86_64-unknown-linux-gnu
Host CPU: corei7-avx
[…]
$ clang++ --version
clang version 3.5.0 (tags/RELEASE_350/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
---
David
|
October 11, 2014 Re: Performance problem in reverse algorithm | ||||
---|---|---|---|---|
| ||||
Posted in reply to David Nadlinger | Hi David,
thank you for checking this case.
Currently, I'm using
LDC - the LLVM D compiler (0.14.0):
based on DMD v2.065 and LLVM 3.4.2
Default target: x86_64-unknown-linux-gnu
Host CPU: core-avx2
If waiting for the next version of ldc is all I have to do that's good news. :-)
Thank you again, kind greetings,
Fool
On Saturday, 11 October 2014 at 20:04:59 UTC, David Nadlinger wrote:
> ---
> $ ldc2 -version
> LDC - the LLVM D compiler (c8b9f6):
> based on DMD v2.066 and LLVM 3.5.0
> Default target: x86_64-unknown-linux-gnu
> Host CPU: corei7-avx
> […]
>
> $ clang++ --version
> clang version 3.5.0 (tags/RELEASE_350/final)
> Target: x86_64-unknown-linux-gnu
> Thread model: posix
> ---
|
Copyright © 1999-2021 by the D Language Foundation