| Thread overview | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
August 24, 2014 Performance problem in reverse algorithm | ||||
|---|---|---|---|---|
| ||||
Below there is a snippet of D code followed by a close C++ translation.
clang++ seems to have some problems optimizing the C++ version if my_reverse is not attributed "noinline". Unfortunately, I was not able to convince ldc to produce a fast executable. g++ and gdc both produce good results; annotation is not required here.
The same effect is visible in the D version if one replaces my_reverse with reverse from std.algorithm. Can someone figure out what the problem is?
clang++ / inline: 5.7s
ldc / inline: 5.7s
clang++ / noinline: 1.4s
ldc / noinline: 5.8s
g++: 1.2s
gdc: 1.2s
Test environment: Arch Linux x86_64
clang version 3.4.2
LDC - the LLVM D compiler (0.14.0)
g++ (GCC) 4.9.1
gdc (GCC) 4.9.1
clang++ -std=c++11 -march=native -O3 -DNOINLINE
ldc2 -O3 -mcpu=native -release -disable-boundscheck
g++ -std=c++11 -march=native -O3
gdc -O3 -march=native -frelease -fno-bounds-check
Moreover, if I do not specify '-disable-boundscheck' ldc2 produces the following error message:
/usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4HTTP18_sharedStaticCtor1FZv':
(.text._D3std3net4curl4HTTP18_sharedStaticCtor1FZv+0x10): undefined reference to `curl_version_info'
/usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4Curl18_sharedStaticDtor3FZv':
(.text._D3std3net4curl4Curl18_sharedStaticDtor3FZv+0x1): undefined reference to `curl_global_cleanup'
/usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl13__shared_ctorZ':
(.text._D3std3net4curl13__shared_ctorZ+0x10): undefined reference to `curl_global_init'
collect2: error: ld returned 1 exit status
Error: /usr/bin/gcc failed with status: 1
////
// reverse.d
import std.algorithm, std.range;
import ldc.attribute;
@attribute("noinline")
void my_reverse(int* b, int* e)
{
auto steps = (e - b) / 2;
if (steps) {
auto l = b;
auto r = e - 1;
do {
swap(*l, *r);
++l;
--r;
} while (--steps);
}
}
void main(string[] args)
{
immutable N = 2000;
immutable K = 10000;
auto a = iota(N).array;
for (auto n = 0; n <= N; ++n) {
for (auto k = 0; k <= K; ++k) {
my_reverse(&a[0], &a[0] + n);
}
}
}
////
// reverse.cpp
#include <numeric>
#include <vector>
#ifdef NOINLINE
__inline__ __attribute__((noinline))
#endif
void my_reverse(int* b, int* e)
{
auto steps = (e - b) / 2;
if (steps) {
auto l = b;
auto r = e - 1;
do {
std::swap(*l, *r);
++l;
--r;
} while (--steps);
}
}
int main()
{
const auto N = 2000;
const auto K = 10000;
std::vector<int> a(N);
auto b = std::begin(a);
auto e = std::end(a);
std::iota(b, e, 0);
for (auto n = 0; n <= N; ++n) {
for (auto k = 0; k <= K; ++k) {
my_reverse(&a[0], &a[0] + n);
}
}
}
| ||||
August 25, 2014 Re: Performance problem in reverse algorithm | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Fool | Hi Fool! On Sunday, 24 August 2014 at 16:45:12 UTC, Fool wrote: > Below there is a snippet of D code followed by a close C++ translation. > > clang++ seems to have some problems optimizing the C++ version if my_reverse is not attributed "noinline". Unfortunately, I was not able to convince ldc to produce a fast executable. g++ and gdc both produce good results; annotation is not required here. The speed of binaries produced by ldc is roughly the same as of binaries from clang++. That is no wonder as both use the same LLVM machinery. You could try the LDC_never_inline pragma (http://wiki.dlang.org/LDC-specific_language_changes). Unfortunately, general attributes are not yet implemented in LDC (see issue #561, https://github.com/ldc-developers/ldc/issues/561). > The same effect is visible in the D version if one replaces my_reverse with reverse from std.algorithm. Can someone figure out what the problem is? > > clang++ / inline: 5.7s > ldc / inline: 5.7s > clang++ / noinline: 1.4s > ldc / noinline: 5.8s > g++: 1.2s > gdc: 1.2s > > Test environment: Arch Linux x86_64 > > clang version 3.4.2 > LDC - the LLVM D compiler (0.14.0) > g++ (GCC) 4.9.1 > gdc (GCC) 4.9.1 > > clang++ -std=c++11 -march=native -O3 -DNOINLINE > ldc2 -O3 -mcpu=native -release -disable-boundscheck > g++ -std=c++11 -march=native -O3 > gdc -O3 -march=native -frelease -fno-bounds-check > > Moreover, if I do not specify '-disable-boundscheck' ldc2 produces the following error message: That is an instance of issue #683, https://github.com/ldc-developers/ldc/issues/683. Sorry for that. > > /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4HTTP18_sharedStaticCtor1FZv': > (.text._D3std3net4curl4HTTP18_sharedStaticCtor1FZv+0x10): undefined reference to `curl_version_info' > /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4Curl18_sharedStaticDtor3FZv': > (.text._D3std3net4curl4Curl18_sharedStaticDtor3FZv+0x1): undefined reference to `curl_global_cleanup' > /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl13__shared_ctorZ': > (.text._D3std3net4curl13__shared_ctorZ+0x10): undefined reference to `curl_global_init' > collect2: error: ld returned 1 exit status > Error: /usr/bin/gcc failed with status: 1 > > > //// > // reverse.d > > import std.algorithm, std.range; > import ldc.attribute; > > @attribute("noinline") > void my_reverse(int* b, int* e) > { > auto steps = (e - b) / 2; > if (steps) { > auto l = b; > auto r = e - 1; > do { > swap(*l, *r); > ++l; > --r; > } while (--steps); > } > } > > void main(string[] args) > { > immutable N = 2000; > immutable K = 10000; > auto a = iota(N).array; > for (auto n = 0; n <= N; ++n) { > for (auto k = 0; k <= K; ++k) { > my_reverse(&a[0], &a[0] + n); > } > } > } > > > //// > // reverse.cpp > > #include <numeric> > #include <vector> > > #ifdef NOINLINE > __inline__ __attribute__((noinline)) > #endif > void my_reverse(int* b, int* e) > { > auto steps = (e - b) / 2; > if (steps) { > auto l = b; > auto r = e - 1; > do { > std::swap(*l, *r); > ++l; > --r; > } while (--steps); > } > } > > int main() > { > const auto N = 2000; > const auto K = 10000; > std::vector<int> a(N); > auto b = std::begin(a); > auto e = std::end(a); > std::iota(b, e, 0); > for (auto n = 0; n <= N; ++n) { > for (auto k = 0; k <= K; ++k) { > my_reverse(&a[0], &a[0] + n); > } > } > } I will try to check this but it will take some time as I am busy with some other D related tasks. Regards, Kai | |||
August 25, 2014 Re: Performance problem in reverse algorithm | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Kai Nacke | Hi Kai! On Monday, 25 August 2014 at 16:49:34 UTC, Kai Nacke wrote: > On Sunday, 24 August 2014 at 16:45:12 UTC, Fool wrote: > The speed of binaries produced by ldc is roughly the same as of binaries from clang++. That is no wonder as both use the same LLVM machinery. Yes, that's what I thought. Probably, there is something strange going on in the depths of LLVM. > You could try the LDC_never_inline pragma (http://wiki.dlang.org/LDC-specific_language_changes). > Unfortunately, general attributes are not yet implemented in LDC (see issue #561, https://github.com/ldc-developers/ldc/issues/561). Introducing pragma(LDC_never_inline) slightly improves execution time of the ldc result to 5.4s. Still, this is more than three times slower than the fast version produced by clang++. > That is an instance of issue #683, https://github.com/ldc-developers/ldc/issues/683. Sorry for that. No problem, thanks for the info! > I will try to check this but it will take some time as I am busy with some other D related tasks. Thanks, there is no time pressure. I compared ASM and LLVM-IR of the fast and slow versions using clang++. Unfortunately, my foo in this area is non-existent. The only thing I noticed is that the fast version has significantly longer ASM and LLVM-I representations. Kind regards, Fool | |||
August 31, 2014 Re: Performance problem in reverse algorithm | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Fool | On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
> Thanks, there is no time pressure. I compared ASM and LLVM-IR of the fast and slow versions using clang++. Unfortunately, my foo in this area is non-existent. The only thing I noticed is that the fast version has significantly longer ASM and LLVM-I representations.
>
> Kind regards,
> Fool
I tried changing 'immutable' to 'enum' and adding 'pure nothrow' annotations on the functions, got a decent speedup for ldc, I currently don't have gdc installed, so I can't check if these changes make gdc even faster or if it's part of the difference.
Regards,
Daniel N
| |||
August 31, 2014 Re: Performance problem in reverse algorithm | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Daniel N | On Sunday, 31 August 2014 at 03:47:04 UTC, Daniel N wrote:
> I tried changing 'immutable' to 'enum' and adding 'pure nothrow' annotations on the functions, got a decent speedup for ldc, I currently don't have gdc installed, so I can't check if these changes make gdc even faster or if it's part of the difference.
>
> Regards,
> Daniel N
Hmm, sorry the execution time varies too wildly on my system, measuring error.
| |||
October 11, 2014 Re: Performance problem in reverse algorithm | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Fool | On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
> Introducing pragma(LDC_never_inline) slightly improves execution time of the ldc result to 5.4s. Still, this is more than three times slower than the fast version produced by clang++.
I just tested this, and on the machine I ran this on the code LDC produces with inlining disabled is just as fast as the one produced by Clang.
---
$ ldc2 -version
LDC - the LLVM D compiler (c8b9f6):
based on DMD v2.066 and LLVM 3.5.0
Default target: x86_64-unknown-linux-gnu
Host CPU: corei7-avx
[…]
$ clang++ --version
clang version 3.5.0 (tags/RELEASE_350/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
---
David
| |||
October 11, 2014 Re: Performance problem in reverse algorithm | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Fool | On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
> Introducing pragma(LDC_never_inline) slightly improves execution time of the ldc result to 5.4s. Still, this is more than three times slower than the fast version produced by clang++.
I just tested this, and on the machine I ran this on the code LDC produces with inlining disabled is just as fast as the one produced by Clang.
---
$ ldc2 -version
LDC - the LLVM D compiler (c8b9f6):
based on DMD v2.066 and LLVM 3.5.0
Default target: x86_64-unknown-linux-gnu
Host CPU: corei7-avx
[…]
$ clang++ --version
clang version 3.5.0 (tags/RELEASE_350/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
---
David
| |||
October 11, 2014 Re: Performance problem in reverse algorithm | ||||
|---|---|---|---|---|
| ||||
Posted in reply to David Nadlinger | Hi David,
thank you for checking this case.
Currently, I'm using
LDC - the LLVM D compiler (0.14.0):
based on DMD v2.065 and LLVM 3.4.2
Default target: x86_64-unknown-linux-gnu
Host CPU: core-avx2
If waiting for the next version of ldc is all I have to do that's good news. :-)
Thank you again, kind greetings,
Fool
On Saturday, 11 October 2014 at 20:04:59 UTC, David Nadlinger wrote:
> ---
> $ ldc2 -version
> LDC - the LLVM D compiler (c8b9f6):
> based on DMD v2.066 and LLVM 3.5.0
> Default target: x86_64-unknown-linux-gnu
> Host CPU: corei7-avx
> […]
>
> $ clang++ --version
> clang version 3.5.0 (tags/RELEASE_350/final)
> Target: x86_64-unknown-linux-gnu
> Thread model: posix
> ---
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply