Thread overview
Performance problem in reverse algorithm
Aug 24, 2014
Fool
Aug 25, 2014
Kai Nacke
Aug 25, 2014
Fool
Aug 31, 2014
Daniel N
Aug 31, 2014
Daniel N
Oct 11, 2014
David Nadlinger
Oct 11, 2014
David Nadlinger
Oct 11, 2014
Fool
August 24, 2014
Below there is a snippet of D code followed by a close C++ translation.

clang++ seems to have some problems optimizing the C++ version if my_reverse is not attributed "noinline". Unfortunately, I was not able to convince ldc to produce a fast executable. g++ and gdc both produce good results; annotation is not required here.

The same effect is visible in the D version if one replaces my_reverse with reverse from std.algorithm. Can someone figure out what the problem is?

clang++ /   inline: 5.7s
    ldc /   inline: 5.7s
clang++ / noinline: 1.4s
    ldc / noinline: 5.8s
               g++: 1.2s
               gdc: 1.2s

Test environment: Arch Linux x86_64

clang version 3.4.2
LDC - the LLVM D compiler (0.14.0)
g++ (GCC) 4.9.1
gdc (GCC) 4.9.1

clang++ -std=c++11 -march=native -O3 -DNOINLINE
ldc2 -O3 -mcpu=native -release -disable-boundscheck
g++ -std=c++11 -march=native -O3
gdc -O3 -march=native -frelease -fno-bounds-check

Moreover, if I do not specify '-disable-boundscheck' ldc2 produces the following error message:

/usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4HTTP18_sharedStaticCtor1FZv':
(.text._D3std3net4curl4HTTP18_sharedStaticCtor1FZv+0x10): undefined reference to `curl_version_info'
/usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4Curl18_sharedStaticDtor3FZv':
(.text._D3std3net4curl4Curl18_sharedStaticDtor3FZv+0x1): undefined reference to `curl_global_cleanup'
/usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl13__shared_ctorZ':
(.text._D3std3net4curl13__shared_ctorZ+0x10): undefined reference to `curl_global_init'
collect2: error: ld returned 1 exit status
Error: /usr/bin/gcc failed with status: 1


////
// reverse.d

import std.algorithm, std.range;
import ldc.attribute;

@attribute("noinline")
void my_reverse(int* b, int* e)
{
   auto steps = (e - b) / 2;
   if (steps) {
      auto l = b;
      auto r = e - 1;
      do {
         swap(*l, *r);
         ++l;
         --r;
      } while (--steps);
   }
}

void main(string[] args)
{
   immutable N = 2000;
   immutable K = 10000;
   auto a = iota(N).array;
   for (auto n = 0; n <= N; ++n) {
      for (auto k = 0; k <= K; ++k) {
         my_reverse(&a[0], &a[0] + n);
      }
   }
}


////
// reverse.cpp

#include <numeric>
#include <vector>

#ifdef NOINLINE
__inline__ __attribute__((noinline))
#endif
void my_reverse(int* b, int* e)
{
   auto steps = (e - b) / 2;
   if (steps) {
      auto l = b;
      auto r = e - 1;
      do {
         std::swap(*l, *r);
         ++l;
         --r;
      } while (--steps);
   }
}

int main()
{
   const auto N = 2000;
   const auto K = 10000;
   std::vector<int> a(N);
   auto b = std::begin(a);
   auto e = std::end(a);
   std::iota(b, e, 0);
   for (auto n = 0; n <= N; ++n) {
      for (auto k = 0; k <= K; ++k) {
         my_reverse(&a[0], &a[0] + n);
      }
   }
}
August 25, 2014
Hi Fool!

On Sunday, 24 August 2014 at 16:45:12 UTC, Fool wrote:
> Below there is a snippet of D code followed by a close C++ translation.
>
> clang++ seems to have some problems optimizing the C++ version if my_reverse is not attributed "noinline". Unfortunately, I was not able to convince ldc to produce a fast executable. g++ and gdc both produce good results; annotation is not required here.

The speed of binaries produced by ldc is roughly the same as of binaries from clang++. That is no wonder as both use the same LLVM machinery.
You could try the LDC_never_inline pragma (http://wiki.dlang.org/LDC-specific_language_changes).
Unfortunately, general attributes are not yet implemented in LDC (see issue #561, https://github.com/ldc-developers/ldc/issues/561).

> The same effect is visible in the D version if one replaces my_reverse with reverse from std.algorithm. Can someone figure out what the problem is?
>
> clang++ /   inline: 5.7s
>     ldc /   inline: 5.7s
> clang++ / noinline: 1.4s
>     ldc / noinline: 5.8s
>                g++: 1.2s
>                gdc: 1.2s
>
> Test environment: Arch Linux x86_64
>
> clang version 3.4.2
> LDC - the LLVM D compiler (0.14.0)
> g++ (GCC) 4.9.1
> gdc (GCC) 4.9.1
>
> clang++ -std=c++11 -march=native -O3 -DNOINLINE
> ldc2 -O3 -mcpu=native -release -disable-boundscheck
> g++ -std=c++11 -march=native -O3
> gdc -O3 -march=native -frelease -fno-bounds-check
>
> Moreover, if I do not specify '-disable-boundscheck' ldc2 produces the following error message:

That is an instance of issue #683, https://github.com/ldc-developers/ldc/issues/683. Sorry for that.

>
> /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4HTTP18_sharedStaticCtor1FZv':
> (.text._D3std3net4curl4HTTP18_sharedStaticCtor1FZv+0x10): undefined reference to `curl_version_info'
> /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl4Curl18_sharedStaticDtor3FZv':
> (.text._D3std3net4curl4Curl18_sharedStaticDtor3FZv+0x1): undefined reference to `curl_global_cleanup'
> /usr/lib/liblphobos2.a(curl.o): In function `_D3std3net4curl13__shared_ctorZ':
> (.text._D3std3net4curl13__shared_ctorZ+0x10): undefined reference to `curl_global_init'
> collect2: error: ld returned 1 exit status
> Error: /usr/bin/gcc failed with status: 1
>
>
> ////
> // reverse.d
>
> import std.algorithm, std.range;
> import ldc.attribute;
>
> @attribute("noinline")
> void my_reverse(int* b, int* e)
> {
>    auto steps = (e - b) / 2;
>    if (steps) {
>       auto l = b;
>       auto r = e - 1;
>       do {
>          swap(*l, *r);
>          ++l;
>          --r;
>       } while (--steps);
>    }
> }
>
> void main(string[] args)
> {
>    immutable N = 2000;
>    immutable K = 10000;
>    auto a = iota(N).array;
>    for (auto n = 0; n <= N; ++n) {
>       for (auto k = 0; k <= K; ++k) {
>          my_reverse(&a[0], &a[0] + n);
>       }
>    }
> }
>
>
> ////
> // reverse.cpp
>
> #include <numeric>
> #include <vector>
>
> #ifdef NOINLINE
> __inline__ __attribute__((noinline))
> #endif
> void my_reverse(int* b, int* e)
> {
>    auto steps = (e - b) / 2;
>    if (steps) {
>       auto l = b;
>       auto r = e - 1;
>       do {
>          std::swap(*l, *r);
>          ++l;
>          --r;
>       } while (--steps);
>    }
> }
>
> int main()
> {
>    const auto N = 2000;
>    const auto K = 10000;
>    std::vector<int> a(N);
>    auto b = std::begin(a);
>    auto e = std::end(a);
>    std::iota(b, e, 0);
>    for (auto n = 0; n <= N; ++n) {
>       for (auto k = 0; k <= K; ++k) {
>          my_reverse(&a[0], &a[0] + n);
>       }
>    }
> }

I will try to check this but it will take some time as I am busy with some other D related tasks.

Regards,
Kai
August 25, 2014
Hi Kai!

On Monday, 25 August 2014 at 16:49:34 UTC, Kai Nacke wrote:
> On Sunday, 24 August 2014 at 16:45:12 UTC, Fool wrote:
> The speed of binaries produced by ldc is roughly the same as of binaries from clang++. That is no wonder as both use the same LLVM machinery.

Yes, that's what I thought. Probably, there is something strange going on in the depths of LLVM.


> You could try the LDC_never_inline pragma (http://wiki.dlang.org/LDC-specific_language_changes).
> Unfortunately, general attributes are not yet implemented in LDC (see issue #561, https://github.com/ldc-developers/ldc/issues/561).

Introducing pragma(LDC_never_inline) slightly improves execution time of the ldc result to 5.4s. Still, this is more than three times slower than the fast version produced by clang++.


> That is an instance of issue #683, https://github.com/ldc-developers/ldc/issues/683. Sorry for that.

No problem, thanks for the info!


> I will try to check this but it will take some time as I am busy with some other D related tasks.

Thanks, there is no time pressure. I compared ASM and LLVM-IR of the fast and slow versions using clang++. Unfortunately, my foo in this area is non-existent. The only thing I noticed is that the fast version has significantly longer ASM and LLVM-I representations.

Kind regards,
Fool
August 31, 2014
On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
> Thanks, there is no time pressure. I compared ASM and LLVM-IR of the fast and slow versions using clang++. Unfortunately, my foo in this area is non-existent. The only thing I noticed is that the fast version has significantly longer ASM and LLVM-I representations.
>
> Kind regards,
> Fool

I tried changing 'immutable' to 'enum' and adding 'pure nothrow' annotations on the functions, got a decent speedup for ldc, I currently don't have gdc installed, so I can't check if these changes make gdc even faster or if it's part of the difference.

Regards,
Daniel N
August 31, 2014
On Sunday, 31 August 2014 at 03:47:04 UTC, Daniel N wrote:
> I tried changing 'immutable' to 'enum' and adding 'pure nothrow' annotations on the functions, got a decent speedup for ldc, I currently don't have gdc installed, so I can't check if these changes make gdc even faster or if it's part of the difference.
>
> Regards,
> Daniel N

Hmm, sorry the execution time varies too wildly on my system, measuring error.
October 11, 2014
On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
> Introducing pragma(LDC_never_inline) slightly improves execution time of the ldc result to 5.4s. Still, this is more than three times slower than the fast version produced by clang++.

I just tested this, and on the machine I ran this on the code LDC produces with inlining disabled is just as fast as the one produced by Clang.

---
$ ldc2 -version
LDC - the LLVM D compiler (c8b9f6):
  based on DMD v2.066 and LLVM 3.5.0
  Default target: x86_64-unknown-linux-gnu
  Host CPU: corei7-avx
[…]

$ clang++ --version
clang version 3.5.0 (tags/RELEASE_350/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
---

David
October 11, 2014
On Monday, 25 August 2014 at 18:04:02 UTC, Fool wrote:
> Introducing pragma(LDC_never_inline) slightly improves execution time of the ldc result to 5.4s. Still, this is more than three times slower than the fast version produced by clang++.

I just tested this, and on the machine I ran this on the code LDC produces with inlining disabled is just as fast as the one produced by Clang.

---
$ ldc2 -version
LDC - the LLVM D compiler (c8b9f6):
  based on DMD v2.066 and LLVM 3.5.0
  Default target: x86_64-unknown-linux-gnu
  Host CPU: corei7-avx
[…]

$ clang++ --version
clang version 3.5.0 (tags/RELEASE_350/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
---

David
October 11, 2014
Hi David,

thank you for checking this case.

Currently, I'm using

LDC - the LLVM D compiler (0.14.0):
  based on DMD v2.065 and LLVM 3.4.2
  Default target: x86_64-unknown-linux-gnu
  Host CPU: core-avx2

If waiting for the next version of ldc is all I have to do that's good news. :-)

Thank you again, kind greetings,
Fool

On Saturday, 11 October 2014 at 20:04:59 UTC, David Nadlinger wrote:
> ---
> $ ldc2 -version
> LDC - the LLVM D compiler (c8b9f6):
>   based on DMD v2.066 and LLVM 3.5.0
>   Default target: x86_64-unknown-linux-gnu
>   Host CPU: corei7-avx
> […]
>
> $ clang++ --version
> clang version 3.5.0 (tags/RELEASE_350/final)
> Target: x86_64-unknown-linux-gnu
> Thread model: posix
> ---