Jump to page: 1 2 3
Thread overview
Why is this code slow?
Mar 24
Csaba
Mar 24
matheus
Mar 24
Sergey
Mar 24
kdevel
Mar 24
rkompass
Mar 24
Sergey
Mar 25
rkompass
Mar 26
Csaba
Mar 27
rkompass
Mar 28
rkompass
Mar 28
rkompass
Mar 28
Sergey
Mar 29
Serg Gini
March 24

I know that benchmarks are always controversial and depend on a lot of factors. So far, I read that D performs very well in benchmarks, as well, if not better, as C.

I wrote a little program that approximates PI using the Leibniz formula. I implemented the same thing in C, D and Python, all of them execute 1,000,000 iterations 20 times and display the average time elapsed.

Here are the results:

C: 0.04s
Python: 0.33s
D: 0.73s

What the hell? D slower than Python? This cannot be real. I am sure I am making a mistake here. I'm sharing all 3 programs here:

C: https://pastebin.com/s7e2HFyL
D: https://pastebin.com/fuURdupc
Python: https://pastebin.com/zcXAkSEf

As you can see the function that does the job is exactly the same in C and D.

Here are the compile/run commands used:

C: gcc leibniz.c -lm -oleibc
D: gdc leibniz.d -frelease -oleibd
Python: python3 leibniz.py

PS. my CPU is AMD A8-5500B and my OS is Ubuntu Linux, if that matters.

March 24
On Sunday, 24 March 2024 at 19:31:19 UTC, Csaba wrote:
> ...
>
> Here are the results:
>
> C: 0.04s
> Python: 0.33s
> D: 0.73s
> 
> ...

I think a few things can be going on, but one way to go is trying using optimization flags like "-O2", and run again.

But anyway, looking through Assembly generated:

C: https://godbolt.org/z/45Kn1W93b
D: https://godbolt.org/z/Ghr3fqaTW

The Leibniz's function is very close each other, except for one thing, the "pow" function on D side. It's a template, maybe you should start from there, in fact I'd try the pow from C to see what happens.

Matheus.
March 24

On Sunday, 24 March 2024 at 19:31:19 UTC, Csaba wrote:

>

As you can see the function that does the job is exactly the same in C and D.
Not really..

The speed of Leibniz algo is mostly the same. You can check the code in this benchmark for example: https://github.com/niklas-heer/speed-comparison

What you could fix in your code:

  • you can use enum for BENCHMARKS and ITERATIONS
  • use pow from core.stdc.math
  • use sw.reset() in a loop

So the main part could look like this:

auto sw = StopWatch(AutoStart.no);
sw.start();
foreach (i; 0..BENCHMARKS) {
    result += leibniz(ITERATIONS);
    total_time += sw.peek.total!"nsecs";
    sw.reset();
}
sw.stop();
March 24

On Sunday, 24 March 2024 at 19:31:19 UTC, Csaba wrote:

>

I know that benchmarks are always controversial and depend on a lot of factors. So far, I read that D performs very well in benchmarks, as well, if not better, as C.

I wrote a little program that approximates PI using the Leibniz formula. I implemented the same thing in C, D and Python, all of them execute 1,000,000 iterations 20 times and display the average time elapsed.

Here are the results:

C: 0.04s
Python: 0.33s
D: 0.73s

What the hell? D slower than Python? This cannot be real. I am sure I am making a mistake here. I'm sharing all 3 programs here:

C: https://pastebin.com/s7e2HFyL
D: https://pastebin.com/fuURdupc
Python: https://pastebin.com/zcXAkSEf

Usually you do not translate mathematical expressions directly into code:

   n += pow(-1.0, i - 1.0) / (i * 2.0 - 1.0);

The term containing the pow invocation computes the alternating sequence -1, 1, -1, ..., which can be replaced by e.g.

   immutable int [2] sign = [-1, 1];
   n += sign [i & 1] / (i * 2.0 - 1.0);

This saves the expensive call to the pow function.

March 24
>

The term containing the pow invocation computes the alternating sequence -1, 1, -1, ..., which can be replaced by e.g.

   immutable int [2] sign = [-1, 1];
   n += sign [i & 1] / (i * 2.0 - 1.0);

This saves the expensive call to the pow function.

I used the loop:

	for (int i = 1; i < iter; i++)
		n += ((i%2) ? -1.0 : 1.0) / (i * 2.0 + 1.0);

in both C and D, with gcc and gdc and got average execution times:

--- C -----
original: .... loop replacement: .... -O2:
0.009989 .... 0.003198 ........... 0.001335

--- D -----
original: .... loop replacement: .... -O2:
0.230346 .... 0.003083 ........... 0.001309

almost no difference.

But the D binary is much larger on my Linux:
4600920 bytes instead of 15504 bytes for the C version.

Are there some simple switches / settings to get a smaller binary?

March 24

On Sunday, 24 March 2024 at 22:16:06 UTC, rkompass wrote:

>

Are there some simple switches / settings to get a smaller binary?

  1. If possible you can use "betterC" - to disable runtime
  2. otherwise
--release --O3 --flto=full -fvisibility=hidden -defaultlib=phobos2-ldc-lto,druntime-ldc-lto -L=-dead_strip -L=-x -L=-S -L=-lz
March 25

On Sunday, 24 March 2024 at 22:16:06 UTC, Kdevel wrote:

>

The term containing the pow invocation computes the alternating sequence -1, 1, -1, ..., which can be replaced by e.g.

   immutable int [2] sign = [-1, 1];
   n += sign [i & 1] / (i * 2.0 - 1.0);

This saves the expensive call to the pow function.

I also used this code:

import std.stdio : writefln;
import std.datetime.stopwatch;

enum ITERATIONS = 1_000_000;
enum BENCHMARKS = 20;

auto leibniz(bool speed = true)(int iter) {
  double n = 1.0;

  static if(speed) const sign = [-1, 1];

  for(int i = 2; i < iter; i++) {
    static if(speed) {
      const m = i << 1;
      n += sign [i & 1] / (m - 1.0);
    } else {
      n += pow(-1, i - 1) / (i * 2.0 - 1.0);
    }
  }
  return n * 4.0;
}

auto pow(F, G)(F x, G n) @nogc @trusted pure nothrow {
    import std.traits : Unsigned, Unqual;

    real p = 1.0, v = void;
    Unsigned!(Unqual!G) m = n;

    if(n < 0) {
        if(n == -1) return 1 / x;
        m = cast(typeof(m))(0 - n);
        v = p / x;
    } else {
        switch(n) {
          case 0: return 1.0;
          case 1: return x;
          case 2: return x * x;
          default:
        }
        v = x;
    }
    while(true) {
        if(m & 1) p *= v;
        m >>= 1;
        if(!m) break;
        v *= v;
    }
    return p;
}

void main()
{
    double result;
    long total_time = 0;

    for(int i = 0; i < BENCHMARKS; i++)
    {
        auto sw = StopWatch(AutoStart.no);
        sw.start();

        result = ITERATIONS.leibniz;//!false;

        sw.stop();
        total_time += sw.peek.total!"nsecs";
    }

    result.writefln!"%0.21f";
    writefln("Avg execution time: %f\n", total_time / BENCHMARKS / 1e9);
}

and results:

>

dmd -run "leibnizTest.d"
3.141594653593692054727
Avg execution time: 0.002005

If I compile with leibniz!false(ITERATIONS) the average execution time increases slightly:

>

Avg execution time: 0.044435

However, if you pay attention, it is not connected to an external library and a power function that works with integers is used. Normally the following function of the library should be called:

>

Unqual!(Largest!(F, G)) pow(F, G)(F x, G y) @nogc @trusted pure nothrow
if (isFloatingPoint!(F) && isFloatingPoint!(G))
...

Now, the person asking the question will ask why it is slow even though we use exactly the same codes in C; rightly. You may think that the more watermelon you carry in your arms, the slower you naturally become. I think the important thing is not to drop the watermelons :)

SDB@79

March 25

On Sunday, 24 March 2024 at 23:02:19 UTC, Sergey wrote:

>

On Sunday, 24 March 2024 at 22:16:06 UTC, rkompass wrote:

>

Are there some simple switches / settings to get a smaller binary?

  1. If possible you can use "betterC" - to disable runtime
  2. otherwise
--release --O3 --flto=full -fvisibility=hidden -defaultlib=phobos2-ldc-lto,druntime-ldc-lto -L=-dead_strip -L=-x -L=-S -L=-lz

Thank you. I succeeded with gdc -Wall -O2 -frelease -shared-libphobos

A little remark:
The approximation to pi is slow, but oscillates up and down much more than its average. So doing the average of 2 steps gives many more precise digits. We can simulate this by doing a last step with half the size:

double leibniz(int it) {
  double n = 1.0;
  for (int i = 1; i < it; i++)
    n += ((i%2) ? -1.0 : 1.0) / (i * 2.0 + 1.0);
  n += 0.5*((it%2) ? -1.0 : 1.0) / (it * 2.0 + 1.0);
  return n * 4.0;
}

Of course you may also combine the up(+) and down(-) step to one:

1/i - 1/(i+2) = 2/(i*(i+2))

double leibniz(int iter) {
  double n = 0.0;
  for (int i = 1; i < iter; i+=4)
    n += 2.0 / (i * (i+2.0));
  return n * 4.0;
}

or even combine both approaches. But of, course mathematically much more is possible. This was not about approximating pi as fast as possible...

The above first approach still works with the original speed, only makes the result a little bit nicer.

March 26

On Sunday, 24 March 2024 at 21:21:13 UTC, kdevel wrote:

>

Usually you do not translate mathematical expressions directly into code:

   n += pow(-1.0, i - 1.0) / (i * 2.0 - 1.0);

The term containing the pow invocation computes the alternating sequence -1, 1, -1, ..., which can be replaced by e.g.

   immutable int [2] sign = [-1, 1];
   n += sign [i & 1] / (i * 2.0 - 1.0);

This saves the expensive call to the pow function.

I know that the code can be simplified/optimized, I just wanted to compare the same expression in C and D.

March 26

On Monday, 25 March 2024 at 14:02:08 UTC, rkompass wrote:

>

Of course you may also combine the up(+) and down(-) step to one:

1/i - 1/(i+2) = 2/(i*(i+2))

double leibniz(int iter) {
  double n = 0.0;
  for (int i = 1; i < iter; i+=4)
    n += 2.0 / (i * (i+2.0));
  return n * 4.0;
}

or even combine both approaches. But of, course mathematically much more is possible. This was not about approximating pi as fast as possible...

The above first approach still works with the original speed, only makes the result a little bit nicer.

It's obvious that you are a good mathematician. You used sequence A005563. First of all, I must apologize to the questioner for digressing from the topic. But I saw that there is a calculation difference between real and double. My goal was to see if there would be a change in speed. For example, with 250 million cycles (iter/4) I got the following result:

>

3.14159265158976691 (250 5million (with real)
3.14159264457621568 (250 million with double)
3.14159265358979324 (std.math.constants.PI)

First of all, my question is: Why do we see this calculation error with double? Could the changes I made to the algorithm have caused this? Here's an executable code snippet:

enum step = 4;
enum loop = 250_000_000;

auto leibniz(T)(int iter)
{
  T n = 2/3.0;
  for(int i = 5; i < iter; i += step)
  {
    T a = (2.0 + i) * i; // https://oeis.org/A005563
    n += 2/a;
  }
  return n * step;
}

import std.stdio : writefln;

void main()
{
  enum iter = loop * step-10;
  
  65358979323.writefln!"Compare.%s";

  iter.leibniz!double.writefln!"%.17f (double)";
  iter.leibniz!real.writefln!"%.17f (real)";

  imported!"std.math".PI.writefln!"%.17f (enum)";
} /* Prints:

Compare.65358979323
3.14159264457621568 (double)
3.14159265158976689 (real)
3.14159265358979324 (enum)
*/

In fact, there are algorithms that calculate accurately up to 12 decimal places with fewer cycles. (e.g. 9999)

SDB@79

« First   ‹ Prev
1 2 3