March 26

On Sunday, 24 March 2024 at 19:31:19 UTC, Csaba wrote:

>

I know that benchmarks are always controversial and depend on a lot of factors. So far, I read that D performs very well in benchmarks, as well, if not better, as C.

I wrote a little program that approximates PI using the Leibniz formula. I implemented the same thing in C, D and Python, all of them execute 1,000,000 iterations 20 times and display the average time elapsed.

Here are the results:

C: 0.04s
Python: 0.33s
D: 0.73s

What the hell? D slower than Python? This cannot be real. I am sure I am making a mistake here. I'm sharing all 3 programs here:

C: https://pastebin.com/s7e2HFyL
D: https://pastebin.com/fuURdupc
Python: https://pastebin.com/zcXAkSEf

As you can see the function that does the job is exactly the same in C and D.

Here are the compile/run commands used:

C: gcc leibniz.c -lm -oleibc
D: gdc leibniz.d -frelease -oleibd
Python: python3 leibniz.py

PS. my CPU is AMD A8-5500B and my OS is Ubuntu Linux, if that matters.

As others suggested, pow is the problem. I noticed that the C versions are often much faster than their D counterparts. (And I don't view that as a problem, since both are built into the language - my only thought is that the D version should call the C version).

Changing

import std.math:pow;

to

import core.stdc.math: pow;

and leaving everything unchanged, I get

C: Avg execution time: 0.007918
D (original): Avg execution time: 0.102612
D (using core.stdc.math): Avg execution time: 0.008134

So more or less the exact same numbers if you use core.stdc.math.

March 26

On Tuesday, 26 March 2024 at 14:25:53 UTC, Lance Bachmeier wrote:

>

On Sunday, 24 March 2024 at 19:31:19 UTC, Csaba wrote:

>

I know that benchmarks are always controversial and depend on a lot of factors. So far, I read that D performs very well in benchmarks, as well, if not better, as C.

I wrote a little program that approximates PI using the Leibniz formula. I implemented the same thing in C, D and Python, all of them execute 1,000,000 iterations 20 times and display the average time elapsed.

Here are the results:

C: 0.04s
Python: 0.33s
D: 0.73s

What the hell? D slower than Python? This cannot be real. I am sure I am making a mistake here. I'm sharing all 3 programs here:

C: https://pastebin.com/s7e2HFyL
D: https://pastebin.com/fuURdupc
Python: https://pastebin.com/zcXAkSEf

As you can see the function that does the job is exactly the same in C and D.

Here are the compile/run commands used:

C: gcc leibniz.c -lm -oleibc
D: gdc leibniz.d -frelease -oleibd
Python: python3 leibniz.py

PS. my CPU is AMD A8-5500B and my OS is Ubuntu Linux, if that matters.

As others suggested, pow is the problem. I noticed that the C versions are often much faster than their D counterparts. (And I don't view that as a problem, since both are built into the language - my only thought is that the D version should call the C version).

Changing

import std.math:pow;

to

import core.stdc.math: pow;

and leaving everything unchanged, I get

C: Avg execution time: 0.007918
D (original): Avg execution time: 0.102612
D (using core.stdc.math): Avg execution time: 0.008134

So more or less the exact same numbers if you use core.stdc.math.

And then the other thing is changing

const int BENCHMARKS = 20;

to

enum BENCHMARKS = 20;

which should allow substitution of the constant directly into the rest of the program, which gives

Avg execution time: 0.007564

On my Ubuntu 22.04 machine, therefore, the LDC binary with no flags is slightly faster than the C code compiled with your flags.

March 27

I apologize for digressing a little bit further - just to share insights to other learners.

I had the question, why my binary was so big (> 4M), discovered the
gdc -Wall -O2 -frelease -shared-libphobos options (now >200K).
Then I tried to avoid GC, just learnt about this: The GC in the Leibnitz code is there only for the writeln. With a change to (again standard C) printf the
@nogc modifier can be applied, the binary then gets down to ~17K, a comparable size of the C counterpart.

Another observation regarding precision:
The iteration proceeds in the wrong order. Adding small contributions first and bigger last leads to less loss when summing up the small parts below the final real/double LSB limit.

So I'm now at this code (abolishing the avarage of 20 interations as unnesseary)

// import std.stdio;  // writeln will lead to the garbage collector to be included
import core.stdc.stdio: printf;
import std.datetime.stopwatch;

const int ITERATIONS = 1_000_000_000;

@nogc pure double leibniz(int it) {  // sum up the small values first
  double n = 0.5*((it%2) ? -1.0 : 1.0) / (it * 2.0 + 1.0);
  for (int i = it-1; i >= 0; i--)
    n += ((i%2) ? -1.0 : 1.0) / (i * 2.0 + 1.0);
  return n * 4.0;
}

@nogc void main() {
    double result;
    double total_time = 0;
    auto sw = StopWatch(AutoStart.yes);
    result = leibniz(ITERATIONS);
    sw.stop();
    total_time = sw.peek.total!"nsecs";
    printf("%.16f\n", result);
    printf("Execution time: %f\n", total_time / 1e9);
}

result:

3.1415926535897931
Execution time: 1.068111
March 28

On Wednesday, 27 March 2024 at 08:22:42 UTC, rkompass wrote:

>

I apologize for digressing a little bit further - just to share insights to other learners.

Good thing you're digressing; I am 45 years old and I still cannot say that I am finished as a student! For me this is version 4 and it looks like we don't need a 3rd variable other than the function parameter and return value:

auto leibniz_v4(int i) @nogc pure {
  double n = 0.5*((i%2) ? -1.0 : 1.0) / (i * 2.0 + 1.0);

  while(--i >= 0)
    n += ((i%2) ? -1.0 : 1.0) / (i * 2.0 + 1.0);

  return n * 4.0;
} /*
3.1415926535892931
3.141592653589 793238462643383279502884197169399375105
3.141593653590774200000 (v1)
Avg execution time: 0.000033
*/

SDB@79

March 28

On Thursday, 28 March 2024 at 01:09:34 UTC, Salih Dincer wrote:

>

Good thing you're digressing; I am 45 years old and I still cannot say that I am finished as a student! For me this is version 4 and it looks like we don't need a 3rd variable other than the function parameter and return value:

So we go with another digression. I discovered parallel, also avoided the extra variable, as suggested by Salih:

import std.range;
import std.parallelism;
import core.stdc.stdio: printf;
import std.datetime.stopwatch;

enum ITERS = 1_000_000_000;
enum STEPS = 31; // 5 is fine, even numbers (e.g. 10) may give bad precision (for math reason ???)

pure double leibniz(int i) {  // sum up the small values first
	double r = (i == ITERS) ? 0.5 * ((i%2) ? -1.0 : 1.0) / (i * 2.0 + 1.0) : 0.0;
	for (--i; i >= 0; i-= STEPS)
		r += ((i%2) ? -1.0 : 1.0) / (i * 2.0 + 1.0);
	return r * 4.0;
}

void main() {
	auto start = iota(ITERS, ITERS-STEPS, -1).array;
	auto sw = StopWatch(AutoStart.yes);
	double result = 0.0;
	foreach(s; start.parallel)
		result += leibniz(s);
	double total_time = sw.peek.total!"nsecs";
    printf("%.16f\n", result);
    printf("Execution time: %f\n", total_time / 1e9);
}

gives:

3.1415926535897931
Execution time: 0.211667

My laptop has 6 cores and obviously 5 are used in parallel by this.

The original question related to a comparison between C, D and Python.
Turning back to this: Are there similarly simple libraries for C, that allow for
parallel computation?

March 28

On Thursday, 28 March 2024 at 11:50:38 UTC, rkompass wrote:

>

Turning back to this: Are there similarly simple libraries for C, that allow for
parallel computation?

You can achieve parallelism in C using libraries such as OpenMP, which provides a set of compiler directives and runtime library routines for parallel programming.

Here’s an example of how you might modify the code to use OpenMP for parallel processing:

#include <stdio.h>
#include <time.h>
#include <omp.h>

#define ITERS 1000000000
#define STEPS 31

double leibniz(int i) {
  double r = (i == ITERS) ? 0.5 * ((i % 2) ? -1.0 : 1.0) / (i * 2.0 + 1.0) : 0.0;
  for (--i; i >= 0; i -= STEPS)
    r += ((i % 2) ? -1.0 : 1.0) / (i * 2.0 + 1.0);
  return r * 4.0;
}

int main() {
  double start_time = omp_get_wtime();

  double result = 0.0;

  #pragma omp parallel for reduction(+:result)
  for (int s = ITERS; s >= 0; s -= STEPS) {
    result += leibniz(s);
  }

  // Calculate the time taken
  double time_taken = omp_get_wtime() - start_time;

  printf("%.16f\n", result);
  printf("%f (seconds)\n", time_taken);

  return 0;
}

To compile this code with OpenMP support, you would use a command like gcc -fopenmp your_program.c. This tells the GCC compiler to enable OpenMP directives. The #pragma omp parallel for directive tells the compiler to parallelize the loop, and the reduction clause is used to safely accumulate the result variable across multiple threads.

SDB@79

March 28

On Thursday, 28 March 2024 at 14:07:43 UTC, Salih Dincer wrote:

>

On Thursday, 28 March 2024 at 11:50:38 UTC, rkompass wrote:

>

Turning back to this: Are there similarly simple libraries for C, that allow for
parallel computation?

You can achieve parallelism in C using libraries such as OpenMP, which provides a set of compiler directives and runtime library routines for parallel programming.

Here’s an example of how you might modify the code to use OpenMP for parallel processing:

 . . .

  #pragma omp parallel for reduction(+:result)
  for (int s = ITERS; s >= 0; s -= STEPS) {
    result += leibniz(s);
  }
 . . . ```
To compile this code with OpenMP support, you would use a command like gcc -fopenmp your_program.c. This tells the GCC compiler to enable OpenMP directives. The #pragma omp parallel for directive tells the compiler to parallelize the loop, and the reduction clause is used to safely accumulate the result variable across multiple threads.

SDB@79

Nice, thank you.
It worked endlessly until I saw I had to correct the for to
for (int s = ITERS; s > ITERS-STEPS; s--)
Now the result is:

3.1415926535897936
Execution time: 0.212483 (seconds).

This result is sooo similar!

I didn't know that OpenMP programming could be that easy.
Binary size is 16K, same order of magnitude, although somewhat less.
D advantage is gone here, I would say.

March 28

On Thursday, 28 March 2024 at 20:18:10 UTC, rkompass wrote:

>

D advantage is gone here, I would say.

It's hard to compare actually.
Std.parallelism has a bit different mechanics, and I think easier to use. The syntax is nicer.

OpenMP is an well-known and highly adopted tool, which is also quite flexible, but usually used with initially sequential code. And the syntax is not very intuitive.

Interesting point from Dr Russel here: https://forum.dlang.org/thread/qvksmhwkaxbrnggsvtxe@forum.dlang.org

However since 2012 OpenMP also got some development and improvement and HPC world is pretty conservative. So it is one of the most popular tool in the area: https://www.openmp.org/wp-content/uploads/sc23-openmp-popularity-mattson.pdf
With MPI.. But probably with AI and GPU revolution the balance will shift a bit to CUDA-like technologies.

March 28

On Thursday, 28 March 2024 at 20:18:10 UTC, rkompass wrote:

>

I didn't know that OpenMP programming could be that easy.
Binary size is 16K, same order of magnitude, although somewhat less.
D advantage is gone here, I would say.

There is no such thing as parallel programming in D anyway. At least it has modules, but I didn't see it being works. Whenever I use toys built in foreach() it always ends in disappointment :)

SDB@79

March 29

On Thursday, 28 March 2024 at 23:15:26 UTC, Salih Dincer wrote:

>

There is no such thing as parallel programming in D anyway. At least it has modules, but I didn't see it being works. Whenever I use toys built in foreach() it always ends in disappointment

I think it just works :)
Which issues did you have with it?