Jump to page: 1 2
Thread overview
benchmark on binary trees
Dec 04, 2015
Alex
Dec 04, 2015
CraigDillabaugh
Dec 04, 2015
anonymous
Dec 04, 2015
Alex
Dec 04, 2015
anonymous
Dec 05, 2015
Alex
Dec 05, 2015
anonymous
Dec 04, 2015
anonymous
Dec 05, 2015
JR
Dec 06, 2015
Alex
Dec 06, 2015
Rikki Cattermole
Dec 07, 2015
Alex
Dec 06, 2015
visitor
Dec 07, 2015
Alex
Dec 07, 2015
visitor
Dec 08, 2015
visitor
Dec 08, 2015
visitor
Dec 09, 2015
visitor
Dec 11, 2015
visitor
December 04, 2015
Hi everybody,
this is going to be a learning by doing a benchmark test - post.

Found this "game" on the web
http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=binarytrees
and wanted to experiment on my self, I tried to reimplement some code in D.

This:
http://dpaste.dzfl.pl/4d8e0ceb867c
I did by reimplementing the C# version
and this:
http://dpaste.dzfl.pl/2fd3841d85ef
by reimplementing the C++ version.

Besides the questions of improving both of them, which are important but only because I would... maybe... at some point... upload my code. For some DLang advertising effect ;)

The learning effect - questions are the following:

1. I wrote the C++ inspired version after the C# inspired, hoping it would be faster. This was not the case. Why?
This is a big question, as there could be just some silly mistakes of kind
"do not use this call, use this, and your program will be twice as fast."
to some mistakes of general kind like
"why were you so foolish, and didn't wrote the whole code in just one line."

2. I failed trying to implement some kind of parallelism in the second version. I tried something like

auto depthind = iota(min_depth, max_depth+1, 2);
foreach(dummy_i, d; taskPool.parallel(depthind))

for the for-loop in the main function, but then, the program never ends. Do you have a hint what was wrong?

3. The compilation was done by:
dmd -O -release -boundscheck=off [filename.d]
Is there anything else to improve performance significantly?

4. This is some ambiguous question. I come originally from the C# corner, so I natively think in terms of the first approach. Can one general make the statement, that for D one of the approaches above will be always faster then the other?
December 04, 2015
On Friday, 4 December 2015 at 14:06:26 UTC, Alex wrote:
> Hi everybody,
> this is going to be a learning by doing a benchmark test - post.
>
clip
>
> 3. The compilation was done by:
> dmd -O -release -boundscheck=off [filename.d]
> Is there anything else to improve performance significantly?
>

If you have access to LDC or GDC they typically produce significantly faster code.


December 04, 2015
On 04.12.2015 15:06, Alex wrote:
> 1. I wrote the C++ inspired version after the C# inspired, hoping it
> would be faster. This was not the case. Why?
[...]

Why did you expect the C++ inspired version to be faster? Just because the original was written in C++?

From a quick skim the two versions seem to be pretty much identical, aside from names and struct vs class.

Names don't make a difference of course. It would be easier to compare the two versions if you used the same names, though.

The differences between structs on the heap and classes are subtle. It's not surprising that they don't have an impact here.

If there are substantial differences between the two versions, please point them out.

> 2. I failed trying to implement some kind of parallelism in the second
> version. I tried something like
>
> auto depthind = iota(min_depth, max_depth+1, 2);
> foreach(dummy_i, d; taskPool.parallel(depthind))
>
> for the for-loop in the main function, but then, the program never ends.
> Do you have a hint what was wrong?

Works for me. Maybe show the exact full program you tried, and tell what compiler version you're using.

> 3. The compilation was done by:
> dmd -O -release -boundscheck=off [filename.d]
> Is there anything else to improve performance significantly?

The other compilers, ldc and gdc, usually produce faster code than dmd.

> 4. This is some ambiguous question. I come originally from the C#
> corner, so I natively think in terms of the first approach. Can one
> general make the statement, that for D one of the approaches above will
> be always faster then the other?

Just reiterating what I said re the first question: I don't really see a difference. If you think there is, please point it out. Or if you're not sure, feel free to ask about specific parts.
December 04, 2015
On Friday, 4 December 2015 at 19:31:22 UTC, anonymous wrote:
> Why did you expect the C++ inspired version to be faster? Just because the original was written in C++?
>
> From a quick skim the two versions seem to be pretty much identical, aside from names and struct vs class.
>
> Names don't make a difference of course. It would be easier to compare the two versions if you used the same names, though.
>
> The differences between structs on the heap and classes are subtle. It's not surprising that they don't have an impact here.
>
> If there are substantial differences between the two versions, please point them out.
>

Yes, I missed this, sorry. The main part of the question was probably about the class and struct difference. I thought handling with structs and pointers would be faster then with classes.

>> 2. auto depthind = iota(min_depth, max_depth+1, 2);
>> foreach(dummy_i, d; taskPool.parallel(depthind))
>
> Works for me. Maybe show the exact full program you tried, and tell what compiler version you're using.
>

Ok, this was strange, but I found the crux. The correct question is:
Why the parallel version is slower then the sequential?
If you set
int n = 14 in the main function
the parallel version is MUCH slower then the sequential. At my machine 7x slower. Shouldn't it be the other way round?

>> 3. The compilation was done by:
>> dmd -O -release -boundscheck=off [filename.d]
>> Is there anything else to improve performance significantly?
>
> The other compilers, ldc and gdc, usually produce faster code than dmd.

Thanks for the hint!
As ldc doesn't have the experimental part of the includes, compared on the first version. The result was: program compiled with ldc2, same params, was 5% slower... nothing crucial, I think...

>
> Just reiterating what I said re the first question: I don't really see a difference. If you think there is, please point it out. Or if you're not sure, feel free to ask about specific parts.

Yeah... so the answer here for me, is that I can stay with my way of thinking in c# style. :)
December 05, 2015
On 04.12.2015 21:30, Alex wrote:
> Yes, I missed this, sorry. The main part of the question was probably
> about the class and struct difference. I thought handling with structs
> and pointers would be faster then with classes.

When you use a struct directly, without going through a pointer, that can be significantly faster than using a class. But structs through pointers are pretty much the same as classes, performance-wise.

[...]
> Why the parallel version is slower then the sequential?
> If you set
> int n = 14 in the main function
> the parallel version is MUCH slower then the sequential. At my machine
> 7x slower. Shouldn't it be the other way round?

I don't know what's going on here. You're allocating a lot of `TreeNode`s, though. That's probably not very parallelizable. The GC could also play a role.

[...]
> As ldc doesn't have the experimental part of the includes, compared on
> the first version. The result was: program compiled with ldc2, same
> params, was 5% slower... nothing crucial, I think...

"The experimental part" is std.experimental.allocator, right? I may be wrong here, but I think the default allocator is essentially just `new`. So that wouldn't give you any performance boost.

[...]
> Yeah... so the answer here for me, is that I can stay with my way of
> thinking in c# style. :)

In this case, I think you're fine. Generally, be aware that D doesn't shine when you create lots of throw-away objects. The GC can't compete with those of C# or Java, so when you translate code from those languages too closely, performance may be worse.
December 05, 2015
On 04.12.2015 15:06, Alex wrote:
> 3. The compilation was done by:
> dmd -O -release -boundscheck=off [filename.d]
> Is there anything else to improve performance significantly?

You forgot -inline.

By the way, I'm not a fan of using -boundscheck=off like a general optimization flag. It undermines @safe, and as the documentation says, it should be used with caution.

http://dlang.org/dmd-windows.html#switch-boundscheck
December 05, 2015
On Friday, 4 December 2015 at 23:23:37 UTC, anonymous wrote:
>> Why the parallel version is slower then the sequential?
>> If you set
>> int n = 14 in the main function
>> the parallel version is MUCH slower then the sequential. At my machine
>> 7x slower. Shouldn't it be the other way round?
>
> I don't know what's going on here. You're allocating a lot of `TreeNode`s, though. That's probably not very parallelizable. The GC could also play a role.
>

found and tried out the -vgc option...
Is there a way to deactivate the GC, if it stands in way?

> In this case, I think you're fine. Generally, be aware that D doesn't shine when you create lots of throw-away objects. The GC can't compete with those of C# or Java, so when you translate code from those languages too closely, performance may be worse.

Yes, I thought in the same direction. That's why I tried to reimplement the c++ version. The idea was: as I can't compete with the GC of C#, I could try to compete by applying another approach. I don't try to write something which compete with c++ either (I would have to take c++, then? ;) ), but something which clearly outperforms the languages with a virtual machine...

tried the -inline option... no time win...
December 05, 2015
On 05.12.2015 01:40, Alex wrote:
> found and tried out the -vgc option...
> Is there a way to deactivate the GC, if it stands in way?

You can call core.memory.GC.disable to disable automatic collections. .enable to turn them on again.

http://dlang.org/phobos/core_memory.html#.GC

> Yes, I thought in the same direction. That's why I tried to reimplement
> the c++ version. The idea was: as I can't compete with the GC of C#, I
> could try to compete by applying another approach. I don't try to write
> something which compete with c++ either (I would have to take c++, then?
> ;) ), but something which clearly outperforms the languages with a
> virtual machine...

Your C++ inspired version still allocated via the GC, though. If that eats performance, then you'd have to mirror more closely what the C++ version actually does. It most probably doesn't use a GC.

I presume this is the C++ version you took as inspiration:
http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=gpp&id=6

That uses a boost::object_pool for the nodes. Assuming that that's being used for a reason, you could probably get a performance boost by doing something similar. I'm not really familiar with std.experimental.allocator, maybe there's something like object_pool in there. Otherwise, you'd have to implement it yourself.

Generally, I think most of the time you can write a program in D that's as fast as one written in C++. But you may have to give up on some convenience, and the libraries may not be there to support you.
December 05, 2015
On Friday, 4 December 2015 at 14:06:26 UTC, Alex wrote:
[...]
> hoping it would be faster. This was not the case. Why?
[...]
> Is there anything else to improve performance significantly?

Profile. Profile profile profile. Callgrind. Find bottlenecks instead of guessing them.


December 06, 2015
Ok, lets conclude this post for now. Did some comparison runs with the original C++ code. Missed this at the beginning.
The results so far are as following:

Here is the original code in C++.
http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=gpp&id=6

With modifications to be able to run it on my mac os machine this results in the code available here:
http://pastebin.com/G5cYESdX
compilation done with
g++ -c -pipe -O3 -fomit-frame-pointer -march=native -fopenmp main.cpp -o main.o && g++ main.o -o main.run -fopenmp -lboost_system

Here you can find the last version of my D code:
http://dpaste.dzfl.pl/8c8ab00699b5
compilation done with
dmd -release -O -boundscheck=off -inline main.d

time comparison, just with
time ./main
yields

for C++
real 0m8.255s
user 0m7.342s
sys 0m0.905s

for D
real 0m35.356s
user 0m35.149s
sys 0m0.154s

so the overall factor is round about 5.

Thanks for commenting to everyone! If anybody has further ideas - all of them would be appreciated :)
The original site is not interested in any further languages to be tested, so my experiment ends here for now...
« First   ‹ Prev
1 2