benchmark on binary trees (page 2)

On 06/12/15 9:35 PM, Alex wrote: > Ok, lets conclude this post for now. Did some comparison runs with the > original C++ code. Missed this at the beginning. > The results so far are as following: > > Here is the original code in C++. > http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=gpp&id=6 > > > With modifications to be able to run it on my mac os machine this > results in the code available here: > http://pastebin.com/G5cYESdX > compilation done with > g++ -c -pipe -O3 -fomit-frame-pointer -march=native -fopenmp main.cpp -o > main.o && g++ main.o -o main.run -fopenmp -lboost_system > > Here you can find the last version of my D code: > http://dpaste.dzfl.pl/8c8ab00699b5 > compilation done with > dmd -release -O -boundscheck=off -inline main.d > > time comparison, just with > time ./main > yields > > for C++ > real 0m8.255s > user 0m7.342s > sys 0m0.905s > > for D > real 0m35.356s > user 0m35.149s > sys 0m0.154s > > so the overall factor is round about 5. > > Thanks for commenting to everyone! If anybody has further ideas - all of > them would be appreciated :) > The original site is not interested in any further languages to be > tested, so my experiment ends here for now... Why is TreeNode not final? Also yours does not use threads in any way. If you cannot add multithreading on D, remove it from c++ before comparing. They are not equal.

On Sunday, 6 December 2015 at 08:35:33 UTC, Alex wrote: > Thanks for commenting to everyone! If anybody has further ideas - all of them would be appreciated :) > The original site is not interested in any further languages to be tested, so my experiment ends here for now... Hello, interesting exercise for me to learn about allocators :-) i managed to parallelize the code reaching similar performance, in terms of speed, as the non parallel version : http://dpaste.dzfl.pl/6c3e6edcff59 BUT it consumes insane memory, don't try with argument more than 17 !!! so either i'm doing something stupid (in parallel code, i guess) or the FreeList allocator is totally not the right tool ... (both ?) some light-shedding would be really appreciated, Thanks

On Sunday, 6 December 2015 at 08:45:10 UTC, Rikki Cattermole wrote: > Why is TreeNode not final? This is an interesting hint! Just after adding final the program takes two seconds less... This is roughly 5%. Do you have another hints of this kind? ;) > Also yours does not use threads in any way. > > If you cannot add multithreading on D, remove it from c++ before comparing. They are not equal. Yes... I'm aware of this discrepancy, and I tried how the time statistics differ with and without the #pragma statement. With the statement it is half a second slower then without. This seems strange to me, and I don't have a clue why it is so. But as this doesn't change very much for the D/C++ comparison and I think the correct goal is to have a parallel version in D I let the #pragma statement inside.

On Sunday, 6 December 2015 at 12:23:52 UTC, visitor wrote: > Hello, interesting exercise for me to learn about allocators :-) Nice to know, a novice can inspire someone :) > i managed to parallelize the code reaching similar performance, in terms of speed, as the non parallel version : > http://dpaste.dzfl.pl/6c3e6edcff59 Cool! This is what I looked for! > BUT it consumes insane memory, don't try with argument more than 17 !!! > > so either i'm doing something stupid (in parallel code, i guess) or the FreeList allocator is totally not the right tool ... (both ?) I assume, the allocator itself is something, that is not really needed in this case. Maybe, there is a more straight forward access to the problem. Even a simpler then in all the versions on the benchgame site, but I don't see it right now. And with the allocator attempt I had a chance to experiment with the experimental module and to write a very quick copy of a program, which I want to have... But your solution is really impressive, it reduces the factor from 5 to 3 immediately :) And I'm going to read your code carefully... > some light-shedding would be really appreciated, Thanks

December 07, 2015

Re: benchmark on binary trees

Posted by visitor
in reply to Alex

Permalink

visitor

Posted in reply to Alex

Permalink

On Monday, 7 December 2015 at 10:55:25 UTC, Alex wrote:
> On Sunday, 6 December 2015 at 12:23:52 UTC, visitor wrote:
>> Hello, interesting exercise for me to learn about allocators :-)
> Nice to know, a novice can inspire someone :)
>
>> i managed to parallelize the code reaching similar performance, in terms of speed, as the non parallel version :
>> http://dpaste.dzfl.pl/6c3e6edcff59
> Cool! This is what I looked for!
>
>> BUT it consumes insane memory, don't try with argument more than 17 !!!
>>
> I assume, the allocator itself is something, that is not really needed in this case. Maybe, there is a more straight forward access to the problem. Even a simpler then in all the versions on the benchgame site, but I don't see it right now.
> And with the allocator attempt I had a chance to experiment with the experimental module and to write a very quick copy of a program, which I want to have...

i've got more speed improvement with "taskPool.parallel(depthind, 2)" in the foreach parallel loop : second argument are workUnits (2 for me, on a quad core gave best results)
Also using directly "FreeList!(Mallocator, Tree_node.sizeof)" without wrapping it in an allocatorObject gives speed improvement (with changes to makeTree method)

i took inspiration from the C code, they use a memory pool management, like anonymous already pointed in c++ version, which i think could (must?) be achieved with allocators, to gain speed i think it's a key point, no GC !! FreeList allocator appears (to me) as a good candidate for this.

but as i'm new to this, i'm sure to not doing it the right way !

i tried the exact same FreeList allocator but backed with the GCAllocator (not the Mallocator used in my code), then memory consumption is very good but of course it"s slow !

i tried a lot of other allocators, variations on the presented code, but memory management is awful :(

using Apache Portable Runtime(APR) like in the C version : http://dpaste.dzfl.pl/6ca8b5ffd6dc works like a charm, 2.061s on my machine ! if file name is binarytrees.d dmd -w -inline -O -release -I/usr/include/apr-1.0 -L/usr/lib/x86_64-linux-gnu/libapr-1.so -of"binarytrees" "binarytrees.d" measured with "time ./binarytrees 20" on command line: real 0m2.061s user 0m8.156s sys 0m0.181s C version gives : real 0m1.892s user 0m9.087s sys 0m0.188s Not bad !!! :-D Still i would like to know what i'm doing wrong with the allocator's version, besides the fact that apr tools are doing more sophisticated work than my naive approach :-D ?

version with apr (like in c version) http://dpaste.dzfl.pl/68c0157225e7 compiled with ldc it's indeed a bit faster on average : real 0m1.999s user 0m9.810s sys 0m0.148 btw Rust version is even faster than my little bit outdated gcc (4.9) latest try with allocators : http://dpaste.dzfl.pl/86b9b3c4ad71 swallows huge memory, slow ! what am i doing wrong ? (concurrency problem ?)

ok, i have a working version (memory is nice, twice the speed as non parallel) ; http://dpaste.dzfl.pl/504a652c6c47 real 0m14.427s user 1m19.347s sys 0m0.124s i've got similar performances, without Allocators, using directly malloc and free i had to recursively deallocate ... though, still far from competing with the fastest, any advice ? i guess FreeList allocator is not the better tool here ?

Forums