December 06, 2015 Re: benchmark on binary trees | ||||
---|---|---|---|---|
| ||||
Posted in reply to Alex | On 06/12/15 9:35 PM, Alex wrote:
> Ok, lets conclude this post for now. Did some comparison runs with the
> original C++ code. Missed this at the beginning.
> The results so far are as following:
>
> Here is the original code in C++.
> http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=gpp&id=6
>
>
> With modifications to be able to run it on my mac os machine this
> results in the code available here:
> http://pastebin.com/G5cYESdX
> compilation done with
> g++ -c -pipe -O3 -fomit-frame-pointer -march=native -fopenmp main.cpp -o
> main.o && g++ main.o -o main.run -fopenmp -lboost_system
>
> Here you can find the last version of my D code:
> http://dpaste.dzfl.pl/8c8ab00699b5
> compilation done with
> dmd -release -O -boundscheck=off -inline main.d
>
> time comparison, just with
> time ./main
> yields
>
> for C++
> real 0m8.255s
> user 0m7.342s
> sys 0m0.905s
>
> for D
> real 0m35.356s
> user 0m35.149s
> sys 0m0.154s
>
> so the overall factor is round about 5.
>
> Thanks for commenting to everyone! If anybody has further ideas - all of
> them would be appreciated :)
> The original site is not interested in any further languages to be
> tested, so my experiment ends here for now...
Why is TreeNode not final?
Also yours does not use threads in any way.
If you cannot add multithreading on D, remove it from c++ before comparing. They are not equal.
|
December 06, 2015 Re: benchmark on binary trees | ||||
---|---|---|---|---|
| ||||
Posted in reply to Alex | On Sunday, 6 December 2015 at 08:35:33 UTC, Alex wrote: > Thanks for commenting to everyone! If anybody has further ideas - all of them would be appreciated :) > The original site is not interested in any further languages to be tested, so my experiment ends here for now... Hello, interesting exercise for me to learn about allocators :-) i managed to parallelize the code reaching similar performance, in terms of speed, as the non parallel version : http://dpaste.dzfl.pl/6c3e6edcff59 BUT it consumes insane memory, don't try with argument more than 17 !!! so either i'm doing something stupid (in parallel code, i guess) or the FreeList allocator is totally not the right tool ... (both ?) some light-shedding would be really appreciated, Thanks |
December 07, 2015 Re: benchmark on binary trees | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rikki Cattermole | On Sunday, 6 December 2015 at 08:45:10 UTC, Rikki Cattermole wrote: > Why is TreeNode not final? This is an interesting hint! Just after adding final the program takes two seconds less... This is roughly 5%. Do you have another hints of this kind? ;) > Also yours does not use threads in any way. > > If you cannot add multithreading on D, remove it from c++ before comparing. They are not equal. Yes... I'm aware of this discrepancy, and I tried how the time statistics differ with and without the #pragma statement. With the statement it is half a second slower then without. This seems strange to me, and I don't have a clue why it is so. But as this doesn't change very much for the D/C++ comparison and I think the correct goal is to have a parallel version in D I let the #pragma statement inside. |
December 07, 2015 Re: benchmark on binary trees | ||||
---|---|---|---|---|
| ||||
Posted in reply to visitor | On Sunday, 6 December 2015 at 12:23:52 UTC, visitor wrote: > Hello, interesting exercise for me to learn about allocators :-) Nice to know, a novice can inspire someone :) > i managed to parallelize the code reaching similar performance, in terms of speed, as the non parallel version : > http://dpaste.dzfl.pl/6c3e6edcff59 Cool! This is what I looked for! > BUT it consumes insane memory, don't try with argument more than 17 !!! > > so either i'm doing something stupid (in parallel code, i guess) or the FreeList allocator is totally not the right tool ... (both ?) I assume, the allocator itself is something, that is not really needed in this case. Maybe, there is a more straight forward access to the problem. Even a simpler then in all the versions on the benchgame site, but I don't see it right now. And with the allocator attempt I had a chance to experiment with the experimental module and to write a very quick copy of a program, which I want to have... But your solution is really impressive, it reduces the factor from 5 to 3 immediately :) And I'm going to read your code carefully... > some light-shedding would be really appreciated, Thanks |
December 07, 2015 Re: benchmark on binary trees | ||||
---|---|---|---|---|
| ||||
Posted in reply to Alex | On Monday, 7 December 2015 at 10:55:25 UTC, Alex wrote:
> On Sunday, 6 December 2015 at 12:23:52 UTC, visitor wrote:
>> Hello, interesting exercise for me to learn about allocators :-)
> Nice to know, a novice can inspire someone :)
>
>> i managed to parallelize the code reaching similar performance, in terms of speed, as the non parallel version :
>> http://dpaste.dzfl.pl/6c3e6edcff59
> Cool! This is what I looked for!
>
>> BUT it consumes insane memory, don't try with argument more than 17 !!!
>>
> I assume, the allocator itself is something, that is not really needed in this case. Maybe, there is a more straight forward access to the problem. Even a simpler then in all the versions on the benchgame site, but I don't see it right now.
> And with the allocator attempt I had a chance to experiment with the experimental module and to write a very quick copy of a program, which I want to have...
i've got more speed improvement with "taskPool.parallel(depthind, 2)" in the foreach parallel loop : second argument are workUnits (2 for me, on a quad core gave best results)
Also using directly "FreeList!(Mallocator, Tree_node.sizeof)" without wrapping it in an allocatorObject gives speed improvement (with changes to makeTree method)
i took inspiration from the C code, they use a memory pool management, like anonymous already pointed in c++ version, which i think could (must?) be achieved with allocators, to gain speed i think it's a key point, no GC !! FreeList allocator appears (to me) as a good candidate for this.
but as i'm new to this, i'm sure to not doing it the right way !
i tried the exact same FreeList allocator but backed with the GCAllocator (not the Mallocator used in my code), then memory consumption is very good but of course it"s slow !
i tried a lot of other allocators, variations on the presented code, but memory management is awful :(
|
December 08, 2015 Re: benchmark on binary trees | ||||
---|---|---|---|---|
| ||||
Posted in reply to visitor | using Apache Portable Runtime(APR) like in the C version : http://dpaste.dzfl.pl/6ca8b5ffd6dc works like a charm, 2.061s on my machine ! if file name is binarytrees.d dmd -w -inline -O -release -I/usr/include/apr-1.0 -L/usr/lib/x86_64-linux-gnu/libapr-1.so -of"binarytrees" "binarytrees.d" measured with "time ./binarytrees 20" on command line: real 0m2.061s user 0m8.156s sys 0m0.181s C version gives : real 0m1.892s user 0m9.087s sys 0m0.188s Not bad !!! :-D Still i would like to know what i'm doing wrong with the allocator's version, besides the fact that apr tools are doing more sophisticated work than my naive approach :-D ? |
December 08, 2015 Re: benchmark on binary trees | ||||
---|---|---|---|---|
| ||||
Posted in reply to visitor | C++ version : real 0m3.587s user 0m9.211s sys 0m7.341s |
December 09, 2015 Re: benchmark on binary trees | ||||
---|---|---|---|---|
| ||||
Posted in reply to visitor | version with apr (like in c version) http://dpaste.dzfl.pl/68c0157225e7 compiled with ldc it's indeed a bit faster on average : real 0m1.999s user 0m9.810s sys 0m0.148 btw Rust version is even faster than my little bit outdated gcc (4.9) latest try with allocators : http://dpaste.dzfl.pl/86b9b3c4ad71 swallows huge memory, slow ! what am i doing wrong ? (concurrency problem ?) |
December 11, 2015 Re: benchmark on binary trees | ||||
---|---|---|---|---|
| ||||
Posted in reply to visitor | ok, i have a working version (memory is nice, twice the speed as non parallel) ; http://dpaste.dzfl.pl/504a652c6c47 real 0m14.427s user 1m19.347s sys 0m0.124s i've got similar performances, without Allocators, using directly malloc and free i had to recursively deallocate ... though, still far from competing with the fastest, any advice ? i guess FreeList allocator is not the better tool here ? |
Copyright © 1999-2021 by the D Language Foundation