August 13, 2017 Re: D outperformed by C++, what am I doing wrong? | ||||
---|---|---|---|---|
| ||||
Attachments:
| this one is even faster than c++: http://ideone.com/TRDsOo On Sun, Aug 13, 2017 at 10:00 AM, Daniel Kozak <kozzi11@gmail.com> wrote: > my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost same > > On Sun, Aug 13, 2017 at 9:51 AM, amfvcg via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote: > >> On Sunday, 13 August 2017 at 07:30:32 UTC, Daniel Kozak wrote: >> >>> Here is more D idiomatic way: >>> >>> import std.stdio : writeln; >>> import std.algorithm.comparison: min; >>> import std.algorithm.iteration: sum; >>> import core.time: MonoTime, Duration; >>> >>> >>> auto sum_subranges(T)(T input, uint range) >>> { >>> import std.array : array; >>> import std.range : chunks, ElementType; >>> import std.algorithm : map; >>> >>> if (range == 0) >>> { >>> return ElementType!(T)[].init; >>> } >>> return input.chunks(range).map!(sum).array; >>> } >>> >>> unittest >>> { >>> assert(sum_subranges([1,1,1], 2) == [2, 1]); >>> assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); >>> assert(sum_subranges([], 2) == []); >>> assert(sum_subranges([1], 2) == [1]); >>> assert(sum_subranges([1], 0) == []); >>> } >>> >>> >>> int main() >>> { >>> import std.range : iota, array; >>> auto v = iota(0,1000000); >>> int sum; >>> MonoTime beg = MonoTime.currTime; >>> for (int i=0; i < 100; i++) >>> sum += cast(int)sum_subranges(v,2).length; >>> MonoTime end = MonoTime.currTime; >>> writeln(end-beg); >>> writeln(sum); >>> return sum; >>> } >>> >>> On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak <kozzi11@gmail.com> wrote: >>> >>> this works ok for me with ldc compiler, gdc does not work on my arch >>>> machine so I can not do comparsion to your c++ versin (clang does not work >>>> with your c++ code) >>>> >>>> import std.stdio : writeln; >>>> import std.algorithm.comparison: min; >>>> import std.algorithm.iteration: sum; >>>> import core.time: MonoTime, Duration; >>>> >>>> >>>> T[] sum_subranges(T)(T[] input, uint range) >>>> { >>>> import std.array : appender; >>>> auto app = appender!(T[])(); >>>> if (range == 0) >>>> { >>>> return app.data; >>>> } >>>> for (uint i; i < input.length; i=min(i+range, input.length)) >>>> { >>>> app.put(sum(input[i..min(i+range, input.length)])); >>>> } >>>> return app.data; >>>> } >>>> >>>> unittest >>>> { >>>> assert(sum_subranges([1,1,1], 2) == [2, 1]); >>>> assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); >>>> assert(sum_subranges([], 2) == []); >>>> assert(sum_subranges([1], 2) == [1]); >>>> assert(sum_subranges([1], 0) == []); >>>> } >>>> >>>> >>>> int main() >>>> { >>>> import std.range : iota, array; >>>> auto v = iota(0,1000000).array; >>>> int sum; >>>> MonoTime beg = MonoTime.currTime; >>>> for (int i=0; i < 100; i++) >>>> sum += cast(int)sum_subranges(v,2).length; >>>> MonoTime end = MonoTime.currTime; >>>> writeln(end-beg); >>>> writeln(sum); >>>> return sum; >>>> } >>>> >>>> On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote: >>>> >>>> [...] >>>>> >>>> >> Thank you all for the replies. Good to know the community is alive in d :) >> >> Let's settle the playground: >> D : http://ideone.com/h4fnsD >> C++: http://ideone.com/X1pyXG >> >> Both using GCC under the hood. >> C++ in 112 ms; >> D in : >> - 2.5 sec with original source; >> - 2.5 sec with Daniel's 1st version; >> - 5 sec timeout exceeded with Daniel's 2nd version; >> - 1.8 sec with Neia-like preallocation; >> >> So still it's not that neaty. >> >> (What's interesting C++ code generates 2KLOC of assembly, and Dlang @ ldc 12KLOC - checked at godbolt). >> >> P.S. For C++ version to work under clang, the function which takes >> (BinaryOp) must go before the other one (my bad). >> > > |
August 13, 2017 Re: D outperformed by C++, what am I doing wrong? | ||||
---|---|---|---|---|
| ||||
Posted in reply to amfvcg | On Sunday, 13 August 2017 at 08:13:56 UTC, amfvcg wrote:
> On Sunday, 13 August 2017 at 08:00:53 UTC, Daniel Kozak wrote:
>> my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost same
>>
>
> Ok, on ideone (ldc 1.1.0) it timeouts, on dpaste (ldc 0.12.0) it gets killed.
> What version are you using?
>
> Either way, if that'd be the case - that's slick. (and ldc would be the compiler of choice for real use cases).
Here are my results:
$ uname -sri
Linux 4.10.0-28-generic x86_64
$ lscpu | grep 'Model name'
Model name: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
$ ldc2 --version | head -n5
LDC - the LLVM D compiler (1.3.0):
based on DMD v2.073.2 and LLVM 4.0.0
built with LDC - the LLVM D compiler (1.3.0)
Default target: x86_64-unknown-linux-gnu
Host CPU: ivybridge
$ g++ --version | head -n1
g++ (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406
$ ldc2 -O3 --release sum_subranges.d
$ ./sum_subranges
378 ms, 556 μs, and 9 hnsecs
50000000
$ g++ -O5 sum_subranges.cpp -o sum_subranges
$ ./sum_subranges
237135
50000000
|
August 13, 2017 Re: D outperformed by C++, what am I doing wrong? | ||||
---|---|---|---|---|
| ||||
Posted in reply to ikod | > Gives me
>
> 5 μs and 2 hnsecs
> 50000000
> 3 secs, 228 ms, 837 μs, and 4 hnsecs
> 50000000
And you've compiled it with?
Btw. clang for c++ version works worse than gcc (for this case [112ms vs 180ms]).
|
August 13, 2017 Re: D outperformed by C++, what am I doing wrong? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Petar Kirov [ZombineDev] | On Sunday, 13 August 2017 at 08:29:30 UTC, Petar Kirov [ZombineDev] wrote: > On Sunday, 13 August 2017 at 08:13:56 UTC, amfvcg wrote: >> On Sunday, 13 August 2017 at 08:00:53 UTC, Daniel Kozak wrote: >>> my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost same >>> >> >> Ok, on ideone (ldc 1.1.0) it timeouts, on dpaste (ldc 0.12.0) it gets killed. >> What version are you using? >> >> Either way, if that'd be the case - that's slick. (and ldc would be the compiler of choice for real use cases). > > Here are my results: > > $ uname -sri > Linux 4.10.0-28-generic x86_64 > > $ lscpu | grep 'Model name' > Model name: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz > > $ ldc2 --version | head -n5 > LDC - the LLVM D compiler (1.3.0): > based on DMD v2.073.2 and LLVM 4.0.0 > built with LDC - the LLVM D compiler (1.3.0) > Default target: x86_64-unknown-linux-gnu > Host CPU: ivybridge > > $ g++ --version | head -n1 > g++ (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406 > > $ ldc2 -O3 --release sum_subranges.d > $ ./sum_subranges > 378 ms, 556 μs, and 9 hnsecs > 50000000 > > $ g++ -O5 sum_subranges.cpp -o sum_subranges > $ ./sum_subranges > 237135 > 50000000 With Daniel's latest version ( http://forum.dlang.org/post/mailman.5963.1502612885.31550.digitalmars-d-learn@puremagic.com ) $ ldc2 -O3 --release sum_subranges2.d $ ./sum_subranges2 210 ms, 838 μs, and 8 hnsecs 50000000 |
August 13, 2017 Re: D outperformed by C++, what am I doing wrong? | ||||
---|---|---|---|---|
| ||||
Attachments:
| And this one is awesome :P http://ideone.com/muehUw On Sun, Aug 13, 2017 at 10:27 AM, Daniel Kozak <kozzi11@gmail.com> wrote: > this one is even faster than c++: > http://ideone.com/TRDsOo > > On Sun, Aug 13, 2017 at 10:00 AM, Daniel Kozak <kozzi11@gmail.com> wrote: > >> my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost same >> >> On Sun, Aug 13, 2017 at 9:51 AM, amfvcg via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote: >> >>> On Sunday, 13 August 2017 at 07:30:32 UTC, Daniel Kozak wrote: >>> >>>> Here is more D idiomatic way: >>>> >>>> import std.stdio : writeln; >>>> import std.algorithm.comparison: min; >>>> import std.algorithm.iteration: sum; >>>> import core.time: MonoTime, Duration; >>>> >>>> >>>> auto sum_subranges(T)(T input, uint range) >>>> { >>>> import std.array : array; >>>> import std.range : chunks, ElementType; >>>> import std.algorithm : map; >>>> >>>> if (range == 0) >>>> { >>>> return ElementType!(T)[].init; >>>> } >>>> return input.chunks(range).map!(sum).array; >>>> } >>>> >>>> unittest >>>> { >>>> assert(sum_subranges([1,1,1], 2) == [2, 1]); >>>> assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); >>>> assert(sum_subranges([], 2) == []); >>>> assert(sum_subranges([1], 2) == [1]); >>>> assert(sum_subranges([1], 0) == []); >>>> } >>>> >>>> >>>> int main() >>>> { >>>> import std.range : iota, array; >>>> auto v = iota(0,1000000); >>>> int sum; >>>> MonoTime beg = MonoTime.currTime; >>>> for (int i=0; i < 100; i++) >>>> sum += cast(int)sum_subranges(v,2).length; >>>> MonoTime end = MonoTime.currTime; >>>> writeln(end-beg); >>>> writeln(sum); >>>> return sum; >>>> } >>>> >>>> On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak <kozzi11@gmail.com> wrote: >>>> >>>> this works ok for me with ldc compiler, gdc does not work on my arch >>>>> machine so I can not do comparsion to your c++ versin (clang does not work >>>>> with your c++ code) >>>>> >>>>> import std.stdio : writeln; >>>>> import std.algorithm.comparison: min; >>>>> import std.algorithm.iteration: sum; >>>>> import core.time: MonoTime, Duration; >>>>> >>>>> >>>>> T[] sum_subranges(T)(T[] input, uint range) >>>>> { >>>>> import std.array : appender; >>>>> auto app = appender!(T[])(); >>>>> if (range == 0) >>>>> { >>>>> return app.data; >>>>> } >>>>> for (uint i; i < input.length; i=min(i+range, input.length)) >>>>> { >>>>> app.put(sum(input[i..min(i+range, input.length)])); >>>>> } >>>>> return app.data; >>>>> } >>>>> >>>>> unittest >>>>> { >>>>> assert(sum_subranges([1,1,1], 2) == [2, 1]); >>>>> assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]); >>>>> assert(sum_subranges([], 2) == []); >>>>> assert(sum_subranges([1], 2) == [1]); >>>>> assert(sum_subranges([1], 0) == []); >>>>> } >>>>> >>>>> >>>>> int main() >>>>> { >>>>> import std.range : iota, array; >>>>> auto v = iota(0,1000000).array; >>>>> int sum; >>>>> MonoTime beg = MonoTime.currTime; >>>>> for (int i=0; i < 100; i++) >>>>> sum += cast(int)sum_subranges(v,2).length; >>>>> MonoTime end = MonoTime.currTime; >>>>> writeln(end-beg); >>>>> writeln(sum); >>>>> return sum; >>>>> } >>>>> >>>>> On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote: >>>>> >>>>> [...] >>>>>> >>>>> >>> Thank you all for the replies. Good to know the community is alive in d :) >>> >>> Let's settle the playground: >>> D : http://ideone.com/h4fnsD >>> C++: http://ideone.com/X1pyXG >>> >>> Both using GCC under the hood. >>> C++ in 112 ms; >>> D in : >>> - 2.5 sec with original source; >>> - 2.5 sec with Daniel's 1st version; >>> - 5 sec timeout exceeded with Daniel's 2nd version; >>> - 1.8 sec with Neia-like preallocation; >>> >>> So still it's not that neaty. >>> >>> (What's interesting C++ code generates 2KLOC of assembly, and Dlang @ ldc 12KLOC - checked at godbolt). >>> >>> P.S. For C++ version to work under clang, the function which takes >>> (BinaryOp) must go before the other one (my bad). >>> >> >> > |
August 13, 2017 Re: D outperformed by C++, what am I doing wrong? | ||||
---|---|---|---|---|
| ||||
Posted in reply to amfvcg | On Sunday, 13 August 2017 at 08:32:50 UTC, amfvcg wrote:
>> Gives me
>>
>> 5 μs and 2 hnsecs
>> 50000000
>> 3 secs, 228 ms, 837 μs, and 4 hnsecs
>> 50000000
>
> And you've compiled it with?
>
> Btw. clang for c++ version works worse than gcc (for this case [112ms vs 180ms]).
DMD64 D Compiler v2.074.1
|
August 13, 2017 Re: D outperformed by C++, what am I doing wrong? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Petar Kirov [ZombineDev] | On Sunday, 13 August 2017 at 08:33:53 UTC, Petar Kirov [ZombineDev] wrote:
>
> With Daniel's latest version (
> http://forum.dlang.org/post/mailman.5963.1502612885.31550.digitalmars-d-learn@puremagic.com )
>
> $ ldc2 -O3 --release sum_subranges2.d
> $ ./sum_subranges2
> 210 ms, 838 μs, and 8 hnsecs
> 50000000
Great!!! And that's what I was hoping for.
So the conclusion is:
use the latest ldc that's out there.
Thank you Petar, thank you Daniel. (I cannot change the subject to SOLVED, can I?)
Btw. the idiomatic version of this d sample looks how I imagined it should!
|
August 13, 2017 Re: D outperformed by C++, what am I doing wrong? | ||||
---|---|---|---|---|
| ||||
Posted in reply to amfvcg | On Sunday, 13 August 2017 at 08:43:29 UTC, amfvcg wrote: > On Sunday, 13 August 2017 at 08:33:53 UTC, Petar Kirov [ZombineDev] wrote: >> >> With Daniel's latest version ( >> http://forum.dlang.org/post/mailman.5963.1502612885.31550.digitalmars-d-learn@puremagic.com ) >> >> $ ldc2 -O3 --release sum_subranges2.d >> $ ./sum_subranges2 >> 210 ms, 838 μs, and 8 hnsecs >> 50000000 > > Great!!! And that's what I was hoping for. > > So the conclusion is: > > use the latest ldc that's out there. > > Thank you Petar, thank you Daniel. (I cannot change the subject to SOLVED, can I?) > > Btw. the idiomatic version of this d sample looks how I imagined it should! There's one especially interesting result: This instantiation: sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint) of the following function: auto sum_subranges(T)(T input, uint range) { import std.range : chunks, ElementType, array; import std.algorithm : map; return input.chunks(range).map!(sum); } gets optimized with LDC to: push rax test edi, edi je .LBB2_2 mov edx, edi mov rax, rsi pop rcx ret .LBB2_2: lea rsi, [rip + .L.str.3] lea rcx, [rip + .L.str] mov edi, 45 mov edx, 89 mov r8d, 6779 call _d_assert_msg@PLT I.e. the compiler turned a O(n) algorithm to O(1), which is quite neat. It is also quite surprising to me that it looks like even dmd managed to do a similar optimization: sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint): push rbp mov rbp,rsp sub rsp,0x30 mov DWORD PTR [rbp-0x8],edi mov r9d,DWORD PTR [rbp-0x8] test r9,r9 jne 41 mov r8d,0x1b67 mov ecx,0x0 mov eax,0x61 mov rdx,rax mov QWORD PTR [rbp-0x28],rdx mov edx,0x0 mov edi,0x2d mov rsi,rdx mov rdx,QWORD PTR [rbp-0x28] call 41 41: mov QWORD PTR [rbp-0x20],rsi mov QWORD PTR [rbp-0x18],r9 mov rdx,QWORD PTR [rbp-0x18] mov rax,QWORD PTR [rbp-0x20] mov rsp,rbp algorithms a pop rbp ret Moral of the story: templates + ranges are an awesome combination. |
August 13, 2017 Re: D outperformed by C++, what am I doing wrong? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Petar Kirov [ZombineDev] | On Sunday, 13 August 2017 at 09:08:14 UTC, Petar Kirov [ZombineDev] wrote:
>
> There's one especially interesting result:
>
> This instantiation:
>
> sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint)
>
> of the following function:
>
> auto sum_subranges(T)(T input, uint range)
> {
> import std.range : chunks, ElementType, array;
> import std.algorithm : map;
> return input.chunks(range).map!(sum);
> }
>
> gets optimized with LDC to:
> push rax
> test edi, edi
> je .LBB2_2
> mov edx, edi
> mov rax, rsi
> pop rcx
> ret
> .LBB2_2:
> lea rsi, [rip + .L.str.3]
> lea rcx, [rip + .L.str]
> mov edi, 45
> mov edx, 89
> mov r8d, 6779
> call _d_assert_msg@PLT
>
> I.e. the compiler turned a O(n) algorithm to O(1), which is quite neat. It is also quite surprising to me that it looks like even dmd managed to do a similar optimization:
>
> sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint):
> push rbp
> mov rbp,rsp
> sub rsp,0x30
> mov DWORD PTR [rbp-0x8],edi
> mov r9d,DWORD PTR [rbp-0x8]
> test r9,r9
> jne 41
> mov r8d,0x1b67
> mov ecx,0x0
> mov eax,0x61
> mov rdx,rax
> mov QWORD PTR [rbp-0x28],rdx
> mov edx,0x0
> mov edi,0x2d
> mov rsi,rdx
> mov rdx,QWORD PTR [rbp-0x28]
> call 41
> 41: mov QWORD PTR [rbp-0x20],rsi
> mov QWORD PTR [rbp-0x18],r9
> mov rdx,QWORD PTR [rbp-0x18]
> mov rax,QWORD PTR [rbp-0x20]
> mov rsp,rbp algorithms a
> pop rbp
> ret
>
> Moral of the story: templates + ranges are an awesome combination.
Change the parameter for this array size to be taken from stdin and I assume that these optimizations will go away.
|
August 13, 2017 Re: D outperformed by C++, what am I doing wrong? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Petar Kirov [ZombineDev] | On Sunday, 13 August 2017 at 09:08:14 UTC, Petar Kirov [ZombineDev] wrote: > > This instantiation: > > sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint) > > of the following function: > > auto sum_subranges(T)(T input, uint range) > { > import std.range : chunks, ElementType, array; > import std.algorithm : map; > return input.chunks(range).map!(sum); > } > > gets optimized with LDC to: > [snip] > I.e. the compiler turned a O(n) algorithm to O(1), which is quite neat. It is also quite surprising to me that it looks like even dmd managed to do a similar optimization: > [snip] Execution of sum_subranges is already O(1), because the calculation of the sum is delayed: the return type of the function is not `uint`, it is `MapResult!(sum, <range>)` which does a lazy evaluation of the sum. - Johan |
Copyright © 1999-2021 by the D Language Foundation