August 13, 2017
this one is even faster than c++:
http://ideone.com/TRDsOo

On Sun, Aug 13, 2017 at 10:00 AM, Daniel Kozak <kozzi11@gmail.com> wrote:

> my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost same
>
> On Sun, Aug 13, 2017 at 9:51 AM, amfvcg via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:
>
>> On Sunday, 13 August 2017 at 07:30:32 UTC, Daniel Kozak wrote:
>>
>>> Here is more D idiomatic way:
>>>
>>> import std.stdio : writeln;
>>> import std.algorithm.comparison: min;
>>> import std.algorithm.iteration: sum;
>>> import core.time: MonoTime, Duration;
>>>
>>>
>>> auto sum_subranges(T)(T input, uint range)
>>> {
>>>     import std.array : array;
>>>     import std.range : chunks, ElementType;
>>>     import std.algorithm : map;
>>>
>>>     if (range == 0)
>>>     {
>>>         return ElementType!(T)[].init;
>>>     }
>>>     return input.chunks(range).map!(sum).array;
>>> }
>>>
>>> unittest
>>> {
>>>     assert(sum_subranges([1,1,1], 2) == [2, 1]);
>>>     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
>>>     assert(sum_subranges([], 2) == []);
>>>     assert(sum_subranges([1], 2) == [1]);
>>>     assert(sum_subranges([1], 0) == []);
>>> }
>>>
>>>
>>> int main()
>>> {
>>>     import std.range : iota, array;
>>>     auto v = iota(0,1000000);
>>>     int sum;
>>>     MonoTime beg = MonoTime.currTime;
>>>     for (int i=0; i < 100; i++)
>>>         sum += cast(int)sum_subranges(v,2).length;
>>>     MonoTime end = MonoTime.currTime;
>>>     writeln(end-beg);
>>>     writeln(sum);
>>>     return sum;
>>> }
>>>
>>> On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak <kozzi11@gmail.com> wrote:
>>>
>>> this works ok for me with ldc compiler, gdc does not work on my arch
>>>> machine so I can not do comparsion to your c++ versin (clang does not work
>>>> with your c++ code)
>>>>
>>>> import std.stdio : writeln;
>>>> import std.algorithm.comparison: min;
>>>> import std.algorithm.iteration: sum;
>>>> import core.time: MonoTime, Duration;
>>>>
>>>>
>>>> T[] sum_subranges(T)(T[] input, uint range)
>>>> {
>>>>     import std.array : appender;
>>>>     auto app = appender!(T[])();
>>>>     if (range == 0)
>>>>     {
>>>>         return app.data;
>>>>     }
>>>>     for (uint i; i < input.length; i=min(i+range, input.length))
>>>>     {
>>>>         app.put(sum(input[i..min(i+range, input.length)]));
>>>>     }
>>>>     return app.data;
>>>> }
>>>>
>>>> unittest
>>>> {
>>>>     assert(sum_subranges([1,1,1], 2) == [2, 1]);
>>>>     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
>>>>     assert(sum_subranges([], 2) == []);
>>>>     assert(sum_subranges([1], 2) == [1]);
>>>>     assert(sum_subranges([1], 0) == []);
>>>> }
>>>>
>>>>
>>>> int main()
>>>> {
>>>>     import std.range : iota, array;
>>>>     auto v = iota(0,1000000).array;
>>>>     int sum;
>>>>     MonoTime beg = MonoTime.currTime;
>>>>     for (int i=0; i < 100; i++)
>>>>         sum += cast(int)sum_subranges(v,2).length;
>>>>     MonoTime end = MonoTime.currTime;
>>>>     writeln(end-beg);
>>>>     writeln(sum);
>>>>     return sum;
>>>> }
>>>>
>>>> On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:
>>>>
>>>> [...]
>>>>>
>>>>
>> Thank you all for the replies. Good to know the community is alive in d :)
>>
>> Let's settle the playground:
>> D  : http://ideone.com/h4fnsD
>> C++: http://ideone.com/X1pyXG
>>
>> Both using GCC under the hood.
>> C++ in 112 ms;
>> D in :
>> - 2.5 sec with original source;
>> - 2.5 sec with Daniel's 1st version;
>> - 5 sec timeout exceeded with Daniel's 2nd version;
>> - 1.8 sec with Neia-like preallocation;
>>
>> So still it's not that neaty.
>>
>> (What's interesting C++ code generates 2KLOC of assembly, and Dlang @ ldc 12KLOC - checked at godbolt).
>>
>> P.S. For C++ version to work under clang, the function which takes
>> (BinaryOp) must go before the other one (my bad).
>>
>
>


August 13, 2017
On Sunday, 13 August 2017 at 08:13:56 UTC, amfvcg wrote:
> On Sunday, 13 August 2017 at 08:00:53 UTC, Daniel Kozak wrote:
>> my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost same
>>
>
> Ok, on ideone (ldc 1.1.0) it timeouts, on dpaste (ldc 0.12.0) it gets killed.
> What version are you using?
>
> Either way, if that'd be the case - that's slick. (and ldc would be the compiler of choice for real use cases).

Here are my results:

$ uname -sri
Linux 4.10.0-28-generic x86_64

$ lscpu | grep 'Model name'
Model name:            Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz

$ ldc2 --version | head -n5
LDC - the LLVM D compiler (1.3.0):
  based on DMD v2.073.2 and LLVM 4.0.0
  built with LDC - the LLVM D compiler (1.3.0)
  Default target: x86_64-unknown-linux-gnu
  Host CPU: ivybridge

$ g++ --version | head -n1
g++ (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406

$ ldc2 -O3 --release sum_subranges.d
$ ./sum_subranges
378 ms, 556 μs, and 9 hnsecs
50000000

$ g++ -O5 sum_subranges.cpp -o sum_subranges
$ ./sum_subranges
237135
50000000
August 13, 2017
> Gives me
>
> 5 μs and 2 hnsecs
> 50000000
> 3 secs, 228 ms, 837 μs, and 4 hnsecs
> 50000000

And you've compiled it with?

Btw. clang for c++ version works worse than gcc (for this case [112ms vs 180ms]).
August 13, 2017
On Sunday, 13 August 2017 at 08:29:30 UTC, Petar Kirov [ZombineDev] wrote:
> On Sunday, 13 August 2017 at 08:13:56 UTC, amfvcg wrote:
>> On Sunday, 13 August 2017 at 08:00:53 UTC, Daniel Kozak wrote:
>>> my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost same
>>>
>>
>> Ok, on ideone (ldc 1.1.0) it timeouts, on dpaste (ldc 0.12.0) it gets killed.
>> What version are you using?
>>
>> Either way, if that'd be the case - that's slick. (and ldc would be the compiler of choice for real use cases).
>
> Here are my results:
>
> $ uname -sri
> Linux 4.10.0-28-generic x86_64
>
> $ lscpu | grep 'Model name'
> Model name:            Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
>
> $ ldc2 --version | head -n5
> LDC - the LLVM D compiler (1.3.0):
>   based on DMD v2.073.2 and LLVM 4.0.0
>   built with LDC - the LLVM D compiler (1.3.0)
>   Default target: x86_64-unknown-linux-gnu
>   Host CPU: ivybridge
>
> $ g++ --version | head -n1
> g++ (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406
>
> $ ldc2 -O3 --release sum_subranges.d
> $ ./sum_subranges
> 378 ms, 556 μs, and 9 hnsecs
> 50000000
>
> $ g++ -O5 sum_subranges.cpp -o sum_subranges
> $ ./sum_subranges
> 237135
> 50000000

With Daniel's latest version (
http://forum.dlang.org/post/mailman.5963.1502612885.31550.digitalmars-d-learn@puremagic.com )

$ ldc2 -O3 --release sum_subranges2.d
$ ./sum_subranges2
210 ms, 838 μs, and 8 hnsecs
50000000
August 13, 2017
And this one is awesome :P
http://ideone.com/muehUw

On Sun, Aug 13, 2017 at 10:27 AM, Daniel Kozak <kozzi11@gmail.com> wrote:

> this one is even faster than c++:
> http://ideone.com/TRDsOo
>
> On Sun, Aug 13, 2017 at 10:00 AM, Daniel Kozak <kozzi11@gmail.com> wrote:
>
>> my second version on ldc takes 380ms and c++ version on same compiler (clang), takes 350ms, so it seems to be almost same
>>
>> On Sun, Aug 13, 2017 at 9:51 AM, amfvcg via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:
>>
>>> On Sunday, 13 August 2017 at 07:30:32 UTC, Daniel Kozak wrote:
>>>
>>>> Here is more D idiomatic way:
>>>>
>>>> import std.stdio : writeln;
>>>> import std.algorithm.comparison: min;
>>>> import std.algorithm.iteration: sum;
>>>> import core.time: MonoTime, Duration;
>>>>
>>>>
>>>> auto sum_subranges(T)(T input, uint range)
>>>> {
>>>>     import std.array : array;
>>>>     import std.range : chunks, ElementType;
>>>>     import std.algorithm : map;
>>>>
>>>>     if (range == 0)
>>>>     {
>>>>         return ElementType!(T)[].init;
>>>>     }
>>>>     return input.chunks(range).map!(sum).array;
>>>> }
>>>>
>>>> unittest
>>>> {
>>>>     assert(sum_subranges([1,1,1], 2) == [2, 1]);
>>>>     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
>>>>     assert(sum_subranges([], 2) == []);
>>>>     assert(sum_subranges([1], 2) == [1]);
>>>>     assert(sum_subranges([1], 0) == []);
>>>> }
>>>>
>>>>
>>>> int main()
>>>> {
>>>>     import std.range : iota, array;
>>>>     auto v = iota(0,1000000);
>>>>     int sum;
>>>>     MonoTime beg = MonoTime.currTime;
>>>>     for (int i=0; i < 100; i++)
>>>>         sum += cast(int)sum_subranges(v,2).length;
>>>>     MonoTime end = MonoTime.currTime;
>>>>     writeln(end-beg);
>>>>     writeln(sum);
>>>>     return sum;
>>>> }
>>>>
>>>> On Sun, Aug 13, 2017 at 9:13 AM, Daniel Kozak <kozzi11@gmail.com> wrote:
>>>>
>>>> this works ok for me with ldc compiler, gdc does not work on my arch
>>>>> machine so I can not do comparsion to your c++ versin (clang does not work
>>>>> with your c++ code)
>>>>>
>>>>> import std.stdio : writeln;
>>>>> import std.algorithm.comparison: min;
>>>>> import std.algorithm.iteration: sum;
>>>>> import core.time: MonoTime, Duration;
>>>>>
>>>>>
>>>>> T[] sum_subranges(T)(T[] input, uint range)
>>>>> {
>>>>>     import std.array : appender;
>>>>>     auto app = appender!(T[])();
>>>>>     if (range == 0)
>>>>>     {
>>>>>         return app.data;
>>>>>     }
>>>>>     for (uint i; i < input.length; i=min(i+range, input.length))
>>>>>     {
>>>>>         app.put(sum(input[i..min(i+range, input.length)]));
>>>>>     }
>>>>>     return app.data;
>>>>> }
>>>>>
>>>>> unittest
>>>>> {
>>>>>     assert(sum_subranges([1,1,1], 2) == [2, 1]);
>>>>>     assert(sum_subranges([1,1,1,2,3,3], 2) == [2, 3, 6]);
>>>>>     assert(sum_subranges([], 2) == []);
>>>>>     assert(sum_subranges([1], 2) == [1]);
>>>>>     assert(sum_subranges([1], 0) == []);
>>>>> }
>>>>>
>>>>>
>>>>> int main()
>>>>> {
>>>>>     import std.range : iota, array;
>>>>>     auto v = iota(0,1000000).array;
>>>>>     int sum;
>>>>>     MonoTime beg = MonoTime.currTime;
>>>>>     for (int i=0; i < 100; i++)
>>>>>         sum += cast(int)sum_subranges(v,2).length;
>>>>>     MonoTime end = MonoTime.currTime;
>>>>>     writeln(end-beg);
>>>>>     writeln(sum);
>>>>>     return sum;
>>>>> }
>>>>>
>>>>> On Sun, Aug 13, 2017 at 9:03 AM, Neia Neutuladh via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:
>>>>>
>>>>> [...]
>>>>>>
>>>>>
>>> Thank you all for the replies. Good to know the community is alive in d :)
>>>
>>> Let's settle the playground:
>>> D  : http://ideone.com/h4fnsD
>>> C++: http://ideone.com/X1pyXG
>>>
>>> Both using GCC under the hood.
>>> C++ in 112 ms;
>>> D in :
>>> - 2.5 sec with original source;
>>> - 2.5 sec with Daniel's 1st version;
>>> - 5 sec timeout exceeded with Daniel's 2nd version;
>>> - 1.8 sec with Neia-like preallocation;
>>>
>>> So still it's not that neaty.
>>>
>>> (What's interesting C++ code generates 2KLOC of assembly, and Dlang @ ldc 12KLOC - checked at godbolt).
>>>
>>> P.S. For C++ version to work under clang, the function which takes
>>> (BinaryOp) must go before the other one (my bad).
>>>
>>
>>
>


August 13, 2017
On Sunday, 13 August 2017 at 08:32:50 UTC, amfvcg wrote:
>> Gives me
>>
>> 5 μs and 2 hnsecs
>> 50000000
>> 3 secs, 228 ms, 837 μs, and 4 hnsecs
>> 50000000
>
> And you've compiled it with?
>
> Btw. clang for c++ version works worse than gcc (for this case [112ms vs 180ms]).

DMD64 D Compiler v2.074.1
August 13, 2017
On Sunday, 13 August 2017 at 08:33:53 UTC, Petar Kirov [ZombineDev] wrote:
>
> With Daniel's latest version (
> http://forum.dlang.org/post/mailman.5963.1502612885.31550.digitalmars-d-learn@puremagic.com )
>
> $ ldc2 -O3 --release sum_subranges2.d
> $ ./sum_subranges2
> 210 ms, 838 μs, and 8 hnsecs
> 50000000

Great!!! And that's what I was hoping for.

So the conclusion is:

use the latest ldc that's out there.

Thank you Petar, thank you Daniel. (I cannot change the subject to SOLVED, can I?)

Btw. the idiomatic version of this d sample looks how I imagined it should!
August 13, 2017
On Sunday, 13 August 2017 at 08:43:29 UTC, amfvcg wrote:
> On Sunday, 13 August 2017 at 08:33:53 UTC, Petar Kirov [ZombineDev] wrote:
>>
>> With Daniel's latest version (
>> http://forum.dlang.org/post/mailman.5963.1502612885.31550.digitalmars-d-learn@puremagic.com )
>>
>> $ ldc2 -O3 --release sum_subranges2.d
>> $ ./sum_subranges2
>> 210 ms, 838 μs, and 8 hnsecs
>> 50000000
>
> Great!!! And that's what I was hoping for.
>
> So the conclusion is:
>
> use the latest ldc that's out there.
>
> Thank you Petar, thank you Daniel. (I cannot change the subject to SOLVED, can I?)
>
> Btw. the idiomatic version of this d sample looks how I imagined it should!

There's one especially interesting result:

This instantiation:

sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint)

of the following function:

auto sum_subranges(T)(T input, uint range)
{
    import std.range : chunks, ElementType, array;
    import std.algorithm : map;
    return input.chunks(range).map!(sum);
}

gets optimized with LDC to:
  push rax
  test edi, edi
  je .LBB2_2
  mov edx, edi
  mov rax, rsi
  pop rcx
  ret
.LBB2_2:
  lea rsi, [rip + .L.str.3]
  lea rcx, [rip + .L.str]
  mov edi, 45
  mov edx, 89
  mov r8d, 6779
  call _d_assert_msg@PLT

I.e. the compiler turned a O(n) algorithm to O(1), which is quite neat. It is also quite surprising to me that it looks like even dmd managed to do a similar optimization:

sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint):
    push   rbp
    mov    rbp,rsp
    sub    rsp,0x30
    mov    DWORD PTR [rbp-0x8],edi
    mov    r9d,DWORD PTR [rbp-0x8]
    test   r9,r9
    jne    41
    mov    r8d,0x1b67
    mov    ecx,0x0
    mov    eax,0x61
    mov    rdx,rax
    mov    QWORD PTR [rbp-0x28],rdx
    mov    edx,0x0
    mov    edi,0x2d
    mov    rsi,rdx
    mov    rdx,QWORD PTR [rbp-0x28]
    call   41
41: mov    QWORD PTR [rbp-0x20],rsi
    mov    QWORD PTR [rbp-0x18],r9
    mov    rdx,QWORD PTR [rbp-0x18]
    mov    rax,QWORD PTR [rbp-0x20]
    mov    rsp,rbp algorithms a
    pop    rbp
    ret

Moral of the story: templates + ranges are an awesome combination.
August 13, 2017
On Sunday, 13 August 2017 at 09:08:14 UTC, Petar Kirov [ZombineDev] wrote:
>
> There's one especially interesting result:
>
> This instantiation:
>
> sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint)
>
> of the following function:
>
> auto sum_subranges(T)(T input, uint range)
> {
>     import std.range : chunks, ElementType, array;
>     import std.algorithm : map;
>     return input.chunks(range).map!(sum);
> }
>
> gets optimized with LDC to:
>   push rax
>   test edi, edi
>   je .LBB2_2
>   mov edx, edi
>   mov rax, rsi
>   pop rcx
>   ret
> .LBB2_2:
>   lea rsi, [rip + .L.str.3]
>   lea rcx, [rip + .L.str]
>   mov edi, 45
>   mov edx, 89
>   mov r8d, 6779
>   call _d_assert_msg@PLT
>
> I.e. the compiler turned a O(n) algorithm to O(1), which is quite neat. It is also quite surprising to me that it looks like even dmd managed to do a similar optimization:
>
> sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint):
>     push   rbp
>     mov    rbp,rsp
>     sub    rsp,0x30
>     mov    DWORD PTR [rbp-0x8],edi
>     mov    r9d,DWORD PTR [rbp-0x8]
>     test   r9,r9
>     jne    41
>     mov    r8d,0x1b67
>     mov    ecx,0x0
>     mov    eax,0x61
>     mov    rdx,rax
>     mov    QWORD PTR [rbp-0x28],rdx
>     mov    edx,0x0
>     mov    edi,0x2d
>     mov    rsi,rdx
>     mov    rdx,QWORD PTR [rbp-0x28]
>     call   41
> 41: mov    QWORD PTR [rbp-0x20],rsi
>     mov    QWORD PTR [rbp-0x18],r9
>     mov    rdx,QWORD PTR [rbp-0x18]
>     mov    rax,QWORD PTR [rbp-0x20]
>     mov    rsp,rbp algorithms a
>     pop    rbp
>     ret
>
> Moral of the story: templates + ranges are an awesome combination.

Change the parameter for this array size to be taken from stdin and I assume that these optimizations will go away.
August 13, 2017
On Sunday, 13 August 2017 at 09:08:14 UTC, Petar Kirov [ZombineDev] wrote:
>
> This instantiation:
>
> sum_subranges(std.range.iota!(int, int).iota(int, int).Result, uint)
>
> of the following function:
>
> auto sum_subranges(T)(T input, uint range)
> {
>     import std.range : chunks, ElementType, array;
>     import std.algorithm : map;
>     return input.chunks(range).map!(sum);
> }
>
> gets optimized with LDC to:
>  [snip]

> I.e. the compiler turned a O(n) algorithm to O(1), which is quite neat. It is also quite surprising to me that it looks like even dmd managed to do a similar optimization:
>  [snip]

Execution of sum_subranges is already O(1), because the calculation of the sum is delayed: the return type of the function is not `uint`, it is `MapResult!(sum, <range>)` which does a lazy evaluation of the sum.

- Johan