May 14, 2014
On Tuesday, 13 May 2014 at 17:32:21 UTC, H. S. Teoh via Digitalmars-d wrote:
> Lastly, since the range API is an *abstraction*, it should not dictate
> any concrete implementation details such as whether .empty can do
> non-trivial initialization work. Properly-written range-based code
> should be able to handle all possible implementations of the range API,
> including those that do non-trivial work in .empty.

An API is a "user" interface. It should be intuitive.

Besides, D ranges will never perform as well as an optimized explicit loop, so you might as well aim for usability over speed. I learned this over 15 years ago when I was fed up with STL begin/end and implemented my own inlined iterators that work like D ranges ( for a scanline renderer ).
May 14, 2014
On Wednesday, 14 May 2014 at 06:00:33 UTC, Ola Fosheim Grøstad wrote:
> On Tuesday, 13 May 2014 at 17:32:21 UTC, H. S. Teoh via Digitalmars-d wrote:
>> Lastly, since the range API is an *abstraction*, it should not dictate
>> any concrete implementation details such as whether .empty can do
>> non-trivial initialization work. Properly-written range-based code
>> should be able to handle all possible implementations of the range API,
>> including those that do non-trivial work in .empty.
>
> An API is a "user" interface. It should be intuitive.
>
> Besides, D ranges will never perform as well as an optimized explicit loop, so you might as well aim for usability over speed.

I wouldn't say never, because optimizers have become damn good. I believe the additional overhead of lazy initialization is mostly optimized away where it matters, i.e. inner loops, because  the compiler can see that the initialization has already been performed in the first iteration. All it requires is unrolling the first iteration of the loop and treating it specially.

But of course, this makes your conclusion even more true, because you
May 14, 2014
On Wednesday, 14 May 2014 at 09:47:56 UTC, Marc Schütz wrote:
> I wouldn't say never, because optimizers have become damn good.

Well, "never" without a proof is of course a too strong claim. :-) I agree with that, but I don't think optimizers are that good yet even though some of my troubles have been solved.

> I believe the additional overhead of lazy initialization is mostly optimized away where it matters, i.e. inner loops, because  the compiler can see that the initialization has already been performed in the first iteration. All it requires is unrolling the first iteration of the loop and treating it specially.

You can do this for some generators/iterators, but you have different ways to construct loops and generators and consumers that affect what is most efficient and how many registers you use etc.

Abstractions hide what is going on, so the consumer might miss more efficient way of structuring the algorithm.

Example: if you know that the end comes after 0, then you only need to test empty() after you read 0. You don't need to test for 0, you just branch up to the start of the loop if the zero-flag is set. Quasi ASM:

loop:
    write(x)
start:
    x = generate_next()
    branch_if_not_zero loop
    test( empty() )
    branch_if_zero loop
    write(0)

It is easy to construct examples where naive ranges will be suboptimal.

For more complex iterators you also need to deal with sparse datastructures, skiplists, run length encoded data, SIMD alignment, guard protected stream ends… Meta information you exploit when you need faster algorithms.
May 14, 2014
On Wednesday, 14 May 2014 at 06:00:33 UTC, Ola Fosheim Grøstad wrote:
> Besides, D ranges will never perform as well as an optimized explicit loop, so you might as well aim for usability over speed.

There is no single reason why this should be true. Actually ranges of medium complexity are already very close to explicit loops in D when you use something like LDC.

Consider sample program like this:

  1 void main()
  2 {
  3     import std.stdio;
  4     int max;
  5     readf("%d", &max);
  6     assert(foo1(max) == foo2(max));
  7 }
  8
  9 int foo1(int max)
 10 {
 11     int sum = 0;
 12     for (auto i = 0; i < max; ++i)
 13     {
 14         sum += i*2;
 15     }
 16     return sum;
 17 }
 18
 19 int foo2(int max)
 20 {
 21     import std.algorithm : reduce, map;
 22     import std.range : iota;
 23     return reduce!((a, b) => a + b)(0, iota(0, max).map!(a => a*2));
 24 }

Disassembly (ldmd2 -O -release -inline):

  765 0000000000402df0 <_D4test4foo1FiZi>:
  766   402df0:   31 c0                   xor    %eax,%eax
  767   402df2:   85 ff                   test   %edi,%edi
  768   402df4:   7e 10                   jle    402e06 <_D4test4foo1FiZi+0x16>
  769   402df6:   8d 47 ff                lea    -0x1(%rdi),%eax
  770   402df9:   8d 4f fe                lea    -0x2(%rdi),%ecx
  771   402dfc:   0f af c8                imul   %eax,%ecx
  772   402dff:   83 e1 fe                and    $0xfffffffe,%ecx
  773   402e02:   8d 44 79 fe             lea    -0x2(%rcx,%rdi,2),%eax
  774   402e06:   c3                      retq
  775   402e07:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  776   402e0e:   00 00
  777
  778 0000000000402e10 <_D4test4foo2FiZi>:
  779   402e10:   31 c0                   xor    %eax,%eax
  780   402e12:   85 ff                   test   %edi,%edi
  781   402e14:   0f 48 f8                cmovs  %eax,%edi
  782   402e17:   85 ff                   test   %edi,%edi
  783   402e19:   74 10                   je     402e2b <_D4test4foo2FiZi+0x1b>
  784   402e1b:   8d 47 ff                lea    -0x1(%rdi),%eax
  785   402e1e:   8d 4f fe                lea    -0x2(%rdi),%ecx
  786   402e21:   0f af c8                imul   %eax,%ecx
  787   402e24:   83 e1 fe                and    $0xfffffffe,%ecx
  788   402e27:   8d 44 79 fe             lea    -0x2(%rcx,%rdi,2),%eax
  789   402e2b:   c3                      retq
  790   402e2c:   0f 1f 40 00             nopl   0x0(%rax)

it is almost identical. I fully expect to be 100% identical at certain point of compiler maturity.
May 14, 2014
On Wednesday, 14 May 2014 at 13:29:35 UTC, Dicebot wrote:
>   9 int foo1(int max)
>  10 {
>  11     int sum = 0;
>  12     for (auto i = 0; i < max; ++i)
>  13     {
>  14         sum += i*2;
>  15     }
>  16     return sum;
>  17 }

This isn't really a loop. It is an expression: n*(n+1)

Try something more realistic such as doing alpha-blending to a buffer or DSP.
May 14, 2014
On Wednesday, 14 May 2014 at 11:23:16 UTC, Ola Fosheim Grøstad wrote:
> On Wednesday, 14 May 2014 at 09:47:56 UTC, Marc Schütz wrote:
>> I wouldn't say never, because optimizers have become damn good.
>
> Well, "never" without a proof is of course a too strong claim. :-) I agree with that, but I don't think optimizers are that good yet even though some of my troubles have been solved.
>
>> I believe the additional overhead of lazy initialization is mostly optimized away where it matters, i.e. inner loops, because  the compiler can see that the initialization has already been performed in the first iteration. All it requires is unrolling the first iteration of the loop and treating it specially.

As a demonstration of what I mean:

///////////////////////////////////
import std.functional;

struct Filter(alias predicate, Range) {
        Range r;

private:
        bool is_initialized;

public:
        void initialize() {
                alias pred = unaryFun!predicate;
                if(is_initialized)
                        return;
                while(!r.empty && !pred(r.front))
                        r.popFront();
                is_initialized = true;
        }

        @property bool empty() {
                initialize();
                return r.empty;
        }

        @property auto front() {
                initialize();
                return r.front;
        }

        void popFront() {
                r.popFront();
                is_initialized = false;
        }

        auto sum() {
                typeof(r.front) result = 0;
                foreach(v; this)
                        result += v;
                return result;
        }
}

auto filter(alias predicate, Range)(Range r) {
        return Filter!(predicate, Range)(r);
}

extern bool myPredicate(int);

struct MyGen {
        int front;
        int max;

        @property bool empty() {
                return front > max;
        }

        void popFront() {
                front++;
        }
}

void main() {
        auto gen = MyGen(1, 6);
        gen.filter!myPredicate.sum();
        gen.filter!"a % 2".sum();
        gen.filter!(a => true).sum();
}
///////////////////////////////////

Compile this with:

    ldc2 -O3 -release -output-s demo.d

and have a look a the generated assembly code for the Filter.sum() functions. All of them have been inlined as much as possible, and in particular the variable `is_initialized` has disappeared, even in the version that uses an external (unknown to the compiler) predicate.

Of course, if you don't access the range in a tight loop, these optimizations usually aren't possible. But arguably, in such cases an extra variable and an extra check for initialization aren't too bad. This means that we can have lazy implementations even without a guaranteed call to `empty`, and still have comparable performance to eagerly initialized ranges where it matters most.
May 14, 2014
On Wednesday, 14 May 2014 at 19:32:57 UTC, Marc Schütz wrote:
> Compile this with:
>
>     ldc2 -O3 -release -output-s demo.d
>
> and have a look a the generated assembly code for the Filter.sum() functions. All of them have been inlined as much as possible, and in particular the variable `is_initialized` has disappeared, even in the version that uses an external (unknown to the compiler) predicate.

I only have DMD installed, but I trust your word for it. However I don't feel confident that compilers will ever catch up, even for simple loops.

Take for instance floating point. Floating point math is inaccurate, that means the compiler will have to guess what kind of accuracy you are happy with… It can't, so it can't optimize real well, even when loop unrolling. Why is that? Well, because even if it can create SIMD instructions it cannot decouple the dependencies between consecutive elements without creating drift between the calculations over time. It has to assume the worst case.

If you have a simple generator like this:

sample[n] = f( sample[n-1] )

you could in theory do

sample[n] = f(f(f(f( sample[n-4] ))))
sample[n+1] = f(f(f(f( sample[n-3] ))))
sample[n+2] = f(f(f(f( sample[n-2] ))))
sample[n+3] = f(f(f(f( sample[n-1] ))))

But because of floating point inaccuracies you would then risk that the sample[BIGNUMBER] and sample[BIGNUMBER+1] is completely disconnected which could be a disaster. So only hand optimization and analysis of the math and stability is sufficient to get the SIMD speed-up.

> This means that we can have implementations even without a guaranteed call to `empty`, and still have comparable performance to eagerly initialized ranges where it matters most.

Maybe you can. :-) I will have to get my hands on ldc and try to create a counter example… Hm.
May 15, 2014
On Wednesday, 14 May 2014 at 14:00:18 UTC, Ola Fosheim Grøstad wrote:
> On Wednesday, 14 May 2014 at 13:29:35 UTC, Dicebot wrote:
>>  9 int foo1(int max)
>> 10 {
>> 11     int sum = 0;
>> 12     for (auto i = 0; i < max; ++i)
>> 13     {
>> 14         sum += i*2;
>> 15     }
>> 16     return sum;
>> 17 }
>
> This isn't really a loop. It is an expression: n*(n+1)
>
> Try something more realistic such as doing alpha-blending to a buffer or DSP.

I think it is your turn to make a counter-example ;) It does not matter that much what loop actually does. It was effectively an expression here but compiler was able to reduce range version to quite the same expression despite all the intermediate code. Of course there are still many ways to fool existing optimizers but declaring fundamental inability to do so? No way.
May 15, 2014
On Thursday, 15 May 2014 at 12:34:49 UTC, Dicebot wrote:
> I think it is your turn to make a counter-example ;)

I've given 2 examples that no compiler can get at… because they lack contextual knowledge. ;)

Basically anything that involves guards at the end is an issue, most things that involve floating point is an issue, but there are others too I think…

I'll get back to counter-examples later when I have time to look closer at LDC.
(not this week)
May 15, 2014
On Thursday, 15 May 2014 at 13:01:34 UTC, Ola Fosheim Grøstad wrote:
> On Thursday, 15 May 2014 at 12:34:49 UTC, Dicebot wrote:
>> I think it is your turn to make a counter-example ;)
>
> I've given 2 examples that no compiler can get at… because they lack contextual knowledge. ;)

If compiler lacks contextual knowledge, than only means that range is not actually semantically equivalent to a loop. Which is quite possible if you start considering something like SIMD, this is true.

What is wrong is assumption that such kinds of loops are anything but tiny minority and this warrants generic "ranges can never be as fast as loops" statement.

Btw,

>  Floating point math is inaccurate, that means the compiler will have to guess what kind of accuracy you are happy with…

AFAIK it is actually not true. Floating point standard defines basic precision guarantees and compilers are free to make any adjustments outside of those basics. In your example you need to verify generated assembly anyway, loop or range.