April 27, 2009
Derek Parnell:
> [...] I still find D2 templates as clear as mud. The syntax is butt-ugly, counter intuitive, and reader-hostile, IMHO.

I find templates in D1 easy to use in most situations. And template constraints of D2 too are easy.
What kind of syntax/semantics do you suggest to improve the current situation? If you have ideas I am interested.

Bye,
bearophile
April 27, 2009
Georg Wrede wrote:
> There's an illusion. And that illusion comes from the D newsgroups having "wrong" names.
> 
> The D2 newsgroup should have a name like "D2 discussion -- for D language development folks, enter at your own risk". And a *D1* newsgroup should then be for anybody who actually uses the language for something. Currently, actually, the D.learn newsgroup has sort-of assumed this functionality.

Yes, well put. I think it would be great to define a digitalmars.d2 or digitalmars.d2-design newsgroup.

Andrei
April 27, 2009
dsimcha:
> Rule number 1 of all performance discussions:  Always measure if possible.

Very good, thank you. I have run your code, with a bit of changes:

import std.stdio: writeln;
import std.perf: PerformanceCounter;

enum uint N = 1_000_000_000;

struct Direct {
    uint n;
    uint front() { return n; }
    void popFront() { n++; }
    bool empty() { return n >= N; }
}


struct Apply {
    int opApply(int delegate(ref uint) dg) {
        int res;
        foreach (i; 0U .. N) {
            res = dg(i);
            if(res)
                break;
        }
        return res;
    }
}


class Virtual {
    uint n;
    uint front() { return n; }
    void popFront() { n++; }
    bool empty() { return n >= N; }
}


void main() {
    scope pc = new PerformanceCounter;
    pc.start;
    foreach (elem; Direct.init) {}
    pc.stop;
    writeln("Direct:         ", pc.milliseconds);

    pc.start;
    foreach (elem; Apply.init) {}
    pc.stop;
    writeln("opApply:        ", pc.milliseconds);

    pc.start;
    auto v = new Virtual;
    foreach (elem; v) {}
    pc.stop;
    writeln("Virtual:        ", pc.milliseconds);

    pc.start;
    scope auto v2 = new Virtual;
    foreach (elem; v2) {}
    pc.stop;
    writeln("Scoped virtual: ", pc.milliseconds);
}

As you see I have added a version with scoped class at the bottom hoping to improve performance a bit, but the result is the opposite, can you explain me why?

Direct:         2699
opApply:        3520
Virtual:        7543
Scoped virtual: 8550

Run on a Core2 2 GHz, on WinXP.

Bye,
bearophile
April 27, 2009
dsimcha wrote:
> Output:
> Direct:  2343
> Virtual:  5695
> opApply:  3014
> 
> Bottom line is that these pretty much come in the order you would expect them to,
> but there are no particularly drastic differences between any of them.  To put
> these timings in perspective, 5700 ms for 1 billion iterations is roughly (on a
> 2.7 GHz machine) 15 clock cycles per iteration.  How often does anyone really have
> code that is performance critical *and* where the contents of the loop don't take
> long enough to dwarf the 15 clock cycles per iteration loop overhead *and* you
> need the iteration to be polymorphic?

Thanks for the numbers. The way I'd read them is not to look at the milliseconds but at the fractions. Loops with opApply have an extra 28% overhead, and I think that doesn't make them look very well. Bearophile's numbers show a 30% overhead.

I have measurements that show more dramatic pessimization, but I never keep those around. I know in my early tests for std.algorithm that had I used opApply in it, I would have essentially never used an algorithm if I could help it.


Andrei
April 27, 2009
> Michel Fortin:
> > Isn't there room for optimisation on the compiler side though? I mean, the compiler could inline opApply, and while doing so it could notice that the delegate is constant and inline it too. I suspect that adding this optimisation would put opApply at the same performance level than ranges.

Daniel Keep:
> Inlining does not automatically make things faster.

That's true in general: if you inline too much code you may end up with too much code to be moved inside and out of the caches, reducing the performance.
I agree with Michel, if done wisely by the compiler some inlining may help with such loops.

Bye,
bearophile
April 27, 2009
On Sun, 26 Apr 2009 21:20:32 -0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> Michel Fortin wrote:
>> On 2009-04-26 11:46:51 -0400, dsimcha <dsimcha@yahoo.com> said:
>>
>>> == Quote from Michel Fortin (michel.fortin@michelf.com)'s article
>>>> Hum, are you saying opApply superior when it comes to iterating in a
>>>> tree? It seems that with opApply you could use the stack by recursively
>>>> calling opApply with the given delegate on each one of the branches.
>>>
>>> Exactly.  On the other hand, you lose a lot of flexibility with opApply.  If you
>>> tried to port most of std.range to operate on the opApply interface instead fo the
>>> forward range interface, I doubt you'd get very far.
>>>
>>> IMHO, though, opApply should *not* be deprecated.  opApply and ranges attempt to
>>> solve similar problems, but not the same problem.  Ranges attempt to solve the
>>> problem of iterating over some object with maximum flexibility from the point of
>>> view of the user of the object.  opApply attempts to solve the problem of
>>> iterating with maximum flexibility from the point of view of the implementer of
>>> the object.  In some cases, like the one you just described, the latter can be
>>> better.
>>  Indeed. I certainly agree that both ranges and opApply have their place.
>>  So what the implementer can do with opApply is write an optimized iteration algorithm for use with foreach. Which may mean that when both opApply and ranges are available for generating foreach's code, the compiler should favor opApply. Currently, I believe it's the reverse.
>
> I think, all other things being equal, that opApply tends to be slower than iterators.

It depends.  The range must have either inline-able functions, or must NOT call virtual functions.  Otherwise, it could be worse.  there is also the factor that you are using an inner function with opApply, so if you regularly access variables outside the foreach loop, then those add up.

Observe that in opapply style, you do one delegate call per loop iteration.
With range style, you do three calls on the range per loop iteration (empty, front, popFront).

If those range calls call virtual calls or ARE virtual calls, then range loses.  Of course, the difference might be washed if you do lots of accesses to an outer variable in your foreach loop, which require a pointer dereference vs. a stack dereference.

We should quantify this and see what the actual performance is.

-Steve
April 27, 2009
Michel Fortin wrote:
> On 2009-04-27 00:50:23 -0400, dsimcha <dsimcha@yahoo.com> said:
> 
>> Output:
>> Direct:  2343
>> Virtual:  5695
>> opApply:  3014
> 
> Nice.
> 
> Isn't there room for optimisation on the compiler side though? I mean, the compiler could inline opApply, and while doing so it could notice that the delegate is constant and inline it too.

For non-virtual opApply this is exactly what LDC does[1] at the higher optimization levels :).
(Assuming both opApply and the foreach body are deemed inline-worthy by the inliner)

[1]: Since trunk r1219, which was about 11 days ago.


> I suspect that adding
> this optimisation would put opApply at the same performance level than
> ranges.

I haven't actually measured to see if this is true, but there should indeed be very little difference since this optimization essentially turns opApply into a regular loop (and LDC does this before any loop optimizations run).
April 27, 2009
Frits van Bommel:
> I haven't actually measured to see if this is true, but there should indeed be very little difference since this optimization essentially turns opApply into a regular loop (and LDC does this before any loop optimizations run).

When you find a bit of free time please time it and show the timings :-)

Bye,
bearophile
April 27, 2009
== Quote from Robert Fraser (fraserofthenight@gmail.com)'s article
> Did you compile that with optimization/inlining? If not, it's not really a fair test... and if so why isn't the compiler getting rid of those pointless loops?

Yes, -O -inline -release.  I didn't bother stating that explicitly b/c it's the standard way of doing benchmarks.
April 27, 2009
Steven Schveighoffer wrote:
> On Sun, 26 Apr 2009 21:20:32 -0400, Andrei Alexandrescu 
>> I think, all other things being equal, that opApply tends to be slower than iterators.
> 
> It depends.  The range must have either inline-able functions, or must NOT call virtual functions.  Otherwise, it could be worse.  there is also the factor that you are using an inner function with opApply, so if you regularly access variables outside the foreach loop, then those add up.
> 
> Observe that in opapply style, you do one delegate call per loop iteration.
> With range style, you do three calls on the range per loop iteration (empty, front, popFront).
> 
> If those range calls call virtual calls or ARE virtual calls, then range loses.  Of course, the difference might be washed if you do lots of accesses to an outer variable in your foreach loop, which require a pointer dereference vs. a stack dereference.
> 
> We should quantify this and see what the actual performance is.

This is a very good point. If a range offers virtual empty/popFront/front, then opApply is a faster proposition. This makes for a good use case for making opApply the preferred way of iteration, if present (if you just do foreach and opApply is present, use it).

Andrei