Thread overview
Ranges, Performance, and opApply
Oct 30, 2011
Kapps
Oct 30, 2011
Martin Nowak
Oct 30, 2011
dsimcha
Oct 30, 2011
Martin Nowak
Oct 31, 2011
Martin Nowak
Nov 01, 2011
Kapps
October 30, 2011
As it stands right now, ranges are quite inefficient. Phobos is designed with an emphasis on ranges, including in performance-sensitive code such as std.algorithm. No matter how optimized the range is, it will always be slower than the equivalent opApply function (and, as far as I can tell, offers no benefit for just iteration).

The fastest way of iterating over anything currently, is foreach over an array. It beats even for(size_t i 0 to length-1) by a decent bit. The lowest level range for all ranges is almost always an array (that is, if you have a range using a range ... using a range, it is likely that the last range is an array or uses an array). By implementing an opApply for these ranges, that calls opApply on the wrapped range or wrapped array (or other), you can basically double the performance cost of iterating (for many iterations over a small body). It is not possible to make the compiler-generated opApply make this optimization, as it does not know what the underlying data for the range is.

Creating an opApply can be done on a per-range basis for ranges that are used in performance-sensitive code (such as countUntil). If the manual opApply foreach's over the input range, and the input range has a similar manual opApply, there are fair performance benefits. If the underlying range does not have a manual opApply, there will still be some performance benefits, and does not break code in any way. If all ranges being used have opApply and the last range is an array, then performance benefits are very significant.

As an example, I made a test of 5 different ways of implementing std.algorithm.filter, and two tests of another filter on top of the output from the first one, one with a manual opApply, and one without. Basically, a manual opApply was over twice as fast. The code and results can be found at http://pastie.org/2781723. Ultimately, nothing is as fast as just doing a foreach yourself, but that's to be expected. It can be however, be easily made that the difference is not as extreme as it currently is.

So, should Phobos ranges attempt to have a manual opApply where feasibly being used in performance sensitive code?
October 30, 2011
On 10/30/11 3:02 AM, Kapps wrote:
> As it stands right now, ranges are quite inefficient. Phobos is
> designed with an emphasis on ranges, including in
> performance-sensitive code such as std.algorithm. No matter how
> optimized the range is, it will always be slower than the equivalent
> opApply function (and, as far as I can tell, offers no benefit for
> just iteration).
>

The test is flawed in a subtle way and no conclusion can be drawn from it.

The test filters a checkerboard distribution (one element satisfies/the
next doesn't satisfy the filter). As such, the loop on lines 26-29 will
always test exactly two times to take one step. In contrast, the loop at
lines 75-81 is integrated and only makes the minimum amount of tests.
Calling the delegate is expensive, but there are only half as many calls
and the saved tests compensate for that.

I believe there is nothing inherent in the design of ranges that makes
them generally less efficient than various alternatives. Yes, there will be the corner cases where internal iteration will be better, but overall I don't think they form a large fraction of all cases.


Andrei
October 30, 2011
On Sun, 30 Oct 2011 15:42:36 +0100, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 10/30/11 3:02 AM, Kapps wrote:
>> As it stands right now, ranges are quite inefficient. Phobos is
>> designed with an emphasis on ranges, including in
>> performance-sensitive code such as std.algorithm. No matter how
>> optimized the range is, it will always be slower than the equivalent
>> opApply function (and, as far as I can tell, offers no benefit for
>> just iteration).
>>
>
> The test is flawed in a subtle way and no conclusion can be drawn from it.
>
> The test filters a checkerboard distribution (one element satisfies/the
> next doesn't satisfy the filter). As such, the loop on lines 26-29 will
> always test exactly two times to take one step. In contrast, the loop at
> lines 75-81 is integrated and only makes the minimum amount of tests.
> Calling the delegate is expensive, but there are only half as many calls
> and the saved tests compensate for that.
>
No it does not.
In fact the opApply version does ONE additional check at loop start, which
doesn't matter for 2.5M iterations.

> I believe there is nothing inherent in the design of ranges that makes
> them generally less efficient than various alternatives. Yes, there will be the corner cases where internal iteration will be better, but overall I don't think they form a large fraction of all cases.
>
>
> Andrei

The test still doesn't measure anything but spurious effects of different inline decisions.
I get a ~10% hit for range foreach over opApply.
When looping without predicate than popFront is inlined and the result is a ~10% hit for opApply.
As the opApply body can never be inlined it is a worse choice in general.

martin
October 30, 2011
On 10/30/11 11:45 AM, Martin Nowak wrote:
> On Sun, 30 Oct 2011 15:42:36 +0100, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> On 10/30/11 3:02 AM, Kapps wrote:
>>> As it stands right now, ranges are quite inefficient. Phobos is
>>> designed with an emphasis on ranges, including in
>>> performance-sensitive code such as std.algorithm. No matter how
>>> optimized the range is, it will always be slower than the equivalent
>>> opApply function (and, as far as I can tell, offers no benefit for
>>> just iteration).
>>>
>>
>> The test is flawed in a subtle way and no conclusion can be drawn from
>> it.
>>
>> The test filters a checkerboard distribution (one element satisfies/the
>> next doesn't satisfy the filter). As such, the loop on lines 26-29 will
>> always test exactly two times to take one step. In contrast, the loop at
>> lines 75-81 is integrated and only makes the minimum amount of tests.
>> Calling the delegate is expensive, but there are only half as many calls
>> and the saved tests compensate for that.
>>
> No it does not.

Indeed you're right. In the quiescent state both implementations call the support functions the same number of times. Thanks.

> In fact the opApply version does ONE additional check at loop start, which
> doesn't matter for 2.5M iterations.
>
>> I believe there is nothing inherent in the design of ranges that makes
>> them generally less efficient than various alternatives. Yes, there
>> will be the corner cases where internal iteration will be better, but
>> overall I don't think they form a large fraction of all cases.
>>
>>
>> Andrei
>
> The test still doesn't measure anything but spurious effects of
> different inline decisions.
> I get a ~10% hit for range foreach over opApply.
> When looping without predicate than popFront is inlined and the result
> is a ~10% hit for opApply.
> As the opApply body can never be inlined it is a worse choice in general.

That was my intuition too.


Andrei
October 30, 2011
On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:
>> As the opApply body can never be inlined it is a worse choice in general.
>
> That was my intuition too.
>
>
> Andrei

Just for future reference, LDC **DOES** sometimes inline opApply bodies.
October 30, 2011
On Sun, 30 Oct 2011 20:23:59 +0100, dsimcha <dsimcha@yahoo.com> wrote:

> On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:
>>> As the opApply body can never be inlined it is a worse choice in general.
>>
>> That was my intuition too.
>>
>>
>> Andrei
>
> Just for future reference, LDC **DOES** sometimes inline opApply bodies.

Right it should have been 'is mostly not inlined'.
October 31, 2011
On Sun, 30 Oct 2011 15:23:59 -0400, dsimcha <dsimcha@yahoo.com> wrote:

> On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:
>>> As the opApply body can never be inlined it is a worse choice in general.
>>
>> That was my intuition too.
>>
>>
>> Andrei
>
> Just for future reference, LDC **DOES** sometimes inline opApply bodies.

The compiler should almost always be able to inline opApply, as the code for the opApply body is always available.

There are few exceptions, such as when opApply is not final, or when it's recursive.  I wonder if even in these cases, some sort of "virtual inlining" such as pushing the code onto the stack and avoiding using function calls would be possible (speaking from naivety, maybe this does nothing).  Being able to exploit the fact that a delegate literal is always fully available would be nice.

Indeed, opApply should beat the pants off of ranges when the range primitives are virtual functions.  But ranges should win when inlining is possible in some cases.

There are always going to be use cases for opApply that ranges cannot duplicate (such as recursion), and vice versa (such as parallel iteration).

-Steve
October 31, 2011
On Mon, 31 Oct 2011 13:56:21 +0100, Steven Schveighoffer <schveiguy@yahoo.com> wrote:

> On Sun, 30 Oct 2011 15:23:59 -0400, dsimcha <dsimcha@yahoo.com> wrote:
>
>> On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:
>>>> As the opApply body can never be inlined it is a worse choice in general.
>>>
>>> That was my intuition too.
>>>
>>>
>>> Andrei
>>
>> Just for future reference, LDC **DOES** sometimes inline opApply bodies.
>
> The compiler should almost always be able to inline opApply, as the code for the opApply body is always available.
>
> There are few exceptions, such as when opApply is not final, or when it's recursive.  I wonder if even in these cases, some sort of "virtual inlining" such as pushing the code onto the stack and avoiding using function calls would be possible (speaking from naivety, maybe this does nothing).  Being able to exploit the fact that a delegate literal is always fully available would be nice.
>
> Indeed, opApply should beat the pants off of ranges when the range primitives are virtual functions.  But ranges should win when inlining is possible in some cases.
>
> There are always going to be use cases for opApply that ranges cannot duplicate (such as recursion), and vice versa (such as parallel iteration).
>
> -Steve

Inlining the body would necessitate one parameterized function
per caller. Quite a lot of code and doesn't even have defined mangling, does it?
It's a different topic when both, the opApply function and the body are
inlined at the caller site.
Also when inlining the body it is much more difficult to assign registers
to variables from the calling stack unless you can prove that nobody can
refer to them.

I think a better approach is to revive the templated 'opApply(alias fbody)'
idea alternatively to the delegated one.
But there are currently too many issues to do this.

Constructing similar functionality requires quite some C++-ish gymnastics.
Partly due to bugs and for reasons discussed here:
http://d.puremagic.com/issues/show_bug.cgi?id=5710.
----
import std.range, std.stdio;

struct Context
{
    // doesn't even work as static method
    alias Context_forEach forEach;
    uint[] data;
}

// doesn't work at module scope when erroneously prefixed with static ???
/*static*/
void Context_forEach(alias fbody)(ref Context self)
{
    foreach(e; self.data)
        fbody(e);
}

void main()
{
    size_t sum;
    auto ctx = Context(array(iota(0U, 500U)));
    Context.forEach!(
    (a)
    {
        sum += a;
    }
    )(ctx);
    writeln(sum);
}

martin
November 01, 2011
Thanks for the explanation guys. That definitely makes me feel better about using ranges in my code (though for plain iteration opApply does seem easier to implement due to not forcing to conform to a specific naming convention except for one operator). When replacing modulus 2 with modulus 100, the results are indeed very close, proving that the performance issue is not related to ranges vs opApply. Which is rather surprising to me actually. Optimization ftw I guess. :P

On 30/10/2011 10:45 AM, Martin Nowak wrote:
> On Sun, 30 Oct 2011 15:42:36 +0100, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> On 10/30/11 3:02 AM, Kapps wrote:
>>> As it stands right now, ranges are quite inefficient. Phobos is
>>> designed with an emphasis on ranges, including in
>>> performance-sensitive code such as std.algorithm. No matter how
>>> optimized the range is, it will always be slower than the equivalent
>>> opApply function (and, as far as I can tell, offers no benefit for
>>> just iteration).
>>>
>>
>> The test is flawed in a subtle way and no conclusion can be drawn from
>> it.
>>
>> The test filters a checkerboard distribution (one element satisfies/the
>> next doesn't satisfy the filter). As such, the loop on lines 26-29 will
>> always test exactly two times to take one step. In contrast, the loop at
>> lines 75-81 is integrated and only makes the minimum amount of tests.
>> Calling the delegate is expensive, but there are only half as many calls
>> and the saved tests compensate for that.
>>
> No it does not.
> In fact the opApply version does ONE additional check at loop start, which
> doesn't matter for 2.5M iterations.
>
>> I believe there is nothing inherent in the design of ranges that makes
>> them generally less efficient than various alternatives. Yes, there
>> will be the corner cases where internal iteration will be better, but
>> overall I don't think they form a large fraction of all cases.
>>
>>
>> Andrei
>
> The test still doesn't measure anything but spurious effects of
> different inline decisions.
> I get a ~10% hit for range foreach over opApply.
> When looping without predicate than popFront is inlined and the result
> is a ~10% hit for opApply.
> As the opApply body can never be inlined it is a worse choice in general.
>
> martin