Thread overview
Performance of ranges verses direct
Dec 11, 2013
Frustrated
Dec 11, 2013
Marco Leise
Dec 11, 2013
John Colvin
Dec 12, 2013
Marco Leise
Dec 12, 2013
John Colvin
Dec 13, 2013
Marco Leise
Dec 13, 2013
Nikolay
Dec 13, 2013
lomereiter
Dec 14, 2013
Nikolay
December 11, 2013
Has anyone done any work on comparing the performance of ranges vs using direct straightforward code(optimized in the sense of the way ranges work but by hand).

e.g., How does filter compare to a simple loop over the elements and comparing it. How does a a chain of UFCS compare to doing the same in a simple loop.
December 11, 2013
Am Wed, 11 Dec 2013 08:29:38 +0100
schrieb "Frustrated" <c1514843@drdrb.com>:

> 
> Has anyone done any work on comparing the performance of ranges vs using direct straightforward code(optimized in the sense of the way ranges work but by hand).
> 
> e.g., How does filter compare to a simple loop over the elements and comparing it. How does a a chain of UFCS compare to doing the same in a simple loop.

Just use hand written loops if you are worried about this or benchmark on a per case basis. There is no one fits all answer. Not all code executes a million times per second though, so feel free to use them now and then where concise code is regarded higher than premature optimization. ;)

-- 
Marco
December 11, 2013
On Wednesday, 11 December 2013 at 07:29:39 UTC, Frustrated wrote:
>
> Has anyone done any work on comparing the performance of ranges vs using direct straightforward code(optimized in the sense of the way ranges work but by hand).
>
> e.g., How does filter compare to a simple loop over the elements and comparing it. How does a a chain of UFCS compare to doing the same in a simple loop.

In my very unscientific experiments:
range based, functional style D code runs about 75-100% speed of the equivalent explicit iterative version.

A lot of the performance loss is down to missed optimisations, in particular inlining. gdc/ldc are good at this, dmd not so good so you'll likely see bigger performance gaps with dmd than the other compilers.


General advice: use ranges, std.range and std.algorithm wherever possible. Profile the resulting code and write out the hotspots in an imperative style to see if you can squeeze out some extra speed.
December 11, 2013
On 11/12/13 11:10, John Colvin wrote:
> A lot of the performance loss is down to missed optimisations, in particular
> inlining.

Simple example:

    foreach(i; iota(0, 10)) { ... }

should be as fast as

    foreach(i; 0 .. 10) { ... }

but isn't.  I remember Andrei noting that this ought to be easily fixable [*], but I don't know if that actually happened, because I haven't compared the two recently.

[* If I recall right, it's achievable by special-casing iota when the increment is 1, but don't quote me on that.]

> General advice: use ranges, std.range and std.algorithm wherever possible.
> Profile the resulting code and write out the hotspots in an imperative style to
> see if you can squeeze out some extra speed.

Yes -- use ranges as much as possible because the resulting code will be so much easier to read and maintain.  Switching to imperative style is only worth it if there's a speed problem (read: "The speed is unsatisfactory for your use case and/or inferior to rival programs doing the same thing.").

December 12, 2013
Am Wed, 11 Dec 2013 17:40:39 +0100
schrieb Joseph Rushton Wakeling <joseph.wakeling@webdrake.net>:

> On 11/12/13 11:10, John Colvin wrote:
> > A lot of the performance loss is down to missed optimisations, in particular inlining.
> 
> Simple example:
> 
>      foreach(i; iota(0, 10)) { ... }
> 
> should be as fast as
> 
>      foreach(i; 0 .. 10) { ... }
> 
> but isn't.  I remember Andrei noting that this ought to be easily fixable [*], but I don't know if that actually happened, because I haven't compared the two recently.

As far as I remember it *is* the same speed using GDC or LDC. But as long as this issue remains with DMD I'm confident that foreach_reverse(i; 0 .. 10)  will stay :)

-- 
Marco

December 12, 2013
On Wednesday, 11 December 2013 at 16:40:48 UTC, Joseph Rushton Wakeling wrote:
> [* If I recall right, it's achievable by special-casing iota when the increment is 1, but don't quote me on that.]

That shouldn't be necessary if the iota operations are inlined (which, if their not, is a much greater concern!). The increment is then known to be a particular compile-time constant and code generated appropriately.

Also, on x86 the gcc and llvm backends both prefer to use add 1 instead of inc, so you wouldn't even expect to see any difference at that level.
December 13, 2013
Am Thu, 12 Dec 2013 16:02:28 +0100
schrieb "John Colvin" <john.loughran.colvin@gmail.com>:

> On Wednesday, 11 December 2013 at 16:40:48 UTC, Joseph Rushton Wakeling wrote:
> > [* If I recall right, it's achievable by special-casing iota when the increment is 1, but don't quote me on that.]
> 
> That shouldn't be necessary if the iota operations are inlined (which, if their not, is a much greater concern!). The increment is then known to be a particular compile-time constant and code generated appropriately.
> 
> Also, on x86 the gcc and llvm backends both prefer to use add 1 instead of inc, so you wouldn't even expect to see any difference at that level.

This doesn't seem to be the general case... http://stackoverflow.com/questions/12163610/why-inc-and-add-1-have-different-performances

"For the x86 architecture, INC updates on a subset of the condition codes, whereas ADD updates the entire set of condition codes. (Other architectures have different rules so this discussion may or may not apply).

So an INC instruction must wait for other previous instructions that update the condition code bits to finish, before it can modify that previous value to produce its final condition code result.

ADD can produce final condition code bits without regard to previous values of the condition codes, so it doesn't need to wait for previous instructions to finish computing their value of the condition codes."

- and -

"On x86 architectures, with variable length instruction
encoding, there could be occasions where either one is
preferable over the other. If the shorter on would fit
in a cache line or decode block where the larger wouldn't,
then it will come out ahead. If the shorter one would leave
half of the next instruction in the window, and the remaining
half in the next window, the larger one might be better by
aligning its successor nicely."

-- 
Marco

December 13, 2013
I found that performance of ranges is miserable in many cases. You should not use them if there is any chance for big/critical computations. Actually I don't like to have two subsets of language: one with good performance, and other with good maintainability/readability. I have a couple thought about this situation. May by we can use CTFE and mixin for inlining some range based code on library level?

FYI One of my benchmark (calculate sum of two simple ranges):
#!/usr/bin/rdmd

module zip_test;

import std.range;
import std.datetime;
import std.getopt;
import std.stdio;
import std.algorithm;


struct Range{
        uint count = 0;
        @property bool empty(){
                return count>=limit;
        }
        @property int front(){
                return count;
        }
        bool popFront(){
                count++;
                return count<limit;
        }
}

uint testForEach(){
        Range r1, r2;
        uint rs = 0;
        for(int i=0; !r1.empty && !r2.empty; i++){
                rs += r1.front + r2.front;
                r1.popFront();
                r2.popFront();
        }
        writefln("foreach: %s", rs);
        return rs;
}
uint testZip(){
        Range r1, r2;
        auto tmpRange = step1(r1, r2);
        auto rs = reduce!(q{ a + b })(0U, tmpRange);
//      auto rs = reduce!((a,b) => a + b)(0U, tmpRange);
        writefln("zip: %s", rs);
        return rs;
}

auto step1(Range r1, Range r2){
        //return zip(r1, r2).map!(a => a[0]+a[1]);
        return zip(r1, r2).map!(q{a[0]+a[1]});
}

uint limit = 10_000;

void main(string[] args){
        getopt( args, "limit", &limit);
        auto r = benchmark!(testForEach, testZip)(10);
        foreach(idx, rs; r){
                writefln("%s - %s", idx, rs.msecs);
        }
}

Results:
~/ldc2-0.12.0-linux-x86_64/bin/ldmd2 -O -release -inline zip_test.d
./zip_test --limit 10000000
…........
0 - 0
1 - 764
December 13, 2013
On Friday, 13 December 2013 at 18:03:28 UTC, Nikolay wrote:
>
> Results:
> ~/ldc2-0.12.0-linux-x86_64/bin/ldmd2 -O -release -inline zip_test.d
> ./zip_test --limit 10000000
> …........
> 0 - 0
> 1 - 764

And now try -inline-threshold=1024 ;-)
Reading -help-hidden sometimes helps.
December 14, 2013
> And now try -inline-threshold=1024 ;-)
> Reading -help-hidden sometimes helps.

Thanks. It is really game changer. I get my words about ranges back.