June 09, 2017
On Friday, 9 June 2017 at 21:11:50 UTC, Steven Schveighoffer wrote:
> Just to show you what I meant, I changed your code to eliminate the functors completely, the main function now looks like this:
>
>
>     foreach (i;  0 .. N)
>     {
>         insertionSort!((a, b) => lt(a, b))(v);
>         insertionSort!((a, b) => lt(b, a))(v);
>     }
>
> I'm sure there's also a way to reduce the initialization of the array to a few lines (or maybe even one?), but didn't have time to think about it.

Thanks for your hints. I'm sure there are many things to improve (also in the C++ version). It should be pretty obvious that my knowledge of D is lacking.


> Well, D is pretty fast, as fast as C++ or C. What I mean is that there is no inherent overhead -- both can produce exactly the same code.

I agree.


> However, there are some parts of C/C++ that have been optimized to death. It's one of those things where D's version of rotate probably hasn't had as much scrutiny as C++'s version. We are always looking to improve such things, and more investigation and suggestions are most welcome! It's why I filed the bug report.

Thank you for filing the bug!

bringToFrontImpl does not seem to exploit bidirectional or random access properties.


> Try to find something besides insertion sort to test with I think ;) D is pretty fast at a lot of things, and pleasant to write. You just found one test case that isn't (and we will fix it, I'm sure).

I guess that benchmarking comparison of string tuples will not result in happy faces unless a single-pass version of the comparison function is used.
June 09, 2017
On Friday, 9 June 2017 at 20:27:58 UTC, Ali Çehreli wrote:
> On 06/09/2017 12:29 PM, Honey wrote:
>
> > I think, I should not rely on standard library facilities.
>
> I think you hit a Phobos function with particularly bad performance (which can be improved). Phobos is not a slow library in general.

What I meant to say is: the comparison would have been less biased if I had written the complete algorithm without relying on different standard library abstractions (iterators vs. ranges).
June 09, 2017
On 06/09/2017 03:46 PM, Honey wrote:
> On Friday, 9 June 2017 at 20:27:58 UTC, Ali Çehreli wrote:
>> On 06/09/2017 12:29 PM, Honey wrote:
>>
>> > I think, I should not rely on standard library facilities.
>>
>> I think you hit a Phobos function with particularly bad performance
>> (which can be improved). Phobos is not a slow library in general.
>
> What I meant to say is: the comparison would have been less biased if I
> had written the complete algorithm without relying on different standard
> library abstractions (iterators vs. ranges).

You would get the exact performance if you implemented e.g. with pointers. Your test has been very valuable for exposing an embarrassing performance issue. :)

Ali

June 10, 2017
On Friday, 9 June 2017 at 23:10:28 UTC, Ali Çehreli wrote:
> You would get the exact performance if you implemented e.g. with pointers. Your test has been very valuable for exposing an embarrassing performance issue. :)

I can confirm that, after changing implementation to the following, performance is on par. It is not immediatetly clear to me how

 r[r[0 .. i].getTransitionIndex!Less(r[i]) .. i + 1]

would look like if written idiomatically.


size_t getTransitionIndex(alias test, Range, V)(Range r, V v)
if (isRandomAccessRange!Range && hasLength!Range)
{
    size_t first = 0;
    size_t count = r.length;
    while (count > 0)
    {
        immutable step = count / 2;
        immutable it = first + step;
        if (!test(v, r[it]))
        {
            first = it + 1;
            count -= step + 1;
        }
        else
        {
            count = step;
        }
    }
    return first;
}

void rotateRight(Range)(Range r)
if (hasLength!Range && isRandomAccessRange!Range && hasSlicing!Range)
{
   auto t = r[$ - 1];
   foreach_reverse (i; 0 .. r.length - 1)
   {
      r[i + 1] = r[i];
   }
   r[0] = t;
}

void insertionSort(alias Less, Range)(Range r)
if (hasLength!Range && isRandomAccessRange!Range && hasSlicing!Range)
{
   foreach (i; 1 .. r.length)
   {
      rotateRight(r[r[0 .. i].getTransitionIndex!Less(r[i]) .. i + 1]);
   }
}
June 10, 2017
On 6/10/17 5:00 AM, Honey wrote:
> On Friday, 9 June 2017 at 23:10:28 UTC, Ali Çehreli wrote:
>> You would get the exact performance if you implemented e.g. with
>> pointers. Your test has been very valuable for exposing an
>> embarrassing performance issue. :)
>
> I can confirm that, after changing implementation to the following,
> performance is on par. It is not immediatetly clear to me how
>
>  r[r[0 .. i].getTransitionIndex!Less(r[i]) .. i + 1]
>
> would look like if written idiomatically.

I wrote it like this, which confirms that it's indeed bringToFront (I tried your getTransitionIndex function, but that didn't change timings at all):

void insertionSort(alias Less, Range)(Range r)
if (hasLength!Range && isRandomAccessRange!Range && hasSlicing!Range)
{
    foreach (immutable i; 1 .. r.length)
    {
        auto ubElem = i - r[0 .. i].assumeSorted!Less.upperBound(r[i]).length;
        r[ubElem .. i+1].rotateRight;
    }
}

On my mac, it's completing in about 4.4 seconds, whereas the C++ version completes in around 3.

Optimized your rotateRight a bit to get better performance when we can use memmove:

void rotateRight(Range)(Range r)
    if (hasLength!Range && isRandomAccessRange!Range && hasSlicing!Range)
{
    auto t = r[$ - 1];
    static if(isDynamicArray!Range)
    {
        import core.stdc.string;
        memmove(r.ptr + 1, r.ptr, (r.length - 1) * r[0].sizeof);
    }
    else
    {
        foreach_reverse (i; 0 .. r.length - 1)
        {
            r[i + 1] = r[i];
        }
    }
    r[0] = t;
}


Brings it to exactly c++ performance :)

I'll update the bug report.

-Steve
June 10, 2017
On Friday, 9 June 2017 at 16:21:22 UTC, Honey wrote:
> What seems particularly strange to me is that -boundscheck=off leads to a performance decrease.

Strange indeed.
`-release` should be synonymous with `-release -boundscheck=off`.
Investigating...

- Johan

June 10, 2017
On Saturday, 10 June 2017 at 11:43:06 UTC, Johan Engelen wrote:
> On Friday, 9 June 2017 at 16:21:22 UTC, Honey wrote:
>> What seems particularly strange to me is that -boundscheck=off leads to a performance decrease.
>
> Strange indeed.
> `-release` should be synonymous with `-release -boundscheck=off`.

Nope it's not.
http://www.digitalmars.com/d/archives/digitalmars/D/What_s_the_deal_with_-boundscheck_260237.html
June 10, 2017
On Saturday, 10 June 2017 at 10:59:24 UTC, Steven Schveighoffer wrote:
> I wrote it like this, which confirms that it's indeed bringToFront (I tried your getTransitionIndex function, but that didn't change timings at all):
>
> void insertionSort(alias Less, Range)(Range r)
> if (hasLength!Range && isRandomAccessRange!Range && hasSlicing!Range)
> {
>     foreach (immutable i; 1 .. r.length)
>     {
>         auto ubElem = i - r[0 .. i].assumeSorted!Less.upperBound(r[i]).length;
>         r[ubElem .. i+1].rotateRight;
>     }
> }

Taking the length of upperBound is a great move! ;-)


> On my mac, it's completing in about 4.4 seconds, whereas the C++ version completes in around 3.

On my machine, I am getting the same performance even without resorting to memmove. Using getTransitionIndex directly, however, leads to a neglectable improvement (about 2.7 vs. 2.8 seconds), here.
June 10, 2017
On Saturday, 10 June 2017 at 11:53:44 UTC, Johan Engelen wrote:
>> `-release` should be synonymous with `-release -boundscheck=off`.
>
> Nope it's not.
> http://www.digitalmars.com/d/archives/digitalmars/D/What_s_the_deal_with_-boundscheck_260237.html

Thank you for clarifying that point. Is it expected that turning off bounds checking can lead to a performance decrease?
June 10, 2017
On Saturday, 10 June 2017 at 12:16:34 UTC, Honey wrote:
> On Saturday, 10 June 2017 at 11:53:44 UTC, Johan Engelen wrote:
>>> `-release` should be synonymous with `-release -boundscheck=off`.
>>
>> Nope it's not.
>> http://www.digitalmars.com/d/archives/digitalmars/D/What_s_the_deal_with_-boundscheck_260237.html
>
> Thank you for clarifying that point. Is it expected that turning off bounds checking can lead to a performance decrease?

Yes, with it on you are doing an "is the index <= the length" for every array access. Now some of them can be elided by the complier when it can prove that the index is always in bounds but it is generally dangerous to do so as it opens up the possibility of buffer overflow.