June 04, 2018
On Monday, 4 June 2018 at 17:40:57 UTC, Dennis wrote:
> On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer wrote:
>> Note, it's not going to necessarily be as efficient, but it's likely to be close.
>>
>> -Steve
>
> I've compared the range versions with a for-loop. For integers and longs or high stride amounts the time is roughly equal, but for bytes with low stride amounts it can be up to twice as slow.
> https://run.dlang.io/is/BoTflQ
>
> 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release
> For-loop  18 ms
> Fill(0)   33 ms
> each!     33 ms
>
> With stride = 13:
> For-loop  7.3 ms
> Fill(0)   7.5 ms
> each!     7.8 ms


This is why I wanted to make sure! I would be using it for a stride of 2 and it seems it might have doubled the cost for no other reason than using ranged. Ranges are great but one can't reason about what is happening in then as easy as a direct loop so I wanted to be sure. Thanks for running the test!



June 04, 2018
On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer wrote:
> BTW, do you have cross-module inlining on?

Just to drive this point home.

https://run.dlang.io/is/nrdzb0

Manually implemented stride and fill with everything forced inline. Otherwise, the original code is unchanged.

17 ms, 891 μs, and 6 hnsecs
15 ms, 694 μs, and 1 hnsec
15 ms, 570 μs, and 9 hnsecs

My new stride outperformed std.range stride, and the manual for-loop. And, because the third test uses the new stride, it also benefited. But interestingly runs every so slightly faster...
June 05, 2018
On Monday, 4 June 2018 at 23:08:17 UTC, Ethan wrote:
> On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer wrote:
>> BTW, do you have cross-module inlining on?
>
> Just to drive this point home.
>
> https://run.dlang.io/is/nrdzb0
>
> Manually implemented stride and fill with everything forced inline. Otherwise, the original code is unchanged.
>
> 17 ms, 891 μs, and 6 hnsecs
> 15 ms, 694 μs, and 1 hnsec
> 15 ms, 570 μs, and 9 hnsecs
>
> My new stride outperformed std.range stride, and the manual for-loop. And, because the third test uses the new stride, it also benefited. But interestingly runs every so slightly faster...

Just as an aside:

    ...
    pragma( inline ) @property length() const { return range.length / strideCount; }
    pragma( inline ) @property empty() const { return currFront > range.length; }
    pragma( inline ) @property ref Elem front() { return range[ currFront ]; }
    pragma( inline ) void popFront() { currFront += strideCount; }
    ...

    pragma( inline ) auto stride( Range )( Range r, int a )
    ...

    pragma( inline ) auto fill( Range, Value )( Range r, Value v )
    ...

pragma(inline), without any argument, does not force inlining. It actually does nothing; it just specifies that the "implementation's default behaviour" should be used. You have to annotate with pragma(inline, true) to force inlining (https://dlang.org/spec/pragma.html#inline).

When I change all the pragma(inline) to pragma(inline, true), there is a non-trivial speedup:

14 ms, 517 μs, and 9 hnsecs
13 ms, 110 μs, and 1 hnsec
13 ms, 199 μs, and 9 hnsecs

There's further reductions using ldc-beta:

14 ms, 520 μs, and 4 hnsecs
13 ms, 87 μs, and 2 hnsecs
12 ms, 938 μs, and 8 hnsecs
June 05, 2018
On Tuesday, 5 June 2018 at 03:13:05 UTC, Meta wrote:
>
> 14 ms, 520 μs, and 4 hnsecs
> 13 ms, 87 μs, and 2 hnsecs
> 12 ms, 938 μs, and 8 hnsecs

When using `dmd -inline -O -release` with an extra simd benchmark I get:

for loop:    21 ms, 291 μs, and 6 hnsecs
stride/fill: 64 ms, 927 μs, and 9 hnsecs
stride/each: 52 ms, 740 μs, and 8 hnsecs
simd &=:     6 ms, 900 μs, and 8 hnsecs

https://run.dlang.io/gist/5fe73cbf9943aa57be1101e597bb25e4?args=-inline%20-O%20-release

Though the simd version does not work in ldc...
June 05, 2018
On Tuesday, 5 June 2018 at 05:05:47 UTC, David Bennett wrote:
> On Tuesday, 5 June 2018 at 03:13:05 UTC, Meta wrote:
>>
>> 14 ms, 520 μs, and 4 hnsecs
>> 13 ms, 87 μs, and 2 hnsecs
>> 12 ms, 938 μs, and 8 hnsecs
>
> When using `dmd -inline -O -release` with an extra simd benchmark I get:
>
> for loop:    21 ms, 291 μs, and 6 hnsecs
> stride/fill: 64 ms, 927 μs, and 9 hnsecs
> stride/each: 52 ms, 740 μs, and 8 hnsecs
> simd &=:     6 ms, 900 μs, and 8 hnsecs
>
> https://run.dlang.io/gist/5fe73cbf9943aa57be1101e597bb25e4?args=-inline%20-O%20-release
>
> Though the simd version does not work in ldc...

Here's a version that works in ldc:

https://run.dlang.io/gist/1d4bb542427fb82cc455fe9dc30185d7?compiler=ldc&args=-inline%20-O4%20-release

for loop:    16 ms, 594 μs, and 1 hnsec
stride/fill: 14 ms, 918 μs, and 9 hnsecs
stride/each: 14 ms and 813 μs
simd &=:     7 ms, 153 μs, and 6 hnsecs

June 05, 2018
On 6/4/18 5:52 PM, DigitalDesigns wrote:
> On Monday, 4 June 2018 at 17:40:57 UTC, Dennis wrote:
>> On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer wrote:
>>> Note, it's not going to necessarily be as efficient, but it's likely to be close.
>>>
>>> -Steve
>>
>> I've compared the range versions with a for-loop. For integers and longs or high stride amounts the time is roughly equal, but for bytes with low stride amounts it can be up to twice as slow.
>> https://run.dlang.io/is/BoTflQ
>>
>> 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release
>> For-loop  18 ms
>> Fill(0)   33 ms
>> each!     33 ms
>>
>> With stride = 13:
>> For-loop  7.3 ms
>> Fill(0)   7.5 ms
>> each!     7.8 ms
> 
> 
> This is why I wanted to make sure! I would be using it for a stride of 2 and it seems it might have doubled the cost for no other reason than using ranged. Ranges are great but one can't reason about what is happening in then as easy as a direct loop so I wanted to be sure. Thanks for running the test!

See later postings from Ethan and others. It's a matter of optimization being able to see the "whole thing". This is why for loops are sometimes better. It's not inherent with ranges, but if you use the right optimization flags, it's done as fast as if you hand-wrote it.

What I've found with D (and especially LDC) is that when you give the compiler everything to work with, it can do some seemingly magic things.

-Steve
June 05, 2018
On 06/04/2018 07:08 PM, Ethan wrote:
> On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer wrote:
>> BTW, do you have cross-module inlining on?
> 
> Just to drive this point home.
> 
> https://run.dlang.io/is/nrdzb0
> 
> Manually implemented stride and fill with everything forced inline. Otherwise, the original code is unchanged.
> 
> 17 ms, 891 μs, and 6 hnsecs
> 15 ms, 694 μs, and 1 hnsec
> 15 ms, 570 μs, and 9 hnsecs
> 
> My new stride outperformed std.range stride, and the manual for-loop. And, because the third test uses the new stride, it also benefited. But interestingly runs every so slightly faster...

BTW I've had this thought for a long time to implement stride with a compile-time step... never got around to implementing it. It would easily generalize the existing code without too much work. Essentially the step would be a template parameter; if that is 0, then use a run-time stride. Most of the code works unchanged.
June 05, 2018
On Tuesday, 5 June 2018 at 13:05:56 UTC, Steven Schveighoffer wrote:
> On 6/4/18 5:52 PM, DigitalDesigns wrote:
>> On Monday, 4 June 2018 at 17:40:57 UTC, Dennis wrote:
>>> On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer wrote:
>>>> Note, it's not going to necessarily be as efficient, but it's likely to be close.
>>>>
>>>> -Steve
>>>
>>> I've compared the range versions with a for-loop. For integers and longs or high stride amounts the time is roughly equal, but for bytes with low stride amounts it can be up to twice as slow.
>>> https://run.dlang.io/is/BoTflQ
>>>
>>> 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release
>>> For-loop  18 ms
>>> Fill(0)   33 ms
>>> each!     33 ms
>>>
>>> With stride = 13:
>>> For-loop  7.3 ms
>>> Fill(0)   7.5 ms
>>> each!     7.8 ms
>> 
>> 
>> This is why I wanted to make sure! I would be using it for a stride of 2 and it seems it might have doubled the cost for no other reason than using ranged. Ranges are great but one can't reason about what is happening in then as easy as a direct loop so I wanted to be sure. Thanks for running the test!
>
> See later postings from Ethan and others. It's a matter of optimization being able to see the "whole thing". This is why for loops are sometimes better. It's not inherent with ranges, but if you use the right optimization flags, it's done as fast as if you hand-wrote it.
>
> What I've found with D (and especially LDC) is that when you give the compiler everything to work with, it can do some seemingly magic things.
>
> -Steve

It would be nice if testing could be done. Maybe even profiling in unit tests to make sure ranges are within some margin of error(10%). One of the main reasons I don't use ranges is I simply don't have faith they will be as fast as direct encoding. While they might offer a slightly easier syntax I don't know what is going on under the hood so I can't reason about it(unless I look up the source). With a for loop, it is pretty much a wrapper on internal cpu logic so it will be near as fast as possible.

I suppose in the long run ranges do have the potential to out perform since they do abstract but there is no guarantee they will even come close. Having some "proof" that they are working well would ease my mind. As this thread shows, ranges have some major issues. Imagine having some code on your machine that is very performant but on another machine in a slightly different circumstances it runs poorly. Now, say it is the stride issue... One normally would not think of that being an issue so one will look in other areas and could waste times. At least with direct loops you pretty much get what you see. It is very easy for ranges to be slow but more difficult for them to be fast.

June 05, 2018
On 6/5/18 12:50 PM, DigitalDesigns wrote:

> I suppose in the long run ranges do have the potential to out perform since they do abstract but there is no guarantee they will even come close. Having some "proof" that they are working well would ease my mind. As this thread shows, ranges have some major issues. 

Just to point it out again, Ethan's numbers:

17 ms, 891 μs, and 6 hnsecs // for loop
15 ms, 694 μs, and 1 hnsec  // fill
15 ms, 570 μs, and 9 hnsecs // each

I think ranges are doing just fine. You just need the cross-module inlining turned on.

> Imagine having some code on your machine that is very performant but on another machine in a slightly different circumstances it runs poorly. Now, say it is the stride issue... One normally would not think of that being an issue so one will look in other areas and could waste times. At least with direct loops you pretty much get what you see. It is very easy for ranges to be slow but more difficult for them to be fast.

The same can be said even with direct loops. There are no guarantees of course that one type of programming style is going to outperform another on all platforms and all compilers. My experience is that things that seem like they should be slower can sometimes be faster, and vice versa. It depends on so many factors, and the best thing to do is test and observe in each situation.

-Steve
June 05, 2018
On 05.06.2018 18:50, DigitalDesigns wrote:
> With a for loop, it is pretty much a wrapper on internal cpu logic so it will be near as fast as possible.

This is not even close to being true for modern CPUs. There are a lot of architectural and micro-architectural details that affect performance but are not visible or accessible in your for loop. If you care about performance, you will need to test anyway, as even rather sophisticated models of CPU performance don't get everything right.

Also, it is often not necessary to be "as fast as possible". It is usually more helpful to figure out where the bottleneck is for your code and concentrate optimization effort there, which you can do more effectively if you can save time and effort for the remaining parts of your program by writing simple and obviously correct range-based code, which often will be fast as well.