protocol for using InputRanges (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » protocol for using InputRanges (page 4)

March 25, 2014

Re: protocol for using InputRanges

Posted by Andrei Alexandrescu
in reply to Walter Bright

Andrei Alexandrescu

Posted in reply to Walter Bright

On 3/25/14, 1:15 PM, Walter Bright wrote:
> Next, priming requires a buffer in the range.

No, front requires a buffer in the range. -- Andrei

March 25, 2014

Re: protocol for using InputRanges

Posted by Andrei Alexandrescu
in reply to Walter Bright

Andrei Alexandrescu

Posted in reply to Walter Bright

I thought I had only 1-2 comments, but I have a few more.

On 3/25/14, 1:15 PM, Walter Bright wrote:
> 1. the protocol is COMPLETELY undocumented and undefined.

s/COMPLETELY/loosely/

> Since ranges+algorithms are central to D, this is a very serious
> problem.

Agreed.

> We want to appeal to the high performance coders. To maximize
> performance, ranges should be optimized to the inner loop case, which is:
>
>      while (!r.empty) { auto e = r.front; ... e ...; r.popFront(); }

Actually generic code often prefers:

while (!r.empty) { ... r.front ...; r.popFront(); }

That saves copying r.front if it returns by ref. A bunch of algos in std do that.

> Since front is guaranteed to succeed if !empty, this puts a requirement
> on many ranges that they have a non-trivial constructor that 'primes the
> pump'. Of course, priming may fail, and so construction may throw, which
> is not good design.

Again, I disagree with this assertion.

> Next, priming requires a buffer in the range.

Priming has nothing do to with the range being buffered. The entire design of ranges is a one-element buffer.

> Buffers add size, making them slower to copy, and will often require
> another field saying if the buffer has data in it or not, further
> bumping the size.

That's an argument for adding an unbuffered range abstraction.

> All this saves for the user is one call to empty for the entire
> algorithm, at a cost incurred with every iteration. I.e. it selects O(n)
> to save O(1).

I don't understand how that O() reasoning works.

> B) I want to be able to call front multiple times in a row, and expect
> to get the same value.

Correct.

> This can force the range to incorporate a one element buffer and a flag
> indicating the status of that buffer.

It may, but many ranges naturally work that way (e.g. arrays).

> The range instance gets bigger and
> more expensive to copy, and the cost of manipulating the flag and the
> buffer is added to every loop iteration. Note that the user of a range
> can trivially use:
>       auto e = r.front;
>       ... using e multiple times ...
> instead.

That would pessimize code using arrays of large structs.

> The big problem with (A) and (B) is these costs become present in every
> range, despite being rarely needed and having simple alternatives. This
> is the wrong place to put cost and complexity. The user should be making
> these decisions based on his needs.
>
> Hence, I propose that the protocol for using input ranges be defined as:
>
>      while (!r.empty) { auto e = r.front; ... e ...; r.popFront(); }
>
> This makes it possible to build pipelines that are firehoses with no
> kinks or constrictions in them. It optimizes for the inner loop case,
> not boundary cases.

The proposed protocol pessimizes arrays of large structs and will trip the unwary if calling r.front again returns something else. I don't think the proposal is good.


Andrei

March 25, 2014

Re: protocol for using InputRanges

Posted by Walter Bright
in reply to monarch_dodra

Walter Bright

Posted in reply to monarch_dodra

On 3/25/2014 1:56 PM, monarch_dodra wrote:
> http://dlang.org/phobos/std_range.html#isInputRange
>
> The semantics of an input range (not checkable during compilation) are assumed
> to be the following (r is an object of type R):
> r.empty returns false iff there is more data available in the range.
> r.front returns the current element in the range. It may return by value or by
> reference. Calling r.front is allowed only if calling r.empty has, or would
> have, returned false.
> r.popFront advances to the next element in the range. Calling r.popFront is
> allowed only if calling r.empty has, or would have, returned false.

I overlooked that. Thanks.

>> We want to appeal to the high performance coders. To maximize performance,
>> ranges should be optimized to the inner loop case, which is:
>>
>>     while (!r.empty) { auto e = r.front; ... e ...; r.popFront(); }
>
> This makes the assumption that r.front is copy constructible at all.
> It also
> makes the assumption that you want to operate on a copy, rather than the actual
> element in the range.

It's a reasonable requirement. If your range has an issue with this, it can return a pointer to the element, and the element can be a struct with access functions. Then, the pointer will work as well as a copy.

> Finally, it means having to declare a local object: It merely means shifting the
> burden of caching from one context to another. If the object is large, chances
> are you are better off just calling front instead of making a copy. Especially
> if the loop is trivial.

This does come up as an issue, and is solvable by returning a pointer as I described. It's up to the designer of the range.

> If you want high performance, then arguably, just provide a O(1) front, and a
> O(1) empty.

I don't think the issues can be waved away so easily, or I wouldn't have brought this up.

> Also, certain ranges, such as "filter" *must* access the front of the previous
> range more than once.
> Unless you are suggesting we add a field for it to cache
> the result of the previous range?

This is putting the cost where it belongs - when needed on the user of a range, rather than on all ranges. It's the "pay only for what you need" idea rather than "pay regardless".

>> With the additional proviso that ranges are passed to algorithms by value, so
>> they should be cheap to copy. Cheap to copy usually means them being small.
> Unfortunately yes. That said, any range that does anything will have at least
> two fields, one of which is a slice, or comparable to in terms of size, so it's
> going to be big anyways. So you *are* better off passing by ref if you can
> regardless, unless your range is *really* trivial.

Many very useful ranges are trivial. Or at least they should be. An array, for example, is a trivial range.

>> A) I know that my range is not empty, so I can skip calling empty.
>>
>> Since front is guaranteed to succeed if !empty, this puts a requirement on
>> many ranges that they have a non-trivial constructor that 'primes the pump'.
>> Of course, priming may fail, and so construction may throw, which is not good
>> design.
>
> If the prime fails, then the range can simply be marked as empty. Then if you
> decide to skip calling empty anyways, it's your own fault.

Yes, one can add state flags to indicate failed construction, which I'll argue is an ugly design. After all, construction is supposed to construct an object or fail, not leave the 'constructed' object in a zombie state.

>> And lastly, it means that lazy ranges will be required to read the first
>> element, even if the range isn't then subsequently used, which defeats what
>> one would expect from a lazy range.
>
> I'm not yet convinced of adding special code for ranges that aren't used. I've
> heard of these kinds of ranges, but I've observed that when you declare one, it
> almost always ends up being used. I don't think we should waste efforts on this
> rare usecase.

It's not rare - it's the primary way ranges are used in C# Linq. Should we throw this entire category of use cases under the bus just to handle a convenience of not needing to call empty in some cases?

> Evaluating an element "on use" as opposed to "1 instruction before use" doesn't
> make much of a change in this context.

Except that it requires the use to start upon construction, which defeats any hope of separating construction of a pipeline from using it.

> I've found that if you are creative enough, you can usually design the range in
> such a way that it works efficiently, lazilly, and without flags.

That's not been my experience.

> I get where you are coming from, but it's simply not manageable in a generic
> fashion, if you want to be able to preserve all the power and the diversity of
> the ranges we have.
>
> The protocol you are suggesting would prevent us from doing a lot of the lovely
> things that ranges empowers us with.

Please show me such a case. Note that I've shown above that this power and diversity throws an entire use case category under the bus. Secondly, in my experiments with ranges, the power and diversity results in slower pipelines.

March 25, 2014

Re: protocol for using InputRanges

Posted by Timon Gehr
in reply to Andrei Alexandrescu

Timon Gehr

Posted in reply to Andrei Alexandrescu

On 03/25/2014 10:29 PM, Andrei Alexandrescu wrote:
>
>> B) I want to be able to call front multiple times in a row, and expect
>> to get the same value.
>
> Correct.

'map' may fail this criterion. In this case, is the blame put on the user?

March 25, 2014

Re: protocol for using InputRanges

Posted by monarch_dodra
in reply to Timon Gehr

monarch_dodra

Posted in reply to Timon Gehr

On Tuesday, 25 March 2014 at 22:54:14 UTC, Timon Gehr wrote:
> On 03/25/2014 10:29 PM, Andrei Alexandrescu wrote:
>>
>>> B) I want to be able to call front multiple times in a row, and expect
>>> to get the same value.
>>
>> Correct.
>
> 'map' may fail this criterion. In this case, is the blame put on the user?

Depends how you define "same" though :/ But yeah.

Also, shameless plug about my "cache" range adaptor:
https://github.com/D-Programming-Language/phobos/pull/1364

March 25, 2014

Re: protocol for using InputRanges

Posted by Walter Bright
in reply to Andrei Alexandrescu

Walter Bright

Posted in reply to Andrei Alexandrescu

On 3/25/2014 2:29 PM, Andrei Alexandrescu wrote:
>> We want to appeal to the high performance coders. To maximize
>> performance, ranges should be optimized to the inner loop case, which is:
>>
>>      while (!r.empty) { auto e = r.front; ... e ...; r.popFront(); }
>
> Actually generic code often prefers:
>
> while (!r.empty) { ... r.front ...; r.popFront(); }
>
> That saves copying r.front if it returns by ref.

Yes, I agree. The idiom foreach (ref e; r) { ... } must also work.

>> Since front is guaranteed to succeed if !empty, this puts a requirement
>> on many ranges that they have a non-trivial constructor that 'primes the
>> pump'. Of course, priming may fail, and so construction may throw, which
>> is not good design.
>
> Again, I disagree with this assertion.

The throwing part, or the not good design part?

>> Next, priming requires a buffer in the range.
> Priming has nothing do to with the range being buffered. The entire design of
> ranges is a one-element buffer.
>> Buffers add size, making them slower to copy, and will often require
>> another field saying if the buffer has data in it or not, further
>> bumping the size.
> That's an argument for adding an unbuffered range abstraction.

I don't know how that would fit into the ecosystem.

In any case, with the protocol I suggest, the designer of the range is free to distribute the functionality into the three functions, however it makes best sense. With arbitrarily calling any in any order, then all three need to have some sort of signalling mechanism between them.

>> All this saves for the user is one call to empty for the entire
>> algorithm, at a cost incurred with every iteration. I.e. it selects O(n)
>> to save O(1).
>
> I don't understand how that O() reasoning works.

I meant adding extra signalling with every iteration of the loop, rather than requiring the user to call empty before front.

>> B) I want to be able to call front multiple times in a row, and expect
>> to get the same value.
>
> Correct.
>
>> This can force the range to incorporate a one element buffer and a flag
>> indicating the status of that buffer.
>
> It may, but many ranges naturally work that way (e.g. arrays).

Sure, but there are far more range types than arrays.

>> The range instance gets bigger and
>> more expensive to copy, and the cost of manipulating the flag and the
>> buffer is added to every loop iteration. Note that the user of a range
>> can trivially use:
>>       auto e = r.front;
>>       ... using e multiple times ...
>> instead.
>
> That would pessimize code using arrays of large structs.

You're already requiring copying with the buffering requirement. And besides, if I was creating a range of expensive-to-copy objects, I would add a layer of indirection to cheapen it.

> The proposed protocol pessimizes arrays of large structs

Not really more than the existing protocol does (i.e. required buffering).

> and will trip the unwary if calling r.front again returns something else.

I'm not proposing that calling them wrongly would make things unsafe. But I don't think it's unreasonable to expect the protocol to be followed, just like file open/read/close.

> I don't think the proposal is good.

I don't like being in the position of when I need high performance code, I have to implement my own ranges & algorithms, or telling customers they need to do so.

March 26, 2014

Re: protocol for using InputRanges

Posted by Regan Heath
in reply to Walter Bright

Regan Heath

Posted in reply to Walter Bright

On Tue, 25 Mar 2014 23:22:18 -0000, Walter Bright <newshound2@digitalmars.com> wrote:
> On 3/25/2014 2:29 PM, Andrei Alexandrescu wrote:
>>> The range instance gets bigger and
>>> more expensive to copy, and the cost of manipulating the flag and the
>>> buffer is added to every loop iteration. Note that the user of a range
>>> can trivially use:
>>>       auto e = r.front;
>>>       ... using e multiple times ...
>>> instead.
>>
>> That would pessimize code using arrays of large structs.
>
> You're already requiring copying with the buffering requirement. And besides, if I was creating a range of expensive-to-copy objects, I would add a layer of indirection to cheapen it.

Surely you'd simply start with a range of pointers to expensive-to-copy objects?  Or, return them by reference from the underlying range/array/source.  You want to avoid *ever* copying them except explicitly where required.

>> The proposed protocol pessimizes arrays of large structs
>
> Not really more than the existing protocol does (i.e. required buffering).
>
>> and will trip the unwary if calling r.front again returns something else.
>
> I'm not proposing that calling them wrongly would make things unsafe. But I don't think it's unreasonable to expect the protocol to be followed, just like file open/read/close.

My immediate expectation upon encountering ranges was that r.front would return the same item repeatedly until r.popFront was called.  Breaking that guarantee will trip a *lot* of people up.

IMO the rules should be something like:
 - r.empty WILL return false if there is more data available in the range.

 - r.empty MUST be called before r.front, r.front WILL succeed if r.empty returned false.
 - r.front WILL repeatably return the current element in the range. It MAY return by value or by reference.

 - r.empty SHOULD be called before r.popFront, otherwise r.popFront MAY do nothing or throw
   (could also make this one a 'MUST')
 - r.popFront WILL advance to the next element in the range.

 - "lazy" ranges SHOULD delay initialisation until r.empty is called

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

March 26, 2014

Re: protocol for using InputRanges

Posted by Steven Schveighoffer
in reply to Regan Heath

Steven Schveighoffer

Posted in reply to Regan Heath

On Wed, 26 Mar 2014 06:46:26 -0400, Regan Heath <regan@netmail.co.nz> wrote:

>
> On Tue, 25 Mar 2014 23:22:18 -0000, Walter Bright <newshound2@digitalmars.com> wrote:
>> On 3/25/2014 2:29 PM, Andrei Alexandrescu wrote:
>>>> The range instance gets bigger and
>>>> more expensive to copy, and the cost of manipulating the flag and the
>>>> buffer is added to every loop iteration. Note that the user of a range
>>>> can trivially use:
>>>>       auto e = r.front;
>>>>       ... using e multiple times ...
>>>> instead.
>>>
>>> That would pessimize code using arrays of large structs.
>>
>> You're already requiring copying with the buffering requirement. And besides, if I was creating a range of expensive-to-copy objects, I would add a layer of indirection to cheapen it.
>
> Surely you'd simply start with a range of pointers to expensive-to-copy objects?  Or, return them by reference from the underlying range/array/source.  You want to avoid *ever* copying them except explicitly where required.
>
>>> The proposed protocol pessimizes arrays of large structs
>>
>> Not really more than the existing protocol does (i.e. required buffering).
>>
>>> and will trip the unwary if calling r.front again returns something else.
>>
>> I'm not proposing that calling them wrongly would make things unsafe. But I don't think it's unreasonable to expect the protocol to be followed, just like file open/read/close.
>
> My immediate expectation upon encountering ranges was that r.front would return the same item repeatedly until r.popFront was called.  Breaking that guarantee will trip a *lot* of people up.
>
> IMO the rules should be something like:
>   - r.empty WILL return false if there is more data available in the range.
>
>   - r.empty MUST be called before r.front, r.front WILL succeed if r.empty returned false.
>   - r.front WILL repeatably return the current element in the range. It MAY return by value or by reference.
>
>   - r.empty SHOULD be called before r.popFront, otherwise r.popFront MAY do nothing or throw
>     (could also make this one a 'MUST')
>   - r.popFront WILL advance to the next element in the range.

These two rules are not necessary if you know the range is not empty. See the conversation inside this pull: https://github.com/D-Programming-Language/phobos/pull/1987

-Steve

March 26, 2014

Re: protocol for using InputRanges

Posted by Steven Schveighoffer
in reply to Steven Schveighoffer

Steven Schveighoffer

Posted in reply to Steven Schveighoffer

On Wed, 26 Mar 2014 08:29:15 -0400, Steven Schveighoffer <schveiguy@yahoo.com> wrote:

> On Wed, 26 Mar 2014 06:46:26 -0400, Regan Heath <regan@netmail.co.nz> wrote:
>> IMO the rules should be something like:
>>   - r.empty WILL return false if there is more data available in the range.
>>
>>   - r.empty MUST be called before r.front, r.front WILL succeed if r.empty returned false.
>>   - r.front WILL repeatably return the current element in the range. It MAY return by value or by reference.
>>
>>   - r.empty SHOULD be called before r.popFront, otherwise r.popFront MAY do nothing or throw
>>     (could also make this one a 'MUST')
>>   - r.popFront WILL advance to the next element in the range.
>
> These two rules are not necessary if you know the range is not empty. See the conversation inside this pull: https://github.com/D-Programming-Language/phobos/pull/1987

Gah, I didn't cut out the right rules. I meant the two rules that empty must be called before others. Those are not necessary.

-Steve

March 26, 2014

Re: protocol for using InputRanges

Posted by Regan Heath
in reply to Steven Schveighoffer

Regan Heath

Posted in reply to Steven Schveighoffer

On Wed, 26 Mar 2014 12:30:53 -0000, Steven Schveighoffer <schveiguy@yahoo.com> wrote:

> On Wed, 26 Mar 2014 08:29:15 -0400, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
>
>> On Wed, 26 Mar 2014 06:46:26 -0400, Regan Heath <regan@netmail.co.nz> wrote:
>>> IMO the rules should be something like:
>>>   - r.empty WILL return false if there is more data available in the range.
>>>
>>>   - r.empty MUST be called before r.front, r.front WILL succeed if r.empty returned false.
>>>   - r.front WILL repeatably return the current element in the range. It MAY return by value or by reference.
>>>
>>>   - r.empty SHOULD be called before r.popFront, otherwise r.popFront MAY do nothing or throw
>>>     (could also make this one a 'MUST')
>>>   - r.popFront WILL advance to the next element in the range.
>>
>> These two rules are not necessary if you know the range is not empty. See the conversation inside this pull: https://github.com/D-Programming-Language/phobos/pull/1987
>
> Gah, I didn't cut out the right rules. I meant the two rules that empty must be called before others. Those are not necessary.

I see.  I was thinking we ought to make empty mandatory to give more guaranteed structure for range implementors, so lazy initialisation can be done in one place only, etc etc.

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation