protocol for using InputRanges (page 3)

Settings

Help

Index » General » protocol for using InputRanges (page 3)

March 24, 2014

Re: protocol for using InputRanges

Posted by Joseph Rushton Wakeling
in reply to Dicebot

Permalink

Joseph Rushton Wakeling

Posted in reply to Dicebot

Permalink

On 24/03/14 14:20, Dicebot wrote:
> I think there is one design mistake with current InputRange rules that makes
> usage so inconsistent. We have `empty` but don't have any distinct `not yet
> started` state. If calling `popFront` at least once was required before
> accessing `front`, I can't imagine the case where having any non-trivial code in
> `front` would have been necessary.

I floated some ideas along those lines, for a "first" method that would be automatically called immediately before the first call to front, popFront, etc., whatever came first.

The trouble is that it's so vague _when_ it should be called.  Example from std.random: RandomSample not only has a .front which needs a check as to whether the first value has actually been selected, it also has an .index method which corresponds to the index of the chosen element of the range being sampled from.

Now, how is a "first" method supposed to know which methods it should precede? Any method?  Specified ones?  In any case, even if we could get it to work, it'd be a convenience rather than a solution, and would still in effect result in a non-const .front.

OTOH a requirement that .popFront() be called at least once before .front is accessed won't wash either.  At least in the case of randomSample, the code required to generate the _first_ sample point is different from what is required to .popFront().

March 25, 2014

Re: protocol for using InputRanges

Posted by Chris
in reply to Walter Bright

Permalink

Chris

Posted in reply to Walter Bright

Permalink

On Sunday, 23 March 2014 at 00:50:34 UTC, Walter Bright wrote:
> It's become clear to me that we've underspecified what an InputRange is. The normal way to use it is:
>
>     while (!r.empty) {
>         auto e = r.front;
>         ... do something with e ...
>         r.popFront();
>     }
>
> no argument there. But there are two issues:
>
> 1. If you know the range is not empty, is it allowed to call r.front without calling r.empty first?
>
> If this is true, extra logic will need to be added to r.front in many cases.
>
> 2. Can r.front be called n times in a row? I.e. is calling front() destructive?
>
> If true, this means that r.front will have to cache a copy in many cases.

In many cases it makes sense to put things into front, for example if you read a file line by line, parsing each line to create a dictionary entry. Although it has been said that any logic should go into popFront, because front can be called several times. Since Ranges usually perform some kind of filtering mechanism and return something new, it makes sense that things go into front, or at least should be allowed to go into front. In my experience it is a bit of a "the hen or the egg" problem, to get something you have to call front, but the something should only be done in popFront. Ranges are often simpler and faster to write, when you put things into front, because in most of my use cases front is not called many times in a row, just once for each item. I still haven't found the golden way(s) of writing ranges.

March 25, 2014

Re: protocol for using InputRanges

Posted by Walter Bright
in reply to Jonathan M Davis

Permalink

Walter Bright

Posted in reply to Jonathan M Davis

Permalink

On 3/23/2014 12:53 AM, Jonathan M Davis wrote:
> But again, front and empty should normally function as if they were variables.
> They should be property functions and should be pure (or at least act like
> they're pure).

This is simply not practical. The canonical example is a tty. You cannot tell if a character is available unless you try to read it. That is NOT pure.

> I'm sure that a _lot_ of code will break if that isn't followed.

It seems there have been a lot of undocumented assumptions about the behavior of ranges.

March 25, 2014

Re: protocol for using InputRanges

Posted by Walter Bright
in reply to w0rp

Permalink

Walter Bright

Posted in reply to w0rp

Permalink

On 3/23/2014 2:26 AM, w0rp wrote:
> .front assignment obviously replaces the cached value.

This is another undocumented assumption.

March 25, 2014

Re: protocol for using InputRanges

Posted by Dicebot
in reply to Joseph Rushton Wakeling

Permalink

Dicebot

Posted in reply to Joseph Rushton Wakeling

Permalink

On Monday, 24 March 2014 at 22:26:10 UTC, Joseph Rushton Wakeling wrote:
> On 24/03/14 14:20, Dicebot wrote:
>> I think there is one design mistake with current InputRange rules that makes
>> usage so inconsistent. We have `empty` but don't have any distinct `not yet
>> started` state. If calling `popFront` at least once was required before
>> accessing `front`, I can't imagine the case where having any non-trivial code in
>> `front` would have been necessary.
>
> I floated some ideas along those lines, for a "first" method that would be automatically called immediately before the first call to front, popFront, etc., whatever came first.

I was thinking about something more simple. Current pattern is:

while (!r.empty)
{
    auto useme = r.front;
    r.popFront;
}

And I think this would have been more practical:

r.popFront;

while (!r.empty)
{
   // same
}

So every range is supposed to start in "init" state and provide data only after first popFront. No other changes. It is not a silver bullet but looks like an improvement over current design to me.

March 25, 2014

Re: protocol for using InputRanges

Posted by Andrei Alexandrescu
in reply to Dicebot

Permalink

Andrei Alexandrescu

Posted in reply to Dicebot

Permalink

On 3/25/14, 12:16 PM, Dicebot wrote:
> On Monday, 24 March 2014 at 22:26:10 UTC, Joseph Rushton Wakeling wrote:
>> On 24/03/14 14:20, Dicebot wrote:
>>> I think there is one design mistake with current InputRange rules
>>> that makes
>>> usage so inconsistent. We have `empty` but don't have any distinct
>>> `not yet
>>> started` state. If calling `popFront` at least once was required before
>>> accessing `front`, I can't imagine the case where having any
>>> non-trivial code in
>>> `front` would have been necessary.
>>
>> I floated some ideas along those lines, for a "first" method that
>> would be automatically called immediately before the first call to
>> front, popFront, etc., whatever came first.
>
> I was thinking about something more simple. Current pattern is:
>
> while (!r.empty)
> {
>      auto useme = r.front;
>      r.popFront;
> }
>
> And I think this would have been more practical:
>
> r.popFront;
>
> while (!r.empty)
> {
>     // same
> }
>
> So every range is supposed to start in "init" state and provide data
> only after first popFront. No other changes. It is not a silver bullet
> but looks like an improvement over current design to me.

Suggestion: focusing on what to do within the present context is more productive.

Andrei

March 25, 2014

Re: protocol for using InputRanges

Posted by Dicebot
in reply to Andrei Alexandrescu

Permalink

Dicebot

Posted in reply to Andrei Alexandrescu

Permalink

On Tuesday, 25 March 2014 at 19:44:46 UTC, Andrei Alexan
> Suggestion: focusing on what to do within the present context is more productive.
>
> Andrei

I do work on template argument pack & friends right now, leave me some fun! :P

March 25, 2014

Re: protocol for using InputRanges

Posted by Walter Bright
in reply to Walter Bright

Permalink

Walter Bright

Posted in reply to Walter Bright

Permalink

It's pretty clear that:

1. the protocol is COMPLETELY undocumented and undefined.

2. in this vacuum, people (including me) have made all sorts of assumptions, and assumed those assumptions were universal and shared.

Since ranges+algorithms are central to D, this is a very serious problem. There are two competing schools of thought:

1. maximum flexibility

2. maximum performance

(1) has been explained well in this thread already. (2) not so much, so I'll stand up for that one.

I've been recently involved in writing high performance apps, as anyone following ScopeBuffer knows. With such apps, it's clear that:

1. size matters - the fewer registers an object fits in, the faster it goes
2. every instruction matters - more operations == slower code
3. lazy seems to be faster, mainly because it eliminates the storage management cost of buffers and extra copying into and out of those buffers

We want to appeal to the high performance coders. To maximize performance, ranges should be optimized to the inner loop case, which is:

while (!r.empty) { auto e = r.front; ... e ...; r.popFront(); }

With the additional proviso that ranges are passed to algorithms by value, so they should be cheap to copy. Cheap to copy usually means them being small.

A) I know that my range is not empty, so I can skip calling empty.

Since front is guaranteed to succeed if !empty, this puts a requirement on many ranges that they have a non-trivial constructor that 'primes the pump'. Of course, priming may fail, and so construction may throw, which is not good design. Next, priming requires a buffer in the range. Buffers add size, making them slower to copy, and will often require another field saying if the buffer has data in it or not, further bumping the size. And lastly, it means that lazy ranges will be required to read the first element, even if the range isn't then subsequently used, which defeats what one would expect from a lazy range. By having mere construction of a range initiate reading from source, this makes the useful idiom of separating construction from use impractical. (For example, building a pipeline object, then sending it to another thread for execution.)

All this saves for the user is one call to empty for the entire algorithm, at a cost incurred with every iteration. I.e. it selects O(n) to save O(1).

B) I want to be able to call front multiple times in a row, and expect to get the same value.

This can force the range to incorporate a one element buffer and a flag indicating the status of that buffer. The range instance gets bigger and more expensive to copy, and the cost of manipulating the flag and the buffer is added to every loop iteration. Note that the user of a range can trivially use:
auto e = r.front;
... using e multiple times ...
instead.

The big problem with (A) and (B) is these costs become present in every range, despite being rarely needed and having simple alternatives. This is the wrong place to put cost and complexity. The user should be making these decisions based on his needs.

Hence, I propose that the protocol for using input ranges be defined as:

while (!r.empty) { auto e = r.front; ... e ...; r.popFront(); }

This makes it possible to build pipelines that are firehoses with no kinks or constrictions in them. It optimizes for the inner loop case, not boundary cases.

March 25, 2014

Re: protocol for using InputRanges

Posted by monarch_dodra
in reply to Walter Bright

Permalink

monarch_dodra

Posted in reply to Walter Bright

Permalink

On Tuesday, 25 March 2014 at 20:15:32 UTC, Walter Bright wrote:
> It's pretty clear that:
>
> 1. the protocol is COMPLETELY undocumented and undefined.

http://dlang.org/phobos/std_range.html#isInputRange

The semantics of an input range (not checkable during compilation) are assumed to be the following (r is an object of type R):
r.empty returns false iff there is more data available in the range.
r.front returns the current element in the range. It may return by value or by reference. Calling r.front is allowed only if calling r.empty has, or would have, returned false.
r.popFront advances to the next element in the range. Calling r.popFront is allowed only if calling r.empty has, or would have, returned false.

> We want to appeal to the high performance coders. To maximize performance, ranges should be optimized to the inner loop case, which is:
>
>     while (!r.empty) { auto e = r.front; ... e ...; r.popFront(); }

This makes the assumption that r.front is copy constructible at all. It also makes the assumption that you want to operate on a copy, rather than the actual element in the range.

Finally, it means having to declare a local object: It merely means shifting the burden of caching from one context to another. If the object is large, chances are you are better off just calling front instead of making a copy. Especially if the loop is trivial.

If you want high performance, then arguably, just provide a O(1) front, and a O(1) empty.

Also, certain ranges, such as "filter" *must* access the front of the previous range more than once. Unless you are suggesting we add a field for it to cache the result of the previous range?

> With the additional proviso that ranges are passed to algorithms by value, so they should be cheap to copy. Cheap to copy usually means them being small.

Unfortunately yes. That said, any range that does anything will have at least two fields, one of which is a slice, or comparable to in terms of size, so it's going to be big anyways. So you *are* better off passing by ref if you can regardless, unless your range is *really* trivial.

I agree that range sizes can be a problem.

> A) I know that my range is not empty, so I can skip calling empty.
>
> Since front is guaranteed to succeed if !empty, this puts a requirement on many ranges that they have a non-trivial constructor that 'primes the pump'. Of course, priming may fail, and so construction may throw, which is not good design.

If the prime fails, then the range can simply be marked as empty. Then if you decide to skip calling empty anyways, it's your own fault.

> And lastly, it means that lazy ranges will be required to read the first element, even if the range isn't then subsequently used, which defeats what one would expect from a lazy range.

I'm not yet convinced of adding special code for ranges that aren't used. I've heard of these kinds of ranges, but I've observed that when you declare one, it almost always ends up being used. I don't think we should waste efforts on this rare usecase.

As for "Lazy", in range terms, it mostly only means you calculate things element at once, as you go up the range chain. As opposed to processing the entire input data, one transformation at a time.

Evaluating an element "on use" as opposed to "1 instruction before use" doesn't make much of a change in this context.

> All this saves for the user is one call to empty for the entire algorithm, at a cost incurred with every iteration. I.e. it selects O(n) to save O(1).

If code that was actually meant to *do* something was in popFront() to begin with, then there'd be no extra overhead.

I've found that if you are creative enough, you can usually design the range in such a way that it works efficiently, lazilly, and without flags.

> Hence, I propose that the protocol for using input ranges be defined as:
>
>     while (!r.empty) { auto e = r.front; ... e ...; r.popFront(); }
>
> This makes it possible to build pipelines that are firehoses with no kinks or constrictions in them. It optimizes for the inner loop case, not boundary cases.

I get where you are coming from, but it's simply not manageable in a generic fashion, if you want to be able to preserve all the power and the diversity of the ranges we have.

The protocol you are suggesting would prevent us from doing a lot of the lovely things that ranges empowers us with.

March 25, 2014

Re: protocol for using InputRanges

Posted by Andrei Alexandrescu
in reply to Walter Bright

Permalink

Andrei Alexandrescu

Posted in reply to Walter Bright

Permalink

On 3/25/14, 1:15 PM, Walter Bright wrote:
> Of course, priming may fail, and so construction may throw, which is not
> good design.

This part I disagree with. -- Andrei

Top | Forum index | About this forum

Forums