Why must a bidirectional range also be a forward range?

Why must a bidirectional range also be a forward range?
Sep 19, 2019 Joseph Rushton Wakeling
Sep 19, 2019 Paul Backus
Sep 19, 2019 Jonathan M Davis
Sep 20, 2019 Joseph Rushton Wakeling
Sep 21, 2019 Jonathan M Davis

September 19, 2019

Posted by Joseph Rushton Wakeling

Permalink

Joseph Rushton Wakeling

Permalink

Hello folks,

A question that occurred to me while implementing a new data structure recently, which I'm not sure I've ever seen a reason for.

Why must bidirectional ranges also be forward ranges (as opposed to just input ranges)?

It doesn't seem to me that the `save` property is inherently required to iterate backwards over a range -- just the `back` and `popBack` methods.

It makes sense that, for bidirectionality, the range needs to be deterministic, so that iterating backward gives the exact same elements as iterating forward, just in reverse order. But it seems strange to require the `save` property in order to automatically assume deterministic behaviour.

For context, the use-case I have is a data structure which stores an internal buffer as an array. A robust `save` method would therefore have to duplicate the array (or at least the active subset of its contents). This means a fresh heap allocation per `save`, which has some nasty implications for phobos algorithms that eagerly `.save` when they can.

So, I'd rather not implement `save` in this case. But there is nothing that blocks implementing `back` and `popBack`; yet I can't use these with any of the functionality that requires bidirectionality, because the current `isBidirectionalRange` check requires `save`.

So what gives? Are there some reasons for the `save` requirement on bidirectional ranges that I'm missing? And regardless, any advice on how to handle my particular use-case?

Thanks & best wishes,

-- Joe

September 19, 2019

Re: Why must a bidirectional range also be a forward range?

Posted by Paul Backus
in reply to Joseph Rushton Wakeling

Permalink

Paul Backus

Posted in reply to Joseph Rushton Wakeling

Permalink

On Thursday, 19 September 2019 at 09:31:32 UTC, Joseph Rushton Wakeling wrote:
> For context, the use-case I have is a data structure which stores an internal buffer as an array.  A robust `save` method would therefore have to duplicate the array (or at least the active subset of its contents).  This means a fresh heap allocation per `save`, which has some nasty implications for phobos algorithms that eagerly `.save` when they can.

In this case, it is probably better to separate the range from the data structure it refers to. For example:

struct Container {
    int[] data;

    private static struct Range {
        int[] contents;
        bool empty()    { return contents.length == 0; }
        int front()     { return contents[0]; }
        void popFront() { contents = contents[1 .. $]; }
        int back()      { return contents[$ - 1]; }
        void popBack()  { contents = contents[0 .. $ - 1]; }
        Range save()    { return this; }
    }

    Range getRange() {
        return Range(data[]);
    }
}

September 19, 2019

Re: Why must a bidirectional range also be a forward range?

Posted by Jonathan M Davis
in reply to Joseph Rushton Wakeling

Permalink

Jonathan M Davis

Posted in reply to Joseph Rushton Wakeling

Permalink

On Thursday, September 19, 2019 3:31:32 AM MDT Joseph Rushton Wakeling via Digitalmars-d-learn wrote:
> Hello folks,
>
> A question that occurred to me while implementing a new data structure recently, which I'm not sure I've ever seen a reason for.
>
> Why must bidirectional ranges also be forward ranges (as opposed
> to just input ranges)?
>
> It doesn't seem to me that the `save` property is inherently required to iterate backwards over a range -- just the `back` and `popBack` methods.
>
> It makes sense that, for bidirectionality, the range needs to be deterministic, so that iterating backward gives the exact same elements as iterating forward, just in reverse order.  But it seems strange to require the `save` property in order to automatically assume deterministic behaviour.
>
> For context, the use-case I have is a data structure which stores an internal buffer as an array.  A robust `save` method would therefore have to duplicate the array (or at least the active subset of its contents).  This means a fresh heap allocation per `save`, which has some nasty implications for phobos algorithms that eagerly `.save` when they can.
>
> So, I'd rather not implement `save` in this case.  But there is nothing that blocks implementing `back` and `popBack`; yet I can't use these with any of the functionality that requires bidirectionality, because the current `isBidirectionalRange` check requires `save`.
>
> So what gives?  Are there some reasons for the `save` requirement on bidirectional ranges that I'm missing?  And regardless, any advice on how to handle my particular use-case?
>
> Thanks & best wishes,
>
>        -- Joe

For better or worse, ranges were more or less set up as a linear hierarchy, and it's unlikely that use cases for bidirectional ranges which aren't forward ranges are common. I expect that it's a bit like infinite, bidirectional ranges. In theory, they could be a thing, but the use cases for them are uncommon enough that we don't really support them. Also, I expect that most range-based algorithms which operate on bidirectional ranges would require save anyway. A lot of algorithms do to the point that basic input ranges can be incredibly frustrating to deal with.

Assuming we were redesigning the range API (which may happen if we do indeed end up doing a Phobos v2), then maybe we could make it so that bidirectional ranges don't have to be forward ranges, but honestly _any_ ranges which aren't forward ranges are a bit of a problem. We do need to support them on some level for exactly the kind of reasons that you're looking to avoid save with a bidirectional range, but the semantic differences between what makes sense for a basic input range and a forward range really aren't the same (in particular, it works far better for basic input ranges to be reference types, whereas it works better for forward ranges to be value types).

As it stands, I don't think that we can change isBidirectionalRange, because it's likely that most code using it relies on its check for isForwardRange. So, I think that we're stuck for the moment, but it is food for thought in a possible range API redesign. I'll add it to my notes on the topic. Some aspects of a range API redesign should look like are pretty clear at this point, whereas others are very much an open question.

Ideally, I'd like to force basic input ranges to be reference types, and forward ranges to be value types, but I'm not sure that that's reasonable in practice. It would really clean up some of the semantics of ranges, but it would also likely require allocating a lot more stuff on the heap than would be desirable. Either way, having bidirectional ranges not need to have the equivalent of save would mean treating them as more of an add-on capability (like length) rather than having the kind of hierarchy that we have now. I don't know if that's ultimately a good or a bad thing.

- Jonathan M Davis

September 20, 2019

Re: Why must a bidirectional range also be a forward range?

Posted by Joseph Rushton Wakeling
in reply to Jonathan M Davis

Permalink

Joseph Rushton Wakeling

Posted in reply to Jonathan M Davis

Permalink

On Thursday, 19 September 2019 at 22:55:55 UTC, Jonathan M Davis wrote:
> For better or worse, ranges were more or less set up as a linear hierarchy, and it's unlikely that use cases for bidirectional ranges which aren't forward ranges are common. I expect that it's a bit like infinite, bidirectional ranges. In theory, they could be a thing, but the use cases for them are uncommon enough that we don't really support them. Also, I expect that most range-based algorithms which operate on bidirectional ranges would require save anyway. A lot of algorithms do to the point that basic input ranges can be incredibly frustrating to deal with.
>
> [ ... ]

Thanks for the characteristically thorough description of both the design considerations and the history involved.

On reflection it occurs to me that the problem in my thinking may be the idea that `save` should result in a full deep copy.  If instead we go by how `save` is implemented for dynamic arrays, it's only ever a shallow copy: it's not possible to make valid assumptions of reproducible behaviour if the original copy is modified in any way.

If instead we assume that `save` is only suitable for temporary shallow-copies that are made under the hood of algorithms, then my problems go away.

> Assuming we were redesigning the range API (which may happen if we do indeed end up doing a Phobos v2), then maybe we could make it so that bidirectional ranges don't have to be forward ranges, but honestly _any_ ranges which aren't forward ranges are a bit of a problem. We do need to support them on some level for exactly the kind of reasons that you're looking to avoid save with a bidirectional range, but the semantic differences between what makes sense for a basic input range and a forward range really aren't the same (in particular, it works far better for basic input ranges to be reference types, whereas it works better for forward ranges to be value types).

It occurs to me that the distinction we're missing here might between "true" input ranges (i.e. which really come from IO of some kind), which indeed must be reference types, versus "pure" input ranges (which are deterministic, but which don't necessarily allow algorithms to rely on the ability to save and replay them).

> As it stands, I don't think that we can change isBidirectionalRange, because it's likely that most code using it relies on its check for isForwardRange. So, I think that we're stuck for the moment, but it is food for thought in a possible range API redesign. I'll add it to my notes on the topic. Some aspects of a range API redesign should look like are pretty clear at this point, whereas others are very much an open question.

Oh, I wasn't asking for any changes to the existing definition (at least not without much thought from everyone!).  I was just wanting to understand the reasons for the current situation.  But thanks for putting it on the list of things to consider.

I may have some follow-up to your other remarks but I think at least now I have a way forward with my code.  Thanks!

September 21, 2019

Re: Why must a bidirectional range also be a forward range?

Posted by Jonathan M Davis
in reply to Joseph Rushton Wakeling

Permalink

Jonathan M Davis

Posted in reply to Joseph Rushton Wakeling

Permalink

On Friday, September 20, 2019 7:08:03 AM MDT Joseph Rushton Wakeling via Digitalmars-d-learn wrote:
> On Thursday, 19 September 2019 at 22:55:55 UTC, Jonathan M Davis
>
> wrote:
> > For better or worse, ranges were more or less set up as a linear hierarchy, and it's unlikely that use cases for bidirectional ranges which aren't forward ranges are common. I expect that it's a bit like infinite, bidirectional ranges. In theory, they could be a thing, but the use cases for them are uncommon enough that we don't really support them. Also, I expect that most range-based algorithms which operate on bidirectional ranges would require save anyway. A lot of algorithms do to the point that basic input ranges can be incredibly frustrating to deal with.
> >
> > [ ... ]
>
> Thanks for the characteristically thorough description of both the design considerations and the history involved.
>
> On reflection it occurs to me that the problem in my thinking may be the idea that `save` should result in a full deep copy.  If instead we go by how `save` is implemented for dynamic arrays, it's only ever a shallow copy: it's not possible to make valid assumptions of reproducible behaviour if the original copy is modified in any way.
>
> If instead we assume that `save` is only suitable for temporary shallow-copies that are made under the hood of algorithms, then my problems go away.

save is supposed to result copies that can be independently iterated over. So, code such as

foreach(r; range.save)
{}
auto arr = r.save.array();
assert(equal(arr, r));

should work. How that's implemented underneath the hood doesn't really matter. However, none of that really takes into account mutation of the elements. The range API pretty much assumes that you don't ever modify any of the elements in the range as you iterate over them. So, if you do something like

auto orig = range.save;
range.front = 42;

whether orig.front is then 42 is implementation-dependent. So, if you're modifying elements as you go, then the behavior you get is going to be highly dependent on what you're doing with the ranges, and certainly, if you're only using a range within a very specific context, it can be implemented in a way that works in that context but doesn't work with range-based functions in general. You just run the risk of problems if you then later modify the code to use other range-based functions which don't necessarily work with whatever you've done with the range.

As for temporary, shallow copies, IIRC, isForwardRange requires that save returns exactly the same type as the original. So, while you can certainly have a range referring to a data structure without owning any of the data (after all, that's what happens with dynamic arrays), you can't have a range of one type that owns the data and then have save return a range type which just refers to the data unless the range is written in a way that you can have both within the same type.

One example of avoiding the need to deep-copy with save where one range is at least sort of the owner whereas the others aren't is how dxml's stax parser works. The ranges share a context that keeps track of how far into the range the farthest range is, and popFront only does the validation when the range that popFront is being called on is the farthest of any range. That way, the stuff related to validating end tags didn't need to be deep-copied, but save always returns exactly the same type, and you get exactly the same behavior regardless of which range gets iterated farther first (or even if one is iterated farther and then another is iterated beyond it).

If popFront every throws, then that range becomes invalid, but the others are fine. The validation other than that for matching end tags currently all gets done every time, so all ranges would throw in the same place for errors other than end tags that don't match, but the same is also true for when the end tags don't match, because even though that validation is only done for the farthest range, if it fails, the shared context is left in exactly the same state, and any other ranges that reach that point would then throw like the first range did.

Without realizing that the validation for the end tags didn't have to be done for every instance of the range but only the one which was farthest along, I would have been screwed with regards to save, because deep-copying would have been required. I'm not sure that that particular trick is widely applicable, but it is an example of how save can do something other than a deep copy even though having each range do exactly the same work would have required a deep copy.

> > Assuming we were redesigning the range API (which may happen if we do indeed end up doing a Phobos v2), then maybe we could make it so that bidirectional ranges don't have to be forward ranges, but honestly _any_ ranges which aren't forward ranges are a bit of a problem. We do need to support them on some level for exactly the kind of reasons that you're looking to avoid save with a bidirectional range, but the semantic differences between what makes sense for a basic input range and a forward range really aren't the same (in particular, it works far better for basic input ranges to be reference types, whereas it works better for forward ranges to be value types).
>
> It occurs to me that the distinction we're missing here might between "true" input ranges (i.e. which really come from IO of some kind), which indeed must be reference types, versus "pure" input ranges (which are deterministic, but which don't necessarily allow algorithms to rely on the ability to save and replay them).

Well, in general, the difference between a basic input range and a forward range is whether you can cheaply copy their state to then have an independent copy to iterate over. If you can't cheaply copy their state, then they're fundamentally going to be either reference types or pseudo-reference types. The closest I can think to having a basic input range that was a value type would be something like a bufferless socket which just retained the socket ID (so, it would be just contain an integral value), but since that ID is pointing to the data elsewhere, it's still basically a reference type. I don't see how you could have a basic input range which couldn't be a forward range which was truly a value type.

You could have a situation where you have a range which was a value type but too expensive to copy for it to be a forward range, but at that point, you just wouldn't make it a range, since ranges get copied all over the place, and it really doesn't work for them to be too expensive to copy. You'd pretty much have to create a separate type which referred to it which was the range.

You can certainly have types which are reference types or pseudo-reference types which are forward ranges, so they don't have to be value types, but they do have to have a way to create independent copies to iterate over the same data, or they can't be forward ranges. So, forward ranges don't _have_ to be value types in that sense, but the way they get passed around pretty much assumes that they're value types. Unless a range-based function takes its argument by ref, you pretty much have to assume that range is going to be consumed and call save when passing it if want to continue to use the range. So, really, we'd just be better off if we required that all forward ranges be value types and that copying them was equivalent to save (so, copy constructors would have to be used in lieu of save if the default copy wasn't equivalent to save, and classes would have to be wrapped by structs with copy constructors). That eliminates all of the bugs that we get which relate to save not being called when it should be, because the code assumes that copying is save.

So, I think that what should be done with forward ranges is pretty clear, and as I understand it, Andrei agrees. What's harder is basic input ranges. If we can assume that forward ranges are value types, then that makes it so that the copying semantics of all forward ranges is well-defined (unlike now), but basic input ranges _can't_ have the same copy semantics, or they could be forward ranges. They either are reference types (so, mutating the copy mutates the original), or they're pseudo-reference types (so, mutating the copy only partially affects the original). The result is that you still can't assume the semantics for copying basic input ranges. You also can't assume the semantics of copying a range when the code operates on both basic input ranges and forward ranges.

If basic input ranges had to be reference types, then we _could_ assume the semantics for copying them - but only if the code for operating on basic input ranges is not shared with the code for operating on forward ranges, and with the current hierarchy, it's a forward range is an input range, so their code is pretty much always shared so long as the code can operate on basic input ranges.

That problem there is the main one that stumps me right no with what we should do with basic input ranges if we were to do a range API redesign. We can make the semantics for copying (and presumably assignment as well) be well-defined for forward ranges by requiring that copying be equivalent to save (and getting rid of save), but that still leaves the semantics for copying and assigning basic input ranges as a mess. Requiring that basic input ranges be reference types improves the situation, but it doesn't completely fix it, and it would require that some basic input ranges be allocated on the heap when that wouldn't currently be necessary.

In some respects, it does make perfect sense to treat forward ranges as being basic input ranges with more capabilities, but in other respects, they're very different beasts to the point that it seems wrong to treat them as being related.

LOL. And given how hard it is to work with basic input ranges, since you can't save their state, in some respects, I'm tempted to say that we'd be better off if they weren't even a thing, but some basic stuff like sockets and other I/O don't work well at all with forward ranges because of the buffering that's required. Honestly, even though ranges are basically streams, they seem to work bizarrely badly for I/O. It seems to me that ranges are a fantastic tool for the cases where forward ranges work, and their API works reasonably well for basic iteration when you can't have a forward range, but going from something that really just works with basic iteration and going to where that then works with the larger range API really doesn't work well. It seems like we need a different solution to bridge that gap, and I really don't know that solution should look like. I should probably study what Steven did with iopipe and see how that fits into things.

In any case, that's probably enough rambling on the subject. I do sure wish that we could find a better way to deal with the gap between basic input ranges and forward ranges though as well as manage to fix it so that the semantics of the copying and assignment of ranges in general can be properly specified and relied upon.

- Jonathan M Davis

Top | Forum index | About this forum

Forums