Jump to page: 1 27  
Page
Thread overview
January 20
I've been thinking about this for a while now, but with the next version of Phobos which is in the early planning stages, we really should do some redesigning of ranges. Most of their API is just fine how it is, but there are some aspects of it which really should be changed if we want them to be better (the most obvious one being the removal of auto-decoding). But what I'd like to discuss specifically in this thread is fixing - and defining - the semantics of copying and assigning to ranges. Specifically, the semantics of stuff like

auto copy = orig;

and

range1 = range2;

should be defined (which then defines what happens when passing ranges to functions and stuff like foreach).

Right now, the only semantics of copying a range which are defined by the range API are what happens to the copy when you make a copy. Specifically, the copy will have the exact same elements in the exact same order that the original would have had if it had been iterated through prior to the copy being made. If that weren't true, then we couldn't pass ranges around. However, because ranges support a whole range of types with varying copy semantics, you can't rely on a range's copy semantics beyond that (particularly with regards to the state of the original range after the copy has been made). Specifically, a range type can have four different types of copy semantics:

1. The range is a value type, so copying it results in two completely indepedent copies. Iterating through one has no effect on the other.

2. The range is a reference type, so copying it results in two completely dependent copies. They both share all of their state, so iterating through one iterates through the other, and they both always have the same state.

3. The range is a pseudo-reference type, and it has value semantics for its iteration state but reference semantics for some portion of the rest of its state. This means that if you iterate through a copy of the range, it has no effect on the original, so the range is a value type in that respect, but the rest of its state may or may not be copied by value (often - though not always - this means that the iteration state is directly within the range type, but the elements themselves are referred to via pointer, so mutating the actual elements would affect all copies of that range, but each range can be iterated independently). Dynamic arrays are the most common example of this sort of range.

4. The range is a pseudo-reference type, but it does not have value semantics with regards to its iteration state. So, when copying the range, some portion of its state is copied by value, but you don't end up with indendepent copies - and in fact, you don't get dependent copies either, since some of the state is copied by value. A common example of this sort of range which would be one which contains most of its state via a pointer or reference but which caches its front as a member variable within the struct. The result of this is that once you've mutated a copy of the range, you can no longer use the original, because mutating the copy only partially mutates the state of the original, leaving it in an inconsistent / invalid state.

Generic code must support ranges with any of these copy semantics, and since these copy semantics are drastically different from one another, the result is that in fully generic code, once you copy a range, you cannot ever use the original again (since you can't depend on its state). And you can't even assign it a new value and then use it, because the semantics of assignment aren't defined for ranges either (though they're likely to be strongly tied to the copy semantics). Once you copy a range, you have to assume that the orginal range is in a potentially invalid state and can't do anything further with it.

In addition to copying ranges to new variables and passing them to other functions, this also means that you can't actually continue to use a range once it's been passed to foreach, because the lowering that foreach uses copies the range (and it has to or it would break the semantics of using dynamic arrays with foreach). So, you can't do stuff like partially iterate through a range, break out of the loop, and then continue to iterate through it, because the state of the range is implementation-dependent and could be invalid (on top of the fact that in the case of many forward ranges, copying it is equivalent to save, so trying to continue the iteration on the original won't work, whereas it would work with some basic input ranges, so even without any ranges ending up in an invalid state, the semantics depend on the type in a way that does not work with generic code). In fact, someone necro-ed a thread in D.Learn on this issue with foreach just the other day, since it's not obvious and can be pretty surprising.

Of course, in order to try to solve the problem with varying copy semantics, the range API has save for forward ranges. This makes it possible to get an independent copy of a forward range, and any generic code which needs an independent copy needs to use it to get one. However, that does not help at all with the inconsistent semantics of actually copying a range, and it's very common for people to write code that works with dynamic arrays but then doesn't work with other types of ranges, because other types of ranges have differing copy semantics, and they didn't test their code with enough range types to find where they accidentally relied on the copy semantics (or assignment semantics) of dynamic arrays (or on the semantics of whatever range type it was that they tested with).

If we really want to fix these classes of issues, we need ranges to have consistent copy semantics, which means that we need to define what those semantics are.

For forward ranges, this is quite straightfoward. We need to require that copying a forward range results in two copies which can be independently iterated - so essentially, the iteration state must be copied by value, or another way to look at it is that copying a forward range must be equivalent to calling save (though the situation is actually more nuanced than that). We would be requiring that all forward ranges have the same copying semantics as dynamic arrays with regards to their iteration state (but would allow the elements themselves to be copied by value or by reference like we do now). And of course, we would then get rid of save, since it would be unnecessary.

What this would mean is that all forward ranges would be either dynamic arrays or structs. Any code that then wanted a class or a pointer to a struct to be a forward range would need to wrap it in a struct to give it the correct copy semantics. A naive implementation would just duplicate the class whenever the struct was copied, but a smarter implementation could do something like use reference counting to keep track of how many ranges there were pointing to that same class object and then duplicate it whenever a function was called on it which would mutate its iteration state, and the reference count was greater than 1 (and of course not bother to duplicate it if the reference count was 1). It wouldn't need to do more memory management than that (though it could), so that doesn't necessarily require anything @system or @trusted as often gets discussed with reference counting. Of course, allowing such an implementation does technically mean that copying a forward range is not necessarily strictly equivalent to calling save, but the overall iteration semantics would treat each copy as independent like save is supposed to allow, so it wouldn't be a problem.

And with that change, we then have consistent copy semantics with forward ranges. All code operating on forward ranges could treat them the same as they do dynamic arrays when copying them and passing them around, and you wouldn't have to worry about forgetting to call save. I think that we can probably get general agreement on such a change, since it's been discussed off-and-on for years now. However, the problem then is basic input ranges.

Basic input ranges cannot have the same copy semantics as dynamic arrays. If they could, then they could be forward ranges. As things stand, they either have reference semantics, or they have pseudo-reference semantics of the variety where mutating a copy puts the original in an invalid state. The only way that we could make their copy semantics consistent is if we then require that they be reference types. The simplest way to do that would be to require that they be classes or pointers to structs, though that would then make it impossible to do anything like reference counting with basic input ranges, which isn't ideal, but it would have the advantage of allowing us to force reference semantics at compile time.

This would of course mean that basic input ranges and forward ranges would then be defined to have different copy semantics, which would mean that either generic code that relied on the copy semantics would have to work with only one of them, or we would need to give basic input ranges a different API. And if we allowed basic input ranges to be structs that defined reference semantics internally (e.g. by implementing reference counting on a pointer or class reference), then we would have to give basic input ranges a different API, because we wouldn't have save to differentiate them anymore, and if we allowed structs, then we couldn't differentiate them based on a basic input range being a class reference or pointer to a struct, whereas a forward range would be a struct or a dynamic array.

Ultimately, I'm inclined to argue that we should give basic input ranges a new API - not just because it would allow them to use reference counting, but also because the current input range API tends to force basic input ranges to cache the value of front (which if anything, encourages basic input ranges to be pseudo-reference types and could result in an extra layer of indirection in some cases if they're forced to be reference types). It would be annoying in some cases to require that different functions be used for basic input ranges and forward ranges (though overloading could obviously be used to avoid needing different names), but it's already often the case that code isn't really designed to work with both, and overloading on the category of range being used is already fairly common, since different range capabilities allow for different implementations. So, given that it would prevent whole classes of copying bugs as well as potentially remove the requirement to cache front for basic input ranges, I think that a separate API for basic input ranges is warranted. What I would propose for that would be a single function

    auto next();

where next returns a nullable type where the value returned is the next element in the range, with a null value being returned if the range is empty. The return type would then need to emulate a pointer - specifically, when casting it to bool, it would be true if it's non-null and false if it's null, and dereferencing it would give you the actual value if it's non-null (with it being undefined behavior if you dereference null). So, a basic input range of ints might define next as

    int* next();

or alternatively, it could be something like

    Nullable!int next();

though that wouldn't work with Phobos' current Nullable type, since it doesn't support either casting to bool or dereferencing (probably because it originally used alias this). Either way, since we'd just require the specific API for the return type rather than requirng a pointer, the range type would have some flexibliity in what it used.

This would then mean that if you wanted to loop over a basic input range, you'd do something like

    for(auto front = range.next; !front; front = range.next)
    {
        ...
    }

And if we go down this road, then we could also add this API to foreach, allowing for code such as

    foreach(e; basicInputRange)
    {
        ...
    }

to work - and unlike now, you could rely on the copy semantics involved such that you would know that you could then break out of the loop and continue to use the range (whereas right now, you can't safely break out of a foreach and then continue to use the range that you were iterating over). Of course, for forward ranges, breaking out of a foreach still wouldn't let you continue to iterate from where you were, since the foreach would be iterating over a copy, but those would be the semantics for all forward ranges, so you could rely on them.

In addition to these changes to the copy semantics, I propose that we then require that assignment follow the same semantics as copying. So,

    range1 = range2;

on a forward range would result in two independent copies with the same the same elements in the same order (regardless of what range1 pointed to before), just like you'd get with

    auto copy = orig;

whereas for basic input ranges

    range1 = range2;

would result in range1 and range2 being fully dependent copies (regardless of what range1 pointed to before) - just like what you'd expect when assigning one pointer to another.

So, the assignment semantics would differ between basic input ranges and forward ranges, but they would be consistent with their copy semantics, and they would be consistent across all ranges of the same category (and of course, assignment would only be expected to work if the two ranges were of exactly the same type).

The result of all of this is that we could then have consistent, reliable copy and assignment semantics for both basic input ranges and forward ranges, and the two would be easily distinguishable so that you don't accidentally use one as the other and end up with bugs due to the differences in semantics between the two categories of ranges.

The other potential change related to this would be to rename basic input ranges. Right now, forward ranges are treated as an extension of input ranges, which makes sense if you're ignoring the issues surrounding copy semantics, since a forward range is an input range with an extra function that allows you to get an independent copy, but if we're giving them different APIs so that the copy semantics can be cleanly defined and separated, then arguably, it doesn't make sense to call them both ranges. As such, we could switch to calling forward ranges simply ranges (which we mostly do anyway) and give basic input ranges a new name. My initial suggestion would be an input stream (or if we absolutely had to continue to call them ranges, then maybe simple ranges), but there's probably a better name to be had, and obviously, plenty of bikeshedding can be done on that topic. Either way, the names used here aren't the primary concern.

What I'd like to get out of this thread is feedback on how much sense this idea does or doesn't make and what problems I'm missing that someone else is able to see. From what I can see, the main negative is simply that you can't then write code that works on both a basic input range and a forward range (though you can obviously still create function overloads so that the caller can use either), but given the issues surrounding copy semantics, I think that that's probably ultimately a good change (and the number of functions that can operate on basic input ranges is already pretty limited anyway in comparison to forward ranges - particularly with regards to generic algorithms). It will also make it much easier to discuss the separation between basic input ranges and forward ranges, which IMHO can be too easy to lose or confuse as things stand.

- Jonathan M Davis



January 21

On Sunday, 21 January 2024 at 05:00:31 UTC, Jonathan M Davis wrote:

>

I've been thinking about this for a while now, but with the ...

Shall

  • Nullable.opCast to bool
  • Nullable.opUnary *

be added to Phobos' std.typecons?

January 21
On Sunday, January 21, 2024 3:36:37 AM MST Per Nordlöw via Digitalmars-d wrote:
> On Sunday, 21 January 2024 at 05:00:31 UTC, Jonathan M Davis
>
> wrote:
> > I've been thinking about this for a while now, but with the ...
>
> Shall
>
> - `Nullable.opCast` to `bool`
> - `Nullable.opUnary` `*`
>
> be added to Phobos' `std.typecons`?

Maybe. It probably should be, though I don't know how many changes like that we're going to do to the current version of Phobos.

Either way, if we're doing a new major version of Phobos, we'll be reworking a variety of stuff anyway, and so Nullable would presumably be on the list (e.g. I think that it was a big mistake to make it a range, and I'd remove that functionality; slicing it to get a range is one thing, but the fact that it itself is a range definitely causes problems in generic code). And if the new API for basic input ranges uses a pointer API to access its elements, then that's a very good reason to make Nullable support that. As I said in my post, I suspect that the main reason that the current version doesn't is because it originally used alias this - which we removed due to all of the bugs that that caused, but giving Nullable an opCast to bool and * for dereferencing would likely have been a terrible idea when Nullable used alias this, whereas without that, I can't think of any reason why it would be a bad idea - though I'd likely still choose to have get and isNull for those who wanted to make it clear that a Nullable was being used. But I think that it's better to use a pointer API for basic input ranges, because then they can actually use pointers in those cases where that makes sense instead of requiring a Nullable type, much as it also needs to work to use a Nullable type.

- Jonathan M Davis




January 21
On Sunday, 21 January 2024 at 05:00:31 UTC, Jonathan M Davis wrote:
>
> 1. The range is a value type, so copying it results in two completely indepedent copies. Iterating through one has no effect on the other.
>
> 2. The range is a reference type, so copying it results in two completely dependent copies. They both share all of their state, so iterating through one iterates through the other, and they both always have the same state.
>
> 3. The range is a pseudo-reference type, and it has value semantics for its iteration state but reference semantics for some portion of the rest of its state. This means that if you iterate through a copy of the range, it has no effect on the original, so the range is a value type in that respect, but the rest of its state may or may not be copied by value (often - though not always - this means that the iteration state is directly within the range type, but the elements themselves are referred to via pointer, so mutating the actual elements would affect all copies of that range, but each range can be iterated independently). Dynamic arrays are the most common example of this sort of range.
>
> 4. The range is a pseudo-reference type, but it does not have value semantics with regards to its iteration state. So, when copying the range, some portion of its state is copied by value, but you don't end up with indendepent copies - and in fact, you don't get dependent copies either, since some of the state is copied by value. A common example of this sort of range which would be one which contains most of its state via a pointer or reference but which caches its front as a member variable within the struct. The result of this is that once you've mutated a copy of the range, you can no longer use the original, because mutating the copy only partially mutates the state of the original, leaving it in an inconsistent / invalid state.

For all these cases there is `.save` method which seems to be ignored in most of the code and compiler, when it should not be. It explicitly defines that range is copied irrelevant of the type of range you've mntioned. Since `.save` is defined on forward range, this means that pure input range must never be copyable, i.e. copy and post constructors must be disabled.

If this rule is followed, issues with copying would dissappear.

Even more drastic measure would be to disable copying on all types of ranges, and allow copying only through `.save` method, i.e. do explicit copying when it is intended to.

Best regards,
Alexandru.
January 21
On Sunday, 21 January 2024 at 10:58:32 UTC, Jonathan M Davis wrote:
> On Sunday, January 21, 2024 3:36:37 AM MST Per Nordlöw via Digitalmars-d wrote:
>> On Sunday, 21 January 2024 at 05:00:31 UTC, Jonathan M Davis
>>
>> wrote:
>> > I've been thinking about this for a while now, but with the ...
>>
>> Shall
>>
>> - `Nullable.opCast` to `bool`
>> - `Nullable.opUnary` `*`
>>
>> be added to Phobos' `std.typecons`?
>
> Maybe. It probably should be, though I don't know how many changes like that we're going to do to the current version of Phobos.

Imho, it must not be added, as this might conflate with contents of `Nullable!bool` types.
January 21
On Sunday, January 21, 2024 6:22:26 AM MST Alexandru Ermicioi via Digitalmars- d wrote:
> For all these cases there is `.save` method which seems to be ignored in most of the code and compiler, when it should not be. It explicitly defines that range is copied irrelevant of the type of range you've mntioned. Since `.save` is defined on forward range, this means that pure input range must never be copyable, i.e. copy and post constructors must be disabled.
>
> If this rule is followed, issues with copying would dissappear.
>
> Even more drastic measure would be to disable copying on all types of ranges, and allow copying only through `.save` method, i.e. do explicit copying when it is intended to.

And how on earth would you use a range if it can't be copied? They're copied all over the place and have to be to even be used. You copy them when you pass them to any range-based function. You copy them when you return them from a range-based function. You copy them when they get wrapped by another range. You copy them when you pass them to foreach. Disabling copying on ranges would make them borderline useless. And you also can't disable copying on classes or dynamic arrays. We could disallow classes which are ranges (even for basic input ranges) if we really wanted to make it so that they couldn't be copied, but we couldn't do the same with dynamic arrays. And really, ranges need to be copyable, because we need to be able to pass them around around and store them. And while ref could theoretically be used in some of those cases, in many of them, it could not (e.g. you have to copy a range once you wrap it in another type, and the ability to do that is crucial for most range-based algorithms).

Also, forward ranges are supposed to mimic dynamic arrays. C++ iterators are meant to mimic pointers, and D's ranges are meant to mimic dynamic arrays. The only reason that we even have save is to support ranges which are classes. A lot of the problems with range-based code come from code that assumes that a range behaves like a dynamic array when it doesn't. If we required that all forward ranges have the same copy semantics as dynamic arrays, then that problem is solved. Having save means treating forward ranges differently than we treat dynamic arrays, and it's completely unnecessary so long as we're willing to require that classes be wrapped in structs to be forward ranges. The ideal situation here is for ranges in general to have the same semantics as dynamic arrays with regards to copying, assignment, and iteration.

So, we can't disable copying for ranges in general and have them work, and even if we all agreed that disabling copying for ranges in general was a good idea, we still couldn't do that for dynamic arrays. So, we'd continue to have the problem that code would be written to be range-based but only worked with arrays, because that was what it was tested with. By requiring that forward ranges have the same semantics as dynamic arrays and treating basic input ranges as a separate thing, it becomes possible to write code that assumes the same semantics as dynamic arrays and then works with all forward ranges, which would significantly reduce the number of bugs that we get in range-based code.

- Jonathan M Davis



January 21

On Sunday, 21 January 2024 at 05:00:31 UTC, Jonathan M Davis wrote:

>

Ultimately, I'm inclined to argue that we should give basic input ranges a new API - not just because it would allow them to use reference counting, but also because the current input range API tends to force basic input ranges to cache the value of front (which if anything, encourages basic input ranges to be pseudo-reference types and could result in an extra layer of indirection in some cases if they're forced to be reference types). It would be annoying in some cases to require that different functions be used for basic input ranges and forward ranges (though overloading could obviously be used to avoid needing different names), but it's already often the case that code isn't really designed to work with both, and overloading on the category of range being used is already fairly common, since different range capabilities allow for different implementations. So, given that it would prevent whole classes of copying bugs as well as potentially remove the requirement to cache front for basic input ranges, I think that a separate API for basic input ranges is warranted. What I would propose for that would be a single function

auto next();

We could use java Iterator api in this case which has also bool hasNext() function.
Then new input range api can also be propagated to other types of range such as forward range and further.

>

where next returns a nullable type where the value returned is the next element in the range, with a null value being returned if the range is empty. The return type would then need to emulate a pointer - specifically, when casting it to bool, it would be true if it's non-null and false if it's null, and dereferencing it would give you the actual value if it's non-null (with it being undefined behavior if you dereference null). So, a basic input range of ints might define next as

int* next();

or alternatively, it could be something like

Nullable!int next();

The returned type should be a tagged union, that can allow storing of actual value inside, or pointer to it, while it's interface will hide the details, i.e. get would look like this: ref T get().

>

though that wouldn't work with Phobos' current Nullable type, since it doesn't support either casting to bool or dereferencing (probably because it originally used alias this). Either way, since we'd just require the specific API for the return type rather than requirng a pointer, the range type would have some flexibliity in what it used.

This would then mean that if you wanted to loop over a basic input range, you'd do something like

for(auto front = range.next; !front; front = range.next)
{
    ...
}

And if we go down this road, then we could also add this API to foreach, allowing for code such as

foreach(e; basicInputRange)
{
    ...
}

to work - and unlike now, you could rely on the copy semantics involved such that you would know that you could then break out of the loop and continue to use the range (whereas right now, you can't safely break out of a foreach and then continue to use the range that you were iterating over). Of course, for

Input ranges could just be disallowed in foreach statements, that would solve different semantics between them and forward ranges, just like how in Java it is done with Stream api.

>

What I'd like to get out of this thread is feedback on how much sense this idea does or doesn't make and what problems I'm missing that someone else is able to see. From what I can see, the main negative is simply that you can't then write code that works on both a basic input range and a forward range (though you can obviously still create function overloads so that the caller can use either), but given the issues surrounding copy semantics, I think that that's probably ultimately a good change (and the number of functions that can operate on basic input ranges is already pretty limited anyway in comparison to forward ranges - particularly with regards to generic algorithms). It will also make it much easier to discuss the separation between basic input ranges and forward ranges, which IMHO can be too easy to lose or confuse as things stand.

Imho, this proposal is complicated, and unnecessarily complicates construction of ranges, making them less appealing to implement in user code. I'd opt for restricting copying completely, and allow copying through .save only.

The .next method proposal is a good improvement though, with addition of .hasNext method at minimum.

January 21
On Sunday, January 21, 2024 6:26:16 AM MST Alexandru Ermicioi via Digitalmars- d wrote:
> On Sunday, 21 January 2024 at 10:58:32 UTC, Jonathan M Davis
>
> wrote:
> > On Sunday, January 21, 2024 3:36:37 AM MST Per Nordlöw via
> >
> > Digitalmars-d wrote:
> >> On Sunday, 21 January 2024 at 05:00:31 UTC, Jonathan M Davis
> >>
> >> wrote:
> >> > I've been thinking about this for a while now, but with the ...
> >>
> >> Shall
> >>
> >> - `Nullable.opCast` to `bool`
> >> - `Nullable.opUnary` `*`
> >>
> >> be added to Phobos' `std.typecons`?
> >
> > Maybe. It probably should be, though I don't know how many changes like that we're going to do to the current version of Phobos.
>
> Imho, it must not be added, as this might conflate with contents of `Nullable!bool` types.

It doesn't conflate with Nullable!bool at all, because the alias this was removed from Nullable. The only way to get the value out of Nullable is to use its get member. If we added dereferencing to it, then dereferencing it would then also work to get its value, but if we added an opCast to bool, casting would never give the Nullable's value even if the type of its value were bool. It would just be equivalent to negating the result of isNull.

I agree that if we still had alias this on Nullable, then adding opCast to bool to it would be problematic - and that's likely why it doesn't have that opCast, since its original API included alias this, but since the alias this is now gone, that's no longer a problem. There is no ambiguity.

- Jonathan M Davis




January 21
On Sunday, 21 January 2024 at 05:00:31 UTC, Jonathan M Davis wrote:
> From what I can see, the main negative is simply that you can't then write code that works on both a basic input range and a forward range (though you can obviously still create function overloads so that the caller can use either)

With the proposed design, it would be possible to implement next() for forward ranges as a UFCS function:

    auto next(FR)(FR fr)
        if (std.v2.range.isForwardRange!FR)
    {
        if (fr.empty)
        {
            return Nullable!(ElementType!FR).init;
        }
        else
        {
            scope(success) fr.popFront;
            return nullable(fr.front);
        }
    }

So, a function written to operate on basic input ranges would be able to accept any kind of range, same as today.
January 21
Thanks for bringing this up. Improving range semantics is something we should not pass up the opportunity for with Phobos3. (I consider Phobos1 the D1 version, Phobos2 the one we have currently.)

Another thing that should be visited is the ability to lock/unlock a range. For example, a range that represents stdout needs to be lockable.
« First   ‹ Prev
1 2 3 4 5 6 7