Thread overview
Re: Proposed Changes to the Range API for Phobos v3
May 17
On Fri, May 17, 2024 at 03:33:26AM +1200, Richard (Rikki) Andrew Cattermole via Digitalmars-d wrote:
> Two things I suggest we consider, given the write up:
> 
> 1. Automatic boxing of slices into a struct, this was something I was considering for signatures specifically for ranges.
[...]

What's the purpose of this, other than to add friction to use built-in arrays as ranges?  What do we get in return for paying this price?


T

-- 
Error: Keyboard not attached. Press F1 to continue. -- Yoon Ha Lee, CONLANG
May 18
On 18/05/2024 2:28 AM, H. S. Teoh wrote:
> On Fri, May 17, 2024 at 03:33:26AM +1200, Richard (Rikki) Andrew Cattermole via Digitalmars-d wrote:
>> Two things I suggest we consider, given the write up:
>>
>> 1. Automatic boxing of slices into a struct, this was something I was
>> considering for signatures specifically for ranges.
> [...]
> 
> What's the purpose of this, other than to add friction to use built-in
> arrays as ranges?  What do we get in return for paying this price?

There would be no friction for users of ranges in PhobosV3 they would all be annotated to turn on the boxing automatically.

User made ones would need annotating however, that could be a bit of a pain.

It is a pain for user made ones already in v2, since you have to make sure you have the right import and don't use it in a way that UFCS cannot work.

So it solves a number of issues like conflicting definitions as described by J.M.D.

However he isn't keen on it, so as a proposal it'll probably end at my suggestion to consider it.
May 18
On Thu, May 16, 2024 at 08:56:55AM -0600, Jonathan M Davis via Digitalmars-d wrote: [...]
> 1. The easy one is that the range API functions for dynamic arrays will not treat arrays of characters as special. A dynamic array of char will be a range of char, and a dynamic array of wchar will be a range of wchar.
> 
> Any code that needs to decode will need to use the phobos v3 replacement for std.utf's decode or decodeFront - or use foreach - to decode the code units to code points (and if it needs to switch encodings, then there will be whatever versions of byUTF, byChar, etc. that the replacement for std.utf will have).

I thought we already have this?  std.string.byRepresentation, std.uni.byCodePoint, std.uni.byGrapheme already fill this need.


[...]
> However, with infinite ranges, there is no such solution. If they cannot be default-initialized, then they either can't be ranges, or they would have to be finite ranges which would just never be empty if they're constructed at runtime (while doing something like the flag trick to make their init value empty). And it's certainly true that the range API doesn't (and can't) guarantee that finite ranges are truly finite, but it's still better if we can define infinite ranges that need to be constructed at runtime as infinite ranges, since then we can get the normal benefits that come from statically knowing that a range is infinite.

Infinite ranges also have the peculiarity that slicing may create a finite range, i.e., the underlying type changes. That's another wrinkle to deal with.


[...]
> 4. All ranges must be either dynamic arrays or structs (and not
> pointers to structs either).

This is not necessarily an ideal solution.  In my own code I've often had to iterate over forward ranges via sub functions, where I expect the iteration state after returning from the sub function to be retained. There are two ways to do this:

1) Have two versions of every iteration function, one taking the range by value (with implicit saving of current iteration state), the other taking the range by reference (retain changes to iteration state upon return), which leads to a lot of code duplication; or:

2) Have a single version of each function and pass the range by reference, usually by passing a pointer to it (since the current API would transparently treat the pointer as the range itself).

Prohibiting pointers eliminates option (2), and leaves me with the
non-ideal situation (1) where I need lots of code duplication.

Although, come to think of it, we could have a .byRef range wrapper that encapsulates a pointer to the range so that changes to iteration state would be preserved.  But then it begs the question, why not just allow pointers in the first place?  Why require jumping through extra hoops?


> 11. Finite random-access ranges are required to implement opDollar, and their opIndex must work with $. Similarly, any ranges which implement slicing must implement opDollar, and slicing must work with $.
> 
> In most cases, this will just be an alias to length, but barring a language change that automatically treats length as opDollar (which has been discussed before but has never materialized and is somewhat controversial given types where it wouldn't make sense to treat length as opDollar), we have to require that opDollar be defined, or generic code won't be able to use $ with indexing or slicing. We probably would have required it years ago except that it would have broken code to add the requirement.
[...]

Will this also require implementing arithmetic operators on the return
type of opDollar? Otherwise things like r[0 .. $-1] still wouldn't
work correctly. Or r[0 .. complicatedMathFunc(($-1)/2)].


--T
May 18

On Saturday, 18 May 2024 at 14:26:18 UTC, H. S. Teoh wrote:

>

On Thu, May 16, 2024 at 08:56:55AM -0600, Jonathan M Davis via Digitalmars-d wrote: [...]

>
  1. The easy one is that the range API functions for dynamic arrays will not treat arrays of characters as special. A dynamic array of char will be a range of char, and a dynamic array of wchar will be a range of wchar.

Any code that needs to decode will need to use the phobos v3 replacement for std.utf's decode or decodeFront - or use foreach - to decode the code units to code points (and if it needs to switch encodings, then there will be whatever versions of byUTF, byChar, etc. that the replacement for std.utf will have).

I thought we already have this? std.string.byRepresentation, std.uni.byCodePoint, std.uni.byGrapheme already fill this need.

Yes, all those would be present, except possibly byRepresentation.

I think the message is that somestr.front will not decode, you need to use decodeFront or whatnot.

>

[...]

>

However, with infinite ranges, there is no such solution. If they cannot be default-initialized, then they either can't be ranges, or they would have to be finite ranges which would just never be empty if they're constructed at runtime (while doing something like the flag trick to make their init value empty). And it's certainly true that the range API doesn't (and can't) guarantee that finite ranges are truly finite, but it's still better if we can define infinite ranges that need to be constructed at runtime as infinite ranges, since then we can get the normal benefits that come from statically knowing that a range is infinite.

Infinite ranges also have the peculiarity that slicing may create a finite range, i.e., the underlying type changes. That's another wrinkle to deal with.

The hasSlicing test is very hard to understand. But it looks like it either it's infinite, or the slice must be the same type as the range itself.

So I think we already are dealing with this, and I wouldn't expect it to change.

Clearly, .init is going to be the same type, and we can exploit that for emptying a range.

I wouldn't expect the rules for hasSlicing to change.

>

[...]

>
  1. All ranges must be either dynamic arrays or structs (and not
    pointers to structs either).
>

Although, come to think of it, we could have a .byRef range wrapper that encapsulates a pointer to the range so that changes to iteration state would be preserved. But then it begs the question, why not just allow pointers in the first place? Why require jumping through extra hoops?

It is impossible to hook the copying of pointers. And in this case, in order to prevent undue aliasing, we must prohibit copying of input ranges (which pointers to ranges are, even if the range itself is a forward range).

The answer, as you said, is to make a generic byRef range wrapper which is non-copyable (and hence, a proper input range).

The point is to remove the requirement for save and use the copyability to determine if a range is a forward range. This makes sense, since most people forget to call save anyway, and just count on copying being the same thing. We should just hook the thing that people use.

> >
  1. Finite random-access ranges are required to implement opDollar, and their opIndex must work with $. Similarly, any ranges which implement slicing must implement opDollar, and slicing must work with $.

In most cases, this will just be an alias to length, but barring a language change that automatically treats length as opDollar (which has been discussed before but has never materialized and is somewhat controversial given types where it wouldn't make sense to treat length as opDollar), we have to require that opDollar be defined, or generic code won't be able to use $ with indexing or slicing. We probably would have required it years ago except that it would have broken code to add the requirement.
[...]

Will this also require implementing arithmetic operators on the return
type of opDollar? Otherwise things like r[0 .. $-1] still wouldn't
work correctly. Or r[0 .. complicatedMathFunc(($-1)/2)].

It appears that the intention in current range code (in hasSlicing at least) is that $ must be size_t. But I'm not sure.

But if that is the intention, obviously size_t has well-defined behavior.

-Steve

May 18
On Saturday, May 18, 2024 2:21:01 PM MDT Steven Schveighoffer via Digitalmars-d wrote:
> The `hasSlicing` test is very hard to understand. But it looks like it either it's infinite, or the slice must be the same type as the range itself.
>
> So I think we already are dealing with this, and I wouldn't expect it to change.
>
> Clearly, `.init` is going to be the same type, and we can exploit that for emptying a range.
>
> I wouldn't expect the rules for hasSlicing to change.

The rules will have to change somewhat, but the way that they will change is to bring them in line with how most folks would likely expect slicing to work. At least with finite ranges, slicing a range should behave like you'd expect with dynamic arrays so that you really don't have to think about it, and code that was written to work with dynamic arrays would then just work with ranges (at least as far as slicing goes).

Right now, what hasSlicing essentially requires is

1. If the range is finite, slicing the range must result in the same type, that type must have length, and length must be size_t (which therefore indirectly requires that a finite range with slicing have length).

2. If the range is infinite, slicing the range with indices must result in a finite range with length, and length must be size_t (though it seems to fail to require that the element type be the same or that the resulting type have slicing).

3. Slicing with indices has to result in a forward range, so that indirectly requires that finite ranges with slicing be forward ranges, but it puts infinite ranges in the bizarre position that they can have slicing as only basic input ranges so long as slicing them with indices somehow is then able to result in a forward range.

4. If slicing the range with an index and $ (e.g. range[0 .. $]) compiles, it must result in a range with the same type, but slicing the range with an index and $ does not actually need to compile.

5. If range[0 .. $ - 1] compiles (whether the range is finite or infinite), then the result must be the same type. There is no requirement that range[0 .. $ - 1] compiles if range[0 .. $] compiles or that range[0 .. $] compiles if range[0 .. $ - 1] compiles.

And actually figuring out that that's what hasSlicing does is annoyingly difficult. It's definitely a mess. As it stands, it wouldn't surprise me if I screwed up something in what I just listed, though I _think_ that I have it right.

Either way, since hasSlicing does not actually require that $ work, we can't actually use it in generic code right now, so the current checks for it are pretty pointless.

What we need is for finite ranges to be sliceable with basically the same semantics as dynamic arrays, and for that to work, not only does $ need to work with slicing, but at least some of the basic arithmetic operations need to work (ideally, no one would really need to think about what they can do with $ and ranges; they'd just do what they do with arrays).

And for infinite ranges, we need $ to work, but it really only makes sense for it to work without arithmetic, since doing arithmetic on infinity is pointless. For infinite ranges, $ is really just a marker for the end of the range and not an index. And just like now, slicing an infinite range should result in a finite range if $ isn't used, and it should result in the same range type if $ is used.

So, the rules become something like

1. Ranges with slicing should be forward ranges (or copyable ranges as
they'll likely be renamed to).

2. Finite ranges should have length, and length should evaluate to size_t.

3. For finite ranges, range[0 .. 0] and range[0 .. $] need to compile and result in the same type. In addition to that, we need the set of arithmetic operations that folks would typically expect to work with dynamic arrays to work, which I think includes subtraction, division, and comparison - e.g. range[0 .. $ - 1], range[0 .. $ / 2], range[0 .. $ == 1 ? 0 : $], and range[0 .. $ < 10 ? $ : 10].

4. For infinite ranges, range[0 .. 0] must compile and result in a finite range with the same element type, and that finite range must also have slicing.

5. For infinite ranges, range[0 .. $] must compile and result in the same range type. No arithmetic or comparison operations will be checked for with $ and infinite ranges.

All in all, I think that that's straightforward and as expected with the main complication being the exact operations that $ supports with finite ranges (which I unfortunately had not thought about in great detail until monkyyy brought it up in this thread). And there are two ways that I see to handle that

1. We figure out the set of arithmetic operations that we want to guarantee work (and I _think_ that that list is subtraction, division, and the comparison operators).

2. We just require that opDollar evaluate to size_t, which would effectively require that it be an alias to length even if it technically wouldn't have to be.

I was thinking that we would need to do #1 to support some edge cases, but since length already has to be size_t, I'm increasingly tempted to just require that $ be size_t as well, since then we don't have to do _any_ work to make sure that $ works with the same operations that $ works with for arrays. And realistically, in the vast majority of cases, opDollar is just going to be an alias of length anyway. It would also have the benefit of making the trait that much shorter and cleaner, so understanding the requirements would be much easier. The only real question is whether there are any reasonable ranges that we would then be disallowing (and I don't think that there would be, but I'm not entirely sure).

Either way, opDollar for infinite ranges would not require any arithmetic and could not be size_t, since that would mean using actual indices for both ends of the range, which would result in a finite range.

And regardless of the exact set of requirements, the updated trait needs to written much more clearly than what we currently have.

Actually, a related issue is whether random-access ranges should require slicing. Slicing can work with forward ranges (presumably because of edge cases where you can reasonably slice a forward range but can't reasonably provide the elements in the middle without iterating through the preceding elements first) even though you'd usually associate slicing with random-access ranges. However, while slicing is typically associated with random-access ranges, isRandomAccessRange does not actually require it. I'm very tempted to add that requirement in order to simplify things, though that would likely result in hasSlicing being used almost not at all (since slicing anything other than a random-access range is definitely an edge case), but I'm not 100% sure that there isn't some weird edge case where it's reasonable to be able to randomly access a range's elements, but it's not reasonable to be able to slice the range.

- Jonathan M Davis



May 18
On Saturday, May 18, 2024 8:26:18 AM MDT H. S. Teoh via Digitalmars-d wrote:
> On Thu, May 16, 2024 at 08:56:55AM -0600, Jonathan M Davis via Digitalmars-d wrote: [...]
>
> > 1. The easy one is that the range API functions for dynamic arrays will
> > not
> > treat arrays of characters as special. A dynamic array of char will be a
> > range of char, and a dynamic array of wchar will be a range of wchar.
> >
> > Any code that needs to decode will need to use the phobos v3 replacement for std.utf's decode or decodeFront - or use foreach - to decode the code units to code points (and if it needs to switch encodings, then there will be whatever versions of byUTF, byChar, etc. that the replacement for std.utf will have).
>
> I thought we already have this?  std.string.byRepresentation, std.uni.byCodePoint, std.uni.byGrapheme already fill this need.

Yes, but Phobos v3 will need its own solutions for those - even if it's simply porting over the existing ones and making them work with the new range API - because nothing in Phobos v3 is going to depend on Phobos v2 at all. Most of Phobos v3 is likely to be a cleaned up version of what's in Phobos v2 rather than something completely different (though in some cases, there will be more drastic changes), but Phobos v3 is intended to ultimately replace Phobos v2 so that you won't need to use it for anything any longer (much as v2 will stick around long term so that old code continues to compile).

The point here is that because auto-decoding will be gone, you'll need to explicitly call the functions for decoding if you want decoding, and if you want graphemes or some other conversion, you'll need to call the appropriate functions for that. The new range API itself does not deal with them at all. We do already have such functions for Phobos v2, but anyone using the range API from Phobos v3 will be calling the Phobos v3 versions of those functions.

>
> [...]
>
> > However, with infinite ranges, there is no such solution. If they cannot be default-initialized, then they either can't be ranges, or they would have to be finite ranges which would just never be empty if they're constructed at runtime (while doing something like the flag trick to make their init value empty). And it's certainly true that the range API doesn't (and can't) guarantee that finite ranges are truly finite, but it's still better if we can define infinite ranges that need to be constructed at runtime as infinite ranges, since then we can get the normal benefits that come from statically knowing that a range is infinite.
>
> Infinite ranges also have the peculiarity that slicing may create a finite range, i.e., the underlying type changes. That's another wrinkle to deal with.

Yes, though I think that the decision that I've gone with of making it so that you're allowed to have finite and infinite ranges disable default initialization while requiring that their init value be valid if they don't disable it (and further requiring that init be empty for finite ranges) avoids any real complications with slicing.

Any infinite range with slicing will have to define a finite range that follows all of the rules for finite ranges, which would include the requirements for initialization, but since the programmer has the full freedom to implement it like they would any other finite range, I don't think there's really much different about such ranges. Worst case, it's similar to a finite range that needs be constructed at runtime to really function properly (since those can still support default initialization by adding a flag to indicate whether they were default-initialized or not and then just have empty be true if they were).

Now, if we were to require that all ranges be default-initializable, since some infinite ranges can't work that way, those ranges would then have to be implemented as finite ranges that were never actually empty, and since slicing has to return the same type for finite ranges, they would end up with a more complex implementation, because the range would presumably have to check internally whether it was the "infinite" version or the truly finite version in portions of its code. So, that approach would be uglier because of that, but it could still be implemented.

If we required that finite ranges be default-initializable but allowed infinite ranges to disable default initialization, then an infinite range with slicing which disabled default initatialization _would_ have to be able to return a finite range which could be default-initialized, and that would then require doing something like using a Nullable to contain the infinite range that it would presumably be wrapping so that the finite range could be default-initialized (as well as being able to determine that it was empty when the infinite range hadn't been initialized). So, that would also be more complicated.

However, in both of those cases, it _would_ still be possible to correctly implement the finite range. It would just be annoying, whereas in the first case, it would be impossible to implement the infinite range as an infinite range if it can't be default-initialized and work correctly. So, that's definitely the larger issue.

In any case, going with the rule where we allow ranges to disable default initialization but require that their init value be valid if they can be default-initialized (and require that the init value be empty for finite ranges) avoids all of those complications. It does have the downside of generic range-based code having to deal with the possibility of ranges not being default-initializable, but that is a general problem with generic code rather than a range-specific one, and the solutions for it are not range-specific.

>
> [...]
>
> > 4. All ranges must be either dynamic arrays or structs (and not
> > pointers to structs either).
>
> This is not necessarily an ideal solution.  In my own code I've often had to iterate over forward ranges via sub functions, where I expect the iteration state after returning from the sub function to be retained. There are two ways to do this:
>
> 1) Have two versions of every iteration function, one taking the range by value (with implicit saving of current iteration state), the other taking the range by reference (retain changes to iteration state upon return), which leads to a lot of code duplication; or:
>
> 2) Have a single version of each function and pass the range by reference, usually by passing a pointer to it (since the current API would transparently treat the pointer as the range itself).
>
> Prohibiting pointers eliminates option (2), and leaves me with the
> non-ideal situation (1) where I need lots of code duplication.
>
> Although, come to think of it, we could have a .byRef range wrapper that encapsulates a pointer to the range so that changes to iteration state would be preserved.  But then it begs the question, why not just allow pointers in the first place?  Why require jumping through extra hoops?

The short answer to this is that we cannot allow forward ranges to be reference types and get rid of save, and the fact that save has been part of the range API has proven to be a big mistake. Along with the removal auto-decoding, it's the #1 change that we've talked about wanting to do for years now. I wrote a really long reply to this before explaining a bunch of details about why locking down the copy semantics matter and deleted it, because it was getting way too long, but ultimately, the fact of the matter is that if we want to get rid of save, we have no choice but to require that copying forward ranges results in an independent copy. If you can make that work with some sort of wrapper type without violating the range API, then have at it (though my experience with RefRange is that it was a big mistake on my part and that attempting such semantics is too error-prone). Either way, save needs to go.

And the reality of the matter is that any code that currently passes a range to a function by value and then does _anything_ with that range after that is relying on the copy semantics of that particular range type as well as the current implementation of the function being called, since generic code cannot rely on the copy semantics of ranges, and if the function was going to make any promises about the state of the range afterwards, it would have taken it by ref. So, unless you control all of the code in question, you're relying on unspecified behavior and are just lucky that it happens to work right now. And if you are in full control of the code, then you can do whatever you want regardless of what the range API says, though Phobos functions will be statically enforcing as much of the range API as they can, so it's true you won't be able to do stuff like use a reference type range with any Phobos code.

But in order get rid of save and lock down the copy and assignment semantics of ranges so that generic code can actually rely on those semantics, reference type ranges have to be banned - or they'll have to be non-copyable ranges. We _could_ make it so that basic input ranges are required to be reference types instead of non-copyable, but that creates a different set of issues, and it wouldn't help you if you want forward ranges which are reference types. Those simply aren't possible if we get rid of save unless we force all forward ranges to be reference types, which would mean having to wrap arrays to make them be ranges as well as significantly increasing how much D code would need to allocate on the heap.

> > 11. Finite random-access ranges are required to implement opDollar, and their opIndex must work with $. Similarly, any ranges which implement slicing must implement opDollar, and slicing must work with $.
> >
> > In most cases, this will just be an alias to length, but barring a language change that automatically treats length as opDollar (which has been discussed before but has never materialized and is somewhat controversial given types where it wouldn't make sense to treat length as opDollar), we have to require that opDollar be defined, or generic code won't be able to use $ with indexing or slicing. We probably would have required it years ago except that it would have broken code to add the requirement.
>
> [...]
>
> Will this also require implementing arithmetic operators on the return
> type of opDollar? Otherwise things like r[0 .. $-1] still wouldn't
> work correctly. Or r[0 .. complicatedMathFunc(($-1)/2)].

I have a detailed reply on this topic in my reply to Steven, but the short answer is that yes, they'll be required for opDollar with finite ranges (they make no sense for infinite ranges), but it's probably going to make sense to just require that opDollar evaluate to size_t, since length already has to be size_t, and I don't think that we want to change that.

- Jonathan M Davis