Jump to page: 1 2
Thread overview
Range length property
Apr 10, 2018
Nordlöw
Apr 10, 2018
Adam D. Ruppe
Apr 10, 2018
Nordlöw
Apr 10, 2018
Jonathan M Davis
Apr 10, 2018
Per Nordlöw
Apr 10, 2018
Cym13
Apr 10, 2018
Jonathan M Davis
Apr 10, 2018
H. S. Teoh
Apr 10, 2018
Jonathan M Davis
April 10, 2018
Should ranges always provide a length property?

If so, in which cases is a length property an advantage or a requirement?
April 10, 2018
On Tuesday, 10 April 2018 at 14:25:52 UTC, Nordlöw wrote:
> Should ranges always provide a length property?

No.

> If so, in which cases is a length property an advantage or a requirement?

Just provide it whenever it is cheap to do so. If you need to do complex calculations or especially loop over contents to figure out the length, do NOT provide it.

But if it is as simple as returning some value, provide it and algorithms can take advantage of it for optimizations etc. as needed.
April 10, 2018
On Tuesday, April 10, 2018 14:25:52 Nordlöw via Digitalmars-d-learn wrote:
> Should ranges always provide a length property?
>
> If so, in which cases is a length property an advantage or a requirement?

Whether a range has a length property or not is primarily dependent on how efficient it is to implement length. https://dlang.org/phobos/std_container.html lists the Big-O complexity of the various functions and properties that are expected for containers, and in the case of length, that also applies to ranges. It lists O(log n) as the Big-O complexity of length, so length should only be implemented if its Big-O complexity is no worse than O(log n). In most cases, that means implementing it only if it's O(1) (I think that it's O(log n) rather than O(1) because of binary trees), and length should certainly never be implemented if it's O(n). Basically, if you can just return the length without calculating it, then it makes sense to implement it, but if you have to calculate it, then in most cases, it shouldn't be there. Range-based functions are going to expect length to be very cheap to call if it is present, and any function that needs to ascertain the length of a range even if it doesn't have length will call walkLength (which returns length if present and iterates the entire range to count it if it's not). As iterating the entire range to count it, is O(n), most algorithms won't do that are more likely to require length if they need to know how many elements there are in the range, but that depends on what they're doing.

A function that requires a length property or which has a path optimized for ranges that have a length property checks for that with std.range.primitives.hasLength. Random access ranges are required to either be infinite or provide length, but for other range types, it's optional and must be checked for if an algorithm is going to use it.

And actually, a large percentage of ranges are lazy, in which case, it's pretty rare that they can have a length property, because you usually have no way of knowing how long they're going to be without actually iterating through the range. So, while it's not uncommon for a range to define length, it's very common that they don't.

- Jonathan M Davis


April 10, 2018
On Tuesday, 10 April 2018 at 14:34:40 UTC, Adam D. Ruppe wrote:
> On Tuesday, 10 April 2018 at 14:25:52 UTC, Nordlöw wrote:
>> Should ranges always provide a length property?
>
> No.
>
>> If so, in which cases is a length property an advantage or a requirement?
>
> Just provide it whenever it is cheap to do so. If you need to do complex calculations or especially loop over contents to figure out the length, do NOT provide it.
>
> But if it is as simple as returning some value, provide it and algorithms can take advantage of it for optimizations etc. as needed.

I'm thinking of my own container Hashmap having its range ByKeyValue requiring one extra word of memory to store the iteration count which, in turn, can be used to calculate the length of the remaining range. Is this motivated?
April 10, 2018
On Tuesday, April 10, 2018 19:47:10 Nordlöw via Digitalmars-d-learn wrote:
> On Tuesday, 10 April 2018 at 14:34:40 UTC, Adam D. Ruppe wrote:
> > On Tuesday, 10 April 2018 at 14:25:52 UTC, Nordlöw wrote:
> >> Should ranges always provide a length property?
> >
> > No.
> >
> >> If so, in which cases is a length property an advantage or a requirement?
> >
> > Just provide it whenever it is cheap to do so. If you need to do complex calculations or especially loop over contents to figure out the length, do NOT provide it.
> >
> > But if it is as simple as returning some value, provide it and algorithms can take advantage of it for optimizations etc. as needed.
>
> I'm thinking of my own container Hashmap having its range ByKeyValue requiring one extra word of memory to store the iteration count which, in turn, can be used to calculate the length of the remaining range. Is this motivated?

That would depend entirely on what you're trying to do, but in general, if a range has length, then some algorithms will be more efficient, and some algorithms do require length. So, if you can provide length, then the range will be more useful, just like a bidirectional range can be more useful than a forward range or a random-access range can be more useful than either. However, if you're not doing anything that ever benefits from it having length, then it doesn't buy you anything. So, it ultimately depends on what you're doing. In a general purpose library, I'd say that it should have length if it can do so in O(1), but if it's just for you, then it may or may not be worth it.

The other thing to consider is what happens when the container is mutated. I don't think that ranges necessarily behave all that well when an underlying container is mutated, but it is something that has to be considered when dealing with a range over a container. Even if mutating the underlying container doesn't necessarily invalidate a range, maintaining the length in the manner that you're suggesting probably makes it so that it would be invalidated in more cases, since if any elements are added or removed in the portion that was already popped off the range, then the iteration count couldn't be used to calculate the length in the same way anymore. Now, with a hash map, the range is probably fully invalidated when anything gets added or removed anyway, since that probably screws with the order of the elements in the range, but how the range is going to behave when the underlying container is mutated and how having the length property does or doesn't affect that is something that you'll need to consider.

- Jonathan M Davis


April 10, 2018
On 4/10/18 4:08 PM, Jonathan M Davis wrote:
> On Tuesday, April 10, 2018 19:47:10 Nordlöw via Digitalmars-d-learn wrote:
>> I'm thinking of my own container Hashmap having its range
>> ByKeyValue requiring one extra word of memory to store the
>> iteration count which, in turn, can be used to calculate the
>> length of the remaining range. Is this motivated?
> 
> That would depend entirely on what you're trying to do, but in general, if a
> range has length, then some algorithms will be more efficient, and some
> algorithms do require length.

e.g. std.array.array is going to pre-allocate an array of the correct length and fill it in, vs. appending each element as it gets them from the range.

Personally, I would store the length because typically a container range is short-lived. It also jives with the container itself which likely has O(1) length.

-Steve
April 10, 2018
On Tuesday, 10 April 2018 at 20:16:18 UTC, Steven Schveighoffer wrote:
> e.g. std.array.array is going to pre-allocate an array of the correct length and fill it in, vs. appending each element as it gets them from the range.
>
> Personally, I would store the length because typically a container range is short-lived. It also jives with the container itself which likely has O(1) length.

Thanks, that's what I thought too.
April 10, 2018
On Tuesday, 10 April 2018 at 20:08:14 UTC, Jonathan M Davis wrote:
> On Tuesday, April 10, 2018 19:47:10 Nordlöw via Digitalmars-d-learn wrote:
>> On Tuesday, 10 April 2018 at 14:34:40 UTC, Adam D. Ruppe wrote:
>> > On Tuesday, 10 April 2018 at 14:25:52 UTC, Nordlöw wrote:
>> >> Should ranges always provide a length property?
>> >
>> > No.
>> >
>> >> If so, in which cases is a length property an advantage or a requirement?
>> >
>> > Just provide it whenever it is cheap to do so. If you need to do complex calculations or especially loop over contents to figure out the length, do NOT provide it.
>> >
>> > But if it is as simple as returning some value, provide it and algorithms can take advantage of it for optimizations etc. as needed.
>>
>> I'm thinking of my own container Hashmap having its range ByKeyValue requiring one extra word of memory to store the iteration count which, in turn, can be used to calculate the length of the remaining range. Is this motivated?
>
> That would depend entirely on what you're trying to do, but in general, if a range has length, then some algorithms will be more efficient, and some algorithms do require length. So, if you can provide length, then the range will be more useful, just like a bidirectional range can be more useful than a forward range or a random-access range can be more useful than either. However, if you're not doing anything that ever benefits from it having length, then it doesn't buy you anything. So, it ultimately depends on what you're doing. In a general purpose library, I'd say that it should have length if it can do so in O(1), but if it's just for you, then it may or may not be worth it.
>
> The other thing to consider is what happens when the container is mutated. I don't think that ranges necessarily behave all that well when an underlying container is mutated, but it is something that has to be considered when dealing with a range over a container. Even if mutating the underlying container doesn't necessarily invalidate a range, maintaining the length in the manner that you're suggesting probably makes it so that it would be invalidated in more cases, since if any elements are added or removed in the portion that was already popped off the range, then the iteration count couldn't be used to calculate the length in the same way anymore. Now, with a hash map, the range is probably fully invalidated when anything gets added or removed anyway, since that probably screws with the order of the elements in the range, but how the range is going to behave when the underlying container is mutated and how having the length property does or doesn't affect that is something that you'll need to consider.
>
> - Jonathan M Davis

I find that discussion very interesting as I had never considered that because of design by introspection having a costly length method would lead to unexpected calls by generic algorithms making it a disadventage if present.

On the other hand I don't think the end user should have to scratch his head to find the length of a range, especially if it's not trivial to get (say, O(log n) kind of case). Therefore exposing a method in any case seems the best from an API perspective.

But to avoid the performance issues mentionned earlier it means it should bear a different name (get/setLength comes to mind). I believe this is the same kind of issue that lead to having "in" for associative arrays but not regular ones. However this also leads to less coherent APIs in contradiction with the principle of least surprise.

In retrospect since only "unexpected" calls to such methods cause the issue I wonder if it wouldn't be best to have an UDA saying "Hey, please, this method is costly, if you're a generic template performing introspection you should probably not call me". And writing that Andrei's work on complexity annotations comes to mind. Anyway, I don't think the user should use different names just to alleviate an issue on the library side but the alternative would be costly to put in place...

Any thoughts?
April 10, 2018
On Tuesday, April 10, 2018 22:07:40 Cym13 via Digitalmars-d-learn wrote:
> On Tuesday, 10 April 2018 at 20:08:14 UTC, Jonathan M Davis wrote:
> > On Tuesday, April 10, 2018 19:47:10 Nordlöw via
> >
> > Digitalmars-d-learn wrote:
> >> On Tuesday, 10 April 2018 at 14:34:40 UTC, Adam D. Ruppe wrote:
> >> > On Tuesday, 10 April 2018 at 14:25:52 UTC, Nordlöw wrote:
> >> >> Should ranges always provide a length property?
> >> >
> >> > No.
> >> >
> >> >> If so, in which cases is a length property an advantage or a requirement?
> >> >
> >> > Just provide it whenever it is cheap to do so. If you need to do complex calculations or especially loop over contents to figure out the length, do NOT provide it.
> >> >
> >> > But if it is as simple as returning some value, provide it and algorithms can take advantage of it for optimizations etc. as needed.
> >>
> >> I'm thinking of my own container Hashmap having its range ByKeyValue requiring one extra word of memory to store the iteration count which, in turn, can be used to calculate the length of the remaining range. Is this motivated?
> >
> > That would depend entirely on what you're trying to do, but in general, if a range has length, then some algorithms will be more efficient, and some algorithms do require length. So, if you can provide length, then the range will be more useful, just like a bidirectional range can be more useful than a forward range or a random-access range can be more useful than either. However, if you're not doing anything that ever benefits from it having length, then it doesn't buy you anything. So, it ultimately depends on what you're doing. In a general purpose library, I'd say that it should have length if it can do so in O(1), but if it's just for you, then it may or may not be worth it.
> >
> > The other thing to consider is what happens when the container is mutated. I don't think that ranges necessarily behave all that well when an underlying container is mutated, but it is something that has to be considered when dealing with a range over a container. Even if mutating the underlying container doesn't necessarily invalidate a range, maintaining the length in the manner that you're suggesting probably makes it so that it would be invalidated in more cases, since if any elements are added or removed in the portion that was already popped off the range, then the iteration count couldn't be used to calculate the length in the same way anymore. Now, with a hash map, the range is probably fully invalidated when anything gets added or removed anyway, since that probably screws with the order of the elements in the range, but how the range is going to behave when the underlying container is mutated and how having the length property does or doesn't affect that is something that you'll need to consider.
> >
> > - Jonathan M Davis
>
> I find that discussion very interesting as I had never considered that because of design by introspection having a costly length method would lead to unexpected calls by generic algorithms making it a disadventage if present.
>
> On the other hand I don't think the end user should have to scratch his head to find the length of a range, especially if it's not trivial to get (say, O(log n) kind of case). Therefore exposing a method in any case seems the best from an API perspective.
>
> But to avoid the performance issues mentionned earlier it means it should bear a different name (get/setLength comes to mind). I believe this is the same kind of issue that lead to having "in" for associative arrays but not regular ones. However this also leads to less coherent APIs in contradiction with the principle of least surprise.
>
> In retrospect since only "unexpected" calls to such methods cause the issue I wonder if it wouldn't be best to have an UDA saying "Hey, please, this method is costly, if you're a generic template performing introspection you should probably not call me". And writing that Andrei's work on complexity annotations comes to mind. Anyway, I don't think the user should use different names just to alleviate an issue on the library side but the alternative would be costly to put in place...
>
> Any thoughts?

In general, if you care about efficient code, it becomes critical that anything that's going to be used in generic code has well-known Big-O complexity, which is why C++ cares about that sort of thing with its containers and why Andrei specifically put the Big-O complexity of all of the generic container operations in std.container. And that extends to ranges. If ranges are implemented without any care about the algorithmic complexity of the operations, then performance is ultimately going to tank. And in the case of ranges, it's really not all that hard to understand the algorithmic complexity, because pretty much all of the range API functions are supposed to be O(1). Certainly, having any of them be O(n) would be a disaster.

Ranges can expose other functions if they want to, but they're not part of the range API and thus would not be useful in any code that wasn't written specifically for that range type. And honestly, at some point, it makes more sense to just allocate a dynamic array and get the full range API than to try and find ways to have inefficient versions of any of the functions in the range API.

In the case of length, we have walkLength. So, anything that can't have an efficient length can have its length obtained using walkLength, and any code that doesn't care whether getting the length is efficient can call walkLength rather than ever dealing with length, since walkLength will use length if it's present. So, I don't think that it makes any sense to try and implement alternate function names for length. That problem has already been solved.

As for in, canFind and find solve the problem.

- Jonathan M Davis


April 10, 2018
On Tue, Apr 10, 2018 at 10:07:40PM +0000, Cym13 via Digitalmars-d-learn wrote: [...]
> On the other hand I don't think the end user should have to scratch his head to find the length of a range, especially if it's not trivial to get (say, O(log n) kind of case). Therefore exposing a method in any case seems the best from an API perspective.
> 
> But to avoid the performance issues mentionned earlier it means it should bear a different name (get/setLength comes to mind). I believe this is the same kind of issue that lead to having "in" for associative arrays but not regular ones. However this also leads to less coherent APIs in contradiction with the principle of least surprise.
[...]

I've run into this in my own code, and I've been using `walkLength` as the name of the method, just for consistency with Phobos' walkLength. I'm not 100% sure this is a good idea, since overloading Phobos names can sometimes lead to annoying symbol conflict situations.  But the one good thing is that you won't forget what it's called because it's so familiar.


T

-- 
An elephant: A mouse built to government specifications. -- Robert Heinlein
« First   ‹ Prev
1 2