Jump to page: 1 2 3
Thread overview
How is chunkBy supposed to behave on copy
Mar 18, 2020
Dukc
Mar 18, 2020
Paul Backus
Mar 18, 2020
H. S. Teoh
Mar 18, 2020
H. S. Teoh
Mar 18, 2020
H. S. Teoh
Mar 18, 2020
H. S. Teoh
Mar 19, 2020
Dukc
Mar 19, 2020
Dukc
Mar 19, 2020
Dukc
Mar 19, 2020
H. S. Teoh
Mar 19, 2020
Dukc
Mar 19, 2020
H. S. Teoh
Mar 19, 2020
H. S. Teoh
Mar 20, 2020
Jonathan M Davis
Mar 20, 2020
H. S. Teoh
Mar 20, 2020
Ben Jones
Mar 20, 2020
H. S. Teoh
Mar 20, 2020
Jonathan M Davis
Mar 22, 2020
H. S. Teoh
Mar 22, 2020
Paul Backus
Mar 23, 2020
H. S. Teoh
Mar 23, 2020
Paul Backus
Mar 23, 2020
Jonathan M Davis
Mar 23, 2020
Pezbi Mahn
March 18, 2020
```
void main()
{	import std;
	auto chunks = [14, 16, 18, 20, 22, 20, 18, 16]
	.chunkBy!((a, b) => a / 10 == b / 10);
	
	//[[14, 16, 18], [20, 22, 20], [18, 16]]
	chunks
	.map!(chunk => chunk.array)
	.array
	.writeln;
	
	//[]
	chunks
	.map!(chunk => chunk.array)
	.array
	.writeln;
}
```

This is how chunkBy[1] currently behaves when copying. It essentially behaves like a reference range: it will only save it's state when `save` is explicitly called, not otherwise, even if the chunked source range is a forward range with value semantics.

As most Phobos ranges preserve the value semantics of their source ranges, this is an odd exception, especially as the documentation says nothing about it.

So the question is, how should it behave? I have looked at the implementation, it should be trivial to have it to use value semantics. On the oher hand, someone may rely on the present behaviour.

[1] https://dlang.org/phobos/std_algorithm_iteration.html#.chunkBy
March 18, 2020
On Wednesday, 18 March 2020 at 14:57:41 UTC, Dukc wrote:
> So the question is, how should it behave? I have looked at the implementation, it should be trivial to have it to use value semantics. On the oher hand, someone may rely on the present behaviour.
>
> [1] https://dlang.org/phobos/std_algorithm_iteration.html#.chunkBy

As I understand it, the result of using a range after copying it is unspecified, and relying on any particular behavior in that case is a bug. So changing the behavior will only break code that is (in some sense) "already broken."

Of course, given the above, the question of how it "should" behave is rendered moot. If you're copying a range, you must either use .save, or only use the copy and not the original from that point forward. In fact, if you want to catch as many errors as possible at compile time, the strictest approach would be to mark the range as non-copyable (@disable this(this)) and require the use of either .save or move to copy it.
March 18, 2020
On 3/18/20 10:57 AM, Dukc wrote:
> ```
> void main()
> {    import std;
>      auto chunks = [14, 16, 18, 20, 22, 20, 18, 16]
>      .chunkBy!((a, b) => a / 10 == b / 10);
> 
>      //[[14, 16, 18], [20, 22, 20], [18, 16]]
>      chunks
>      .map!(chunk => chunk.array)
>      .array
>      .writeln;
> 
>      //[]
>      chunks
>      .map!(chunk => chunk.array)
>      .array
>      .writeln;
> }
> ```
> 
> This is how chunkBy[1] currently behaves when copying. It essentially behaves like a reference range: it will only save it's state when `save` is explicitly called, not otherwise, even if the chunked source range is a forward range with value semantics.
> 
> As most Phobos ranges preserve the value semantics of their source ranges, this is an odd exception, especially as the documentation says nothing about it.
> 
> So the question is, how should it behave? I have looked at the implementation, it should be trivial to have it to use value semantics. On the oher hand, someone may rely on the present behaviour.
> 
> [1] https://dlang.org/phobos/std_algorithm_iteration.html#.chunkBy

According to the range spec, save must be called if you wish to preserve state.

In practice, this is seldom done, because most of the time, save just does `return this;`, so copying works just the same.

The short answer is, use save.

The long answer is, rangeBy uses reference counting internally to store the range data. I'm not sure why this is, as it seems possible to do without this mechanism. It seems there is some optimization surrounding pushing along the range without iterating it twice. But I don't know if that's a worthwhile optimization, especially if the allocation and reference counting are more expensive than the iteration.

I would think you could get away with simply storing the range twice, and using .save to make sure you don't have problems.

-Steve
March 18, 2020
On Wed, Mar 18, 2020 at 02:57:41PM +0000, Dukc via Digitalmars-d wrote: [...]
> This is how chunkBy[1] currently behaves when copying. It essentially behaves like a reference range: it will only save it's state when `save` is explicitly called, not otherwise, even if the chunked source range is a forward range with value semantics.

The current range API design, which Andrei himself admitted was not the best design in retrospect, does not specify behaviour on copying a range.  IOW, it's the application-level equivalent of undefined behaviour.  Generally, this is not a problem because usually you're working with your own range types which you already know the semantics of.  But in generic code, assumptions of this sort are a no-no, precisely because of breakages of this sort when different ranges behave differently on copy.

tl;dr: never copy a range directly, always use .save.  Never assume a range remains unchanged after iterating a copy; always assume the worst and use .save whenever you wish the original range to remain unchanged afterwards.


> As most Phobos ranges preserve the value semantics of their source ranges,
[...]

This is a dangerous assumption.  My personal advice is, if you expect a range to retain its contents after iteration, call .save explicitly. Don't assume anything not explicitly required by the range API.


T

-- 
Verbing weirds language. -- Calvin (& Hobbes)
March 18, 2020
On Wed, Mar 18, 2020 at 12:18:04PM -0400, Steven Schveighoffer via Digitalmars-d wrote: [...]
> The long answer is, rangeBy uses reference counting internally to store the range data. I'm not sure why this is,

Because Andrei demanded it when I submitted the PR. The main reason is to address the issue of what happens when you iterate the outer range while retaining a copy of an inner range. In my original implementation, advancing the inner range also advances the outer range, and vice versa, but only when it's an input range. When it's a forward range, it takes advantage of .save to decouple iteration on an inner range from iteration on the outer range.  But that meant that iterating over the entire thing required traversing the original range twice: once in each inner range, and once on the outer range.  Andrei didn't like this, so we hashed out several solutions and eventually settled on the reference counting one.


> as it seems possible to do without this mechanism. It seems there is some optimization surrounding pushing along the range without iterating it twice. But I don't know if that's a worthwhile optimization, especially if the allocation and reference counting are more expensive than the iteration.

That's up for debate, but yes, the whole point of the reference counting thing is to avoid traversing a forward range twice when iterating over subranges.  It really depends on what the original range does, IMO.  If it's generating values on-the-fly with an expensive calculation, you could be saving quite a bit of work. But for simple ranges, yeah, reference-counting inner ranges is kinda overkill, probably with a performance hit.


> I would think you could get away with simply storing the range twice, and using .save to make sure you don't have problems.
[...]

This is what the original implementation did, but Andrei did not like it.


T

-- 
Старый друг лучше новых двух.
March 18, 2020
On 3/18/20 12:37 PM, H. S. Teoh wrote:
> On Wed, Mar 18, 2020 at 12:18:04PM -0400, Steven Schveighoffer via Digitalmars-d wrote:

>> as it seems possible to do without this mechanism. It seems there is
>> some optimization surrounding pushing along the range without
>> iterating it twice. But I don't know if that's a worthwhile
>> optimization, especially if the allocation and reference counting are
>> more expensive than the iteration.
> 
> That's up for debate, but yes, the whole point of the reference counting
> thing is to avoid traversing a forward range twice when iterating over
> subranges.  It really depends on what the original range does, IMO.  If
> it's generating values on-the-fly with an expensive calculation, you
> could be saving quite a bit of work. But for simple ranges, yeah,
> reference-counting inner ranges is kinda overkill, probably with a
> performance hit.

Ugh, I think this is way overkill in most cases, and depends heavily on where the performance hit is.

Not only that, but you are only seeing a benefit if you iterate a chunk completely (I think).

For instance, an array has nearly zero cost to iterate elements, but the predicate for checking the chunking is probably way more expensive. The algorithm would be nicer if you simply iterated the array until you found the boundary, and then returned a slice as the element. Only one iteration through everything necessary (you pre-calculate the "next" range).

In other cases, iterating the elements is going to be expensive, so you don't want to do that twice if possible.

I think a good solution might be to provide different versions of the function or an enum designating what mechanism to prefer (e.g. IterationPolicy.MinimizePopfront or IterationPolicy.MinimizePredicateEval).

And of course, the .save behavior sucks, as always.

-Steve
March 18, 2020
On Wed, Mar 18, 2020 at 01:06:02PM -0400, Steven Schveighoffer via Digitalmars-d wrote:
> On 3/18/20 12:37 PM, H. S. Teoh wrote:
[...]
> > That's up for debate, but yes, the whole point of the reference counting thing is to avoid traversing a forward range twice when iterating over subranges.
[...]
> Ugh, I think this is way overkill in most cases, and depends heavily on where the performance hit is.
> 
> Not only that, but you are only seeing a benefit if you iterate a
> chunk completely (I think).

Yeah probably.


> For instance, an array has nearly zero cost to iterate elements, but the predicate for checking the chunking is probably way more expensive. The algorithm would be nicer if you simply iterated the array until you found the boundary, and then returned a slice as the element. Only one iteration through everything necessary (you pre-calculate the "next" range).

We *could* just detect hasSlicing!R and switch to an alternative implementation that doesn't need to jump through all of those refcounting hoops.


> In other cases, iterating the elements is going to be expensive, so you don't want to do that twice if possible.
> 
> I think a good solution might be to provide different versions of the function or an enum designating what mechanism to prefer (e.g. IterationPolicy.MinimizePopfront or IterationPolicy.MinimizePredicateEval).
[...]

It sounds like a good idea, but these days I'm wary of proliferating these implementation-detail-based parameters.  They're a pain to write, a pain to use, and a pain to maintain, and once you have the option you're committed to supporting it even if one day you invent a better algorithm that completely changes the implementation.

I much rather detect hasSlicing on the incoming range, and switching to a better implementation.


T

-- 
INTEL = Only half of "intelligence".
March 18, 2020
On 3/18/20 2:30 PM, H. S. Teoh wrote:

> It sounds like a good idea, but these days I'm wary of proliferating
> these implementation-detail-based parameters.  They're a pain to write,
> a pain to use, and a pain to maintain, and once you have the option
> you're committed to supporting it even if one day you invent a better
> algorithm that completely changes the implementation.
> 
> I much rather detect hasSlicing on the incoming range, and switching to
> a better implementation.

The problem is that you don't know what the bottleneck is, only the user does.

auto r = arr.map!(elem => someHorribleCalculation(elem));

static assert(hasSlicing(typeof(r))); // 2x horrible calculations

The thing I would do is provide the "implementation detail" mechanisms, but default to something reasonable. It could even change based on whether slicing is available (the default, that is).

-Steve
March 18, 2020
On Wed, Mar 18, 2020 at 02:55:35PM -0400, Steven Schveighoffer via Digitalmars-d wrote:
> On 3/18/20 2:30 PM, H. S. Teoh wrote:
[...]
> > I much rather detect hasSlicing on the incoming range, and switching to a better implementation.
> 
> The problem is that you don't know what the bottleneck is, only the user does.
> 
> auto r = arr.map!(elem => someHorribleCalculation(elem));
> 
> static assert(hasSlicing(typeof(r))); // 2x horrible calculations
> 
> The thing I would do is provide the "implementation detail" mechanisms, but default to something reasonable. It could even change based on whether slicing is available (the default, that is).
[...]

Fair enough.

Still, I think we should totally detect hasSlicing, or at least built-in arrays, and default that to the slicing implementation instead.


T

-- 
Bomb technician: If I'm running, try to keep up.
March 19, 2020
On Wednesday, 18 March 2020 at 16:29:53 UTC, H. S. Teoh wrote:
> This is a dangerous assumption.  My personal advice is, if you expect a range to retain its contents after iteration, call .save explicitly. Don't assume anything not explicitly required by the range API.

This means that even if the api does thing X right now, I should not assume it won't change unless it's documented, right?


« First   ‹ Prev
1 2 3