[Issue 13936] groupBy must be redone (page 3)

https://issues.dlang.org/show_bug.cgi?id=13936 --- Comment #20 from Andrei Alexandrescu <andrei@erdani.com> --- (In reply to Dicebot from comment #18) > Some quick notes from a very quick look: > > > I think we shouldn't care about non-equivalence relations > > This is something I simply can't accept. groupBy is not a specialized linear algebra thing - it is a quite common utility that will be widely used and widely misused. groupByAdjacent should take care of that. --

https://issues.dlang.org/show_bug.cgi?id=13936 --- Comment #21 from Dicebot <public@dicebot.lv> --- And it is not good exactly for a reason mentioned above - seeing function named groupBy one would never keep looking for something named like groupByAdjacent. Resulting in broken app and frustration. Going other way around is as trivial as mentioning "use this flag for pure equivalence relation" in docs as those who know linear algebra will immediately recognize it (and even of they don't, program will still be correct, just slower) --

https://issues.dlang.org/show_bug.cgi?id=13936 --- Comment #22 from Andrei Alexandrescu <andrei@erdani.com> --- (In reply to Dicebot from comment #21) > And it is not good exactly for a reason mentioned above - seeing function named groupBy one would never keep looking for something named like groupByAdjacent. Resulting in broken app and frustration. > > Going other way around is as trivial as mentioning "use this flag for pure equivalence relation" in docs as those who know linear algebra will immediately recognize it (and even of they don't, program will still be correct, just slower) groupBy evokes the linear algebra groupBy. groupByAdjacent is special purpose. We can't hurt everybody for the benefit of very few. --

https://issues.dlang.org/show_bug.cgi?id=13936 --- Comment #24 from Dicebot <public@dicebot.lv> --- If you consider those who know/care about math "everyone" and everyone else "few" you are of much better opinion of people than I am. Also performance penalty is not even remotely as harmful as infinite loop /broken logic so this argument goes against your proposal. Anyway, this is up to you to decide. You have been asking for opinion - here it is. --

https://issues.dlang.org/show_bug.cgi?id=13936 --- Comment #26 from hsteoh@quickfur.ath.cx --- For some reason I couldn't access the dpaste yesterday. But anyway, it works now. The first thing I note, is that this still does not address the situation where the group is incompletely iterated, then the user decides to pop the outer range. In that case you duplicate almost all of the effort of popping the entire group again. The choice of implementation also imposes an artificial limitation on group length to uint.max, and would insert an artificial group boundary at that point even if predicate() still holds true. This is unlikely to ever happen, but still, it would be nice to have some insurance for it in documentation and in throwing an exception should this case get triggered (but unfortunately that breaks nothrow, which could be a show-stopper). Or use size_t, which defers the problem to the more unlikely distant future. A similar thing happens with wrapped-around group serial numbers, but that's a far less likely case to cause problems 'cos the user would have to keep a reference to a group around for uint.max groups, which realistically won't happen unless he was deliberately trying to shoot his own foot. Now on to the implementation: - What on earth is going on with GroupBy.groupStart and GroupBy.groupCurrent?! Shouldn't there just be a single reference to the underlying range, which gets advanced to groupNext once the current Group is finished? Currently this code looks really suspect because .groupStart and .groupCurrent seem to be redundant with each other, yet .empty checks .groupCurrent whereas .front copies .groupStart. - GroupBy.front passes the current range by value which Group stores by value. This will cause broken behaviour if the range has reference semantics: Group.popFront will cause subsequent calls to GroupBy.front to return a truncated group. Previously-saved Group's will also break once the mothership advances to the next group, because they use groupStart to evaluate the predicate, which would have changed if the underlying range has reference semantics. Basically, .save should be used everywhere except where you can justify not using it, otherwise you will almost certainly get broken boundary cases. - OTOH, if the range has very short groups, then most of the performance will be bottlenecked on all those .save's every time .front is accessed. We should probably profile this thing, with at least one test case with very long groups, and one test case with very short (1-2 element) groups. And of course, test it to death. - Group doesn't implement .save? Shouldn't that be possible with this design? - Group.this assumes groupStart is non-empty. Should assert in case future changes break GroupBy.front. There are probably more issues, but I haven't looked closely enough yet. :-P I think the design seems OK (barring the issue with incompletely-iterated Groups, but I'm not sure if fixing that wouldn't introduce too much additional complexity), but the implementation is atrocious! :-P Plus, it totally doesn't work with input ranges, as you noted. In fact, input ranges totally don't need refcounting or any of that serial number fanciness -- those would just be useless overhead. I'd even propose that for input ranges we should just use my current implementation. :-D I'm almost ready to bet that performance is better in that case. They could just be side-by-side overloads. --

https://issues.dlang.org/show_bug.cgi?id=13936 --- Comment #27 from hsteoh@quickfur.ath.cx --- Anyway, another design occurred to me. For ranges with slicing, we could scan Groups in advance and just return slices from .front. The predicate only needs to be evaluated once per element pair. This can be extended to non-sliceable ranges by precomputing the length of each Group in advance: then Group.popFront won't need to evaluate the predicate again (just decrement the length), whereas now if you iterate over 10 copies of each Group, each one will have to evaluate the predicate each time. --

https://issues.dlang.org/show_bug.cgi?id=13936 --- Comment #28 from hsteoh@quickfur.ath.cx --- Another note: your current implementation looks like it could be easily extended to handle non-equivalence predicates. All you need to do is to evaluate the predicate on adjacent elements vs. with the head of the group (IOW, just advance groupStart each time in Group.popFront). I doubt it would cause too much performance hit. --

January 12, 2015

[Issue 13936] groupBy must be redone

Posted by Andrei Alexandrescu

Permalink

Andrei Alexandrescu

Permalink

https://issues.dlang.org/show_bug.cgi?id=13936

--- Comment #29 from Andrei Alexandrescu <andrei@erdani.com> ---
(In reply to hsteoh from comment #26)
> The first thing I note, is that this still does not address the situation where the group is incompletely iterated, then the user decides to pop the outer range. In that case you duplicate almost all of the effort of popping the entire group again.

It's trivial - just add the adjustment in the Group destructor as well.

> The choice of implementation also imposes an artificial limitation on group length to uint.max, and would insert an artificial group boundary at that point even if predicate() still holds true. This is unlikely to ever happen, but still, it would be nice to have some insurance for it in documentation and in throwing an exception should this case get triggered (but unfortunately that breaks nothrow, which could be a show-stopper). Or use size_t, which defers the problem to the more unlikely distant future.

Come on, that's nitpicking. Make that size_t or ulong. Implementation may use an extra bool, too. It's not a golden standard, it's a proof of concept!

> A similar thing happens with wrapped-around group serial numbers, but that's a far less likely case to cause problems 'cos the user would have to keep a reference to a group around for uint.max groups, which realistically won't happen unless he was deliberately trying to shoot his own foot.
> 
> Now on to the implementation:
> 
> - What on earth is going on with GroupBy.groupStart and GroupBy.groupCurrent?! Shouldn't there just be a single reference to the underlying range, which gets advanced to groupNext once the current Group is finished? Currently this code looks really suspect because .groupStart and .groupCurrent seem to be redundant with each other, yet .empty checks .groupCurrent whereas .front copies .groupStart.

groupStart is there only for its front. Calling front doesn't use groupStart.

> - GroupBy.front passes the current range by value which Group stores by value. This will cause broken behaviour if the range has reference semantics: Group.popFront will cause subsequent calls to GroupBy.front to return a truncated group. Previously-saved Group's will also break once the mothership advances to the next group, because they use groupStart to evaluate the predicate, which would have changed if the underlying range has reference semantics. Basically, .save should be used everywhere except where you can justify not using it, otherwise you will almost certainly get broken boundary cases.

The implementation is intended for forward ranges, which are the hard case here. If there are other issues, could you give an example?

> - OTOH, if the range has very short groups, then most of the performance will be bottlenecked on all those .save's every time .front is accessed. We should probably profile this thing, with at least one test case with very long groups, and one test case with very short (1-2 element) groups. And of course, test it to death.

That would be great indeed.

> - Group doesn't implement .save? Shouldn't that be possible with this design?

It should.

> - Group.this assumes groupStart is non-empty. Should assert in case future changes break GroupBy.front.

That seems like a safe assumption to me. Of course assert is appropriate.

> There are probably more issues, but I haven't looked closely enough yet. :-P

I'm surprised. It's a very short codebase - 128 lines all told and uses no trick and no subterfuge. All straight code.

> I think the design seems OK (barring the issue with incompletely-iterated Groups, but I'm not sure if fixing that wouldn't introduce too much additional complexity), but the implementation is atrocious! :-P

I find that hard to agree with, especially since you had predicted dire difficulties and contortions several times in your long posts dedicated to explaining how it can't be done. Frankly after 128 lines of code destroy assertions that took several walls of text, I was shooting for "Oh okay, I see what you mean."

> Plus, it
> totally doesn't work with input ranges, as you noted.

Input non-forward ranges are the easy case.

> In fact, input ranges
> totally don't need refcounting or any of that serial number fanciness --
> those would just be useless overhead. I'd even propose that for input ranges
> we should just use my current implementation. :-D  I'm almost ready to bet
> that performance is better in that case. They could just be side-by-side
> overloads.

That was my intent as well.

--

Forums