Jump to page: 1 26  
Page
Thread overview
[Issue 13936] groupBy must be redone
Jan 09, 2015
Dicebot
Jan 11, 2015
Dicebot
Jan 11, 2015
Dicebot
Jan 11, 2015
Dicebot
Jan 11, 2015
Dicebot
January 05, 2015
https://issues.dlang.org/show_bug.cgi?id=13936

--- Comment #1 from Andrei Alexandrescu <andrei@erdani.com> ---
Sorry, pressed Enter too soon.

The implementation of groupBy in https://github.com/D-Programming-Language/phobos/pull/2654 must be redone from the ground up.

The main use case for groupBy is to iterate each element in each group and do something. That pattern must be supported with no or minimum overhead. Currently, groupBy iterates the same range twice - once inside each group, and once again to find the next group in the range of ranges.

One detail - this:

    in
    {
        import core.exception : RangeError;
        if (r.empty) throw new RangeError();
    }

seems superfluous seeing as r itself has whatever protection mechanism in its own implementation. Just defer to it; add an assert() at the maximum.

--
January 05, 2015
https://issues.dlang.org/show_bug.cgi?id=13936

Andrei Alexandrescu <andrei@erdani.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|nobody@puremagic.com        |hsteoh@quickfur.ath.cx

--
January 05, 2015
https://issues.dlang.org/show_bug.cgi?id=13936

--- Comment #2 from Andrei Alexandrescu <andrei@erdani.com> ---
With regards to the equivalence relation etc. - there's just too much fuss about it. Define it simple and robust: there's one quick moving range "fast" and one slow moving range "slow".

Initially "slow" and "fast" are in the same position. Then move "fast" forward without moving "slow" until !pred(slow.front, fast.front). At that point fast-forward slow to fast and continue. Only "fast" needs to call popFront.

For now let the caller worry about transitory vs. non transitory ranges or equivalence vs. non-equivalence predicates.

--
January 05, 2015
https://issues.dlang.org/show_bug.cgi?id=13936

bearophile_hugs@eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bearophile_hugs@eml.cc

--- Comment #3 from bearophile_hugs@eml.cc ---
Very nice comments, Andrei.

--
January 05, 2015
https://issues.dlang.org/show_bug.cgi?id=13936

--- Comment #4 from hsteoh@quickfur.ath.cx ---
@andrei:

The current implementation of groupBy only traverses the range once if it's an input range (non-forward).

The difference between equivalence and non-equivalence relations is actually important; it's not a "fussy" issue. A non-equivalence relation requires that you always evaluate the predicate on two adjacent elements, because you cannot assume transitivity: the fact that pred(r[0], r[1]) and pred(r[1], r[2]) holds, does not imply pred(r[0], r[2]) also holds, so you have to .save every previous position of the range instead of just once per subrange (determining the end of the subrange simply by evaluating the predicate with the head element of that subrange), when you know that pred is an equivalence relation.

Your fast/slow range idea is flawed, because the outer range's .front must return an independent instance of the subrange. This is an artifact of the range API. There is no way for the outer range to keep track of how far the subrange has moved ahead without introducing some kind of reference semantics to it, which becomes messy if the caller invokes .front multiple times. For example, if the user calls .front twice and .save's both subranges, then randomly iterates one or the other, what should happen? Should the outer range's 'fast' range track the first subrange, the second, or both, or neither (because they are .save'd copies)? Does advancing the first copy of the subrange also advance the second? (Which, btw, breaks the semantics of .save)?

The only way for your fast/slow range idea to work is to make subranges non-forward, so that there is only one copy of the 'fast' range that gets advanced no matter which copy of it you call popFront on. But this means users can no longer call .save on subranges, which sux if they handed you a forward range precisely for that purpose.

--
January 06, 2015
https://issues.dlang.org/show_bug.cgi?id=13936

--- Comment #5 from Andrei Alexandrescu <andrei@erdani.com> ---
(In reply to hsteoh from comment #4)
> @andrei:

Thanks for answering. OK, let's see.

> The current implementation of groupBy only traverses the range once if it's an input range (non-forward).

Yes - couldn't even do otherwise. Input non-forward ranges move along all together (that's somewhat implied but let's assume it's always the case). So groupBy relies on the fact that if r1 and r2 are copies of the same input/non-forward range, calling popFront in one also pops the front in the other.

That's efficient and all. The challenge here is to make sure the same goes for forward ranges. The simplest way to achieve that is by using a pointer, i.e. instead of using `Range`, use `Range*`. Then copying the pointer around will refer to the same actual range. Calling popFront through one will also pop the other.

But that produces garbage so a RefCounted!Range comes to mind. That would probably be a nice solution. In my opinion what we have right now penalizes forward ranges too much to be workable.

> The difference between equivalence and non-equivalence relations is actually important; it's not a "fussy" issue. A non-equivalence relation requires that you always evaluate the predicate on two adjacent elements, because you cannot assume transitivity: the fact that pred(r[0], r[1]) and pred(r[1], r[2]) holds, does not imply pred(r[0], r[2]) also holds, so you have to .save every previous position of the range instead of just once per subrange (determining the end of the subrange simply by evaluating the predicate with the head element of that subrange), when you know that pred is an equivalence relation.

We must go with efficiency for the common classic relational case. I think a distinct place we do not want to be in is: "Well for cool relational stuff you can use groupBy, it's very general and actually goes outside what relational algebra can do. However, if you want speed don't use it; you gotta use manual loops."

We must handle the typical relational algebra treatment, and handle it well enough that handmade approaches are unnecessary. If someone has a weird predicate and a weird requirement they can use adjacentFind or handmade loops etc. We don't need to cater for them. Pushing two ranges along is inefficient and therefore unacceptable.

> Your fast/slow range idea is flawed, because the outer range's .front must return an independent instance of the subrange. This is an artifact of the range API. There is no way for the outer range to keep track of how far the subrange has moved ahead without introducing some kind of reference semantics to it, which becomes messy if the caller invokes .front multiple times.

So there is no way, or there's a messy way? Big difference. I say there is a way.

> For example, if the user calls .front twice and .save's both
> subranges, then randomly iterates one or the other, what should happen?
> Should the outer range's 'fast' range track the first subrange, the second,
> or both, or neither (because they are .save'd copies)? Does advancing the
> first copy of the subrange also advance the second? (Which, btw, breaks the
> semantics of .save)?
> 
> The only way for your fast/slow range idea to work is to make subranges non-forward, so that there is only one copy of the 'fast' range that gets advanced no matter which copy of it you call popFront on. But this means users can no longer call .save on subranges, which sux if they handed you a forward range precisely for that purpose.

This can be handled, it's indeed not simple. There is an implementation in https://github.com/D-Programming-Language/phobos/pull/1186 that is working but a bit oddly and is probably more complicated than it needs to.

Anyhow, the larger point is we want groupBy to work as close to native speed for the common case of going once through all groups and with an equivalence relation. For more complex uses, Just Fine(tm) at a small extra cost should be okay. Is this reasonable?

--
January 08, 2015
https://issues.dlang.org/show_bug.cgi?id=13936

--- Comment #6 from Andrei Alexandrescu <andrei@erdani.com> ---
ping pliz pliz :o)

--
January 08, 2015
https://issues.dlang.org/show_bug.cgi?id=13936

--- Comment #7 from hsteoh@quickfur.ath.cx ---
I don't understand your beef against isEquivRelation. It's there precisely to give users the choice of having maximal expressive power at a slight performance cost, or being restricted to only equivalence relations but have better performance.

As for only iterating things once, a thought occurred to me today that instead of creating a new subrange every time the outer .front is called, we can simply keep *both* inner and outer ranges in the same struct, and .front simply returns a reference to the inner range. Multiple calls to .front returns (a reference to) the same inner range, so popping any of them will pop all of them at the same time. The inner range is then only valid until the next call to popFront() -- i.e., groupBy will return a transient range. This will give you maximum performance.

However, it forces the inner range to be non-forward -- there is no way to .save it without making a copy of both the outer and inner ranges, and once you .save it, it detaches from the outer range and no longer influences it when popFront is called. There is no way to make .save work without completely breaking the idea of single traversal, because if I use .save to make n copies of an inner range, then which copy's popFront will influence the outer range? And what happens if I call popFront on the outer range and then go back to a .save'd copy of the inner range afterwards? There are endless nasty corner cases that make this extremely messy.

So there you have it: a fast solution. But it comes at the cost of messy semantics (transient outer range, inner ranges have reference semantics and cannot be .save'd). The transience of the outer range is quite a major concern IMO, because many algorithms in Phobos do not deal with them correctly. Expect all the same problems with byLine to turn up all over again.

Honestly, I'd rather make this single-pass implementation a *different* function (or use a different compile-time argument) that is opt-in rather than opt-out. D's motto after all is "correct by default, fast if you ask for it". Naïve use cases of groupBy should return the "safest" behaviour -- i.e., the current behaviour where nothing is transient and .save has the usual semantics. If people find that the performance is too slow, recommend groupByFast (or equivalent), with the caveat that they will have to deal with the more unconventional semantics in their code.

--
January 08, 2015
https://issues.dlang.org/show_bug.cgi?id=13936

--- Comment #8 from Andrei Alexandrescu <andrei@erdani.com> ---
Great going, thx. Let's beat this into shape.

(In reply to hsteoh from comment #7)
> I don't understand your beef against isEquivRelation. It's there precisely to give users the choice of having maximal expressive power at a slight performance cost, or being restricted to only equivalence relations but have better performance.

I'd say offering non-equivalence relations would be desirable only if (a) the default is to assume the predicate is an equivalence relation, and (b) implementing it doesn't take focus away from the matter at hand, which is an efficient groupBy with equivalence relations against input and forward ranges.

> As for only iterating things once, a thought occurred to me today that instead of creating a new subrange every time the outer .front is called, we can simply keep *both* inner and outer ranges in the same struct, and .front simply returns a reference to the inner range. Multiple calls to .front returns (a reference to) the same inner range, so popping any of them will pop all of them at the same time. The inner range is then only valid until the next call to popFront() -- i.e., groupBy will return a transient range. This will give you maximum performance.

Yah, but it's not sufficient as you note below.

> However, it forces the inner range to be non-forward -- there is no way to .save it without making a copy of both the outer and inner ranges, and once you .save it, it detaches from the outer range and no longer influences it when popFront is called. There is no way to make .save work without completely breaking the idea of single traversal, because if I use .save to make n copies of an inner range, then which copy's popFront will influence the outer range? And what happens if I call popFront on the outer range and then go back to a .save'd copy of the inner range afterwards? There are endless nasty corner cases that make this extremely messy.
>
> So there you have it: a fast solution. But it comes at the cost of messy semantics (transient outer range, inner ranges have reference semantics and cannot be .save'd). The transience of the outer range is quite a major concern IMO, because many algorithms in Phobos do not deal with them correctly. Expect all the same problems with byLine to turn up all over again.

I agree that's not a good solution. We shouldn't code it.

> Honestly, I'd rather make this single-pass implementation a *different* function (or use a different compile-time argument) that is opt-in rather than opt-out. D's motto after all is "correct by default, fast if you ask for it". Naïve use cases of groupBy should return the "safest" behaviour -- i.e., the current behaviour where nothing is transient and .save has the usual semantics. If people find that the performance is too slow, recommend groupByFast (or equivalent), with the caveat that they will have to deal with the more unconventional semantics in their code.

"It can't be done satisfactorily well" is in a way a given - it's the baseline we're starting from. It's also not terribly interesting because it precludes continuing to look into the matter. The trick is to actually find a way to do it, and here's where I'm relying on your wits.

I think it can be done in a satisfactory manner. There needs to be communication between individual groups and the mothership (range of ranges). Did you take a look at that pull request of mine? In there, each group has a serial number, and the mothership also has a serial number. That way it's easy to figure which group the mothership is currently in.

Once a group reaches its end, it "calls home" (the mothership) and checks whether the serial number of the mothership is the same as the serial number of the group. If so, great - it offers the mothership the O(1) answer for the mothership's next popFront. If not, it means the mothership has already moved forward so there's nothing to do.

I feel I'm not explaining this very well. Basically the proper iteration can be achieved by establishing a simple protocol between the range of ranges and each individual group.

Does this make sense?

--
January 09, 2015
https://issues.dlang.org/show_bug.cgi?id=13936

--- Comment #9 from hsteoh@quickfur.ath.cx ---
The nice thing about assuming non-equivalence relation by default is that it's the most general: it will produce correct behaviour in all use cases with no further work required from the user. In my initial implementation of groupBy, it assumed equivalence relation by default, and led to buggy behaviour like non-terminating empty ranges or "wrong" groupings because either transitivity or reflexivity failed to hold, but the code assumed they did.

My initial reaction was to define it away -- state that only equivalence relations are supported and close the bug. But upon further thought, I decided that supporting non-equivalence relations opens up more possibilities: you can now do things like grouping by maximum adjacent difference, for example, which can be useful as a kind of edge detector in a stream of data, but maximum adjacent difference is not an equivalence relation due to non-transitivity. It makes groupBy useful beyond the usual mundane equivalence relation based predicates.

In any case, this is a (mostly) orthogonal issue to performance. The problem I see with your "call home" proposal is that it introduces a lot of hidden complexities that either may offset the performance gain, or introduces subtle semantic corner cases that are ugly to deal with. First of all, this would require the subranges to retain a reference to the outer range. There are several problems with this:

(1) the outer range can no longer be a by-value type, because if it's a struct, structs are liable to being moved around, invalidating references to it;

(2) so we have to allocate on the heap, which is already a minus given the current push for reducing Phobos allocations.

(3) Which in turn means GC load, unless we introduce reference counting, which
adds additional complication (plus overhead to maintain the ref count).

(4) The subranges may need extra care to implement -- if they are a by-value struct type, the struct may get copied around, leaving stray references to the outer range, so either you need GC to clean up, or you need to tread into dangerous territory of struct dtors (which leads into areas with many compiler bugs and/or limitations, and fragile behaviour -- a lot of Phobos code doesn't work well with structs that have dtors because they don't have nice semantics for things like "T tmp; return tmp;" to get an empty range of the same type, for example).

(5) This *still* doesn't fully solve the performance issue -- if I have a 1000-element subrange and I iterate over 900 of them and then give up and decide to move on to the next subrange, the parent range will still repeat the 1000 popFront's from the beginning of the subrange because the subrange never called home.

(6) Your call-home trick falls flat on its face if somebody calls .save on the outer range.  Which copy of the outer range should the subranges call home to?

Using the single-subrange idea I had, does solve (5), but then it introduces poor API because the subranges can no longer be forward ranges. (In retrospect this is not surprising, since we *are* trying to optimize for the single-pass use case, so input-only ranges would be the natural solution. It's wanting *both* single-pass performance *plus* the full semantics of range-of-ranges that makes things complicated.) One idea that occurs to me is to store the subrange as a struct inside the parent range (so it never needs to call home, the parent range just picks up on where it left off and continues scanning for the next subrange), and have .save just return an independent copy of the subrange. However, this isn't a foolproof solution either, since a common idiom with forward ranges is to always work with a .save'd copy of a range, so then the user never actually uses the embedded copy of the subrange and any popFront()'s called on the .save'd copy doesn't benefit the outer range.

One possible solution I can think of, that still requires the extra complexity of heap-allocating the outer range, but does solve (5): one of the possibly multiple copies of the subranges is designated as the "furthest subrange", and gets the special privilege of having a reference to it from the outer range. Each copy of the subrange "competes" for this position by keeping a pop-counter that is bumped every time popFront is called. If the pop-counter exceeds the pop-counter of the current "furthest subrange", the range that went further installs itself as the new "furthest subrange" in the outer range, replacing the old champion. (The old champion can snatch the position back if it later on has more popFront's called.) When popFront on the outer range is called, it .save's the current furthest subrange and scans for the next subrange boundary. It also bumps the serial number, which indicates to all previous subranges that they've all been demoted and can no longer compete for the furthest subrange trophy. This ensures that the outer range always does the minimum necessary work to find the next subrange.

However, this requires extra overhead in popFront for each subrange (a pointer dereference, a comparison, and potentially one or more pointer assignments). At the end of the day, this may turn out making performance worse, because the optimizer is hampered by the aliasing caused by all these references and probably won't be able to generate optimal code or elide all this extra work that isn't needed in the single-pass use case. So sure, we save on the big-O characteristics (O(n) where n=length of range, instead of O(n*r) where r=number of subranges)), but it may not actually *run* faster in practice compared to a plain old for-loop. And (6) is still not solved, which I surmise is a pretty major showstopper.

I still think that if you want maximal performance for the single-pass case, you should just return an input range -- and make this something chosen by the user. We can have groupBy() return a fully-general forward range, and have groupByFast() (or groupBy!(Yes.singlePass)) implement a truly optimal single-pass algorithm where .save is not implemented, thus side-stepping the complexity and reducing to basically a manually-written for-loop except it has a range API. Well actually, you *can* implement .save for the outer range if you use the embedded-subrange idea I proposed: the same struct holds both the outer range and the inner range, so .save can just copy the struct and now you have sane semantics on both copies. You just can't .save the inner range, that's all. (Well actually, you *can*... by .save'ing the outer range, since the inner range has reference semantics so .front will resume where the inner range left off. You just can't save the inner range without also saving the outer range.)

In retrospect, this seems to confirm my observation that basically the current state of things is an artifact of subranges having the full (forward) range API. If the outer range only allowed a "current subrange element" cursor/iterator/what-have-you, we wouldn't have this problem, since it'd be clear that there is only a single outer range coupled with a single subrange at any time, so you don't have the many-to-many problem where somebody .save's both outer and inner ranges and now you don't know how to have them talk to each other in a sane way. By making the inner iteration a full-fledged range, all of this baggage gets inherited. Even in my proposed single-subrange idea, if you restrict the inner range to be input-only, the problem is still somewhat tractable. But once you allow .save in the inner range, all hell breaks loose. The thing is, if you knew ahead of time how the user is going to traverse the range, you can optimize for that, but since we can't know that ahead of time, either we need the user to tell us (I want full .save semantics on everything, so give me GC outer/inner ranges; or, I want single-pass only, so make everything a non-copyable input range, I don't care), or we have to cater to the general case, which unfortunately means poor performance.

--
« First   ‹ Prev
1 2 3 4 5 6