Thread overview
chunkBy implementation
Sep 29, 2020
H. S. Teoh
September 29, 2020
Here is the chunkBy implementation for an array: https://github.com/dlang/phobos/blob/4d7e5829cb60946fefad84d066de948e813eb6f5/std/algorithm/iteration.d#L2057

Complete with RefCounted allocations and everything. The whole works! Warning, do not pass this into std.array.array, it really isn't what you want to do.

Here's a simple range that slices an array according to a predicate:

struct SliceBy(alias pred, R)
{
    R remaining;
    size_t nextIdx;
    auto front() { return remaining[0 .. nextIdx]; }
    void popFront() {
        remaining = remaining[nextIdx .. $];
        if(remaining.length == 0)
            return;
        nextIdx = 1;
        while(nextIdx < remaining.length && pred(remaining[nextIdx - 1], remaining[nextIdx]))
            ++nextIdx;
    }
    bool empty() { return remaining.empty; }
}

No allocations, no RefCounted. The range elements have the same properties as the original (random access, etc.). I can squash this into std.array.array and it works as expected (i.e. I get an R[]). I can add forward and bidirectional range primitives if I wanted to (and all the other machinery that would be required to get it phobos-ready), but I don't need all that for my purposes, I just need a foreachable thingy.

But why is this not something that is already in there? Is there a reason we want to call the predicate all over again if you process a group multiple times? Is there a reason we aren't taking advantage of slicing when available?

Just trying to figure out the reasoning behind the complexity.

-Steve
September 29, 2020
On Tue, Sep 29, 2020 at 05:33:35PM -0400, Steven Schveighoffer via Digitalmars-d wrote:
> Here is the chunkBy implementation for an array: https://github.com/dlang/phobos/blob/4d7e5829cb60946fefad84d066de948e813eb6f5/std/algorithm/iteration.d#L2057
> 
> Complete with RefCounted allocations and everything. The whole works!

I was the one who wrote the original (non-RefCounted) implementation of chunkBy. During PR review, however, we ran into the case of how the resulting range iterates the original range twice: once in the subranges as you .popFront through them, and once in the parent range's .popFront (which needs to find the end of the subrange).

To eliminate this redundancy, Andrei came up with the mothership idea: the subrange would keep a reference to the parent range, so that if the subrange has already been consumed, the parent range can just pick up where it left off instead of starting from the beginning of the subrange and scan until the next subrange.

Keeping a reference to the parent range, however, opens a whole can o' worms about lifetime and all that jazz, so in the end, RefCounted was brought into the mix, to solve what's arguably a case of premature optimization.


[...]
> Is there a reason we want to call the predicate all over again if you process a group multiple times? Is there a reason we aren't taking advantage of slicing when available?

I think I was so burned out by the time we arrived at the RefCounted implementation, I just had no more strength or interest left to think about the case when the incoming range has slicing!  It's probably a case of an over-engineered solution displacing all other locally more optimal solutions -- since the RefCounted monster already handles all cases, nobody wants to introduce another case into the code that only handles a subset of cases.


T

-- 
You are only young once, but you can stay immature indefinitely. -- azephrahel
September 29, 2020
On 9/29/20 6:08 PM, H. S. Teoh wrote:
> On Tue, Sep 29, 2020 at 05:33:35PM -0400, Steven Schveighoffer via Digitalmars-d wrote:
>> Here is the chunkBy implementation for an array: https://github.com/dlang/phobos/blob/4d7e5829cb60946fefad84d066de948e813eb6f5/std/algorithm/iteration.d#L2057
>>
>> Complete with RefCounted allocations and everything. The whole works!
> 
> I was the one who wrote the original (non-RefCounted) implementation of
> chunkBy. During PR review, however, we ran into the case of how the
> resulting range iterates the original range twice: once in the subranges
> as you .popFront through them, and once in the parent range's .popFront
> (which needs to find the end of the subrange).
> 
> To eliminate this redundancy, Andrei came up with the mothership idea:
> the subrange would keep a reference to the parent range, so that if the
> subrange has already been consumed, the parent range can just pick up
> where it left off instead of starting from the beginning of the subrange
> and scan until the next subrange.

Hm... well, what about eagerly deciding when to stop, and just returning a Take range with the right number of elements? Then you aren't running the predicates more than once per element.

In any case, I've solved my local problem, and I wanted to ask about it here, since it seemed so out of left field to have this bizarre "mothership" mechanism (I haven't seen that on any other ranges).

If something like my quick-and-dirty version was added, I'm assuming maybe we'd have to name it something else.

-Steve