Ranges/algorithms for aggregation

Is there a neat way to do this transformation with ranges and std.algorithms? Input: ------- B foo B bar C ble B big A begga Output: (aggregated and sorted on length) ------- B -> [foo, bar, big] C -> [ble] A -> [begga] The most obvious way (to me) to do this without standard algorithms is with an AA to the aggregation. The most obvious way (to me) to do this with std.algorithms is: B foo B bar C ble B big A begga => [B, foo] [B, bar] [C, ble] [B, big] [A, begga] => [B, foo] [B, bar] [B, big] [C, ble] [A, begga] => B -> [foo, bar, big] C -> [ble] A -> [begga] But this seems wasteful on memory. Is there a better way to do this in a more algorithmic way?

Luís Marques: > Is there a neat way to do this transformation with ranges and std.algorithms? > > Input: > ------- > B foo > B bar > C ble > B big > A begga > > Output: (aggregated and sorted on length) > ------- > B -> [foo, bar, big] > C -> [ble] > A -> [begga] What is the desired output data structure? An associative array of dynamic arrays? Or is a dynamic arrays of dynamic arrays of 2-tuples enough? There are various ways to solve your problem. Related: https://d.puremagic.com/issues/show_bug.cgi?id=5968 https://d.puremagic.com/issues/show_bug.cgi?id=9842 Bye, bearophile

On Friday, 21 March 2014 at 15:38:23 UTC, bearophile wrote: >> Output: (aggregated and sorted on length) >> ------- >> B -> [foo, bar, big] >> C -> [ble] >> A -> [begga] > > What is the desired output data structure? An associative array of dynamic arrays? Or is a dynamic arrays of dynamic arrays of 2-tuples enough? I'm doing this for a D newbie, to teach him the range/algorithmic approach. The function he wrote to output the result of that transformation takes as an input an array of arrays (the latter), but he builds that input iteratively using an AA of arrays (the former). I asked him about that mismatch and at the time he told me that "for now" he only needed the latter, suggesting he had other future plans where he might need the AA, but I'm not sure. So let's just say that the client is unclear on his requirements, which does happen in the real world anyway :-). In any case, I think the hashGroupBy is what I was asking about :-). Neat. (did anyone actually implement it?) I'm not sure how if "a dynamic arrays of dynamic arrays of 2-tuples" sufficed that would help with the intermediate step, if we wanted to avoid the sorting step. Did you have anything in particular in mind there?

Luís Marques: > So let's just say that the client is unclear on his requirements, which does happen in the real world anyway :-). Yes, it happens, But it's a problem, because often if you know what you need you can produce the results more efficiently :-) > (did anyone actually implement it?) Not yet. > I'm not sure how if "a dynamic arrays of dynamic arrays of 2-tuples" sufficed that would help with the intermediate step, if we wanted to avoid the sorting step. Did you have anything in particular in mind there? I think this problem needs a sorting or a hash. One possible solution, if you don't need an associative array as output, is to use a multiSort followed by a building of groups using slicing. It could be efficient enough. Later you search the keys with a some kind of binary search. Bye, bearophile

On Friday, 21 March 2014 at 16:04:45 UTC, bearophile wrote: > I think this problem needs a sorting or a hash. One possible solution, if you don't need an associative array as output, is to use a multiSort followed by a building of groups using slicing. It could be efficient enough. Later you search the keys with a some kind of binary search. The number of keys is large and unbounded (not just three as in my example), so I guess this multiSort approach would not be practical, right? I think we really need the hashGroupBy.

March 21, 2014

Re: Ranges/algorithms for aggregation

Posted by Justin Whear
in reply to Luís Marques

Permalink

Justin Whear

Posted in reply to Luís Marques

Permalink

On Fri, 21 Mar 2014 15:22:13 +0000, Luís Marques wrote:

> Is there a neat way to do this transformation with ranges and std.algorithms?
> 
>      Input:
>      -------
>      B foo B bar C ble B big A begga
> 
>      Output: (aggregated and sorted on length)
>      -------
>      B -> [foo, bar, big]
>      C -> [ble]
>      A -> [begga]
> 
> The most obvious way (to me) to do this without standard algorithms is
> with an AA to the aggregation. The most obvious way (to me) to do this
> with std.algorithms is:
> 
>      B foo B bar C ble B big A begga
> 
>      =>
> 
>      [B, foo]
>      [B, bar]
>      [C, ble]
>      [B, big]
>      [A, begga]
> 
>      =>
> 
>      [B, foo]
>      [B, bar]
>      [B, big]
>      [C, ble]
>      [A, begga]
> 
>      =>
> 
>      B -> [foo, bar, big]
>      C -> [ble]
>      A -> [begga]
> 
> But this seems wasteful on memory. Is there a better way to do this in a more algorithmic way?

This pull request[1] for groupBy has been hanging around for a year now, driving me to copy-and-paste the implementation into a couple of my projects.  Using it, you could do this:

auto tuples = ... // get your list of (B, foo), (B, bar), etc.
auto output = tuples.sort!`a[0] < b[0]`
                    .groupBy!`a[0] == b[0]`;
// output is a range of:
//    [
//     [(A, begga)],
//     [(B, foo), (B, bar), (B, big)],
//     [(C, ble)]
//    ]

The advantage being that output isn't an array at all but a lazy range of lazy ranges.

1 https://github.com/D-Programming-Language/phobos/pull/1186

On Fri, Mar 21, 2014 at 04:10:12PM +0000, Justin Whear wrote: [...] > This pull request[1] for groupBy has been hanging around for a year now, driving me to copy-and-paste the implementation into a couple of my projects. Using it, you could do this: > > auto tuples = ... // get your list of (B, foo), (B, bar), etc. > auto output = tuples.sort!`a[0] < b[0]` > .groupBy!`a[0] == b[0]`; > // output is a range of: > // [ > // [(A, begga)], > // [(B, foo), (B, bar), (B, big)], > // [(C, ble)] > // ] > > The advantage being that output isn't an array at all but a lazy range of lazy ranges. > > 1 https://github.com/D-Programming-Language/phobos/pull/1186 Be aware, though, that groupBy only compares *adjacent* elements for equivalence; it does not sort the input. So if your input has equivalent elements interspersed with non-equivalent elements, you will have the equivalent elements split into multiple runs in the output. Example: auto data = [ tuple(1, "a") tuple(1, "b") tuple(2, "c") tuple(1, "d") ]; writeln(data.groupBy!((a,b) => a[0] == b[0])); Will output: [[tuple(1, "a"), tuple(1, "b")], [tuple(2, "c")], [tuple(1, "d")]] Which may not be what the OP wants. T -- Unix was not designed to stop people from doing stupid things, because that would also stop them from doing clever things. -- Doug Gwyn

On Friday, 21 March 2014 at 16:53:46 UTC, H. S. Teoh wrote: > Be aware, though, that groupBy only compares *adjacent* elements for > equivalence; it does not sort the input. So if your input has equivalent > elements interspersed with non-equivalent elements, you will have the > equivalent elements split into multiple runs in the output. I think that's why Justin used sort. The hashGroupBy proposed by bearophile would avoid the sort and the additional memory usage though, so that would be even better.

On Friday, 21 March 2014 at 17:20:38 UTC, Luís Marques wrote: > I think that's why Justin used sort. The hashGroupBy proposed by bearophile would avoid the sort and the additional memory usage though, so that would be even better. I was thinking, we don't even need the full power of sort. Is there a standard algorithm that makes elements with equal keys be in sequence, but that otherwise is less expensive than sort?

Luís Marques: > I was thinking, we don't even need the full power of sort. Is there a standard algorithm that makes elements with equal keys be in sequence, but that otherwise is less expensive than sort? I don't know any, how is it supposed to know where to group items? Usually you build an associative array for that. Bye, bearophile

Forums