January 26, 2015
>
> If "group by" in other languages refers to the latter function, then
> that means "groupBy" is poorly-named and we need to come up with a
> better name for it. Changing it to return tuples and what-not seems to
> be beating around the bush to me.
>
>
> T

T: you are good with algorithms.  In many applications you have a bunch of results and want to summarise them.  This is often what the corporate manager is doing with Excel pivot tables, and it is what the groupby function is used for in pandas.  See here for a simple tutorial.

http://wesmckinney.com/blog/?p=125

And here for a summary of what pandas can do with data:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html

Is there any reason why we shouldn't add to Phobos: median, ranking, stddev, variance, correlation, covariance, skew, kurtosis, quantile, moving average, exp mov average, rolling window (see pandas)?

I personally am fine with the implementation we have (although as Ray Dalio would say. I haven't yet earned the right that you should care what I think).  All that it means is that you need to sort on multi key your results first before passing to groupby.

My question is how much is lost by doing it in two steps (sort, groupby) rather than one.  I don't think all that much, but it is not my field,  I am also not that bothered, because this comes at the end of processing, not within the inner loop, so for me I don't think it makes a difference for now.  If data sets reach commoncrawl type sizes then it might be different, although I will take D over java any day, warts and all.

In any case, the documentation should be very clear on what groupby does, and how the user can do what he might be expecting to achieve, coming from a different framework.

It would be interesting to benchmark D against pandas (which is implemented in cython for the key bits) and see how we look.
January 26, 2015
On Mon, Jan 26, 2015 at 07:59:10PM +0000, Laeeth Isharc via Digitalmars-d wrote: [...]
> My question is how much is lost by doing it in two steps (sort,
> groupby) rather than one.
[...]

I'm not sure I understand what you mean by "lost"... I assume you're not talking about data loss, since that's not applicable either way. In any case, as with all things, there's a tradeoff.

Building a hash table of groups, as some people demand, has the advantage that the grouping predicate is applied globally, and no sorting (O(n log n)) is needed. The disadvantage is that the entire data set has to fit in memory, and you cannot process it lazily. Before you've seen all of the data, you cannot know whether there are more distinct groups to come, or whether currently known groups have more members. Once the data size gets too large, you run into trouble. Plus, some people are adverse to gratuitous GC use in Phobos algorithms. Perhaps some of this is misplaced, but that's the perception.

The current implementation has the advantage that it requires very little memory to work, and can process data lazily. It only needs to see a tiny portion of the entire data set to do its work. Each group returned is also lazy, so it doesn't need to store the whole group in its entirety. It can handle 10GB long groups in a 100GB data set streamed over the network with very little memory, whereas such a monstrous thing would be infeasibly resource-hungry if you need to allocate an on-disk hashtable (very inefficient!), iterate over the entire dataset, and then be I/O bound as you iterate over each group loading group members from disk. The big disadvantage, however, is that if your data is not sorted, then the groupings returned won't be quite what you'd expect, since the predicate is evaluated only over consecutive runs of elements, not globally over the entire data set.

Which one is better? That depends on your specific application and its needs. For some applications, there is simply no way around the fact that you have a large dataset with randomly-ordered elements, so the sorting has to be done *somewhere*. For other applications, you *can* make stream your data in sorted order -- or perhaps it doesn't care if groupings aren't global -- so you can take advantage of the fact that the current groupBy is extremely cheap. It requires minimal memory to do its work, and it can do so inside a pipeline without requiring anything more than an input range.

Basically, std.algorithm is (currently) primarily focused on linear algorithms.  While there *are* some sorting algorithms and sublinear algorithms that take advantage of sorted data, it is still primarily focused on linear processing. There is no good abstraction as yet for more complex processing like multidimensional or graph traversals. As a result, it is best suited for stream-based processing.

Generally speaking, most application code is primarily concerned with linear processing (copy this text from here to there, scan this text for some linear string AKA search keyword, render this line of text to the screen, draw this line around the window, move this item along in the processing queue, etc.). Non-linear computations generally take place only in dedicated computing modules of limited scope in the application, where one tends to write dedicated algorithms rather than reuse stock generic algorithms.


> In any case, the documentation should be very clear on what groupby does, and how the user can do what he might be expecting to achieve, coming from a different framework.
[...]

As far as I know, the current groupBy docs explain quite clearly what it does.  If you find it still inadequate or unclear, please file bugs against it so that we can look into improving the docs.


T

-- 
GEEK = Gatherer of Extremely Enlightening Knowledge
January 26, 2015
On 1/26/15 12:31 PM, H. S. Teoh via Digitalmars-d wrote:
> As far as I know, the current groupBy docs explain quite clearly what it
> does.

I agree. That said, perhaps a change of name would avoid confusion. I like groupBy as we currently define it, it's straight within the spirit of D. However, some people may legitimately expect it does global processing and groups all identical keys together.

So here's what I think we should do:

1. Rename our current groupBy to chunkBy

2. Change code to have chunkBy with binary predicates return tuples, as discussed

3. Add a method called groupEquivalent to SortedRange. It takes no predicate (because predicate already is embedded in SortedRange). It groups together elements that are equivalent according to the sorting order. Returns a range of ranges (similar to what groupBy returns now).

4. Perhaps add a global chunkEquivalent as an alias for groupBy!((a, b) => a == b).

So then in order to get relational group-by semantics by sorting, one would write range.sort!(predicate).groupEquivalent...


Andrei


January 26, 2015
Don't forget that there is already http://dlang.org/phobos/std_algorithm.html#.group which matches groupBy in this context (only groups consequitive elements). It would be totally awkward to have those named differently (which was why I have suggested to change the name from chunkBy to groupBy in the very beginning)
January 26, 2015
Thank you for the thoughtful reply.

I meant lost in terms of extra processing time and memory consumption in order to achieve the fairly common use case of a groupby pivot table or pandas style (ie what you get if you sort the data by group and then run D groupby)

If you first sort the data, and then run groupby, then it seems likely in some cases this will be more expensive than a straight pandas style groupby., and my question was how much more expensive.

If you needed to sort the data anyway, because you were producing a report, this doesn't matter.  But if you didn't need it sorted because you just wanted to produce statistics on each group (eg median, inter quartile difference) then it might.

I hadn't thought clearly before about in memory vs not.

I guess for pandas style groupby, having to sort a dataset externally just to use groupby is a comparative disaster because it seems to involve much more I/O than is needed.  If I am not mistaken, this might be one reason to have also at some point a separate function that does pandas groupby efficiently for  data that may not be held in memory.

Can you explain why for hashgroupby the entire dataset needs to fit in memory?  If I were writing this naively, one could have an associative array of lists of relevant items hashed by group.  You go through each line once, add the index to the list for that group, then at the end sort the group keys and return indices or whatever the slice equivalent is of the lines matching each group.  But I may be missing some obvious points.

Ie my imagined use case has large data for each line, but the hash table and lists would fit in memory.

If thinking of a case where the data per line is not that large, but there are so many lines that it becomes an external sort then I see your point.  However I suggest that the use case I describe - whilst not the only one - is important enough at least to briefly consider.

"The  big disadvantage, however,
> is that
> if your data is not sorted, then the groupings returned won't be quite
> what you'd expect, since the predicate is evaluated only over
> consecutive runs of elements, not globally over the entire data set."

Well - it is a completely different function...

I really do believe that pandas style dataframe type processing could be an interesting area for D, and most of what is needed to construct this is already here.


> As far as I know, the current groupBy docs explain quite clearly what it
> does.  If you find it still inadequate or unclear, please file bugs
> against it so that we can look into improving the docs.


Sorry - I had hadn't thought to check!  Will do so now.


Laeeth.
January 26, 2015
> As far as I know, the current groupBy docs explain quite clearly what it
> does.  If you find it still inadequate or unclear, please file bug against it so that we can look into improving the docs.

Read the docs now - they are perfect within the context of the style of documentation (and these documents themselves have certainly been improved over the past months).

It is very clear to me, but I am not sure it would be to everyone I work with.  The formal library description may not be the place for this, but I do think many less sophisticated readers would find the style below more accessible and easier to follow (this is not about groupBy, but a broader question of documentation to supplement the formal reference description for the standard library):

http://pandas.pydata.org/pandas-docs/stable/groupby.html

I appreciate that it is all very well to suggest improvements but someone actually has to do the work.  Recognizing the direction of what might be helpful is perhaps a start though.  I might be on my own with this view, but I don't think I am.
January 26, 2015
On Monday, 26 January 2015 at 18:31:05 UTC, Andrei Alexandrescu wrote:
>
> So the key (ahem) here is to make groupBy with unary predicate different from groupBy with binary predicate. The former returns the tuple, the latter is unchanged. Makes sense?

The current implementation has a certain beauty and does not lend itself easily to a split in two different versions. Also, the additional key return value might be unnecessary and unexpected in many cases.

You can easily build the extended groupBy function out of the current one, if you only calculate the predicate beforehand. If the predicate is expensive, this might be sensible anyway.

auto groupByStar(alias pred, Range)(Range r)
{
  return r.map!(pred, "a")
	.groupBy!(a => a[0])
	.map!(inner => tuple(inner.front[0], inner.map!"a[1]"))
	;
}

void main()
{
  auto a = iota(10);
  foreach (r; a.groupByStar!"a/3") {
	writeln(r[0], " ", r[1]);
  }
}

I do not know if this solution is any better. The above might be a terrible idea.

Note: I failed to compile `r.groupBy!"a[0]"` with

iteration.d(1648): Error: function expected before (), not "a[0]" of type string
January 26, 2015
On Monday, 26 January 2015 at 20:58:36 UTC, Dicebot wrote:
> Don't forget that there is already http://dlang.org/phobos/std_algorithm.html#.group which matches groupBy in this context (only groups consequitive elements). It would be totally awkward to have those named differently (which was why I have suggested to change the name from chunkBy to groupBy in the very beginning)

This seems valid to me. I'd suggest to keep the name and implementation as it is and add the "common" groupBy function as an example.

https://github.com/D-Programming-Language/phobos/pull/2915
1 2 3 4 5
Next ›   Last »