January 22, 2015 sortUniq | ||||
---|---|---|---|---|
| ||||
There's this classic patter on Unix: |sort|uniq, i.e. sort some data and only display the unique elements. What would be a better integrated version - one that does sorting and uniq in one shot? I suspect the combination could be quite a bit better than doing the two in sequence. A few google searches didn't yield much. Ideas? Thanks, Andrei |
January 22, 2015 Re: sortUniq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Thu, Jan 22, 2015 at 01:40:56PM -0800, Andrei Alexandrescu via Digitalmars-d wrote: > There's this classic patter on Unix: |sort|uniq, i.e. sort some data and only display the unique elements. > > What would be a better integrated version - one that does sorting and uniq in one shot? I suspect the combination could be quite a bit better than doing the two in sequence. > > A few google searches didn't yield much. Ideas? [...] Off the top of my head: a sorting algorithm that swallows identical elements? I've never actually seen such an algorithm before, though. Generally, sorting algorithms try not to output less elements than they were given. :-P T -- Marketing: the art of convincing people to pay for what they didn't need before which you fail to deliver after. |
January 22, 2015 Re: sortUniq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Thursday, 22 January 2015 at 21:40:57 UTC, Andrei Alexandrescu wrote:
> There's this classic patter on Unix: |sort|uniq, i.e. sort some data and only display the unique elements.
>
> What would be a better integrated version - one that does sorting and uniq in one shot? I suspect the combination could be quite a bit better than doing the two in sequence.
I think the only way you can go lower than O(n log n) is pigeonholing (counting sort without the counting), but for generic keys you need AAs which aren't O(1) (and slow).
Maybe heapsort (you could drop duplicates while building the heap), but that needs an extra allocation.
|
January 22, 2015 Re: sortUniq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu:
> There's this classic patter on Unix: |sort|uniq, i.e. sort some data and only display the unique elements.
In Bugzilla I've asked for a hashGroup, it returns a built-in associative array.
Bye,
bearophile
|
January 22, 2015 Re: sortUniq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Thu, 22 Jan 2015 13:40:56 -0800, Andrei Alexandrescu wrote: > There's this classic patter on Unix: |sort|uniq, i.e. sort some data and only display the unique elements. > > What would be a better integrated version - one that does sorting and uniq in one shot? I suspect the combination could be quite a bit better than doing the two in sequence. > > A few google searches didn't yield much. Ideas? > > > Thanks, > > Andrei Efficiency improvement-wise, perhaps a generalization of a counting sort (http://en.wikipedia.org/wiki/Counting_sort), see "Variant algorithms." |
January 22, 2015 Re: sortUniq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Justin Whear | On 1/22/15 2:06 PM, Justin Whear wrote:
> On Thu, 22 Jan 2015 13:40:56 -0800, Andrei Alexandrescu wrote:
>
>> There's this classic patter on Unix: |sort|uniq, i.e. sort some data and
>> only display the unique elements.
>>
>> What would be a better integrated version - one that does sorting and
>> uniq in one shot? I suspect the combination could be quite a bit better
>> than doing the two in sequence.
>>
>> A few google searches didn't yield much. Ideas?
>>
>>
>> Thanks,
>>
>> Andrei
>
> Efficiency improvement-wise, perhaps a generalization of a counting sort
> (http://en.wikipedia.org/wiki/Counting_sort), see "Variant algorithms."
Thanks. I'm thinking of something based on general comparisons. Something like a modified quicksort with fat pivot:
A. Partition into three subranges: R1<pivot, R2=pivot, R3>pivot
B. Recurse on R1 obtaining R11, a subrange of R1 containing only unique elements.
C. Sort-move-unique R2 into the portion after R11, obtaining the result.
A and B seem fine, I'm unclear about C. It would be an algorithm that sorts and simultaneously moves data in a range that overlaps the input.
Andrei
|
January 22, 2015 Re: sortUniq | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Thu, Jan 22, 2015 at 02:29:58PM -0800, Andrei Alexandrescu via Digitalmars-d wrote: [...] > Thanks. I'm thinking of something based on general comparisons. Something like a modified quicksort with fat pivot: > > A. Partition into three subranges: R1<pivot, R2=pivot, R3>pivot > > B. Recurse on R1 obtaining R11, a subrange of R1 containing only unique elements. > > C. Sort-move-unique R2 into the portion after R11, obtaining the result. > > A and B seem fine, I'm unclear about C. It would be an algorithm that sorts and simultaneously moves data in a range that overlaps the input. [...] Wait, if R1 and R3, after the recursion, are sorted and contain only unique elements, doesn't that mean that we can just collapse R2 into a single element along with the endpoints of R1 and R3? So if R1[$-1] < pivot, then we're done, otherwise coalesce it with the pivot, and likewise with R3[0]. T -- Famous last words: I wonder what will happen if I do *this*... |
January 22, 2015 Re: sortUniq | ||||
---|---|---|---|---|
| ||||
On Thu, Jan 22, 2015 at 02:51:38PM -0800, H. S. Teoh via Digitalmars-d wrote: > On Thu, Jan 22, 2015 at 02:29:58PM -0800, Andrei Alexandrescu via Digitalmars-d wrote: [...] > > Thanks. I'm thinking of something based on general comparisons. Something like a modified quicksort with fat pivot: > > > > A. Partition into three subranges: R1<pivot, R2=pivot, R3>pivot > > > > B. Recurse on R1 obtaining R11, a subrange of R1 containing only unique elements. > > > > C. Sort-move-unique R2 into the portion after R11, obtaining the result. > > > > A and B seem fine, I'm unclear about C. It would be an algorithm that sorts and simultaneously moves data in a range that overlaps the input. > [...] > > Wait, if R1 and R3, after the recursion, are sorted and contain only unique elements, doesn't that mean that we can just collapse R2 into a single element along with the endpoints of R1 and R3? So if R1[$-1] < pivot, then we're done, otherwise coalesce it with the pivot, and likewise with R3[0]. [...] Working proof of concept: import std.algorithm; import std.array; import std.stdio; // Returns: final index of pivot size_t partition(R)(R r) { auto pivot = r[0]; swap(r[0], r[$-1]); // move pivot to end auto leftIdx = 0; foreach (i; 0 .. r.length) { if (r[i] < pivot) swap(r[i], r[leftIdx++]); } swap(r[leftIdx], r[$-1]); return leftIdx; } R uniqueSort(R)(R r) { if (r.empty) return r; // Partition auto pivotIdx = partition(r); auto pivot = r[pivotIdx]; // Recurse auto left = uniqueSort(r[0 .. pivotIdx]); auto right = uniqueSort(r[pivotIdx + 1 .. $]); // Coalesce left ~ pivot ~ right. if (!left.empty && left[$-1] == pivot) left = left[0 .. $-1]; if (!right.empty && right[0] == pivot) right = right[1 .. $]; r[left.length] = pivot; auto vacant = copy(right, r[left.length + 1 .. $]); return r[0 .. $ - vacant.length]; } unittest { auto data = [4, 0, 9, 7, 2, 5, 1, 1, 3, 4, 2, 0, 9, 5, 3, 6, 1, 7, 4, 8]; auto sorted = data.uniqueSort(); assert(sorted == [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]); } I'm not sure about the performance of this algorithm, though, because the call to copy() might become prohibitive once you add up all the recursive calls to uniqueSort(). Also, I only got one test case working; no idea if there are subtle bugs that are lurking in there somewhere. Please test with more test cases. :-) T -- "No, John. I want formats that are actually useful, rather than over-featured megaliths that address all questions by piling on ridiculous internal links in forms which are hideously over-complex." -- Simon St. Laurent on xml-dev |
January 22, 2015 Re: sortUniq | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Thursday, 22 January 2015 at 23:35:32 UTC, H. S. Teoh via Digitalmars-d wrote: > On Thu, Jan 22, 2015 at 02:51:38PM -0800, H. S. Teoh via Digitalmars-d wrote: >> Wait, if R1 and R3, after the recursion, are sorted and contain only >> unique elements, doesn't that mean that we can just collapse R2 into a >> single element along with the endpoints of R1 and R3? So if R1[$-1] < >> pivot, then we're done, otherwise coalesce it with the pivot, and >> likewise with R3[0]. I'm pretty sure that when Andrei said: >> > C. Sort-move-unique R2 into the portion after R11, obtaining the >> > result. He meant R3, not R2. R2 is obviously collapsed to one element (the pivot). > Working proof of concept: I think this will actually have worse performance than sort+uniq, since you're now shuffling half of the elements at each recursion depth - an additional O(n log n). |
January 22, 2015 Re: sortUniq | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Thursday, 22 January 2015 at 22:53:59 UTC, H. S. Teoh via Digitalmars-d wrote:
> So if R1[$-1] < pivot,
This should be always true, no? R1 might be empty, though.
|
Copyright © 1999-2021 by the D Language Foundation