sortUniq (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » sortUniq (page 2)

January 23, 2015

Posted by bearophile
in reply to Andrei Alexandrescu

bearophile

Posted in reply to Andrei Alexandrescu

Andrei Alexandrescu:

> What would be a better integrated version - one that does sorting and uniq in one shot? I suspect the combination could be quite a bit better than doing the two in sequence.

What I need most is a "groping by hashing", since years. A unique sort can be handy, but it's less useful that the first.

Bye,
bearophile

January 23, 2015

Posted by H. S. Teoh
in reply to Vladimir Panteleev

H. S. Teoh

Posted in reply to Vladimir Panteleev

On Thu, Jan 22, 2015 at 11:52:18PM +0000, Vladimir Panteleev via Digitalmars-d wrote:
> On Thursday, 22 January 2015 at 23:35:32 UTC, H. S. Teoh via Digitalmars-d wrote:
> >On Thu, Jan 22, 2015 at 02:51:38PM -0800, H. S. Teoh via Digitalmars-d wrote:
> >>Wait, if R1 and R3, after the recursion, are sorted and contain only unique elements, doesn't that mean that we can just collapse R2 into a single element along with the endpoints of R1 and R3? So if R1[$-1] < pivot, then we're done, otherwise coalesce it with the pivot, and likewise with R3[0].
> 
> I'm pretty sure that when Andrei said:
> 
> >>> C. Sort-move-unique R2 into the portion after R11, obtaining > the result.
> 
> He meant R3, not R2. R2 is obviously collapsed to one element (the
> pivot).

Ah, that's why I misunderstood him.

> >Working proof of concept:
> 
> I think this will actually have worse performance than sort+uniq, since you're now shuffling half of the elements at each recursion depth - an additional O(n log n).

That's what I thought. At each step, you have to coalesce the elements by copying `right` over. So that adds up to quite a cost.

Actually, I just realized that the code has an unnecessary check for left[$-1] being equal to pivot: this is actually impossible, since partition() makes sure that the left half of the range is strictly less than the pivot, so it cannot possibly contain the pivot. So we can eliminate that check. But it doesn't solve the problem of excessive copying.

Hmm... actually, I think it might be possible to modify partition so that it eliminates duplicated pivots for you. However, this still doesn't solve the recursive case where `left` might have shrunk, thus necessitating `right` be shifted over afterwards.

A modified heapsort might be a better candidate for uniqueSort() than
quicksort, it looks like.

T

-- 
Music critic: "That's an imitation fugue!"

January 23, 2015

Posted by Andrei Alexandrescu
in reply to H. S. Teoh

Andrei Alexandrescu

Posted in reply to H. S. Teoh

On 1/22/15 2:51 PM, H. S. Teoh via Digitalmars-d wrote:
> Wait, if R1 and R3, after the recursion, are sorted and contain only
> unique elements, doesn't that mean that we can just collapse R2 into a
> single element along with the endpoints of R1 and R3? So if R1[$-1] <
> pivot, then we're done, otherwise coalesce it with the pivot, and
> likewise with R3[0].

Right, forgot to mention the pivot.

You don't recurse on the right, only the left; that way you massage the unique stuff in R1 toward the left in the possibly smaller range R11. Then the challenge is to sort, uinique-fy, and move in one fell swoop from R3 (padded with the pivot to the left) into the portion adjacent to R11.

Example: say the range has 100 elements. Then we divide like this:

r[0 .. 30] is <pivot
r[30 .. 34] is =pivot
r[34 .. 100] is >pivot

First, recurse on r[0 .. 30] getting a smaller range that's sorted and unique. Say r[0 .. 24].

The step needed now is to take r[33 .. 100], which is ONE copy of the pivot plus everything greater than pivot, and in one shot deposit it starting at r[24], sorted and uniquefied.

I'm thinking of a modified heapsort that progressively creates the heap at r[24], or something like that.

Andrei

January 23, 2015

Posted by Andrei Alexandrescu
in reply to Vladimir Panteleev

Andrei Alexandrescu

Posted in reply to Vladimir Panteleev

On 1/22/15 3:52 PM, Vladimir Panteleev wrote:
> On Thursday, 22 January 2015 at 23:35:32 UTC, H. S. Teoh via
> Digitalmars-d wrote:
>> On Thu, Jan 22, 2015 at 02:51:38PM -0800, H. S. Teoh via Digitalmars-d
>> wrote:
>>> Wait, if R1 and R3, after the recursion, are sorted and contain only
>>> unique elements, doesn't that mean that we can just collapse R2 into a
>>> single element along with the endpoints of R1 and R3? So if R1[$-1] <
>>> pivot, then we're done, otherwise coalesce it with the pivot, and
>>> likewise with R3[0].
>
> I'm pretty sure that when Andrei said:
>
>>> > C. Sort-move-unique R2 into the portion after R11, obtaining > the
>>> > result.
>
> He meant R3, not R2. R2 is obviously collapsed to one element (the pivot).

Yes, thanks and sorry for the distraction.

>> Working proof of concept:
>
> I think this will actually have worse performance than sort+uniq, since
> you're now shuffling half of the elements at each recursion depth - an
> additional O(n log n).

Yah, the key here is to exploit the combination to get better performance than composing the two. So you must do surgery on the core algorithms involved.


Andrei

January 23, 2015

Posted by Andrei Alexandrescu
in reply to bearophile

Andrei Alexandrescu

Posted in reply to bearophile

On 1/22/15 4:06 PM, bearophile wrote:
> Andrei Alexandrescu:
>
>> What would be a better integrated version - one that does sorting and
>> uniq in one shot? I suspect the combination could be quite a bit
>> better than doing the two in sequence.
>
> What I need most is a "groping by hashing", since years. A unique sort
> can be handy, but it's less useful that the first.

I agree a hash uniq would be nice. -- Andrei

January 23, 2015

Posted by Paul O'Neil
in reply to bearophile

Paul O'Neil

Posted in reply to bearophile

On 01/22/2015 07:06 PM, bearophile wrote:
> Andrei Alexandrescu:
> 
>> What would be a better integrated version - one that does sorting and uniq in one shot? I suspect the combination could be quite a bit better than doing the two in sequence.
> 
> What I need most is a "groping by hashing", since years. A unique sort can be handy, but it's less useful that the first.
> 
> Bye,
> bearophile

Is that different than groupBy?

-- 
Paul O'Neil
Github / IRC: todayman

January 23, 2015

Posted by bearophile
in reply to Paul O'Neil

bearophile

Posted in reply to Paul O'Neil

Paul O'Neil:

> Is that different than groupBy?

It works with hashing, and doesn't require a sorting, so it's different. On the other hand it's just few lines of code long, and it's quite easy to write...

Bye,
bearophile

January 23, 2015

Posted by H. S. Teoh
in reply to Andrei Alexandrescu

H. S. Teoh

Posted in reply to Andrei Alexandrescu

On Thu, Jan 22, 2015 at 04:10:45PM -0800, Andrei Alexandrescu via Digitalmars-d wrote:
> On 1/22/15 2:51 PM, H. S. Teoh via Digitalmars-d wrote:
> >Wait, if R1 and R3, after the recursion, are sorted and contain only unique elements, doesn't that mean that we can just collapse R2 into a single element along with the endpoints of R1 and R3? So if R1[$-1] < pivot, then we're done, otherwise coalesce it with the pivot, and likewise with R3[0].
> 
> Right, forgot to mention the pivot.
> 
> You don't recurse on the right, only the left; that way you massage the unique stuff in R1 toward the left in the possibly smaller range R11. Then the challenge is to sort, uinique-fy, and move in one fell swoop from R3 (padded with the pivot to the left) into the portion adjacent to R11.
> 
> Example: say the range has 100 elements. Then we divide like this:
> 
> r[0 .. 30] is <pivot
> r[30 .. 34] is =pivot
> r[34 .. 100] is >pivot
> 
> First, recurse on r[0 .. 30] getting a smaller range that's sorted and unique. Say r[0 .. 24].
> 
> The step needed now is to take r[33 .. 100], which is ONE copy of the pivot plus everything greater than pivot, and in one shot deposit it starting at r[24], sorted and uniquefied.
> 
> I'm thinking of a modified heapsort that progressively creates the heap at r[24], or something like that.
[...]

Well, that would work. Recurse on left to get a uniqSorted left segment, then modified heapsort on right (instead of heapsorting in place, deposit extracted heap elements into the target range).

The question, though, is whether this hybrid recursion actually wins us anything, since heapsort is known to be somewhat slower than quicksort. In the worst case, you choose a poor pivot and get *both* the worst case of quicksort and the inferior performance of heapsort compounded together.

Unless there's a way to modify heapsort to be more efficient as well?

Actually, another idea occurred to me: suppose we modify partition() instead, so that instead of partitioning in-place, it deposits partitioned elements into a target (possibly overlapping) range. Then we modify the right-recursion, so that its partitioning step actually simultaneously moves the range over to the target area for us. Not sure if this will eliminate the copying cost, but it might, since partition has to traverse the entire subrange anyway.

T

-- 
Старый друг лучше новых двух.

January 23, 2015

Posted by Andrei Alexandrescu
in reply to H. S. Teoh

Andrei Alexandrescu

Posted in reply to H. S. Teoh

On 1/22/15 4:26 PM, H. S. Teoh via Digitalmars-d wrote:
> The question, though, is whether this hybrid recursion actually wins us
> anything, since heapsort is known to be somewhat slower than quicksort.
> In the worst case, you choose a poor pivot and get*both*  the worst case
> of quicksort and the inferior performance of heapsort compounded
> together.

The glass-half-full view of this is you combine the advantages of quicksort and heapsort :o).

> Unless there's a way to modify heapsort to be more efficient as well?
>
> Actually, another idea occurred to me: suppose we modify partition()
> instead, so that instead of partitioning in-place, it deposits
> partitioned elements into a target (possibly overlapping) range. Then we
> modify the right-recursion, so that its partitioning step actually
> simultaneously moves the range over to the target area for us. Not sure
> if this will eliminate the copying cost, but it might, since partition
> has to traverse the entire subrange anyway.

That's interesting!


Andrei

January 23, 2015

Posted by Xinok
in reply to Andrei Alexandrescu

Xinok

Posted in reply to Andrei Alexandrescu

On Thursday, 22 January 2015 at 21:40:57 UTC, Andrei Alexandrescu wrote:
> There's this classic patter on Unix: |sort|uniq, i.e. sort some data and only display the unique elements.
>
> What would be a better integrated version - one that does sorting and uniq in one shot? I suspect the combination could be quite a bit better than doing the two in sequence.
>
> A few google searches didn't yield much. Ideas?
>
>
> Thanks,
>
> Andrei

My thought is that uniq on a sorted list is only an O(n) operation, so it's not an expensive operation by any means. If there's to be any benefit to a sortUniq, it has to be capable of reducing work incurred during the sorting process; else you're just going to end up with something less efficient.

One solution may be RedBlackTree, which has an option to disallow duplicate elements. This container has three useful properties:

(1) The tree grows one element at a time. This is in opposition to other algorithms like quicksort or heapsort in which you must operate on the entire set of elements at once.

(2) Removing duplicates only requires a single comparison per element, thus retaining the worst-case of |sort|uniq.

(3) Duplicate elements are removed immediately. Inserting an element into a balanced tree with N elements is an O(lg n) operation, so the smaller n is, the less work that is required.

The shortcoming of RedBlackTree is that it's not very fast. However, I don't know any other O(n lg n) sorting algorithm which has these properties.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation