January 28, 2012
Andrei Alexandrescu wrote:

> That's three sorts and at least one temporary array.

If a temporary array is allowed, the ranks and the sum of the ranks might be computed by a diogonal sweep over the area defined by the two dimensions and populated by the elements.

Population and sweep would each be linear in time.

-manfred
January 28, 2012
On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu wrote:
> The brute force approach would essentially compute the two ranks and then sort by their sum. That's three sorts and at least one temporary array. But I think things can be done considerably better. Any ideas?

Can you define "better"?

Obviously you can't beat the O(n lg n) for comparison based ordering. The O(n) space may be avoidable, but it doesn't seem like the end of the world.

Does this problem occur frequently enough in practice to justify having an optimised version in the standard library? Three sorts + O(n) space sounds fine to me.
January 28, 2012
On 1/27/12 8:26 PM, Manfred Nowak wrote:
> Andrei Alexandrescu wrote:
>
>> That's three sorts and at least one temporary array.
>
> If a temporary array is allowed, the ranks and the sum of the ranks
> might be computed by a diogonal sweep over the area defined by the two
> dimensions and populated by the elements.
>
> Population and sweep would each be linear in time.
>
> -manfred

Interesting, I have some thoughts along the same lines. Could you please give more detail?

Andrei
January 28, 2012
On 1/27/12 8:33 PM, Peter Alexander wrote:
> On Friday, 27 January 2012 at 19:36:49 UTC, Andrei Alexandrescu wrote:
>> The brute force approach would essentially compute the two ranks and
>> then sort by their sum. That's three sorts and at least one temporary
>> array. But I think things can be done considerably better. Any ideas?
>
> Can you define "better"?
>
> Obviously you can't beat the O(n lg n) for comparison based ordering.
> The O(n) space may be avoidable, but it doesn't seem like the end of the
> world.
>
> Does this problem occur frequently enough in practice to justify having
> an optimised version in the standard library? Three sorts + O(n) space
> sounds fine to me.

I wasn't thinking necessarily of the standard library, but it is something that is often needed even if not perceived as such. For example, the stock screeners offered by financial sites are crappy - they allow you to specify a range of e.g. P/E, P/book value, dividends etc. etc. and sort by one of those. This is crappy - generally what one wants is a selection of "best of breed" stocks that are among the top ones in a variety of categories. (Relative importance could be assigned to each category.)

A colleague at Facebook mentioned that you can use O(n) bucket sort for the last-leg sorting.

The best approach I'm thinking of is to allocate an extra array of size_t. First sort by k1 and then fill the array with 0, 1, ..., n-1. Then sort simultaneously the two arrays by k2, and then add 0 to the first slot, 1 to the second slot, ..., n-1 to the last slot. Continue this by any number of keys, and finally sort using the secondary array as a key. This works for any number of keys and only requires one extra array of size_t.

The more interesting problem is yielding the answer lazily. I'm unclear what the best algorithm for such would be.


Andrei
January 28, 2012
On Fri, Jan 27, 2012 at 09:22:26PM -0500, Andrei Alexandrescu wrote: [...]
> I wasn't thinking necessarily of the standard library, but it is something that is often needed even if not perceived as such. For example, the stock screeners offered by financial sites are crappy - they allow you to specify a range of e.g. P/E, P/book value, dividends etc. etc. and sort by one of those. This is crappy - generally what one wants is a selection of "best of breed" stocks that are among the top ones in a variety of categories. (Relative importance could be assigned to each category.)
[...]

If I understand the problem correctly, isn't this a kind of optimization problem? I.e., given a set A of k objects with n attributes, find the object(s) that maximizes a given measure function f over its attributes, where f is a weighted sum of n attributes.

If you treat the attributes as components of an n-dimensional vector, then you could recast the problem as the convex hull of k hypercuboids, where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The points on the convex hull then are the high-ranking points (points inside the hull are obviously suboptimal). Furthermore, the closest points to the line spanned by the vector (1,1,1,...) are the points that maximize the most number of attributes.

If you "peel away" the points on the convex hull, then the convex hull generated by the remaining points gives you the "second class" points. Peel away those, and the convex hull gives you the "third class" points, etc..

Granted, this isn't exactly the original problem as stated, but it seems to be closer to what you intend to achieve with your stocks example.


T

-- 
If the comments and the code disagree, it's likely that *both* are wrong. -- Christopher
January 28, 2012
Andrei Alexandrescu wrote:

> generally what one wants is a selection of "best of breed" stocks that are among the top ones in a variety of categories. (Relative importance could be assigned to each category.)

This is quite different and easier than the problem initially stated, because ranks must not be computed.

1) Compute the individual importance `ii' of each element `e', which
seems to be the sum over all categories `c' of all
  `e.c.value * c.importance'
2) Use the linear median algorithm to find the element `ek' whose
individual importance `ek.ii' has rank `k' in the set of all elements
`e' and `k' is the number of elements `e' acceptable as "best of the
breed"
3) return all elements `e' whith `e.ii >= ek.ii'. This can be done
by a sweep over the array.

Total running time aproximately: #elements * #categories


Please note, that this relies on having choosen the vector of relativ importances for the categories correctly. How does one know about this correctness?

For example in marketing it is desired to occupy unique selling points, because the gain of profit might be huge. But this means, that also the less for the buyer is huge.

-manfred


January 28, 2012
On 1/27/12 11:02 PM, H. S. Teoh wrote:
> If you treat the attributes as components of an n-dimensional vector,
> then you could recast the problem as the convex hull of k hypercuboids,
> where each hypercuboid lies between (0,0,0,...) and (x1,x2,x3,...). The
> points on the convex hull then are the high-ranking points (points
> inside the hull are obviously suboptimal). Furthermore, the closest
> points to the line spanned by the vector (1,1,1,...) are the points that
> maximize the most number of attributes.
>
> If you "peel away" the points on the convex hull, then the convex hull
> generated by the remaining points gives you the "second class" points.
> Peel away those, and the convex hull gives you the "third class" points,
> etc..
>
> Granted, this isn't exactly the original problem as stated, but it seems
> to be closer to what you intend to achieve with your stocks example.

That's close. Closer is to use simplexes instead of hypercubes. And the space is defined by rank, not the original features.

Andrei
January 28, 2012
On 1/28/12 1:08 AM, Manfred Nowak wrote:
> Andrei Alexandrescu wrote:
>
>> generally what one wants is a selection of "best of breed" stocks
>> that are among the top ones in a variety of categories. (Relative
>> importance could be assigned to each category.)
>
> This is quite different and easier than the problem initially
> stated, because ranks must not be computed.

Actually they need to be computed. I think it's possible to find cases where rank(a, k1) + rank(a, k2) < rank(b, k1) + rank(b, k2) but alpha * a.k1 + beta * a.k2 > alpha * b.k1 + beta.k2.

One potentially confusing issue is that importance applies to rank, not features.


Andrei
January 28, 2012
Andrei Alexandrescu wrote:

[ About ranks]
> Actually they need to be computed.

Then the problem is still too unclear.

> I think it's possible to find cases where
>    rank(a, k1) + rank(a, k2) < rank(b, k1) + rank(b, k2)
> but
>    alpha * a.k1 + beta * a.k2 > alpha * b.k1 + beta.k2.

Of course. Especially if the a, b, ... are sortable only partielly. But if the ranks are known, the problem can be finished with the linear median algorithm.

> One potentially confusing issue is that importance applies to rank, not features.

Do you mean that
  alpha * rank(a, k1) + beta * rank(a, k2)
has to be extremized for all `a'?

Again: the problem is still unclear.

-manfred
January 28, 2012
On 1/28/12 9:10 AM, Manfred Nowak wrote:
> Andrei Alexandrescu wrote:
>
> [ About ranks]
>> Actually they need to be computed.
>
> Then the problem is still too unclear.

It's very simple and clear in formulation actually.

Given a range r of elements, define rank(a, p) as the position that object a would have if range r would be sorted by predicate p. So the rank is a number between 0 and n-1 (n being the number of elements in the range.) For example, if we sort [ 40, 10, 30, 20 ] using "<" as a predicate, the rank of 20 is 1. (For simplicity we can consider for now that all elements are distinct.)

Now say we have multiple predicates, p1, p2, ..., pk. We want to sort elements in increasing order of rank(a, p1) + ... + rank(a, pk).

>> I think it's possible to find cases where
>>     rank(a, k1) + rank(a, k2)<  rank(b, k1) + rank(b, k2)
>> but
>>     alpha * a.k1 + beta * a.k2>  alpha * b.k1 + beta.k2.
>
> Of course. Especially if the a, b, ... are sortable only partielly.
> But if the ranks are known, the problem can be finished with the
> linear median algorithm.
>
>> One potentially confusing issue is that importance applies to
>> rank, not features.
>
> Do you mean that
>    alpha * rank(a, k1) + beta * rank(a, k2)
> has to be extremized for all `a'?

I don't understand this.

> Again: the problem is still unclear.

Hopefully the above has clarified it. Thanks for looking into it!


Andrei