Worst-case performance of quickSort / getPivot (page 5) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Worst-case performance of quickSort / getPivot (page 5)

November 17, 2013

Re: Worst-case performance of quickSort / getPivot

Posted by Ivan Kazmenko
in reply to Timon Gehr

Ivan Kazmenko

Posted in reply to Timon Gehr

On Sunday, 17 November 2013 at 13:07:40 UTC, Timon Gehr wrote:
> On 11/17/2013 02:07 AM, Ivan Kazmenko wrote:
>> The random pick fails in the following sense: if we seed the RNG,
>> construct a killer case, and then start with the same seed, we get
>> Theta(n^2) behavior reproduced.
>
> Hence, in no sense. This does not perform independent uniform random picks.

Not at all.  There is a number of situations where you want your program to use RNG, but it is also important for the result to be reproducible.  In such cases, you typically store the RNG seed for re-use.

Of course there are also many cases where you don't need reproducibility guarantees, and there, the attack is useless.

Ivan Kazmenko.

November 17, 2013

Re: Worst-case performance of quickSort / getPivot

Posted by Chris Cain
in reply to Andrei Alexandrescu

Chris Cain

Posted in reply to Andrei Alexandrescu

On Sunday, 17 November 2013 at 07:19:26 UTC, Andrei Alexandrescu wrote:
> On 11/16/13 9:21 PM, Chris Cain wrote:
>> That said, it might also be reproduced "well enough" using a random
>> generator to create similar strings to sort, but the basic idea is
>> there. I just like using real genomes for performance testing things :)
>
> I am hoping for some more representative corpora, along the lines of http://sortbenchmark.org/. Some data that we can use as good proxies for typical application usage.
>
> Andrei

I think I get what you're saying, but sortbenchmark.org uses completely pseudorandom (but reproducable) entries that I don't think are representative of real data either:

(using gensort -a minus the verification columns)
---
AsfAGHM5om
~sHd0jDv6X
uI^EYm8s=|
Q)JN)R9z-L
o4FoBkqERn
*}-Wz1;TD-
0fssx}~[oB
...
---

Most places use very fake data as proxies for real data. It's better to have something somewhat structured and choose data that, despite not being real data, stresses the benchmark in a unique way.

I'm not suggesting my benchmark be the only one; if we're going to use pseudorandom data (I'm not certain we could actually get "realistic data" that would serve us that much better) we might as well have different test cases that stress the sort routine in different ways. Obviously, using the real genome would be preferable to generating some (since it's actually truly "real" data, just used in an unorthodox way) but there's a disadvantage to attaching a 4.6MB file to a benchmarking setup. Especially if more might come.

Anyway, it's a reasonable representation of "data that has no discernable order that can occasionally take some time to compare." Think something like sorting a list of customer records by name. If they're ordered by ID, then the names would not likely have a discernable pattern and the comparison between names might be "more expensive" because some names can be common.

We can do "more realistic" for that type of scenario, if you'd like. I could look at a distribution for last names/first names and generate fake names to sort in a reasonable approximation of a distribution of real names. I'm not certain the outcome would change that much.

November 17, 2013

Re: Worst-case performance of quickSort / getPivot

Posted by Andrei Alexandrescu
in reply to Ivan Kazmenko

Andrei Alexandrescu

Posted in reply to Ivan Kazmenko

On 11/17/13 2:20 AM, Ivan Kazmenko wrote:
> On Sunday, 17 November 2013 at 03:58:58 UTC, Andrei Alexandrescu wrote:
>> On 11/16/13 5:07 PM, Ivan Kazmenko wrote:
>>> The above is just my retelling of a great short article "A Killer
>>> Adversary for Quicksort" by M. D. McIlroy here:
>>> http://www.cs.dartmouth.edu/~doug/mdmspe.pdf
>>
>> Nice story, but the setup is a tad tenuous (albeit indeed
>> theoretically interesting). For starters, if the attacker has a hook
>> into the comparison function, they could trivially do a lot worse...
>
> Actually, I was thinking about a two phase attack, and it is not at all
> unlikely.
>
> 0. Say, the Sorter presents a library quicksort solution.  It may be
> closed source, but can take comparison function as argument.
>
> 1. On the preparation phase, the Attacker uses the tricky comparison
> function described previously.  What's important is that, besides
> observing Theta(n^2) behavior once, he now gets a real array a[] such
> that this behavior can be reproduced.
>
> 2. On the attack phase, the Attacker does not need access to the
> comparison function.  He just feeds the array obtained on the previous
> step as plain data.
>
> Ivan Kazmenko.

That won't do with randomized pivot selection.

Andrei

November 17, 2013

Re: Worst-case performance of quickSort / getPivot

Posted by Vladimir Panteleev
in reply to Jean Christophe

Vladimir Panteleev

Posted in reply to Jean Christophe

On Sunday, 17 November 2013 at 05:30:24 UTC, Jean Christophe wrote:
>> You mean, sort!`a.foo < b.foo` ?
>
> Yes.
>
> An indirect sorting, assuming a and b to be ojects of class SomePotentialyLargeClass.
>
> Because the array to sort contains pointers only, all the data movement is essentially the same as if we were sorting integer.

If the range elements are reference types, that's what will happen (unless they overload opAssign oslt). Otherwise, there's makeIndex (already mentioned by Andrei), or you could also do it by hand:

1. r.length.iota.array.sort!((a, b) => r[a]<r[b]);
2. r.length.iota.map!(a => &r[a]).array.sort!((a, b) => *a<*b);

r.map!`&a` doesn't work though, because we don't get a reference to the range element in the predicate.

November 17, 2013

Re: Worst-case performance of quickSort / getPivot

Posted by Vladimir Panteleev
in reply to Craig Dillabaugh

Vladimir Panteleev

Posted in reply to Craig Dillabaugh

On Sunday, 17 November 2013 at 08:33:11 UTC, Craig Dillabaugh wrote:
> http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15451-s07/www/lecture_notes/lect0125.pdf

Nice! Didn't know about this - although still seems to be a lot of work for each 'n'.

November 17, 2013

Re: Worst-case performance of quickSort / getPivot

Posted by Andrei Alexandrescu
in reply to Chris Cain

Andrei Alexandrescu

Posted in reply to Chris Cain

On 11/17/13 6:20 AM, Chris Cain wrote:
> I'm not suggesting my benchmark be the only one; if we're going to use
> pseudorandom data (I'm not certain we could actually get "realistic
> data" that would serve us that much better) we might as well have
> different test cases that stress the sort routine in different ways.
> Obviously, using the real genome would be preferable to generating some
> (since it's actually truly "real" data, just used in an unorthodox way)
> but there's a disadvantage to attaching a 4.6MB file to a benchmarking
> setup. Especially if more might come.

OK, since I see you have some interest...

You said nobody would care to actually sort genome data. I'm aiming for data that's likely to be a good proxy for tasks people are interested in doing.

For example, people may be interested in sorting floating-point numbers resulting from sales, measurements, frequencies, probabilities, and whatnot. Since most of those have a Gaussian distribution, a corpus with Gaussian-distributed measurements would be nice.

Then, people may want to sort things by date/time. Depending on the scale the distribution is different - diurnal cycle, weekly cycle, seasonal cycle, secular ebbs and flows etc. I'm unclear on what would be a good set of data. For sub-day time ranges uniform distribution may be appropriate.

Then, people may want to sort records by e.g. Lastname, Firstname, or index a text by words. For names we'd need some census data or phonebook. For general text sorting we can use classic texts such as Alice in Wonderland or the King James Bible (see http://corpus.canterbury.ac.nz/descriptions/). Sorting by word length is a possibility (word lengths are probably Gaussian-distributed).

Uniform random data is also a baseline, not terribly representative, but worth keeping an eye on. In fact uniform data is unfairly rigged in quicksort's favor: any pivot is likely to be pretty good, and there are no sorted runs that often occur in real data.

Andrei

November 17, 2013

Re: Worst-case performance of quickSort / getPivot

Posted by Chris Cain
in reply to Andrei Alexandrescu

Chris Cain

Posted in reply to Andrei Alexandrescu

On Sunday, 17 November 2013 at 16:57:19 UTC, Andrei Alexandrescu wrote:
> Then, people may want to sort records by e.g. Lastname, Firstname, or index a text by words. For names we'd need some census data or phonebook.

Sure.

As a source of data, app_c.csv from http://www.census.gov/genealogy/www/data/2000surnames/ -> File B: Surnames...


Test case:
---
import std.stdio, std.algorithm, std.random, std.range, std.conv, std.datetime;

// Name generator constants
enum NumberOfPossibleNames = 1_000_000;
enum NumberOfRandomNames = 1_000_000;

enum SortStrategy = SwapStrategy.unstable;

void main() {
    auto data = File("app_c.csv").byLine
            .drop(1) // First line is just column names
            .take(NumberOfPossibleNames)
            .map!(e => e.text.split(","))
            .array;
    auto names = data.map!(e => e[0]).array;
    auto proportions = data.map!(e => e[2].to!size_t).array;

    auto rnd = Random(50);
    auto randomNames = fastDice(rnd, proportions)
            .take(NumberOfRandomNames)
            .map!(i => names[i])
            .array;

    StopWatch sw = AutoStart.yes;
    randomNames.sort!("a < b", SortStrategy)();
    sw.stop();

    writeln(randomNames.length, " names sorted in ",
        sw.peek.msecs, " msecs using ", SortStrategy, " sort");
}

struct FastDice(Rng, SearchPolicy pol = SearchPolicy.gallop) {
    SortedRange!(size_t[]) _propCumSumSorted;
    size_t _sum;
    size_t _front;
    Rng* _rng;

    this(ref Rng rng, size_t[] proportions) {
        size_t[] _propCumSum = proportions.save.array;
        _rng = &rng;

        size_t mass = 0;
        foreach(ref e; _propCumSum) {
            mass += e;
            e = mass;
        }

        _sum = _propCumSum.back;

        _propCumSumSorted = assumeSorted(_propCumSum);

        popFront();
    }

    void popFront() {
        immutable point = uniform(0, _sum, *_rng);
        assert(point < _sum);

        _front = _propCumSumSorted.lowerBound!pol(point).length;
    }

    enum empty = false;

    @property
    auto front() {
        return _front;
    }
}

auto fastDice(Rng)(ref Rng rng, size_t[] proportions) {
    return FastDice!(Rng)(rng, proportions);
}

auto fastDice(size_t[] proportions) {
    return fastDice(rndGen, proportions);
}
---

Results (using `-O -inline -release -noboundschecks` on my computer):
unstable sort: 738 msecs
stable sort: 1001 msecs


Also, to do this (in reasonable time) I had to create a new range which I called "FastDice" ... it does the same as std.random.dice, but is intended for cases where you'll be throwing dice throws often on the same data, so it does a bit of precomputation (creating a cumulative sum array) and allows for binary search/gallop to reduce the time to come up with each throw. I opted for gallop in this since the data is sorted in such a way that most common names come first.

It needs a bit of work to make it actually generic, but it's a good start and I'll just say it's WTFPL code, so use it for whatever. It'll be especially good for generating test cases like I have done above.

FWIW, FastDice takes ~400ms to generate all those dice throws. I did a back-of-the-envelope calculation on using dice, and just the time saved caching the sum saved maybe 30 minutes (No, I didn't wait that long, I stopped after 5 and wrote FastDice) of time each run. And the time saved using a gallop search instead of linear search is pretty significant as well (difference between n and log n time).

November 17, 2013

Re: Worst-case performance of quickSort / getPivot

Posted by Timon Gehr
in reply to Ivan Kazmenko

Timon Gehr

Posted in reply to Ivan Kazmenko

On 11/17/2013 02:14 PM, Ivan Kazmenko wrote:
> On Sunday, 17 November 2013 at 13:07:40 UTC, Timon Gehr wrote:
>> On 11/17/2013 02:07 AM, Ivan Kazmenko wrote:
>>> The random pick fails in the following sense: if we seed the RNG,
>>> construct a killer case, and then start with the same seed, we get
>>> Theta(n^2) behavior reproduced.
>>
>> Hence, in no sense. This does not perform independent uniform random
>> picks.
>
> Not at all.  There is a number of situations where you want your program
> to use RNG, but it is also important for the result to be reproducible.
> In such cases, you typically store the RNG seed for re-use.
>
> Of course there are also many cases where you don't need reproducibility
> guarantees, and there, the attack is useless.
>
> Ivan Kazmenko.

One can't say random picking is bad because one can use some other (deterministic) pivot selection algorithm instead which is bad. If you need deterministic reproducibility guarantees, then random picking is useless.

November 18, 2013

Re: Worst-case performance of quickSort / getPivot

Posted by Andrei Alexandrescu
in reply to Chris Cain

Andrei Alexandrescu

Posted in reply to Chris Cain

On 11/17/13 1:23 PM, Chris Cain wrote:
> On Sunday, 17 November 2013 at 16:57:19 UTC, Andrei Alexandrescu wrote:
>> Then, people may want to sort records by e.g. Lastname, Firstname, or
>> index a text by words. For names we'd need some census data or phonebook.
>
> Sure.
>
> As a source of data, app_c.csv from
> http://www.census.gov/genealogy/www/data/2000surnames/ -> File B:
> Surnames...
[snip]

> It needs a bit of work to make it actually generic, but it's a good
> start and I'll just say it's WTFPL code, so use it for whatever. It'll
> be especially good for generating test cases like I have done above.
>
> FWIW, FastDice takes ~400ms to generate all those dice throws. I did a
> back-of-the-envelope calculation on using dice, and just the time saved
> caching the sum saved maybe 30 minutes (No, I didn't wait that long, I
> stopped after 5 and wrote FastDice) of time each run. And the time saved
> using a gallop search instead of linear search is pretty significant as
> well (difference between n and log n time).

I think that's a terrific start! (Not sure I understand where the 30 minutes come from...)


Andrei

November 18, 2013

Re: Worst-case performance of quickSort / getPivot

Posted by bearophile
in reply to Chris Cain

bearophile

Posted in reply to Chris Cain

Chris Cain:

> Also, to do this (in reasonable time) I had to create a new range which I called "FastDice" ... it does the same as std.random.dice, but is intended for cases where you'll be throwing dice throws often on the same data, so it does a bit of precomputation (creating a cumulative sum array)

See also:
http://d.puremagic.com/issues/show_bug.cgi?id=5849

Bye,
bearophile

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation