topN using a heap (page 6) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » topN using a heap (page 6)

January 19, 2016

Re: topN using a heap

Posted by Ilya
in reply to Andrei Alexandrescu

Ilya

Posted in reply to Andrei Alexandrescu

On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu wrote:
> On 1/18/16 7:46 PM, Ilya wrote:
>> On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
>>> On 01/19/2016 01:12 AM, Ilya wrote:
>>>>>
>>>>
>>>> There is already implementation with predictable seed. Proof:
>>>> https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151
>>>>
>>>> --Ilya
>>>
>>> The default RNG is seeded with unpredictableSeed. What is the point
>>> you are trying to make?
>>
>> unpredictableSeed is initialized only once and can be easily estimated.
>> --Ilya
>
> https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120
>
> unpredictableSeed uses MinstdRand0. The sources of randomness used for initializing MinstdRand0 are the PID, the thread ID, and the system time at the moment of the seeding. Then at each subsequent call to unpredictableSeed, the time of the call is XORed with the current value of the MinstdRand0.
>
> How do you think things could be improved?
>
>
> Andrei

A good variant with minimal overhead is to call unpredictableSeed for sorting big arrays each time (one seed per array):
   a. A hacker would not be able to estimate a seed using other API calls. For example, "give me a set of random numbers".
   b. A hacker would not be able to estimate a seed using a small arrays because they don't use RNGs. (and they have not any overhead!)
   c. A hacker would not be able to estimate a seed for big arrays, because attack based on time measurement would not work for big arrays.

EDIT:
  c. ... because each big array has own seed and time measurement would not work for big arrays with different seeds.

__EDIT2__:
To be fair (c) is not really correct. To do (c) correct a thread local global seed should be used to generate unpredictableSeed for each big arrays.

Ilya

January 19, 2016

Re: topN using a heap

Posted by deadalnix
in reply to Andrei Alexandrescu

deadalnix

Posted in reply to Andrei Alexandrescu

On Tuesday, 19 January 2016 at 00:17:30 UTC, Andrei Alexandrescu wrote:
> How would this translate to a matter of selecting the pivot during sort? -- Andrei

A large chunk of a given datacenter going quadratic at the same time.

January 19, 2016

Re: topN using a heap

Posted by Ivan Kazmenko
in reply to Andrei Alexandrescu

Ivan Kazmenko

Posted in reply to Andrei Alexandrescu

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:
> 4. sort was and is attackable before all of these changes

No, sort utilizes Introsort (Quicksort but switch to Heapsort if recurse too deep): see https://github.com/D-Programming-Language/phobos/blob/2.067/std/algorithm/sorting.d#L1048-L1053.  This improvement happened a year or two ago, when algorithm.d was not even separated into sub-modules.

My adversary program (change "topN (a, MAX_N / 2)" to "sort (a)") admits that by not being able to slow the sort down.  But, if I comment out line 1053 to disable the switching, things go quadratic as expected.

L1053:    depth = depth >= depth.max / 2 ? (depth / 3) * 2 : (depth * 2) / 3;

The same remedy can help for topN: if we called partition more than log (length) in base (3/2) times, switch to the heap implementation of topN regardless of whether n is close to the edge.

Ivan Kazmenko.

January 19, 2016

Re: topN using a heap

Posted by Xinok
in reply to Andrei Alexandrescu

Xinok

Posted in reply to Andrei Alexandrescu

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:
> Of course not. I think this back-and-forth takes away from the gist of things. So let me summarize what has happened:
>
> 1. topN was reportedly slow. It was using a random pivot. I made it use getPivot (deterministic) instead of a random pivot in https://github.com/D-Programming-Language/phobos/pull/3921. getPivot is also what sort uses.
>
> 2. Now that both functions use getPivot, I set out to improve it in https://github.com/D-Programming-Language/phobos/pull/3922. The problem is I couldn't make getPivot impure; pure functions already call sort, so making it impure would have been a regression.
>
> 3. So I made getPivot use just a bit of randomness taken from the length of the range.
>
> 4. sort was and is attackable before all of these changes
>
> 5. So now we have pure topN and sort (subject to the range offering pure primitives) but they are both attackable.
>
> 6. PRNGs don't have any other downside than making functions impure.
>
> The way I see it we have these solutions:
>
> (a) make topN still use a random pivot. That way there's no regression. Then proceed and make sort() avoid quadratic cases in other ways, e.g. switch to heapsort if performance degrades.
>
> (b) Improve sort() first, then apply a similar strategy to improving topN. Do not use RNGs at all.
>
> (c) Seek randomness in other places, e.g. address of elements, data hashes etc. Come with a solution that may still be attacked in narrow cases but make that unlikely enough to be a real concern.
>
>
> Andrei

To be clear, sort is NOT attackable. Introsort is used for unstable sorting which begins with quick sort but falls back to heap sort after too many recursions to maintain O(n log n) running time. Timsort is used for stable sorting which is a variant of merge sort but still runs in O(n log n) time. In either case, you're guaranteed to have O(n log n) running time in the worst case.

On the other hand, someone can improve upon quick sort by choosing better pivots but be careful not to add too much overhead. A couple years ago, I tried choosing the pivot from a median of five but the result was as much as 10% slower. One idea is to begin initially choosing better pivots, but once the sublists fall below a certain length, simply choose the pivot from a median of three to avoid the extra overhead.

As for topN, there are two approaches I'm aware of that are deterministic without resorting to impurities or RNGs. The first is introselect which is similar to introsort in that it has a fall back algorithm. The other approach is to choose the pivot from a median of medians. My idea is a variant of the latter approach: Begin by choosing the pivot from a median of five. If you continuously choose bad pivots, take a larger sample of elements to choose the pivot from. Keep growing the sample as necessary.

Speaking of RNGs, they're technically pure as long as you always use the same seed. So what if we generated a random seed at the start of the process, but then reused that same seed over and over in pure functions for the duration of the process?

January 19, 2016

Re: topN using a heap

Posted by Ivan Kazmenko
in reply to Andrei Alexandrescu

Ivan Kazmenko

Posted in reply to Andrei Alexandrescu

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:
> 4. sort was and is attackable before all of these changes
>
> (b) Improve sort() first, then apply a similar strategy to improving topN. Do not use RNGs at all.

Since point 4 is in fact already fixed a good while ago, my suggestion would be (b): to do the same for topN.

Introselect (https://en.wikipedia.org/wiki/Introselect), already mentioned in this thread, uses median-of-medians to achieve worst case O(n) performance if we recurse too deep.  There is already an issue suggesting to implement it: https://issues.dlang.org/show_bug.cgi?id=12141 (std.algorithm: implement deterministic topN).

In fact, the O(n log n) heap approach as it is implemented now could be faster than O(n) median-of-medians approach on reasonable inputs.  So, someone will have to implement median-of-medians and, well, measure.

At the very least, googling for "median of medians in practice" and such yields the tag wiki from StackOverflow.com: "The constant factor in the O(n) is large, and the algorithm is not commonly used in practice.".

Ivan Kazmenko.

January 19, 2016

Re: topN using a heap

Posted by Andrei Alexandrescu
in reply to Ivan Kazmenko

Andrei Alexandrescu

Posted in reply to Ivan Kazmenko

On 01/18/2016 09:44 PM, Ivan Kazmenko wrote:
> On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:
>> 4. sort was and is attackable before all of these changes
>>
>> (b) Improve sort() first, then apply a similar strategy to improving
>> topN. Do not use RNGs at all.
>
> Since point 4 is in fact already fixed a good while ago, my suggestion
> would be (b): to do the same for topN.
>
> Introselect (https://en.wikipedia.org/wiki/Introselect), already
> mentioned in this thread, uses median-of-medians to achieve worst case
> O(n) performance if we recurse too deep.  There is already an issue
> suggesting to implement it:
> https://issues.dlang.org/show_bug.cgi?id=12141 (std.algorithm: implement
> deterministic topN).
>
> In fact, the O(n log n) heap approach as it is implemented now could be
> faster than O(n) median-of-medians approach on reasonable inputs.  So,
> someone will have to implement median-of-medians and, well, measure.
>
> At the very least, googling for "median of medians in practice" and such
> yields the tag wiki from StackOverflow.com: "The constant factor in the
> O(n) is large, and the algorithm is not commonly used in practice.".

Yah, I also think heap/double-heap topN is better; median-of-medians (MoM) is theoretically nice but practically less so. Though there are two things that could help MoM in D: (a) median of five (core part of MoM) is relatively cheap to compute with six comparisons; (b) we can use stride() to compute median without needing to copy a fifth of the input or to complicate the algorithm - just recurse on the stride! All in all this might be a good algo to implement and test.

So anyhow we need the "intro" part. I'll get to that soon.

Andrei

January 19, 2016

Re: topN using a heap

Posted by Andrei Alexandrescu
in reply to Xinok

Andrei Alexandrescu

Posted in reply to Xinok

On 01/18/2016 09:42 PM, Xinok wrote:
> To be clear, sort is NOT attackable. Introsort is used for unstable
> sorting which begins with quick sort but falls back to heap sort after
> too many recursions to maintain O(n log n) running time. Timsort is used
> for stable sorting which is a variant of merge sort but still runs in
> O(n log n) time. In either case, you're guaranteed to have O(n log n)
> running time in the worst case.

Sorry, my bad. Could you please point to the code doing the introspection? I might do the same in topN.

> On the other hand, someone can improve upon quick sort by choosing
> better pivots but be careful not to add too much overhead. A couple
> years ago, I tried choosing the pivot from a median of five but the
> result was as much as 10% slower. One idea is to begin initially
> choosing better pivots, but once the sublists fall below a certain
> length, simply choose the pivot from a median of three to avoid the
> extra overhead.

That's what https://github.com/D-Programming-Language/phobos/pull/3922 and in my measurements it's about as fast or faster than the current. Could you please re-run your measurements against that PR?

> Speaking of RNGs, they're technically pure as long as you always use the
> same seed. So what if we generated a random seed at the start of the
> process, but then reused that same seed over and over in pure functions
> for the duration of the process?

That's a great idea. The way D is defined, it already implies that "immutable" is process-immutable, not immutable across runs. Consider http://dpaste.dzfl.pl/2fa5baf0c149.

Very nice insights, Xinok. Thanks!

Andrei

January 19, 2016

Re: topN using a heap

Posted by Andrei Alexandrescu
in reply to Ivan Kazmenko

Andrei Alexandrescu

Posted in reply to Ivan Kazmenko

On 01/18/2016 09:21 PM, Ivan Kazmenko wrote:
> On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:
>> 4. sort was and is attackable before all of these changes
>
> No, sort utilizes Introsort (Quicksort but switch to Heapsort if recurse
> too deep): see
> https://github.com/D-Programming-Language/phobos/blob/2.067/std/algorithm/sorting.d#L1048-L1053.
> This improvement happened a year or two ago, when algorithm.d was not
> even separated into sub-modules.
>
> My adversary program (change "topN (a, MAX_N / 2)" to "sort (a)") admits
> that by not being able to slow the sort down.  But, if I comment out
> line 1053 to disable the switching, things go quadratic as expected.
>
> L1053:    depth = depth >= depth.max / 2 ? (depth / 3) * 2 : (depth * 2)
> / 3;
>
> The same remedy can help for topN: if we called partition more than log
> (length) in base (3/2) times, switch to the heap implementation of topN
> regardless of whether n is close to the edge.

Do you think sort and topN would be attackable if they used a per-process-seeded RNG as per Xinok's idea? -- Andrei

January 19, 2016

Re: topN using a heap

Posted by Andrei Alexandrescu
in reply to Andrei Alexandrescu

Andrei Alexandrescu

Posted in reply to Andrei Alexandrescu

On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:
> So anyhow we need the "intro" part. I'll get to that soon.

Destroy! I made it part of https://github.com/D-Programming-Language/phobos/pull/3934.

https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04

One interesting side topic would be how to avoid the use of goto without losing clarity.


Andrei

January 19, 2016

Re: topN using a heap

Posted by Timon Gehr
in reply to Andrei Alexandrescu

Timon Gehr

Posted in reply to Andrei Alexandrescu

On 01/19/2016 04:33 PM, Andrei Alexandrescu wrote:
> On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:
>> So anyhow we need the "intro" part. I'll get to that soon.
>
> Destroy! I made it part of
> https://github.com/D-Programming-Language/phobos/pull/3934.
>
> https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04
> ...

The switching heuristic is bad. It always switches after at most 8 steps.

I'd just use the second heuristic discussed in https://en.wikipedia.org/wiki/Introselect .

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation