topN using a heap (page 5) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » topN using a heap (page 5)

January 19, 2016

Re: topN using a heap

Posted by Ivan Kazmenko
in reply to Andrei Alexandrescu

Ivan Kazmenko

Posted in reply to Andrei Alexandrescu

On Monday, 18 January 2016 at 23:39:02 UTC, Andrei Alexandrescu wrote:
> On 1/18/16 6:18 PM, Ilya wrote:
>> A RNGs don't improve worst case. It only changes an permutation for worst case. --Ilya
>
> Well it does improve things. The probability of hitting the worst case repeatedly is practically zero, and it's impossible to create an attack input.

Possible, but only if the seed and previous usage of victim program's global RNG are also known to the attacker... at which point the victim may well have a bigger problem than just quadratic topN.

> I'm not sure whether we should worry about this. Probably we do because sort attacks are a press favorite.
>
> The nice thing about not relying on randomness is that pure functions can call sort, topN etc.

Is it feasible to have two overloads of sort and/or topN, one pure and the other with better complexity guarantees?

> As a sort of a compromise I was thinking of seeding the RNG with not only the length of the range, but also the integral representation of the address of the first element. This would still allow an attack if the input is always at the same address.

Hmm.  Isn't using any pointers still breaking strong purity?  Say the GC moved our array, and it now has a different address.  A strongly pure function would order the elements exactly the same then.

The same goes about sorting, where the thing depending on pivot choice is the relative order of "equal" elements.

From a theoretical standpoint (not taking current D purity rules into account), I'd say using a pointer (which may be modified by GC) is as pure as just allowing a static RNG (a global one, or even another instance dedicated specifically to sort/topN).

Ivan Kazmenko.

January 18, 2016

Re: topN using a heap

Posted by Andrei Alexandrescu
in reply to Ilya

Andrei Alexandrescu

Posted in reply to Ilya

On 1/18/16 6:51 PM, Ilya wrote:
> On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:
>> On 1/18/16 6:44 PM, Ilya wrote:
>>> On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:
>>>> On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
>>>>> A RNGs don't improve worst case. It only changes an permutation for
>>>>> worst case. --Ilya
>>>>
>>>> Still, use of RNG makes it impossible to construct the worst case
>>>> beforehand, once and for all.  In that sense, this is a regression.
>>>
>>> No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya
>>
>> unpredictableSeed uses the system clock as a source of randomness, so
>> we're good there. -- Andrei
>
> Would this work for pure functions? --Ilya

Of course not. I think this back-and-forth takes away from the gist of things. So let me summarize what has happened:

1. topN was reportedly slow. It was using a random pivot. I made it use getPivot (deterministic) instead of a random pivot in https://github.com/D-Programming-Language/phobos/pull/3921. getPivot is also what sort uses.

2. Now that both functions use getPivot, I set out to improve it in https://github.com/D-Programming-Language/phobos/pull/3922. The problem is I couldn't make getPivot impure; pure functions already call sort, so making it impure would have been a regression.

3. So I made getPivot use just a bit of randomness taken from the length of the range.

4. sort was and is attackable before all of these changes

5. So now we have pure topN and sort (subject to the range offering pure primitives) but they are both attackable.

6. PRNGs don't have any other downside than making functions impure.

The way I see it we have these solutions:

(a) make topN still use a random pivot. That way there's no regression. Then proceed and make sort() avoid quadratic cases in other ways, e.g. switch to heapsort if performance degrades.

(b) Improve sort() first, then apply a similar strategy to improving topN. Do not use RNGs at all.

(c) Seek randomness in other places, e.g. address of elements, data hashes etc. Come with a solution that may still be attacked in narrow cases but make that unlikely enough to be a real concern.

Andrei

January 19, 2016

Re: topN using a heap

Posted by Ilya
in reply to Timon Gehr

Ilya

Posted in reply to Timon Gehr

On Tuesday, 19 January 2016 at 00:02:08 UTC, Timon Gehr wrote:
> On 01/19/2016 12:55 AM, Ilya wrote:
>> On Monday, 18 January 2016 at 23:53:53 UTC, Timon Gehr wrote:
>>> On 01/19/2016 12:50 AM, Ilya wrote:
>>>> ...
>>>>
>>>> 1. Yes, probability of hitting the worst case repeatedly is is
>>>> practically zero. But RNGs do not change this probability.
>>>> 2. It is possible to build attack for our RNGs, because they are
>>>> Pseudo-RNGs.
>>>> --Ilya
>>>
>>> You also need to predict the seed. How do you do that?
>>
>> We can not use unpredictable seed (like system clock) in pure functions.
>> --Ilya
>
> Clearly. The point is, there already was an impure implementation of topN whose expected linear running time is still specified in the documentation. The current implementation does not fulfill this bound.

There is already implementation with predictable seed. Proof: https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151 --Ilya

January 18, 2016

Re: topN using a heap

Posted by Andrei Alexandrescu
in reply to deadalnix

Andrei Alexandrescu

Posted in reply to deadalnix

On 1/18/16 6:55 PM, deadalnix wrote:
> On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:
>> unpredictableSeed uses the system clock as a source of randomness, so
>> we're good there. -- Andrei
>
> I got problem with that even when crytographically secure randomness
> wasn't needed more than once. A specific case included adding jitter on
> some timeout to avoid all hosts to expire at the same time, and used
> mersenne twister with a pseudo random seed. There still were way too
> many hosts ending up with the same timeout.
>
> System time is not enough.

How would this translate to a matter of selecting the pivot during sort? -- Andrei

January 19, 2016

Re: topN using a heap

Posted by Ilya
in reply to Andrei Alexandrescu

Ilya

Posted in reply to Andrei Alexandrescu

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:
> On 1/18/16 6:51 PM, Ilya wrote:
>> On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:
>>> On 1/18/16 6:44 PM, Ilya wrote:
>>>> On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:
>>>>> On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
>>>>>> A RNGs don't improve worst case. It only changes an permutation for
>>>>>> worst case. --Ilya
>>>>>
>>>>> Still, use of RNG makes it impossible to construct the worst case
>>>>> beforehand, once and for all.  In that sense, this is a regression.
>>>>
>>>> No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya
>>>
>>> unpredictableSeed uses the system clock as a source of randomness, so
>>> we're good there. -- Andrei
>>
>> Would this work for pure functions? --Ilya
>
> Of course not. I think this back-and-forth takes away from the gist of things. So let me summarize what has happened:
>
> 1. topN was reportedly slow. It was using a random pivot. I made it use getPivot (deterministic) instead of a random pivot in https://github.com/D-Programming-Language/phobos/pull/3921. getPivot is also what sort uses.
>
> 2. Now that both functions use getPivot, I set out to improve it in https://github.com/D-Programming-Language/phobos/pull/3922. The problem is I couldn't make getPivot impure; pure functions already call sort, so making it impure would have been a regression.
>
> 3. So I made getPivot use just a bit of randomness taken from the length of the range.
>
> 4. sort was and is attackable before all of these changes
>
> 5. So now we have pure topN and sort (subject to the range offering pure primitives) but they are both attackable.
>
> 6. PRNGs don't have any other downside than making functions impure.
>
> The way I see it we have these solutions:
>
> (a) make topN still use a random pivot. That way there's no regression. Then proceed and make sort() avoid quadratic cases in other ways, e.g. switch to heapsort if performance degrades.
>
> (b) Improve sort() first, then apply a similar strategy to improving topN. Do not use RNGs at all.
>
> (c) Seek randomness in other places, e.g. address of elements, data hashes etc. Come with a solution that may still be attacked in narrow cases but make that unlikely enough to be a real concern.
>
>
> Andrei

(a) Hope no
(b) Yes
(c) Memory addresses may not work with user defined ranges and hashes could be slow.

(d) Make PRNGs optional. --Ilya

January 19, 2016

Re: topN using a heap

Posted by Timon Gehr
in reply to Ilya

Timon Gehr

Posted in reply to Ilya

On 01/19/2016 01:12 AM, Ilya wrote:
>>
>
> There is already implementation with predictable seed. Proof:
> https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151
> --Ilya

The default RNG is seeded with unpredictableSeed. What is the point you are trying to make?

January 19, 2016

Re: topN using a heap

Posted by Ilya
in reply to Timon Gehr

Ilya

Posted in reply to Timon Gehr

On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
> On 01/19/2016 01:12 AM, Ilya wrote:
>>>
>>
>> There is already implementation with predictable seed. Proof:
>> https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151
>> --Ilya
>
> The default RNG is seeded with unpredictableSeed. What is the point you are trying to make?

unpredictableSeed is initialized only once and can be easily estimated. --Ilya

January 18, 2016

Re: topN using a heap

Posted by Andrei Alexandrescu
in reply to Ilya

Andrei Alexandrescu

Posted in reply to Ilya

On 1/18/16 7:46 PM, Ilya wrote:
> On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
>> On 01/19/2016 01:12 AM, Ilya wrote:
>>>>
>>>
>>> There is already implementation with predictable seed. Proof:
>>> https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151
>>>
>>> --Ilya
>>
>> The default RNG is seeded with unpredictableSeed. What is the point
>> you are trying to make?
>
> unpredictableSeed is initialized only once and can be easily estimated.
> --Ilya

https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120

unpredictableSeed uses MinstdRand0. The sources of randomness used for initializing MinstdRand0 are the PID, the thread ID, and the system time at the moment of the seeding. Then at each subsequent call to unpredictableSeed, the time of the call is XORed with the current value of the MinstdRand0.

How do you think things could be improved?

Andrei

January 19, 2016

Re: topN using a heap

Posted by Ilya
in reply to Andrei Alexandrescu

Ilya

Posted in reply to Andrei Alexandrescu

On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu wrote:
> On 1/18/16 7:46 PM, Ilya wrote:
>> On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
>>> On 01/19/2016 01:12 AM, Ilya wrote:
>>>>>
>>>>
>>>> There is already implementation with predictable seed. Proof:
>>>> https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151
>>>>
>>>> --Ilya
>>>
>>> The default RNG is seeded with unpredictableSeed. What is the point
>>> you are trying to make?
>>
>> unpredictableSeed is initialized only once and can be easily estimated.
>> --Ilya
>
> https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120
>
> unpredictableSeed uses MinstdRand0. The sources of randomness used for initializing MinstdRand0 are the PID, the thread ID, and the system time at the moment of the seeding. Then at each subsequent call to unpredictableSeed, the time of the call is XORed with the current value of the MinstdRand0.
>
> How do you think things could be improved?
>
>
> Andrei

A good variant with minimal overhead is to call unpredictableSeed for sorting big arrays each time (one seed per array):
   a. A hacker would not able to estimate a seed using other API calls. For example, "give me a set of random numbers".
   b. A hacker would not be able to estimate a seed using a small arrays because they don't use RNGs. (and they have not any overhead!)
   c. A hacker would not be able to estimate a seed for big arrays, because attack based on time measurement would not work for big arrays.

Ilya

January 19, 2016

Re: topN using a heap

Posted by Ilya
in reply to Andrei Alexandrescu

Ilya

Posted in reply to Andrei Alexandrescu

On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu wrote:
> On 1/18/16 7:46 PM, Ilya wrote:
>> On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
>>> On 01/19/2016 01:12 AM, Ilya wrote:
>>>>>
>>>>
>>>> There is already implementation with predictable seed. Proof:
>>>> https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151
>>>>
>>>> --Ilya
>>>
>>> The default RNG is seeded with unpredictableSeed. What is the point
>>> you are trying to make?
>>
>> unpredictableSeed is initialized only once and can be easily estimated.
>> --Ilya
>
> https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120
>
> unpredictableSeed uses MinstdRand0. The sources of randomness used for initializing MinstdRand0 are the PID, the thread ID, and the system time at the moment of the seeding. Then at each subsequent call to unpredictableSeed, the time of the call is XORed with the current value of the MinstdRand0.
>
> How do you think things could be improved?
>
>
> Andrei

A good variant with minimal overhead is to call unpredictableSeed for sorting big arrays each time (one seed per array):
   a. A hacker would not able to estimate a seed using other API calls. For example, "give me a set of random numbers".
   b. A hacker would not be able to estimate a seed using a small arrays because they don't use RNGs. (and they have not any overhead!)
   c. A hacker would not be able to estimate a seed for big arrays, because attack based on time measurement would not work for big arrays.

EDIT:
  c. ... because each big array has own seed and time measurement would not work for big arrays with different seeds.

Ilya

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation