topN using a heap (page 7)

On 01/19/2016 01:27 PM, Timon Gehr wrote: > On 01/19/2016 04:33 PM, Andrei Alexandrescu wrote: >> On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote: >>> So anyhow we need the "intro" part. I'll get to that soon. >> >> Destroy! I made it part of >> https://github.com/D-Programming-Language/phobos/pull/3934. >> >> https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04 >> >> ... > > The switching heuristic is bad. It always switches after at most 8 steps. My algebra is completely rotten. Thanks! Updated https://github.com/andralex/phobos/commit/cdd59b51a397ee1a4584e55c07d921c59acb5978. Andrei

On Tuesday, 19 January 2016 at 13:52:08 UTC, Andrei Alexandrescu wrote: > On 01/18/2016 09:21 PM, Ivan Kazmenko wrote: > Do you think sort and topN would be attackable if they used a per-process-seeded RNG as per Xinok's idea? -- Andrei Yes, but with a little interactivity (not generating the input in advance) and more labor (finding the state of RNG). Suppose the adversary has the means to generate an input for sort/topN, inspect the output, and then generate a bad input. With the current (not-secure) RNGs, from the first two actions, they can learn the current state of RNG: with sort, the order of elements which compare as equal depends on the choice of pivot, and therefore exposes the random bits involved; with topN, the order of everything except what topN guarantees does the same thing. After that, generating a bad input is definitely possible, since everything is known and deterministic at this point. The previously discussed method can be used, or something more clever if they want to make it fast. Ivan Kazmenko.

On 01/19/2016 08:08 PM, Ivan Kazmenko wrote: > On Tuesday, 19 January 2016 at 13:52:08 UTC, Andrei Alexandrescu wrote: >> On 01/18/2016 09:21 PM, Ivan Kazmenko wrote: >> Do you think sort and topN would be attackable if they used a >> per-process-seeded RNG as per Xinok's idea? -- Andrei > > Yes, but with a little interactivity (not generating the input in > advance) and more labor (finding the state of RNG). [snip] Thanks! -- Andrei

September 21, 2016

Re: topN using a heap

Posted by Jon Degenhardt
in reply to Andrei Alexandrescu

Permalink

Jon Degenhardt

Posted in reply to Andrei Alexandrescu

Permalink

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:
>
> So let me summarize what has happened:
>
> 1. topN was reportedly slow. It was using a random pivot. I made it use getPivot (deterministic) instead of a random pivot in https://github.com/D-Programming-Language/phobos/pull/3921. getPivot is also what sort uses.
>
> [snip]
>
Not completely clear from this thread what the conclusion was wrt getting known topN performance issues addressed. From pull requests it appears identified fixes are in current release versions of the DMD/LDC. However, I hit significant issues on one of the first large data sets I tried. Not an artificial data, but one with very skewed distributions of values (a google ngram file).

Details here:  https://issues.dlang.org/show_bug.cgi?id=16517. Includes test program, url for the ngram file.

A brief summary - Data file is a TSV file with 3 numeric fields, a bit over 10 million values each with different distribution properties. Used both topN and sort get the median value. Since this was median, it topN for the mid-point value, not at one end or the other. (This is a specific callout for some of the issues identified.)

Timing comparison of sort and topN, times in milliseconds:

          sort      topN
Field 2:   289      1756
Field 3:   285    148793
Field 4:   273    668906

The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar behavior is seen for DMD 2.071.2. This makes topN pretty much unusable.

On 9/21/16 4:16 AM, Jon Degenhardt wrote: > Timing comparison of sort and topN, times in milliseconds: > > sort topN > Field 2: 289 1756 > Field 3: 285 148793 > Field 4: 273 668906 > > The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar behavior > is seen for DMD 2.071.2. This makes topN pretty much unusable. I have it on my list to move https://arxiv.org/abs/1606.00484 into Phobos. Thanks for the data! -- Andrei

On Wednesday, 21 September 2016 at 14:58:27 UTC, Andrei Alexandrescu wrote: > On 9/21/16 4:16 AM, Jon Degenhardt wrote: >> Timing comparison of sort and topN, times in milliseconds: >> >> sort topN >> Field 2: 289 1756 >> Field 3: 285 148793 >> Field 4: 273 668906 >> >> The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar behavior >> is seen for DMD 2.071.2. This makes topN pretty much unusable. > > I have it on my list to move https://arxiv.org/abs/1606.00484 into Phobos. Thanks for the data! -- Andrei Very good, thanks. It'll be interesting to see how this algorithm does on this data set. --Jon

On 09/21/2016 01:37 PM, Jon Degenhardt wrote: > On Wednesday, 21 September 2016 at 14:58:27 UTC, Andrei Alexandrescu wrote: >> On 9/21/16 4:16 AM, Jon Degenhardt wrote: >>> Timing comparison of sort and topN, times in milliseconds: >>> >>> sort topN >>> Field 2: 289 1756 >>> Field 3: 285 148793 >>> Field 4: 273 668906 >>> >>> The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar behavior >>> is seen for DMD 2.071.2. This makes topN pretty much unusable. >> >> I have it on my list to move https://arxiv.org/abs/1606.00484 into >> Phobos. Thanks for the data! -- Andrei > > Very good, thanks. It'll be interesting to see how this algorithm does > on this data set. Preliminary results indicate that QuickselectAdaptive is about 10x faster than sort, net of data reading overheads, on the second-field data set. I'll proceed with submitting the algorithm in Phobos. -- Andrei

On Friday, 23 September 2016 at 15:30:20 UTC, Andrei Alexandrescu wrote: > Work is blocked by https://issues.dlang.org/show_bug.cgi?id=16528, which is quite a head-scratcher. Any ideas for a workaround? Thanks! -- Andrei annotate the templates.

On 09/23/2016 11:40 AM, Stefan Koch wrote: > On Friday, 23 September 2016 at 15:30:20 UTC, Andrei Alexandrescu wrote: >> Work is blocked by https://issues.dlang.org/show_bug.cgi?id=16528, >> which is quite a head-scratcher. Any ideas for a workaround? Thanks! >> -- Andrei > > annotate the templates. I don't want to lose the deduction. I used another workaround (enumerate the potential unsafe operations under an if (false) and then forward to a @trusted function) in https://github.com/dlang/phobos/pull/4815. Reviews are welcome! BTW, as I commented in https://issues.dlang.org/show_bug.cgi?id=16517, using the new topN implementation instead of sort to compute the median over google's 1-grams is over 11x faster using dmd. Andrei

Forums