October 10, 2008
BCS wrote:
> Reply to Andrei,
> 
>> BCS wrote:
>>
>>> Reply to Andrei,
>>>
>>>> BCS wrote:
>>>>
>>>>> I don't think there is any way to avoid storing the whole file
>>>>> because for a uniform sort there is a possibility that the last
>>>>> line will come out first.
>>>>>
>>>> I agree with the last paragraph, but lseeking seems overly
>>>> inefficient. Could you avoid that?
>>>>
>>>> Andrei
>>>>
>>> algorithmically, I don't think the lseek will matter,
>>>
>> I think it does. Essentially you impose random access on the input, or
>> copy to a medium that offers it.
>>
>> gunzip --stdout bigfile.gz | shuffle
>>
>> You'll have to compulsively store a copy of the input.
> 
> You have to anyway, or is that not what you agread with above?

Well I agree things can be coded such that you make the copy only if the file is too large anyway. But one lseek call per line is a complete disaster, whether you call it I/O effect or whatnot. I'd go as far as saying I'd never even try such a solution, especially since there exists a forward-pass algorithm. Of course measuring would be a great way to prove me wrong.

Andrei
October 10, 2008
Fri, 10 Oct 2008 16:27:03 -0500,
Andrei Alexandrescu wrote:
> > while (a.length || b.length)
> > {
> >   if (uniform(gen, 0, a.length + b.length) < a.length)
> >   {
> >     ab.put(a.head);
> >     a.next();
> >   }
> >   else
> >   {
> >     ab.put(b.head);
> >     b.next();
> >   }
> > }
> 
> This should work. Use of ranges noted :o).

Actually I felt an urge to write:

if (uniform(gen, 0, a.length + b.length) < a.length)
  ab.put(a.next);
else
  ab.put(b.next);

I had to really get hold of myself.  ;)
October 10, 2008
Reply to Andrei,


> Well I agree things can be coded such that you make the copy only if
> the file is too large anyway. But one lseek call per line is a
> complete disaster, whether you call it I/O effect or whatnot. I'd go
> as far as saying I'd never even try such a solution, especially since
> there exists a forward-pass algorithm. Of course measuring would be a
> great way to prove me wrong.
> 
> Andrei
> 

I think Big-O on mine is n.

I don't think that can be beat. (Maybe I was being not at all clear that I was only looking at that aspect, the rest doesn't interest me as a puzzle)


October 10, 2008
BCS wrote:
> Reply to Andrei,
> 
> 
>> Well I agree things can be coded such that you make the copy only if
>> the file is too large anyway. But one lseek call per line is a
>> complete disaster, whether you call it I/O effect or whatnot. I'd go
>> as far as saying I'd never even try such a solution, especially since
>> there exists a forward-pass algorithm. Of course measuring would be a
>> great way to prove me wrong.
>>
>> Andrei
>>
> 
> I think Big-O on mine is n.

I disagree. Big-O with lseek under the rug? That's a non-solution because it essentially redefines the problem into "if I had a random-access operator then this is pretty much like array shuffling". If you have a random-access operator there is no problem to think about.

Andrei
October 10, 2008
Andrei Alexandrescu wrote:
> Sean Kelly wrote:
>> It would be slower than the seeking option, but something like a randomized mergesort would work as well.  If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file.  Repeat until the input is exhausted.  Now randomly pick two of the temporary files and randomly merge them.  Repeat until two temporary files remain, then output the result of the final random merge to the screen.
>>
>> For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.
> 
> I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?

When a is merged with (b&c) then its lines will be interleaved randomly in the result.


Sean
October 10, 2008
Sean Kelly wrote:
> Andrei Alexandrescu wrote:
>> Sean Kelly wrote:
>>> It would be slower than the seeking option, but something like a randomized mergesort would work as well.  If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file.  Repeat until the input is exhausted.  Now randomly pick two of the temporary files and randomly merge them.  Repeat until two temporary files remain, then output the result of the final random merge to the screen.
>>>
>>> For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.
>>
>> I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
> 
> When a is merged with (b&c) then its lines will be interleaved randomly in the result.

I think that works with Sergey's appropriate crooking of the dice. For the record, my solution involved generating k shuffled sub-files, and then roll a fair k-sided dice and write one line in the chosen file to the output, until all files are exhausted. Sergey showed that my solution is wrong unless it so happens the last fragment is equal to all others. Thanks!

Well are we ready for the next puzzle, or bored?

Andrei
October 10, 2008
Reply to Andrei,

> BCS wrote:
> 
>> Reply to Andrei,
>> 
>>> Well I agree things can be coded such that you make the copy only if
>>> the file is too large anyway. But one lseek call per line is a
>>> complete disaster, whether you call it I/O effect or whatnot. I'd go
>>> as far as saying I'd never even try such a solution, especially
>>> since there exists a forward-pass algorithm. Of course measuring
>>> would be a great way to prove me wrong.
>>> 
>>> Andrei
>>> 
>> I think Big-O on mine is n.
>> 
> I disagree. Big-O with lseek under the rug? That's a non-solution
> because it essentially redefines the problem into "if I had a
> random-access operator then this is pretty much like array shuffling".
> If you have a random-access operator there is no problem to think
> about.
> 
> Andrei
> 

In that case this isn't a problem /I/ am intersted in trying to solve.


October 10, 2008
Andrei Alexandrescu wrote:
> Sergey Gromov wrote:
>> Fri, 10 Oct 2008 15:54:18 -0500,
>> Andrei Alexandrescu wrote:
>>> Sean Kelly wrote:
>>>> It would be slower than the seeking option, but something like a randomized mergesort would work as well.  If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file.  Repeat until the input is exhausted.  Now randomly pick two of the temporary files and randomly merge them.  Repeat until two temporary files remain, then output the result of the final random merge to the screen.
>>>>
>>>> For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.
>>> I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
>>
>> After merging b and c you end up with a and bc.  Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result.
>>
>> When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.
> 
> How do you "randomly merge"? Describe the process.


while( a.hasMore || b.hasMore )
{
    if( rand % 2 )
        r ~= a.readLine;
    else
        r ~= b.readLine;
}
while( a.hasMore )
    r ~= a.readLine;
while( b.hasMore )
    r ~= b.readLine;

The downside to randomly choosing two files is that it's possible you could end up always choosing a small and a large file, thus increasing the chance that the large file will have a bunch of lines left over which are essentially appended.

The easy way around this would be to maintain a list of files so only equal-sized files are merged (similar to how mergesort works).  However, I think a better way would be to use some sort of bias for the random check above so that different-sized files are on equal footing.  Perhaps write out the file with the first 4 bytes containing the number of lines in that file and use that to construct the bias.  I'll leave those with better math-fu than I to come up with a formula which would be fair.


Sean
October 10, 2008
Sean Kelly wrote:
> Andrei Alexandrescu wrote:
>> Sergey Gromov wrote:
>>> Fri, 10 Oct 2008 15:54:18 -0500,
>>> Andrei Alexandrescu wrote:
>>>> Sean Kelly wrote:
>>>>> It would be slower than the seeking option, but something like a randomized mergesort would work as well.  If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file.  Repeat until the input is exhausted.  Now randomly pick two of the temporary files and randomly merge them.  Repeat until two temporary files remain, then output the result of the final random merge to the screen.
>>>>>
>>>>> For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.
>>>> I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
>>>
>>> After merging b and c you end up with a and bc.  Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result.
>>>
>>> When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.
>>
>> How do you "randomly merge"? Describe the process.
> 
> 
> while( a.hasMore || b.hasMore )
> {
>     if( rand % 2 )
>         r ~= a.readLine;
>     else
>         r ~= b.readLine;
> }
> while( a.hasMore )
>     r ~= a.readLine;
> while( b.hasMore )
>     r ~= b.readLine;
> 
> The downside to randomly choosing two files is that it's possible you could end up always choosing a small and a large file, thus increasing the chance that the large file will have a bunch of lines left over which are essentially appended.

Yah, Sergey got me thinking too. What if you have a large file that doesn't fit in memory by exactly one line?

Then I shuffle (n - 1) lines at random, and there remains the last line to write. If I choose from either fragment with probability 0.5, the last line will be with high probability among the first in the result!

> The easy way around this would be to maintain a list of files so only equal-sized files are merged (similar to how mergesort works).  However, I think a better way would be to use some sort of bias for the random check above so that different-sized files are on equal footing.  Perhaps write out the file with the first 4 bytes containing the number of lines in that file and use that to construct the bias.  I'll leave those with better math-fu than I to come up with a formula which would be fair.

(In similar situations I've put the number of lines right in the name of the temporary file. I kid you not! And it's O(1) too!)

It looks like any fragment should be chosen from with a probability proportional to its length. The lingering question for me is whether that probability needs to be updated, or can be fixed at the beginning of merging. I think fixed doesn't quite work unless special measures are taken at whenever a range is done.

Again many thanks to all participants for your illuminating insights.


Andrei
October 11, 2008
Fri, 10 Oct 2008 17:45:28 -0500,
Andrei Alexandrescu wrote:
> Sean Kelly wrote:
> > Andrei Alexandrescu wrote:
> >> Sergey Gromov wrote:
> >>> Fri, 10 Oct 2008 15:54:18 -0500,
> >>> Andrei Alexandrescu wrote:
> >>>> Sean Kelly wrote:
> >>>>> It would be slower than the seeking option, but something like a randomized mergesort would work as well.  If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted.  Now randomly pick two of the temporary files and randomly merge them.  Repeat until two temporary files remain, then output the result of the final random merge to the screen.
> >>>>>
> >>>>> For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.
> >>>> I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
> >>>
> >>> After merging b and c you end up with a and bc.  Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result.
> >>>
> >>> When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.
> >>
> >> How do you "randomly merge"? Describe the process.
> > 
> > 
> > while( a.hasMore || b.hasMore )
> > {
> >     if( rand % 2 )
> >         r ~= a.readLine;
> >     else
> >         r ~= b.readLine;
> > }
> > while( a.hasMore )
> >     r ~= a.readLine;
> > while( b.hasMore )
> >     r ~= b.readLine;
> > 
> > The downside to randomly choosing two files is that it's possible you could end up always choosing a small and a large file, thus increasing the chance that the large file will have a bunch of lines left over which are essentially appended.
> 
> Yah, Sergey got me thinking too. What if you have a large file that doesn't fit in memory by exactly one line?
> 
> Then I shuffle (n - 1) lines at random, and there remains the last line to write. If I choose from either fragment with probability 0.5, the last line will be with high probability among the first in the result!
> 
> > The easy way around this would be to maintain a list of files so only equal-sized files are merged (similar to how mergesort works).  However, I think a better way would be to use some sort of bias for the random check above so that different-sized files are on equal footing.  Perhaps write out the file with the first 4 bytes containing the number of lines in that file and use that to construct the bias.  I'll leave those with better math-fu than I to come up with a formula which would be fair.
> 
> (In similar situations I've put the number of lines right in the name of the temporary file. I kid you not! And it's O(1) too!)
> 
> It looks like any fragment should be chosen from with a probability proportional to its length. The lingering question for me is whether that probability needs to be updated, or can be fixed at the beginning of merging. I think fixed doesn't quite work unless special measures are taken at whenever a range is done.
> 
> Again many thanks to all participants for your illuminating insights.

Here's my little proof.  Paste this into wikipedia to see formulas.

Probability of a line from <math>a</math> getting into the first line of result:

<math>
p_a^1=\frac{n_a}{n_a+n_b}
</math>

Likewise for <math>b</math>:

<math>
p_b^1=\frac{n_b}{n_a+n_b}
</math>

Probability of a line from <math>a</math> becoming a second line of the output, <math>p_a^2</math>, is:

<math>
p_a^2=p_a^2\Big|_{a_1} + p_a^2\Big|_{b_1}
</math>

where <math>p_a^2\Big|_{a_1}</math> and <math>p_a^2\Big|_{b_1}</math> are probabilities of a line from <math>a</math> getting into position <math>2</math> when the first line is from <math>a</math> and from <math>b</math> respectively.

<math>
p_a^2\Big|_{a_1} = \frac{n_a}{n_a+n_b}\cdot\frac{n_a-1}{n_a-1+n_b} =
\frac{n_a(n_a-1)}{(n_a+n_b)(n_a+n_b-1)}
</math>

<math>
p_a^2\Big|_{b_1} = \frac{n_b}{n_a+n_b}\cdot\frac{n_a}{n_a+n_b-1} = \frac
{n_a n_b}{(n_a+n_b)(n_a+n_b-1)}
</math>

Therefore

<math>
p_a^2 = \frac{n_a(n_a-1)}{(n_a+n_b)(n_a+n_b-1)}+\frac{n_a n_b}{(n_a+n_b)
(n_a+n_b-1)} = \frac{n_a(n_a+n_b-1)}{(n_a+n_b)(n_a+n_b-1)} = \frac{n_a}
{n_a+n_b}
</math>

So at least <math>p_a^1=p_a^2</math>.  Induce anyone?