October 10, 2008
Sean Kelly wrote:
> It would be slower than the seeking option, but something like a randomized mergesort would work as well.  If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file.  Repeat until the input is exhausted.  Now randomly pick two of the temporary files and randomly merge them.  Repeat until two temporary files remain, then output the result of the final random merge to the screen.
> 
> For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.

Sweet! This is even better than my solution. (And it's much faster than seeking.)

Andrei
October 10, 2008
Sean Kelly wrote:
> It would be slower than the seeking option, but something like a randomized mergesort would work as well.  If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file.  Repeat until the input is exhausted.  Now randomly pick two of the temporary files and randomly merge them.  Repeat until two temporary files remain, then output the result of the final random merge to the screen.
> 
> For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.

I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?

Andrei
October 10, 2008
Fri, 10 Oct 2008 15:54:18 -0500,
Andrei Alexandrescu wrote:
> Sean Kelly wrote:
> > It would be slower than the seeking option, but something like a randomized mergesort would work as well.  If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file.  Repeat until the input is exhausted.  Now randomly pick two of the temporary files and randomly merge them.  Repeat until two temporary files remain, then output the result of the final random merge to the screen.
> > 
> > For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.
> 
> I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?

After merging b and c you end up with a and bc.  Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result.

When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.
October 10, 2008
Sergey Gromov wrote:
> Fri, 10 Oct 2008 15:54:18 -0500,
> Andrei Alexandrescu wrote:
>> Sean Kelly wrote:
>>> It would be slower than the seeking option, but something like a randomized mergesort would work as well.  If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file.  Repeat until the input is exhausted.  Now randomly pick two of the temporary files and randomly merge them.  Repeat until two temporary files remain, then output the result of the final random merge to the screen.
>>>
>>> For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.
>> I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
> 
> After merging b and c you end up with a and bc.  Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result.
> 
> When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.

How do you "randomly merge"? Describe the process.

Andrei
October 10, 2008
Reply to Andrei,

> BCS wrote:
> 

>> I don't think there is any way to avoid storing the whole file
>> because for a uniform sort there is a possibility that the last line
>> will come out first.
>> 
> I agree with the last paragraph, but lseeking seems overly
> inefficient. Could you avoid that?
> 
> Andrei
> 

algorithmically, I don't think the lseek will matter, as to I/O and cache effects, I'll leave that to someone else.


October 10, 2008
Fri, 10 Oct 2008 16:10:29 -0500,
Andrei Alexandrescu wrote:
> Sergey Gromov wrote:
> > Fri, 10 Oct 2008 15:54:18 -0500,
> > Andrei Alexandrescu wrote:
> >> Sean Kelly wrote:
> >>> It would be slower than the seeking option, but something like a randomized mergesort would work as well.  If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file.  Repeat until the input is exhausted.  Now randomly pick two of the temporary files and randomly merge them.  Repeat until two temporary files remain, then output the result of the final random merge to the screen.
> >>>
> >>> For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.
> >> I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
> > 
> > After merging b and c you end up with a and bc.  Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result.
> > 
> > When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.
> 
> How do you "randomly merge"? Describe the process.

a[] and b[] are files to merge.  ab[] is the result.  a.length and b.length are number of lines left in either file.

while (a.length || b.length)
{
  if (uniform(gen, 0, a.length + b.length) < a.length)
  {
    ab.put(a.head);
    a.next();
  }
  else
  {
    ab.put(b.head);
    b.next();
  }
}
October 10, 2008
BCS wrote:
> Reply to Andrei,
> 
>> BCS wrote:
>>
> 
>>> I don't think there is any way to avoid storing the whole file
>>> because for a uniform sort there is a possibility that the last line
>>> will come out first.
>>>
>> I agree with the last paragraph, but lseeking seems overly
>> inefficient. Could you avoid that?
>>
>> Andrei
>>
> 
> algorithmically, I don't think the lseek will matter,

I think it does. Essentially you impose random access on the input, or copy to a medium that offers it.

gunzip --stdout bigfile.gz | shuffle

You'll have to compulsively store a copy of the input. Besides, random access is kind of a dicey proposition on large files. Of course, only measurement will show...


Andrei
October 10, 2008
Sergey Gromov wrote:
> Fri, 10 Oct 2008 16:10:29 -0500,
> Andrei Alexandrescu wrote:
>> Sergey Gromov wrote:
>>> Fri, 10 Oct 2008 15:54:18 -0500,
>>> Andrei Alexandrescu wrote:
>>>> Sean Kelly wrote:
>>>>> It would be slower than the seeking option, but something like a randomized mergesort would work as well.  If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file.  Repeat until the input is exhausted.  Now randomly pick two of the temporary files and randomly merge them.  Repeat until two temporary files remain, then output the result of the final random merge to the screen.
>>>>>
>>>>> For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.
>>>> I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
>>> After merging b and c you end up with a and bc.  Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result.
>>>
>>> When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.
>> How do you "randomly merge"? Describe the process.
> 
> a[] and b[] are files to merge.  ab[] is the result.  a.length and b.length are number of lines left in either file.
> 
> while (a.length || b.length)
> {
>   if (uniform(gen, 0, a.length + b.length) < a.length)
>   {
>     ab.put(a.head);
>     a.next();
>   }
>   else
>   {
>     ab.put(b.head);
>     b.next();
>   }
> }

This should work. Use of ranges noted :o).

Andrei
October 10, 2008
Fri, 10 Oct 2008 16:27:03 -0500,
Andrei Alexandrescu wrote:
> Sergey Gromov wrote:
> > Fri, 10 Oct 2008 16:10:29 -0500,
> > Andrei Alexandrescu wrote:
> >> Sergey Gromov wrote:
> >>> Fri, 10 Oct 2008 15:54:18 -0500,
> >>> Andrei Alexandrescu wrote:
> >>>> Sean Kelly wrote:
> >>>>> It would be slower than the seeking option, but something like a randomized mergesort would work as well.  If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file.  Repeat until the input is exhausted.  Now randomly pick two of the temporary files and randomly merge them.  Repeat until two temporary files remain, then output the result of the final random merge to the screen.
> >>>>>
> >>>>> For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.
> >>>> I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
> >>> After merging b and c you end up with a and bc.  Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result.
> >>>
> >>> When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.
> >> How do you "randomly merge"? Describe the process.
> > 
> > a[] and b[] are files to merge.  ab[] is the result.  a.length and b.length are number of lines left in either file.
> > 
> > while (a.length || b.length)
> > {
> >   if (uniform(gen, 0, a.length + b.length) < a.length)
> >   {
> >     ab.put(a.head);
> >     a.next();
> >   }
> >   else
> >   {
> >     ab.put(b.head);
> >     b.next();
> >   }
> > }
> 
> This should work.

I just rigorously proved that probability of line from a getting into output position 0 is equal to probability of line from a getting into position 1 and is equal to a.length/(a.length+b.length).  I need to recollect TeX to write this proof down for electronic use.

> Use of ranges noted :o).

Except that a.length and b.length should be stored in files somehow.
October 10, 2008
Reply to Andrei,

> BCS wrote:
> 
>> Reply to Andrei,
>> 
>>> BCS wrote:
>>> 
>>>> I don't think there is any way to avoid storing the whole file
>>>> because for a uniform sort there is a possibility that the last
>>>> line will come out first.
>>>> 
>>> I agree with the last paragraph, but lseeking seems overly
>>> inefficient. Could you avoid that?
>>> 
>>> Andrei
>>> 
>> algorithmically, I don't think the lseek will matter,
>> 
> I think it does. Essentially you impose random access on the input, or
> copy to a medium that offers it.
> 
> gunzip --stdout bigfile.gz | shuffle
> 
> You'll have to compulsively store a copy of the input.

You have to anyway, or is that not what you agread with above?

> Besides, random
> access is kind of a dicey proposition on large files. Of course, only
> measurement will show...

Is that not an I/O effect?

>>  as to I/O and cache effects, I'll leave that to someone else.
> 
> Andrei
>