October 11, 2008 Re: [OT] shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Sergey Gromov | Sergey Gromov wrote:
> Fri, 10 Oct 2008 09:20:44 -0500,
> Andrei Alexandrescu wrote:
>> Sergey Gromov wrote:
>>> Fri, 10 Oct 2008 07:55:38 -0500,
>>> Andrei Alexandrescu wrote:
>>>> Sergey Gromov wrote:
>>>>> Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
>>>> Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
>>> This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.
>> Creating temporary files is fine as long as you never need to store the entire input in memory.
>
> Ok, let me re-formulate the task then, since the creation of a temporary file containing all the lines seems inevitable. You basically want to uniformly shuffle lines in a very large file.
>
> The idea is this. Let N be number of lines in the file, which you know after storing it. There are two passes, forward and backward. They are symmetric. On a forward pass you read as many lines as you can fit in memory, let it be m. Shuffle them. Then choose a number, k which is
> 0 < k < m and is somehow based upon the m and N, like k = m * m/N. Write the k first lines back into the file and discard them from memory. Append as many new lines to your memory array as possible. Shuffle. Choose k. Write. Repeat. The reverse pass is exactly the same but you start from the end of file.
>
> The tricky part is how to choose k. It obviously needs to take number of discarded lines, q, into account like in k = m * m/(N-q). But my math-fu is almost non-existent so I'm unable to prove if this approach can give an uniform distribution. I feel like it can though.
You have to continually alter k to maintain a random distribution, but you have an invariant 0 <= k <= m. Those bounds mean you can't use this approach.
| |||
October 11, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
> Sean Kelly wrote:
>> It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen.
>>
>> For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.
>
> Sweet! This is even better than my solution. (And it's much faster than seeking.)
>
> Andrei
You can reduce the number of passes by using more temporary files per iteration. For instance, with O(log n) files per iteration, you'll get a time complexity of O(n log* n) rather than O(n log n). However, using more files makes the disk cache unhappy.
| |||
October 13, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | I've read most of the responses (not all) and haven't yet seen a
duplicate of my algorithm, so here it is. You can debate whether the
cost of opening many little files is as bad as lseek or not...I don't know.
BEGIN CODE
void main()
{
CreateAnEmptyTmpDirectoryAndChdirToIt();
while(!feof(stdin))
{
char[] name = Random8CharacterFilename();
if(!file_exists(name))
WriteToFile(name, ReadALine(stdin));
else
{
if(CoinFlip())
WriteToFile("junk", ReadALine(stdin));
else
{
rename(name, "junk");
WriteToFile(name, ReadALine(stdin));
}
// shuffle "junk" to some non-duplicate name. chains of
// renames are allowed so as to prevent any biasing of
// the odds towards lines earlier in the file.
name = Random8CharacterFilename();
while(file_exists(name))
{
if(CoinFlip())
{
rename(name, "tmp");
rename("junk", name);
rename("tmp", "junk")
}
name = Random8CharacterFilename();
}
}
}
char[][] names = GetSortedFilenames();
foreach(name; names)
{
PrintFileContents(name);
unlink(name);
}
DestroyTmpDir();
}
END CODE
Andrei Alexandrescu wrote:
> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
>
> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
>
> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
>
> In case you hate random stuff, just wait; the next problem won't have any random element in it :o).
>
>
> Andrei
| |||
October 16, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Robert Fraser | Robert Fraser wrote: > > I know it's been suggested before, but we really need digitalmars.d.offtopic Word. -- Bruno Medeiros - Software Developer, MSc. in CS/E graduate http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D | |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply