October 10, 2008 Re: [OT] shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to BLS | BLS wrote:
> Andrei Alexandrescu schrieb:
>> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
>>
>> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
>>
>> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
>>
>> In case you hate random stuff, just wait; the next problem won't have any random element in it :o).
>>
>>
>> Andrei
>
> Just in general
> I would use a memory mapped file. Map data in chunks, say 4MB. Chunks are unmapped as necessary ~LRU, means implement a LRUCache.
>
> This is how some "in memory database-engines" solve that sort of problems.
> Also possible : I miss the real problem :(
I think I overspecified it. Pure and simple you must randomize lines in a file that may not fit in memory. Note that memory mapping may help the speed of whichever implementation, but gives zero indication on how you would actually tackle the problem.
Please add [OT] to the title when answering, as I did, so people can filter it out. I think these are cool things to discuss nonetheless in D's context because it gives us ideas on language features and libraries.
Andrei
| |||
October 10, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Thu, 09 Oct 2008 23:42:47 -0500,
Andrei Alexandrescu wrote:
> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
>
> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
>
> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
module shuffle;
import std.cstream: din, dout;
import std.stream: BufferedFile, FileMode;
import std.file: remove, exists;
import std.random: Random, uniform, unpredictableSeed, randomShuffle;
import std.stdio: writeln;
void main()
{
auto gen = Random(unpredictableSeed);
enum tempName = "temp.$$$";
if (exists(tempName))
remove(tempName);
auto temp = new BufferedFile(tempName,
FileMode.In | FileMode.OutNew);
ulong[] cache;
foreach (char[] s; din)
{
cache ~= temp.position;
temp.writeLine(s);
}
randomShuffle(cache, gen);
foreach (pos; cache)
{
temp.position = pos;
writeln(temp.readLine());
}
temp.close();
}
| |||
October 10, 2008 Re: [OT] shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Sergey Gromov | Sergey Gromov wrote:
> Thu, 09 Oct 2008 23:42:47 -0500,
> Andrei Alexandrescu wrote:
>> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
>>
>> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
>>
>> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
>
> Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
Andrei
| |||
October 10, 2008 Re: [OT] shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Fri, 10 Oct 2008 07:55:38 -0500,
Andrei Alexandrescu wrote:
> Sergey Gromov wrote:
> > Thu, 09 Oct 2008 23:42:47 -0500,
> > Andrei Alexandrescu wrote:
> >> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
> >>
> >> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
> >>
> >> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
> >
> > Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
>
> Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.
| |||
October 10, 2008 Re: [OT] shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Sergey Gromov | Sergey Gromov wrote:
> Fri, 10 Oct 2008 07:55:38 -0500,
> Andrei Alexandrescu wrote:
>> Sergey Gromov wrote:
>>> Thu, 09 Oct 2008 23:42:47 -0500,
>>> Andrei Alexandrescu wrote:
>>>> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
>>>>
>>>> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
>>>>
>>>> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
>>> Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
>> Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
>
> This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.
Creating temporary files is fine as long as you never need to store the entire input in memory.
Andrei
| |||
October 10, 2008 Re: [OT] shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
> Sergey Gromov wrote:
>> Fri, 10 Oct 2008 07:55:38 -0500,
>> Andrei Alexandrescu wrote:
>>> Sergey Gromov wrote:
>>>> Thu, 09 Oct 2008 23:42:47 -0500,
>>>> Andrei Alexandrescu wrote:
>>>>> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
>>>>>
>>>>> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
>>>>>
>>>>> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
>>>> Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
>>> Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
>>
>> This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.
>
> Creating temporary files is fine as long as you never need to store the entire input in memory.
>
> Andrei
So HDD (or any file system) is not counted as memory? Then just store all temporary variables to HDD.
| |||
October 10, 2008 Re: [OT] shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to KennyTM~ | KennyTM~ wrote:
> Andrei Alexandrescu wrote:
>> Sergey Gromov wrote:
>>> Fri, 10 Oct 2008 07:55:38 -0500,
>>> Andrei Alexandrescu wrote:
>>>> Sergey Gromov wrote:
>>>>> Thu, 09 Oct 2008 23:42:47 -0500,
>>>>> Andrei Alexandrescu wrote:
>>>>>> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
>>>>>>
>>>>>> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
>>>>>>
>>>>>> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
>>>>> Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
>>>> Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
>>>
>>> This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.
>>
>> Creating temporary files is fine as long as you never need to store the entire input in memory.
>>
>> Andrei
>
> So HDD (or any file system) is not counted as memory? Then just store all temporary variables to HDD.
This takes us back to one lseek per line. You can create temporary files. They behave as files usually do - they are better accessed in sequential patterns and not too many times.
Andrei
| |||
October 10, 2008 Re: [OT] shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Fri, 10 Oct 2008 09:20:44 -0500,
Andrei Alexandrescu wrote:
> Sergey Gromov wrote:
> > Fri, 10 Oct 2008 07:55:38 -0500,
> > Andrei Alexandrescu wrote:
> >> Sergey Gromov wrote:
> >>> Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
> >>
> >> Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
> >
> > This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.
>
> Creating temporary files is fine as long as you never need to store the entire input in memory.
Ok, let me re-formulate the task then, since the creation of a temporary file containing all the lines seems inevitable. You basically want to uniformly shuffle lines in a very large file.
The idea is this. Let N be number of lines in the file, which you know
after storing it. There are two passes, forward and backward. They are
symmetric. On a forward pass you read as many lines as you can fit in
memory, let it be m. Shuffle them. Then choose a number, k which is
0 < k < m and is somehow based upon the m and N, like k = m * m/N.
Write the k first lines back into the file and discard them from memory.
Append as many new lines to your memory array as possible. Shuffle.
Choose k. Write. Repeat. The reverse pass is exactly the same but you
start from the end of file.
The tricky part is how to choose k. It obviously needs to take number of discarded lines, q, into account like in k = m * m/(N-q). But my math-fu is almost non-existent so I'm unable to prove if this approach can give an uniform distribution. I feel like it can though.
| |||
October 10, 2008 Re: [OT] shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Sergey Gromov | Sergey Gromov wrote:
> Fri, 10 Oct 2008 09:20:44 -0500,
> Andrei Alexandrescu wrote:
>> Sergey Gromov wrote:
>>> Fri, 10 Oct 2008 07:55:38 -0500,
>>> Andrei Alexandrescu wrote:
>>>> Sergey Gromov wrote:
>>>>> Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
>>>> Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
>>> This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.
>> Creating temporary files is fine as long as you never need to store the entire input in memory.
>
> Ok, let me re-formulate the task then, since the creation of a temporary file containing all the lines seems inevitable. You basically want to uniformly shuffle lines in a very large file.
>
> The idea is this. Let N be number of lines in the file, which you know after storing it. There are two passes, forward and backward. They are symmetric. On a forward pass you read as many lines as you can fit in memory, let it be m. Shuffle them. Then choose a number, k which is
> 0 < k < m and is somehow based upon the m and N, like k = m * m/N. Write the k first lines back into the file and discard them from memory. Append as many new lines to your memory array as possible. Shuffle. Choose k. Write. Repeat. The reverse pass is exactly the same but you start from the end of file.
>
> The tricky part is how to choose k. It obviously needs to take number of discarded lines, q, into account like in k = m * m/(N-q). But my math-fu is almost non-existent so I'm unable to prove if this approach can give an uniform distribution. I feel like it can though.
Feeling doesn't qualify.
Andrei
| |||
October 10, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place. Sean | |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply