| Thread overview | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
October 10, 2008 shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous. In case you hate random stuff, just wait; the next problem won't have any random element in it :o). Andrei | ||||
October 10, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Reply to Andrei,
> The high traffic enjoyed by "random k-sample of a file" confirms that
> plenty of people around here have idle cycles ready to be spent on fun
> tasks.
>
> In that vein, here's another problem. Say you want to randomly shuffle
> lines in a stream. You don't know the size beforehand, and one pass is
> preferable. The stream can be larger than the working memory, but not
> an unbounded amount of times larger. Also lines in the stream can be
> unbounded, but it is guaranteed that at least a nonzero fraction of
> the file fits in working memory.
>
> You can use temporary files, but of course minimizing their total size
> is recommended. Speed is of course important. Draft your solution and
> explain why you think it's correct and advantageous.
>
> In case you hate random stuff, just wait; the next problem won't have
> any random element in it :o).
>
> Andrei
>
Um... I'm not following that.
| |||
October 10, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to BCS | BCS wrote:
> Reply to Andrei,
>
>> The high traffic enjoyed by "random k-sample of a file" confirms that
>> plenty of people around here have idle cycles ready to be spent on fun
>> tasks.
>>
>> In that vein, here's another problem. Say you want to randomly shuffle
>> lines in a stream. You don't know the size beforehand, and one pass is
>> preferable. The stream can be larger than the working memory, but not
>> an unbounded amount of times larger. Also lines in the stream can be
>> unbounded, but it is guaranteed that at least a nonzero fraction of
>> the file fits in working memory.
>>
>> You can use temporary files, but of course minimizing their total size
>> is recommended. Speed is of course important. Draft your solution and
>> explain why you think it's correct and advantageous.
>>
>> In case you hate random stuff, just wait; the next problem won't have
>> any random element in it :o).
>>
>> Andrei
>>
>
> Um... I'm not following that.
Very simple: given a file, just randomly change the order of its lines and output them. For example (assume read from standard input):
$ ./shuffle
a
b
c
^D
b
a
c
$ ./shuffle
a
b
c
^D
b
c
a
$ _
Andrei
| |||
October 10, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Reply to Andrei,
> You don't know the size beforehand, and
> one pass is preferable. The stream can be larger than the working
> memory, but not an unbounded amount of times larger. Also lines in
> the stream can be unbounded, but it is guaranteed that at least a
> nonzero fraction of the file fits in working memory.
I got the bit about the end effect. The above is what I didn't get.
| |||
October 10, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to BCS | BCS wrote:
> Reply to Andrei,
>
>> You don't know the size beforehand, and
>> one pass is preferable. The stream can be larger than the working
>> memory, but not an unbounded amount of times larger. Also lines in
>> the stream can be unbounded, but it is guaranteed that at least a
>> nonzero fraction of the file fits in working memory.
>
> I got the bit about the end effect. The above is what I didn't get.
Don't worry about that for now.
Andrei
| |||
October 10, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Reply to Andrei,
> BCS wrote:
>
>> Reply to Andrei,
>>
>>> You don't know the size beforehand, and
>>> one pass is preferable. The stream can be larger than the working
>>> memory, but not an unbounded amount of times larger. Also lines in
>>> the stream can be unbounded, but it is guaranteed that at least a
>>> nonzero fraction of the file fits in working memory.
>> I got the bit about the end effect. The above is what I didn't get.
>>
> Don't worry about that for now.
>
> Andrei
>
cache the whole file into a tmp file
store slice data on the way through (start/length of each line)
shuffle that set
lseek/read/write each slice in order
Quick'n'dirty
The only fun part is how to shuffle the deck. What I think might be easiest to do would be to qsort it with rand as the compare function.
I don't think there is any way to avoid storing the whole file because for a uniform sort there is a possibility that the last line will come out first.
| |||
October 10, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to BCS | BCS wrote:
> Reply to Andrei,
>
>> BCS wrote:
>>
>>> Reply to Andrei,
>>>
>>>> You don't know the size beforehand, and
>>>> one pass is preferable. The stream can be larger than the working
>>>> memory, but not an unbounded amount of times larger. Also lines in
>>>> the stream can be unbounded, but it is guaranteed that at least a
>>>> nonzero fraction of the file fits in working memory.
>>> I got the bit about the end effect. The above is what I didn't get.
>>>
>> Don't worry about that for now.
>>
>> Andrei
>>
>
> cache the whole file into a tmp file
> store slice data on the way through (start/length of each line)
> shuffle that set
> lseek/read/write each slice in order
>
> Quick'n'dirty
>
> The only fun part is how to shuffle the deck. What I think might be easiest to do would be to qsort it with rand as the compare function.
>
> I don't think there is any way to avoid storing the whole file because for a uniform sort there is a possibility that the last line will come out first.
I agree with the last paragraph, but lseeking seems overly inefficient. Could you avoid that?
Andrei
| |||
October 10, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
>
> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
>
> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
>
> In case you hate random stuff, just wait; the next problem won't have any random element in it :o).
>
>
> Andrei
Forgot to mention a desirable property. It would be nice if the algorithm would naturally subscale, e.g. if ran on a short file, it would produce a shuffled file with the speed competitive with an algorithm conceived for shorter files.
Andrei
| |||
October 10, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
>
> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
>
> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
>
> In case you hate random stuff, just wait; the next problem won't have any random element in it :o).
>
>
> Andrei
I know it's been suggested before, but we really need digitalmars.d.offtopic
| |||
October 10, 2008 Re: shuffling lines in a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu schrieb:
> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
>
> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
>
> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
>
> In case you hate random stuff, just wait; the next problem won't have any random element in it :o).
>
>
> Andrei
Just in general
I would use a memory mapped file. Map data in chunks, say 4MB. Chunks are unmapped as necessary ~LRU, means implement a LRUCache.
This is how some "in memory database-engines" solve that sort of problems.
Also possible : I miss the real problem :(
Bjoern
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply