Thread overview
Re: [OT] shuffling lines in a stream
Oct 10, 2008
Jason House
Oct 10, 2008
BLS
Oct 10, 2008
Benji Smith
Oct 10, 2008
BLS
October 10, 2008
Andrei Alexandrescu Wrote:

> Sergey Gromov wrote:
> > Thu, 09 Oct 2008 23:42:47 -0500,
> > Andrei Alexandrescu wrote:
> >> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
> >>
> >> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
> >>
> >> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
> > 
> > Well, my temporary file contains the whole stream.  Because there is a probability that the last line of the stream goes first.
> 
> Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
> 
> Andrei

What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing?

I look forward to the next challenge ;)
October 10, 2008
Jason House wrote:
> Andrei Alexandrescu Wrote:
> 
>> Sergey Gromov wrote:
>>> Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:
>>>> The high traffic enjoyed by "random k-sample of a file"
>>>> confirms that plenty of people around here have idle cycles
>>>> ready to be spent on fun tasks.
>>>> 
>>>> In that vein, here's another problem. Say you want to randomly
>>>> shuffle lines in a stream. You don't know the size beforehand,
>>>> and one pass is preferable. The stream can be larger than the
>>>> working memory, but not an unbounded amount of times larger.
>>>> Also lines in the stream can be unbounded, but it is guaranteed
>>>> that at least a nonzero fraction of the file fits in working
>>>> memory.
>>>> 
>>>> You can use temporary files, but of course minimizing their
>>>> total size is recommended. Speed is of course important. Draft
>>>> your solution and explain why you think it's correct and
>>>> advantageous.
>>> Well, my temporary file contains the whole stream.  Because there
>>> is a probability that the last line of the stream goes first.
>> Yes, that probability exists, but that doesn't imply you must load
>> the entire file in memory.
>> 
>> Andrei
> 
> What you want to optimize is still unclear to me.

At the end of the day, devise a means to take care of the task. Faster
is better, occupying less space is better.

> At its heart, you must pass over the data twice. Once to index it,
> and once to display it. Some library-based memmory-mapped file or a
> seekable stream seems like enough to get the job done. Neither that
> bit nor the randomization of indicies seems particularly interesting.
> What am I missing?

Your solution will be slow. Doing one random seek per line is too much, even in a memory-mapped file. Mapping into memory is not magic! We are talking about many millions of lines. Also you can't memory-map streams unless you create a full-blown copy first.

> I look forward to the next challenge ;)

I think you will be surprised by the juicy parts of this one.


Andrei
October 10, 2008
Jason House schrieb:
> Andrei Alexandrescu Wrote:
> 
>> Sergey Gromov wrote:
>>> Thu, 09 Oct 2008 23:42:47 -0500,
>>> Andrei Alexandrescu wrote:
>>>> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
>>>>
>>>> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
>>>>
>>>> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
>>> Well, my temporary file contains the whole stream.  Because there is a probability that the last line of the stream goes first.
>> Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
>>
>> Andrei
> 
> What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing?
> 
> I look forward to the next challenge ;) 
Hi Jason,
I think (I may be wrong) that -Normal distribution- is the key
to optimize speed.  Let's say the stream Andrei is talking about is a book. Each sentence in this book consist of N characters. (End of Line is a DOT) Now we read the first ~10 pages. Character by Character. Based on this information we are able to say that each sentence has an average size of 100 characters and the normal distribution looks similat to :
no Graphic possible :(
so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution

Indeed, NOT the solution Andrei is looking for, because indexing is still necesary. But due to the fact that finding the EOL/DOT is probabely faster by creating a slice from position 90 to 110 (in sentence #1, +- 10 percent from our average sentence size), this could be already a win.
Further : During the indexing process we can also adjust our first(base) normal distribution, as well as our +- tolerance.

Heck it is late... hope this is not completely bullshi*
Bjoern



October 10, 2008
BLS wrote:
> Jason House schrieb:
>> Andrei Alexandrescu Wrote:
>>
>>> Sergey Gromov wrote:
>>>> Thu, 09 Oct 2008 23:42:47 -0500,
>>>> Andrei Alexandrescu wrote:
>>>>> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
>>>>>
>>>>> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
>>>>>
>>>>> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
>>>> Well, my temporary file contains the whole stream.  Because there is a probability that the last line of the stream goes first.
>>> Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
>>>
>>> Andrei
>>
>> What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing?
>>
>> I look forward to the next challenge ;) 
> Hi Jason,
> I think (I may be wrong) that -Normal distribution- is the key
> to optimize speed.  Let's say the stream Andrei is talking about is a book. Each sentence in this book consist of N characters. (End of Line is a DOT) Now we read the first ~10 pages. Character by Character. Based on this information we are able to say that each sentence has an average size of 100 characters and the normal distribution looks similat to :
> no Graphic possible :(
> so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution
> 
> Indeed, NOT the solution Andrei is looking for, because indexing is still necesary. But due to the fact that finding the EOL/DOT is probabely faster by creating a slice from position 90 to 110 (in sentence #1, +- 10 percent from our average sentence size), this could be already a win.
> Further : During the indexing process we can also adjust our first(base) normal distribution, as well as our +- tolerance.
> 
> Heck it is late... hope this is not completely bullshi*
> Bjoern

Of course, this relies on the assumption that sentences have normally-distributed length (which may or may not be true on any specific document).

You might be interested in checking out something called a "Kernel Density Estimator", which is the typical solution for estimating the shape of an unknown distribution based on a series of samples.

Here's the relevant wikipedia article:

http://en.wikipedia.org/wiki/Kernel_density_estimation

I learned about this technique because it was the subject of a job interview question once. The question was something like "Given a file containing 100 billion floating point values, what's the most effective algorithm for finding the median value. Your machine only has room for a billion values in memory at any given time, and making multiple passes of the file is very expensive."

In my solution, I built a histogram during the first pass, counting the number of entries in each bucket. At the end of the first pass, I could isolate the bucket containing the median, and then during the second pass, I'd be able to definitely identify the median value.

They pointed out several degenerate cases where my algorithm would collapse (because I always had to make assumptions about the shape of the distribution), and I left the interview stumped about the real answer.

When I got home, I did my research and found the KDE technique, which I've used on a few projects since then.

--benji
October 10, 2008
Benji Smith schrieb:
> BLS wrote:
>> Jason House schrieb:
>>> Andrei Alexandrescu Wrote:
>>>
>>>> Sergey Gromov wrote:
>>>>> Thu, 09 Oct 2008 23:42:47 -0500,
>>>>> Andrei Alexandrescu wrote:
>>>>>> The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks.
>>>>>>
>>>>>> In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.
>>>>>>
>>>>>> You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.
>>>>> Well, my temporary file contains the whole stream.  Because there is a probability that the last line of the stream goes first.
>>>> Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
>>>>
>>>> Andrei
>>>
>>> What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing?
>>>
>>> I look forward to the next challenge ;) 
>> Hi Jason,
>> I think (I may be wrong) that -Normal distribution- is the key
>> to optimize speed.  Let's say the stream Andrei is talking about is a book. Each sentence in this book consist of N characters. (End of Line is a DOT) Now we read the first ~10 pages. Character by Character. Based on this information we are able to say that each sentence has an average size of 100 characters and the normal distribution looks similat to :
>> no Graphic possible :(
>> so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution
>>
>> Indeed, NOT the solution Andrei is looking for, because indexing is still necesary. But due to the fact that finding the EOL/DOT is probabely faster by creating a slice from position 90 to 110 (in sentence #1, +- 10 percent from our average sentence size), this could be already a win.
>> Further : During the indexing process we can also adjust our first(base) normal distribution, as well as our +- tolerance.
>>
>> Heck it is late... hope this is not completely bullshi*
>> Bjoern
> 
> Of course, this relies on the assumption that sentences have normally-distributed length (which may or may not be true on any specific document).
> 
> You might be interested in checking out something called a "Kernel Density Estimator", which is the typical solution for estimating the shape of an unknown distribution based on a series of samples.
> 
> Here's the relevant wikipedia article:
> 
> http://en.wikipedia.org/wiki/Kernel_density_estimation
> 
> I learned about this technique because it was the subject of a job interview question once. The question was something like "Given a file containing 100 billion floating point values, what's the most effective algorithm for finding the median value. Your machine only has room for a billion values in memory at any given time, and making multiple passes of the file is very expensive."
> 
> In my solution, I built a histogram during the first pass, counting the number of entries in each bucket. At the end of the first pass, I could isolate the bucket containing the median, and then during the second pass, I'd be able to definitely identify the median value.
> 
> They pointed out several degenerate cases where my algorithm would collapse (because I always had to make assumptions about the shape of the distribution), and I left the interview stumped about the real answer.
> 
> When I got home, I did my research and found the KDE technique, which I've used on a few projects since then.
> 
> --benji

Thanks for the pointer benji. I should have a book about it, somewhere. ... well, have to look another day. And yes, my idea is pretty weak.
bjoern