| Thread overview | |||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
October 23, 2008 [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Consider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines. For example, this stream has horizon 9: lorem ipsum dolor habeas corpus ipsum sit amet rigor mortis consectetur coitus interruptus corpus adipisicing elit This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance. The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always. This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow! Andrei | ||||
October 23, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | First try, D1, Phobos:
import std.stdio, std.stream;
void main() {
int[string] seen;
int horizon;
int nline;
foreach (string line; new BufferedFile("data.txt")) {
auto pos_ptr = line in seen;
if (pos_ptr) {
if ((nline - *pos_ptr) > horizon)
horizon = nline - *pos_ptr;
} else
seen[line.dup] = nline;
nline++;
}
writefln(horizon);
}
Idem, with my libs:
import d.all, d.string;
void main() {
int[string] seen;
int horizon;
foreach (nline, line; xfile("data.txt")) {
auto pos_ptr = line in seen;
if (pos_ptr) {
if ((nline - *pos_ptr) > horizon)
horizon = nline - *pos_ptr;
} else
seen[line.dup] = nline;
}
putr(horizon);
putr(seen);
}
I'll show an "improved" version soon...
Bye,
bearophile
| |||
October 23, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile wrote:
> First try, D1, Phobos:
[snip]
At this point a discussion "ok, I'll store every line in a hashtable" would have sufficed. I can tell straight away that this is not a workable solution for a large file. It's actually happened to me :o|.
Andrei
| |||
October 23, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | This versions compares hash values first, avoiding some slower string compares, because in D1 strings don't store their hash value:
import d.all, d.string;
void main() {
int[Record!(hash_t, string)] seen;
int horizon;
foreach (nline, line; xfile("data.txt")) {
auto clean_line = line.chomp();
auto hash_clean_line = hash(clean_line);
auto pos_ptr = record(hash_clean_line, clean_line) in seen;
if (pos_ptr) {
if ((nline - *pos_ptr) > horizon)
horizon = nline - *pos_ptr;
} else
seen[record(hash_clean_line, clean_line.dup)] = nline;
}
putr(horizon);
}
Maybe it's not fully correct yet, so I'll test it some more...
Bye,
bearophile
| |||
October 23, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu:
> At this point a discussion "ok, I'll store every line in a hashtable" would have sufficed. I can tell straight away that this is not a workable solution for a large file. It's actually happened to me :o|.
I understand, but I have 2 GB of RAM, so that solution (in Python) works most of the times. When you design an algorithm, you start with the simpler version (that's also easy to debug and test), and only when you experimentally see it doesn't suffice (too much slow, not enough RAM, etc) you try to improve it. When you debug the refined version you can also use the unittests developed with the basic version, this is very useful.
Another solution is to keep an associative array of just the string hash values, and lookup into the input file when you have a (possible) false positive.
Another solution that requires even less RAM is to use a bloom filter (I have created a bloom filter class in Python but not in D yet) to keep what strings are seen, and lookup into the input file (as before) to remove (possible) false positives.
Finally, looking up into the input file every time there's a match is too much slow, so you just store the candidate matches into a file or in RAM, to test them all in a second read of the file. If the bloom filter is defined well enough, then such possible matches aren't too many, and you can keep them in RAM.
Bye,
bearophile
| |||
October 23, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile wrote: > Andrei Alexandrescu: >> At this point a discussion "ok, I'll store every line in a >> hashtable" would have sufficed. I can tell straight away that this >> is not a workable solution for a large file. It's actually happened >> to me :o|. > > I understand, but I have 2 GB of RAM, so that solution (in Python) > works most of the times. When you design an algorithm, you start with > the simpler version (that's also easy to debug and test), and only > when you experimentally see it doesn't suffice (too much slow, not > enough RAM, etc) you try to improve it. When you debug the refined > version you can also use the unittests developed with the basic > version, this is very useful. > > Another solution is to keep an associative array of just the string > hash values, and lookup into the input file when you have a > (possible) false positive. Now you're talking! This is a solution worth a lot of attention. Under what circumstances it does well vs. not so well? How can you ensure it works for a stream of any size if multiple passes are possible? etc. > Another solution that requires even less RAM is to use a bloom filter > (I have created a bloom filter class in Python but not in D yet) to > keep what strings are seen, and lookup into the input file (as > before) to remove (possible) false positives. Bloom filter is also an excellent idea! Andrei | |||
October 23, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile:
> This versions compares hash values first, avoiding some slower string compares, because in D1 strings don't store their hash value:
>...
> auto pos_ptr = record(hash_clean_line, clean_line) in seen;
Ignore this solution, it's not faster, because the whole string is hashed two times, etc. You have to use a String class like mine that lazily computes and stores the hash value too...
Bye,
bearophile
| |||
October 23, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu:
> Under what circumstances it does well vs. not so well?
Assuming that the hash function is good, very similar strings too have different hash values.
When there are lot of equal lines it doesn't work well, I presume, because there are lot of equal hash values.
Note that in our example the lines are very short, so replacing them with a hash_t (as the key of the associative array, so it stores this value two times, because it keeps the hash of the hash value too, that being probably a size_t or uint, is hashed to itself) doesn't save you that much RAM.
Bye,
bearophile
| |||
October 23, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile Wrote:
> Another solution that requires even less RAM is to use a bloom filter...
There's a third possible solution, that is very fast: assuming the hash values are uint (32 bit), then you can create a bitfield of 2^32 bits, it requires just 1/2 GB, that is 1/4 of my RAM. So each bit represents a possible different value of the hash.
Bye,
bearophile
| |||
October 23, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Thu, 23 Oct 2008 14:29:36 -0500,
Andrei Alexandrescu wrote:
> Consider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines.
>
> For example, this stream has horizon 9:
>
> lorem
> ipsum
> dolor
> habeas
> corpus
> ipsum
> sit
> amet
> rigor
> mortis
> consectetur
> coitus
> interruptus
> corpus
> adipisicing
> elit
>
> This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance.
>
> The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always.
>
> This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow!
Traverse the stream, line by line. If the word is new, add it to a dictionary, size_t[string], with the current line number. If it's already in the dictionary, look at the distance to its first occurence and update horizon if that distance is greater than what we already have. If we don't run out of memory, we're done in one pass.
If we do run out of memory, recover ;) then continue without adding words into the dictionary but dumping them into a temporary file instead, just checking and updating the horizon.
The stream ends. If our current horizon is greater than the number of lines we dumped into a temp file, we're done. Otherwise, empty the dictionary, open the temp file, and repeat.
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply