Jump to page: 1 2 3
Thread overview
[OT] The horizon of a stream
Oct 23, 2008
bearophile
Oct 23, 2008
bearophile
Oct 23, 2008
bearophile
Oct 23, 2008
bearophile
Oct 23, 2008
bearophile
Oct 23, 2008
bearophile
Oct 23, 2008
Sergey Gromov
Oct 23, 2008
mgen
Oct 23, 2008
BCS
Oct 24, 2008
Lionello Lunesu
Oct 24, 2008
BCS
Oct 24, 2008
Jason House
Oct 26, 2008
Nigel Sandever
Oct 26, 2008
Nigel Sandever
Oct 26, 2008
mgen
Oct 26, 2008
Nigel Sandever
Oct 26, 2008
bearophile
Oct 26, 2008
Nigel Sandever
Oct 26, 2008
bearophile
Oct 26, 2008
Nigel Sandever
Oct 26, 2008
bearophile
Oct 26, 2008
Nigel Sandever
Nov 01, 2008
Christopher Wright
October 23, 2008
Consider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines.

For example, this stream has horizon 9:

lorem
ipsum
dolor
habeas
corpus
ipsum
sit
amet
rigor
mortis
consectetur
coitus
interruptus
corpus
adipisicing
elit

This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance.

The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always.

This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow!


Andrei
October 23, 2008
First try, D1, Phobos:

import std.stdio, std.stream;

void main() {
    int[string] seen;
    int horizon;

    int nline;
    foreach (string line; new BufferedFile("data.txt")) {
        auto pos_ptr = line in seen;
        if (pos_ptr) {
            if ((nline - *pos_ptr) >  horizon)
                 horizon = nline - *pos_ptr;
        } else
            seen[line.dup] = nline;
        nline++;
    }

    writefln(horizon);
}


Idem, with my libs:

import d.all, d.string;

void main() {
    int[string] seen;
    int horizon;

    foreach (nline, line; xfile("data.txt")) {
        auto pos_ptr = line in seen;
        if (pos_ptr) {
            if ((nline - *pos_ptr) >  horizon)
                 horizon = nline - *pos_ptr;
        } else
            seen[line.dup] = nline;
    }

    putr(horizon);
    putr(seen);
}

I'll show an "improved" version soon...

Bye,
bearophile
October 23, 2008
bearophile wrote:
> First try, D1, Phobos:
[snip]

At this point a discussion "ok, I'll store every line in a hashtable" would have sufficed. I can tell straight away that this is not a workable solution for a large file. It's actually happened to me :o|.

Andrei
October 23, 2008
This versions compares hash values first, avoiding some slower string compares, because in D1 strings don't store their hash value:

import d.all, d.string;

void main() {
    int[Record!(hash_t, string)] seen;
    int horizon;

    foreach (nline, line; xfile("data.txt")) {
        auto clean_line = line.chomp();
        auto hash_clean_line = hash(clean_line);
        auto pos_ptr = record(hash_clean_line, clean_line) in seen;
        if (pos_ptr) {
            if ((nline - *pos_ptr) >  horizon)
                 horizon = nline - *pos_ptr;
        } else
            seen[record(hash_clean_line, clean_line.dup)] = nline;
    }

    putr(horizon);
}

Maybe it's not fully correct yet, so I'll test it some more...

Bye,
bearophile
October 23, 2008
Andrei Alexandrescu:
> At this point a discussion "ok, I'll store every line in a hashtable" would have sufficed. I can tell straight away that this is not a workable solution for a large file. It's actually happened to me :o|.

I understand, but I have 2 GB of RAM, so that solution (in Python) works most of the times. When you design an algorithm, you start with the simpler version (that's also easy to debug and test), and only when you experimentally see it doesn't suffice (too much slow, not enough RAM, etc) you try to improve it. When you debug the refined version you can also use the unittests developed with the basic version, this is very useful.

Another solution is to keep an associative array of just the string hash values, and lookup into the input file when you have a (possible) false positive.

Another solution that requires even less RAM is to use a bloom filter (I have created a bloom filter class in Python but not in D yet) to keep what strings are seen, and lookup into the input file (as before) to remove (possible) false positives.

Finally, looking up into the input file every time there's a match is too much slow, so you just store the candidate matches into a file or in RAM, to test them all in a second read of the file. If the bloom filter is defined well enough, then such possible matches aren't too many, and you can keep them in RAM.

Bye,
bearophile
October 23, 2008
bearophile wrote:
> Andrei Alexandrescu:
>> At this point a discussion "ok, I'll store every line in a
>> hashtable" would have sufficed. I can tell straight away that this
>> is not a workable solution for a large file. It's actually happened
>> to me :o|.
> 
> I understand, but I have 2 GB of RAM, so that solution (in Python)
> works most of the times. When you design an algorithm, you start with
> the simpler version (that's also easy to debug and test), and only
> when you experimentally see it doesn't suffice (too much slow, not
> enough RAM, etc) you try to improve it. When you debug the refined
> version you can also use the unittests developed with the basic
> version, this is very useful.
> 
> Another solution is to keep an associative array of just the string
> hash values, and lookup into the input file when you have a
> (possible) false positive.

Now you're talking! This is a solution worth a lot of attention. Under
what circumstances it does well vs. not so well? How can you ensure it works for a stream of any size if multiple passes are possible? etc.

> Another solution that requires even less RAM is to use a bloom filter
> (I have created a bloom filter class in Python but not in D yet) to
> keep what strings are seen, and lookup into the input file (as
> before) to remove (possible) false positives.

Bloom filter is also an excellent idea!


Andrei

October 23, 2008
bearophile:
> This versions compares hash values first, avoiding some slower string compares, because in D1 strings don't store their hash value:
>...
>         auto pos_ptr = record(hash_clean_line, clean_line) in seen;

Ignore this solution, it's not faster, because the whole string is hashed two times, etc. You have to use a String class like mine that lazily computes and stores the hash value too...

Bye,
bearophile
October 23, 2008
Andrei Alexandrescu:
> Under what circumstances it does well vs. not so well?

Assuming that the hash function is good, very similar strings too have different hash values.

When there are lot of equal lines it doesn't work well, I presume, because there are lot of equal hash values.

Note that in our example the lines are very short, so replacing them with a hash_t (as the key of the associative array, so it stores this value two times, because it keeps the hash of the hash value too, that being probably a size_t or uint, is hashed to itself) doesn't save you that much RAM.

Bye,
bearophile
October 23, 2008
bearophile Wrote:
> Another solution that requires even less RAM is to use a bloom filter...

There's a third possible solution, that is very fast: assuming the hash values are uint (32 bit), then you can create a bitfield of 2^32 bits, it requires just 1/2 GB, that is 1/4 of my RAM. So each bit represents a possible different value of the hash.

Bye,
bearophile
October 23, 2008
Thu, 23 Oct 2008 14:29:36 -0500,
Andrei Alexandrescu wrote:
> Consider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines.
> 
> For example, this stream has horizon 9:
> 
> lorem
> ipsum
> dolor
> habeas
> corpus
> ipsum
> sit
> amet
> rigor
> mortis
> consectetur
> coitus
> interruptus
> corpus
> adipisicing
> elit
> 
> This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance.
> 
> The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always.
> 
> This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow!

Traverse the stream, line by line.  If the word is new, add it to a dictionary, size_t[string], with the current line number.  If it's already in the dictionary, look at the distance to its first occurence and update horizon if that distance is greater than what we already have. If we don't run out of memory, we're done in one pass.

If we do run out of memory, recover ;) then continue without adding words into the dictionary but dumping them into a temporary file instead, just checking and updating the horizon.

The stream ends.  If our current horizon is greater than the number of lines we dumped into a temp file, we're done.  Otherwise, empty the dictionary, open the temp file, and repeat.
« First   ‹ Prev
1 2 3