October 23, 2008
If size is a problem use a trie!

//---CODE---
import std.stream;
import std.stdio;

class position_trie
{
bool seen = false;
int last_seen = 0;
position_trie[char] branch;
}

void main()
{
  int[char[]] last;
  int max;
  Stream file = new MemoryStream(
    "lorem
ipsum
dolor
habeas
corpus
ipsum
sit
amet
rigor
mortis
consectetur
coitus
interruptus
corpus
adipisicing
elit"
  );
  auto root = new position_trie();
  position_trie node = root;
  position_trie next;
  foreach(ulong lineno,char[] line;file)
  {
    foreach(int chrno,char chr;line)
    {
      if(chr in node.branch)
      {
        node = node.branch[chr];
      }
      else
      {
        next = new position_trie();
        node.branch[chr] = next;
        node = next;
        foreach(char chr2; line[chrno+1..$])
        {
          next = new position_trie();
          node.branch[chr2] = next;
          node = next;
        }
        break;
      }
    }
    if(node.seen)
    {
      auto dist = lineno - node.last_seen;
      if(dist > max)
        max = dist;
    }
    node.seen = true;
    node.last_seen = lineno;
    node = root;
  }
  writefln(max);
}

October 23, 2008
(going at this blind so I might dup others ideas)

- walk the file line by line
- hash each line with a "good hash function"*
- line index and the start location of each line as a function of it's hash

- look for the hash bucket with the largest spread of line indexes
- grab that spread
- page in the needed file segments
- test if the lines are the same, if so return
- else try again
-- using some kind of LRU cache keep as many pages in ram as you can
-- prefer testing stuff that is in memory over stuff that is not.

The main issue will be the what-to-pick-next-function. It must be fast, not requiter to much meta data, and work well with the cache.



Another option:

load every line into a database table with it's index:

INSERT INTO tmp SELECT MAX(index) - MIN(index) FROM lines GROUP BY line;
SELECT MAX(dif) INTO i, COUNT(dif) INTO c FROM tmp;

if(c == 0)
 return 0;
else
 return i;


October 24, 2008
With if "corpus" would appear one more time but at a smaller distance?

line 5: corpus
...
line 14: corpus
..
line 16: corpus

Would this stream have a different horizon?
October 24, 2008
Reply to Lionello,

> With if "corpus" would appear one more time but at a smaller distance?
> 
> line 5: corpus
> ...
> line 14: corpus
> ..
> line 16: corpus
> Would this stream have a different horizon?
> 

IMNSHO no the max spread would be from 5 to 16


October 24, 2008
By hash value, store a linked list of (line #, char offset within file)

When a hash collision occurs, append to the list, but also calculate the horizon from the head of the list.

While the horizon is higher than previously observed, scan the list until a string match is found and update the maximum observed horizon.


This performance is linear, but has extra string lookups over some alternatives. It's possible to optimize the string comparisons with some kind of temporary caching of strings such as weak pointers.



Andrei Alexandrescu wrote:

> Consider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines.
> 
> For example, this stream has horizon 9:
> 
> lorem
> ipsum
> dolor
> habeas
> corpus
> ipsum
> sit
> amet
> rigor
> mortis
> consectetur
> coitus
> interruptus
> corpus
> adipisicing
> elit
> 
> This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance.
> 
> The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always.
> 
> This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow!
> 
> 
> Andrei

October 26, 2008
On Thu, 23 Oct 2008 14:29:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
> Consider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines.
> 
> For example, this stream has horizon 9:
> 
> lorem
> ipsum
> dolor
> habeas
> corpus
> ipsum
> sit
> amet
> rigor
> mortis
> consectetur
> coitus
> interruptus
> corpus
> adipisicing
> elit
> 
> This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance.
> 
> The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always.
> 
> This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow!
> 
> 
> Andrei

The hash solution is simple and fast, but uses too much memory. So, sub-divide your data.

Obviously, you cannot process the first half and then the second, but you can make multiple complete passes, and only consider a subset during each pass.

For example, if you hash all lines beginning with 'a' (or 'A' or whatever is applicable with  your data), and 'b' on the second pass etc. At the end of each pass you retain only the best horizion so far discovered and discard the rest.

Assuming all lower-case and roughly even distribution, you reduce your memory requirements to ~4%.

By way of example, I knocked up a data file constisting of 100+ million lines (1GB) of words chosen randomly from a dictionary. Using Perl & 26 passes, I got my result in just under 25 minutes using < 2MB of memory.

Admittedly, with this size of file, the entire thing was in the file system caches for the second and subsequent passes which saved considerable time. If your file is much bigger, that probably wouldn't be the case. But, if you're going to write this in D, you can probably speed thing up a bit anyway.

wc -l with a cold cache manages to achieve a throughput of 28mb/sec on my system, where perl only achieved a best of 60mb/sec with a warm cache.

I seem to recall seeing a memory-mapped file module in the D library somewhere. Maybe that would reduce the IO for multiple passes?

Also, given the minimal 2MB max. memory usage above, if you know (or can probe) your data, then you can reduce the number of passes by doing (say) a-f on the first pass, g-l on the second etc.

Or if all your data begins with (say) "Line ...", then key off the appropriate position in the line to form your pass split. Or, if your file is really huge use the first two characters for a 676-way split. (And wait :)

Anyway, just a thought from a drive-by reader.



October 26, 2008
On Sun, 26 Oct 2008 01:00:59 GMT, Nigel Sandever <nigelsandever@btconnect.com> wrote:

> system, where perl only achieved a best of 60mb/sec with a warm cache.
> 
Should have been 12MB/sec. I was instrumenting in 5 second intervals.


October 26, 2008
Could you post that file somewhere so we could get some even level testing going for these algos?
October 26, 2008
On Sat, 25 Oct 2008 21:45:29 -0400, mgen <bmeck@stedwards.edu> wrote:
> Could you post that file somewhere so we could get some even level testing
going for these algos?

Sure. If you know somewhere that woudl allow me to upload a (bzip -9) 300Mb file?

And are prepared to wait a while, I'm on dial-up :(




October 26, 2008
Nigel Sandever:
> For example, if you hash all lines beginning with 'a' (or 'A' or whatever is applicable with  your data), and 'b' on the second pass etc. At the end of each pass you retain only the best horizion so far discovered and discard the rest.

A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', etc, to split the input data into bins. But you want to fill those bins uniformly. So how can do it? The hash number if good is nearly random, so you can partition its bits to partition your data into more uniform groups :-) The downside is that you have to compute the hash for each line many times. To speed this up you can hash just a sample of the chars of each line...

Note that 1 GB file is too much small for a true benchmark, I have 2 GB RAM... You don't need to load the lines as strings, you can load the whole file as a single memory block in RAM. Then you don't even need slices (8 bytes) to represent a string, because their ending point is the end of the file or the successive newline, so you just need the starting points as keys and the line indexes as values. Unfortunately the built-in AA with the built-in memory allocator uses the same memory if your keys are 8 or 4 bytes long. So if you see memory isn't enough to run your program, then you need to use a less memory-hungry hash map. You can find plenty around, written in C, I have translated one to D, it's easy. With that if the lines aren't too much short or there are some duplicate lines, then the RAM (assuming to use 2 GB) suffices to keep the whole associative array, so you need to load the file just one time.

Bye,
bearophile