October 26, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to mgen | mgen:
> Could you post that file somewhere so we could get some even level testing going for these algos?
It looks a bit too much huge. Better to write a little program that generates random data with a suitable distribution. To make it portable across Phobos/Tango/etc the RND generator can be included into the code, there are very fast around.
Bye,
bearophile
| |||
October 26, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | On Sat, 25 Oct 2008 22:54:47 -0400, bearophile <bearophileHUGS@lycos.com> wrote: > mgen: > > Could you post that file somewhere so we could get some even level testing going for these algos? > > It looks a bit too much huge. Better to write a little program that generates random data with a suitable distribution. To make it portable across Phobos/Tango/etc the RND generator can be included into the code, there are very fast around. > > Bye, > bearophile > You'd need the PRNG to produce the same sequence for teh same seed cross platform--the MT for example. And you'd need to distribute (or point to) a common dictionary. | |||
October 26, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | On Sat, 25 Oct 2008 22:52:51 -0400, bearophile <bearophileHUGS@lycos.com> wrote: > Nigel Sandever: > > For example, if you hash all lines beginning with 'a' (or 'A' or whatever is applicable with your data), and 'b' on the second pass etc. At the end of each > > pass you retain only the best horizion so far discovered and discard the rest. > > A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', etc, to split the input data into bins. But you want to fill those bins uniformly. So how can do it? The hash number if good is nearly random, so you can partition its bits to partition your data into more uniform groups :-) The downside is that you have to compute the hash for each line many times. To speed this up you can hash just a sample of the chars of each line... I did try that (using md5), but the penalty in Perl was horrible, The hash(AA) just had to rehash the binary md5 anyway, so it was pure cost for my known dataset. For a real application with unknown/non-uniform data it would work like a charm though ... assuming you don't hit that rarest of non-impossibilities: duplicate md5s of non-identical inputs :) That said. For most applications of this, one would probably have some feel for the dataset(s) one is likely to deal with. And trading domain specific knowledge for total generality is often the best optimisation one can perform. > > Note that 1 GB file is too much small for a true benchmark, I have 2 GB RAM... Agreed and ditto. But provided algorithms are written to assume files larger than memory, the numbers do scale linearly. You don't need to load the lines as strings, you can load the whole file as a single memory block in RAM. Then you don't even need slices (8 bytes) to represent a string, because their ending point is the end of the file or the successive newline, so you just need the starting points as keys and the line indexes as values. Unfortunately the built-in AA with the built-in memory allocator uses the same memory if your keys are 8 or 4 bytes long. So if you see memory isn't enough to run your program, then you need to use a less memory- hungry hash map. You can find plenty around, written in C, I have translated one to D, it's easy. With that if the lines aren't too much short or there are some duplicate lines, then the RAM (assuming to use 2 GB) suffices to keep the whole associative array, so you need to load the file just one time. I agree that a 'testset producer' and a freely accessible known dicionary is a better option. I used (a slightly modified version of) 2of12inf available from http://downloads.sourceforge.net/wordlist/alt12dicts-4.tar.gz > > Bye, > bearophile | |||
October 26, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Nigel Sandever | Nigel Sandever: >I did try that (using md5), but the penalty in Perl was horrible,< This is a D newsgroup, so use D, it allows you to manage bits more efficiently. md5 is too much slow for this purpose, use the built-in hashing of strings. >That said. For most applications of this, one would probably have some feel for the dataset(s) one is likely to deal with. And trading domain specific knowledge for total generality is often the best optimisation one can perform.< I agree, when you have gigabytes of data it's better to gain some knowledge about the data before process it. >But provided algorithms are written to assume files larger than memory, the numbers do scale linearly.< Several of the provided algorithms don't assume files larger than memory, or they work quite better when the file is smaller than memory. >I used (a slightly modified version of) 2of12inf available from< That's a quite complex file, so I suggest something simpler, as this after a cleaning of the non ASCII words: http://www.norvig.com/big.txt Bye, bearophile | |||
October 26, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | On Sun, 26 Oct 2008 03:39:50 -0400, bearophile <bearophileHUGS@lycos.com> wrote: > Nigel Sandever: > > >I did try that (using md5), but the penalty in Perl was horrible,< > > This is a D newsgroup, so use D, it allows you to manage bits more efficiently. > Sorry. No disrespect meant to D. I always prototype in Perl and then convert to C or D if I need performance. I'm just more familiar with Perl. > > >I used (a slightly modified version of) 2of12inf available from< > > That's a quite complex file, so I suggest something simpler, as this after a cleaning of the non ASCII words: > http://www.norvig.com/big.txt I don't know what is "complex" about a 1 word per line, 81536 line dictionary file? Or how having everyone clean up Conan Doyle would be simpler? If you have Perl, you can produce a suitable testfile from any 1 word per line dictionary with the command line: perl -l12n0777aF/\n/ -ne'print $F[rand @F] for 1..4e8' yourdict >thedata With the 2of12inf dictionary file, 4e8 produces a file a little under 4GB in a round 10 minutes. YMWV depending upon the average length of the lines in your local dict. Of course the won't all be the same as mine, or anyone elses, but given the random nature, the results will be broadly comparible. My D foo is too rusty to try and write that in D. Especially for a D audience :) I'm sure one of you guys can knock that up in the blink of an eye. > > Bye, > bearophile | |||
November 01, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu wrote:
> Consider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines.
>
> For example, this stream has horizon 9:
>
> lorem
> ipsum
> dolor
> habeas
> corpus
> ipsum
> sit
> amet
> rigor
> mortis
> consectetur
> coitus
> interruptus
> corpus
> adipisicing
> elit
>
> This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance.
>
> The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always.
>
> This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow!
>
>
> Andrei
Simplicity considering external memory constraints.
Read in the file, line by line. Output the line followed by the line number. Sort the output (merge sort works wonderfully). Now you go through it a second time and find the maximum distance.
This takes O(n log n) time, which is far from optimal. However, I think we already did sorting in external memory, so I could implement this much quicker than most alternatives. Depending on the number of times this has to be done, that might be a net advantage.
| |||
November 01, 2008 Re: [OT] The horizon of a stream | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Christopher Wright | Christopher Wright wrote:
> Andrei Alexandrescu wrote:
>> Consider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines.
>>
>> For example, this stream has horizon 9:
>>
>> lorem
>> ipsum
>> dolor
>> habeas
>> corpus
>> ipsum
>> sit
>> amet
>> rigor
>> mortis
>> consectetur
>> coitus
>> interruptus
>> corpus
>> adipisicing
>> elit
>>
>> This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance.
>>
>> The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always.
>>
>> This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow!
>>
>>
>> Andrei
>
> Simplicity considering external memory constraints.
> Read in the file, line by line. Output the line followed by the line number. Sort the output (merge sort works wonderfully). Now you go through it a second time and find the maximum distance.
>
> This takes O(n log n) time, which is far from optimal. However, I think we already did sorting in external memory, so I could implement this much quicker than most alternatives. Depending on the number of times this has to be done, that might be a net advantage.
I like this solution a lot. Kudos!
Andrei
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply