Thread overview | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
July 16, 2011 Search a file skiping whitespace | ||||
---|---|---|---|---|
| ||||
Hello. I'm new to D but bear with me, please. I have several files that look like this: 71104 08924 72394 13995 49707 98696 48245 08311 44066 67172 56025 07952 00384 37808 90166 13871 94258 37216 I'm trying to read those files and search for sequences of digits inside, hopefully with the Boyer-Moore implementation in std.algorithm. Right now I have made a small script that iterates over the .txt files in the current directory and reads line by line and uses find on it. But I haven't been able to write a range that removes the whitespace and can be used with find. It should generate one long stream of digits like: 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216 Any help? Thanks |
July 16, 2011 Re: Search a file skiping whitespace | ||||
---|---|---|---|---|
| ||||
Posted in reply to Willy Martinez | On 7/16/2011 3:05 PM, Willy Martinez wrote: > Any help? import std.algorithm; import std.ascii; import std.stdio; void main(string[] argv) { auto s = "71104 08924 72394 13995 49707 98696 48245 08311 44066 67172 56025 07952 00384 37808 90166 13871 94258 37216"; auto t = filter!("!std.ascii.isWhite(a)")(s); bool found = canFind(t, "408"); writeln(t); writeln(found); } Outputs: 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216 true |
July 16, 2011 Re: Search a file skiping whitespace | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johann MacDonagh | On 7/16/2011 3:56 PM, Johann MacDonagh wrote:
> On 7/16/2011 3:05 PM, Willy Martinez wrote:
>> Any help?
>
> import std.algorithm;
> import std.ascii;
> import std.stdio;
>
> void main(string[] argv)
> {
> auto s = "71104 08924 72394 13995 49707 98696
> 48245 08311 44066 67172 56025 07952
> 00384 37808 90166 13871 94258 37216";
>
> auto t = filter!("!std.ascii.isWhite(a)")(s);
>
> bool found = canFind(t, "408");
>
> writeln(t);
> writeln(found);
> }
>
> Outputs:
>
> 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216
>
> true
>
This is actually better:
import std.algorithm;
import std.ascii;
import std.stdio;
void main(string[] argv)
{
auto s = "71104 08924 72394 13995 49707 98696
48245 08311 44066 67172 56025 07952
00384 37808 90166 13871 94258 37216";
auto t = filter!( (c) { return isWhite(c); } )(s);
bool found = canFind(t, "408");
writeln(t);
writeln(found);
}
Otherwise we're relying on std.algorithm importing std.ascii.
|
July 16, 2011 Re: Search a file skiping whitespace | ||||
---|---|---|---|---|
| ||||
Posted in reply to Willy Martinez | On 16.07.2011 23:05, Willy Martinez wrote: > Hello. I'm new to D but bear with me, please. > > I have several files that look like this: > > 71104 08924 72394 13995 49707 98696 > 48245 08311 44066 67172 56025 07952 > 00384 37808 90166 13871 94258 37216 > > I'm trying to read those files and search for sequences of digits inside, > hopefully with the Boyer-Moore implementation in std.algorithm. > > Right now I have made a small script that iterates over the .txt files in the > current directory and reads line by line and uses find on it. > > But I haven't been able to write a range that removes the whitespace and can > be used with find. It should generate one long stream of digits like: > > 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216 If you wish to avoid storing all of this in an array by using e.g. filter _and_ use Boyer-Moore search on it then: No, you can't do that. The reason is that filter is ForwardRange with an important consequence that you can't look at arbitrary Nth element in O(1). And Boyer-Moore requires such and access to be anywhere efficient. Why doesn't filter not provide O(1) random access ? Because to get Nth element you'd need to check at least N (and potentially unlimited) number of elements before in case they get filtered out. > Any help? If I'd had this sort of problem I'd use something along the lines: auto file = File("yourfile"); foreach( line; file.ByLine) { auto onlyDigitis = array(filter!((x){ return !isWhite(x); })(line)); // this copies all digits to a new array auto result = find(onlyDigits, ... ); //your query here ///.... } > Thanks -- Dmitry Olshansky |
July 16, 2011 Re: Search a file skiping whitespace | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | == Quote from Dmitry Olshansky (dmitry.olsh@gmail.com)'s article > If you wish to avoid storing all of this in an array by using e.g. > filter _and_ use Boyer-Moore search on it then: No, you can't do that. > The reason is that filter is ForwardRange with an important consequence > that you can't look at arbitrary Nth element in O(1). And Boyer-Moore > requires such and access to be anywhere efficient. > Why doesn't filter not provide O(1) random access ? Because to get Nth > element you'd need to check at least N (and potentially unlimited) > number of elements before in case they get filtered out. > > Any help? > If I'd had this sort of problem I'd use something along the lines: > auto file = File("yourfile"); > foreach( line; file.ByLine) > { > auto onlyDigitis = array(filter!((x){ return !isWhite(x); > })(line)); // this copies all digits to a new array > auto result = find(onlyDigits, ... ); //your query here > ///.... > } > > Thanks I don't mind storing it in memory. Each .txt file is around 20MB so the filtered string should be even smaller. Still, calling array gives this error: ..\..\src\phobos\std\algorithm.d(3252): Error: function std.algorithm.BoyerMooreFinder!(result,string).BoyerMooreFinder.beFound (string haystack) is not callable using argument types (dchar[]) ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert expression (haystack) of type dchar[] to string ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert expression (needle.beFound((__error))) of type string to dchar[] search_seq.d(13): Error: template instance std.algorithm.find!(dchar[],result,string) error instantiating From this code: import std.algorithm; import std.array; import std.file; import std.stdio; void main(string[] args) { auto needle = boyerMooreFinder(args[1]); foreach (string name; dirEntries(".", SpanMode.shallow)) { if (name[$-3 .. $] == "txt") { writeln(name); string text = readText(name); auto haystack = array(filter!("a >= '0' && a <= '9'")(text)); auto result = find(haystack, needle); writeln(result); } } } I'm using DMD 2.054 on Windows if that helps |
July 16, 2011 Re: Search a file skiping whitespace | ||||
---|---|---|---|---|
| ||||
Posted in reply to Willy Martinez | On 7/16/2011 4:41 PM, Willy Martinez wrote:
> auto needle = boyerMooreFinder(args[1]);
I'm confused. Boyer-Moore is when you want to find two or more elements efficiently in a range. When you create a BoyerMooreFinder you give it a range of elements it should search for. When you give it args[1] you are giving it a string which is immutable(char)[], or a range of characters. So if args[1] = "123" you're saying "search for 1, 2, or 3".
Are you actually trying to search for multiple elements at the same time?
|
July 16, 2011 Re: Search a file skiping whitespace | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johann MacDonagh | On 7/16/2011 4:43 PM, Johann MacDonagh wrote:
> On 7/16/2011 4:41 PM, Willy Martinez wrote:
>> auto needle = boyerMooreFinder(args[1]);
>
> I'm confused. Boyer-Moore is when you want to find two or more elements
> efficiently in a range. When you create a BoyerMooreFinder you give it a
> range of elements it should search for. When you give it args[1] you are
> giving it a string which is immutable(char)[], or a range of characters.
> So if args[1] = "123" you're saying "search for 1, 2, or 3".
>
> Are you actually trying to search for multiple elements at the same time?
>
Ugh, I'm dumb. I misread the documentation.
Instead of defining "needle", just pass arg[1] into your call to find.
|
July 16, 2011 Re: Search a file skiping whitespace | ||||
---|---|---|---|---|
| ||||
Posted in reply to Willy Martinez | On 17.07.2011 0:41, Willy Martinez wrote: > == Quote from Dmitry Olshansky (dmitry.olsh@gmail.com)'s article >> If you wish to avoid storing all of this in an array by using e.g. >> filter _and_ use Boyer-Moore search on it then: No, you can't do that. >> The reason is that filter is ForwardRange with an important consequence >> that you can't look at arbitrary Nth element in O(1). And Boyer-Moore >> requires such and access to be anywhere efficient. >> Why doesn't filter not provide O(1) random access ? Because to get Nth >> element you'd need to check at least N (and potentially unlimited) >> number of elements before in case they get filtered out. >>> Any help? >> If I'd had this sort of problem I'd use something along the lines: >> auto file = File("yourfile"); >> foreach( line; file.ByLine) >> { >> auto onlyDigitis = array(filter!((x){ return !isWhite(x); >> })(line)); // this copies all digits to a new array >> auto result = find(onlyDigits, ... ); //your query here >> ///.... >> } >>> Thanks > I don't mind storing it in memory. Each .txt file is around 20MB so the filtered > string should be even smaller. > > Still, calling array gives this error: Not exactly calling array but I perfectly understand why you have confused it. > > ..\..\src\phobos\std\algorithm.d(3252): Error: function > std.algorithm.BoyerMooreFinder!(result,string).BoyerMooreFinder.beFound (string > haystack) is not callable using argument types (dchar[]) > ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert > expression (haystack) of type dchar[] to string > ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert > expression (needle.beFound((__error))) of type string to dchar[] search_seq.d(13): > Error: template instance std.algorithm.find!(dchar[],result,string) error > instantiating Let's drill down to the problem through this barrage of crap: the problem statement is Error: cannot implicitly convert expression (haystack) of type dchar[] to string So (apparently) the problem is that after array(filter!(... you get array of dchars (unicode codepoints)as a result of filtering string (which is UTF-8 under the hood btw) while you are going to search an UTF-8 string. And UTF-8 string is (once again) is not random accessible in sense of codepoints (it's needs an UTF decode though it's clearly not needed in your case). The simplest workaround I can think of is convert needle to dstring: auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.conv > > > From this code: > > import std.algorithm; > import std.array; > import std.file; > import std.stdio; > > void main(string[] args) { > auto needle = boyerMooreFinder(args[1]); > foreach (string name; dirEntries(".", SpanMode.shallow)) { > if (name[$-3 .. $] == "txt") { > writeln(name); > string text = readText(name); > auto haystack = array(filter!("a>= '0'&& a<= '9'")(text)); > auto result = find(haystack, needle); > writeln(result); > } > } > } > > > I'm using DMD 2.054 on Windows if that helps -- Dmitry Olshansky |
July 16, 2011 Re: Search a file skiping whitespace | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On 7/16/2011 5:07 PM, Dmitry Olshansky wrote:
>
> Let's drill down to the problem through this barrage of crap:
>
> the problem statement is
>
> Error: cannot implicitly convert expression (haystack) of type dchar[]
> to string
>
> So (apparently) the problem is that after array(filter!(... you get
> array of dchars (unicode codepoints)as a result of filtering string
> (which is UTF-8 under the hood btw) while you are going to search an
> UTF-8 string.
> And UTF-8 string is (once again) is not random accessible in sense of
> codepoints (it's needs an UTF decode though it's clearly not needed in
> your case).
> The simplest workaround I can think of is convert needle to dstring:
> auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.conv
>
>
Nope, that doesn't work.
Here's what he should do:
import std.algorithm;
import std.array;
import std.file;
import std.stdio;
void main(string[] args) {
foreach (string name; dirEntries(".", SpanMode.shallow)) {
if (name[$-3 .. $] == "txt") {
writeln(name);
string text = readText(name);
auto haystack = filter!("a >= '0' && a <='9'")(text);
auto result = find(haystack, args[1]);
writeln(result);
}
}
}
If you call array() on the filter you're already doing O(n) operations, meaning you're not going to get any performance benefit by being able to do Boyer-Moore.
Correct me if I'm wrong, but I believe find() will do Boyer-Moore if it knows its haystack is random access. This is the power of templated programming. find! can generate an efficient find algorithm if it knows the range is random access, or default to a less efficient one if not.
|
July 16, 2011 Re: Search a file skiping whitespace | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johann MacDonagh | On 17.07.2011 1:28, Johann MacDonagh wrote: > On 7/16/2011 5:07 PM, Dmitry Olshansky wrote: >> >> Let's drill down to the problem through this barrage of crap: >> >> the problem statement is >> >> Error: cannot implicitly convert expression (haystack) of type dchar[] >> to string >> >> So (apparently) the problem is that after array(filter!(... you get >> array of dchars (unicode codepoints)as a result of filtering string >> (which is UTF-8 under the hood btw) while you are going to search an >> UTF-8 string. >> And UTF-8 string is (once again) is not random accessible in sense of >> codepoints (it's needs an UTF decode though it's clearly not needed in >> your case). >> The simplest workaround I can think of is convert needle to dstring: >> auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.conv >> >> > > Nope, that doesn't work. > > Here's what he should do: > > import std.algorithm; > import std.array; > import std.file; > import std.stdio; > > void main(string[] args) { > foreach (string name; dirEntries(".", SpanMode.shallow)) { > if (name[$-3 .. $] == "txt") { > writeln(name); > string text = readText(name); > auto haystack = filter!("a >= '0' && a <='9'")(text); > auto result = find(haystack, args[1]); > writeln(result); > } > } > } > > If you call array() on the filter you're already doing O(n) operations, meaning you're not going to get any performance benefit by being able to do Boyer-Moore. > Actually with brute force you do O(mn) comparisons in general while there are Boyer-Moor implementations that will do O(n). Now with an alphabet of 0..9 I'd say you won't get a lot of speed up with Boyer-Moore, in any case I'd suggest OP to at least test both. In this case however my bet is on bruteforce (given an extra allocation on each string with B-M). > Correct me if I'm wrong, but I believe find() will do Boyer-Moore if it knows its haystack is random access. This is the power of templated programming. find! can generate an efficient find algorithm if it knows the range is random access, or default to a less efficient one if not. No, the problem of BoyerMoor search is there is some preparatory overhead to construct lookup table, apparently that table is stored in that BoyerMoorFinder. Frankly, std.algorithm might add at least one another sting searching algorithm e.g. http://monge.univ-mlv.fr/~lecroq/string/ -- Dmitry Olshansky |
Copyright © 1999-2021 by the D Language Foundation