Thread overview
Search a file skiping whitespace
Jul 16, 2011
Willy Martinez
Jul 16, 2011
Johann MacDonagh
Jul 16, 2011
Johann MacDonagh
Jul 16, 2011
Dmitry Olshansky
Jul 16, 2011
Willy Martinez
Jul 16, 2011
Johann MacDonagh
Jul 16, 2011
Johann MacDonagh
Jul 16, 2011
Dmitry Olshansky
Jul 16, 2011
Johann MacDonagh
Jul 16, 2011
Dmitry Olshansky
July 16, 2011
Hello. I'm new to D but bear with me, please.

I have several files that look like this:

71104 08924 72394 13995 49707 98696
48245 08311 44066 67172 56025 07952
00384 37808 90166 13871 94258 37216

I'm trying to read those files and search for sequences of digits inside, hopefully with the Boyer-Moore implementation in std.algorithm.

Right now I have made a small script that iterates over the .txt files in the current directory and reads line by line and uses find on it.

But I haven't been able to write a range that removes the whitespace and can be used with find. It should generate one long stream of digits like:

711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216

Any help?

Thanks
July 16, 2011
On 7/16/2011 3:05 PM, Willy Martinez wrote:
> Any help?

import std.algorithm;
import std.ascii;
import std.stdio;

void main(string[] argv)
{
    auto s = "71104 08924 72394 13995 49707 98696
48245 08311 44066 67172 56025 07952
00384 37808 90166 13871 94258 37216";

    auto t = filter!("!std.ascii.isWhite(a)")(s);

    bool found = canFind(t, "408");

    writeln(t);
    writeln(found);
}

Outputs:

711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216
true

July 16, 2011
On 7/16/2011 3:56 PM, Johann MacDonagh wrote:
> On 7/16/2011 3:05 PM, Willy Martinez wrote:
>> Any help?
>
> import std.algorithm;
> import std.ascii;
> import std.stdio;
>
> void main(string[] argv)
> {
> auto s = "71104 08924 72394 13995 49707 98696
> 48245 08311 44066 67172 56025 07952
> 00384 37808 90166 13871 94258 37216";
>
> auto t = filter!("!std.ascii.isWhite(a)")(s);
>
> bool found = canFind(t, "408");
>
> writeln(t);
> writeln(found);
> }
>
> Outputs:
>
> 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216
>
> true
>

This is actually better:

import std.algorithm;
import std.ascii;
import std.stdio;

void main(string[] argv)
{
    auto s = "71104 08924 72394 13995 49707 98696
48245 08311 44066 67172 56025 07952
00384 37808 90166 13871 94258 37216";

    auto t = filter!( (c) { return isWhite(c); } )(s);

    bool found = canFind(t, "408");

    writeln(t);
    writeln(found);
}

Otherwise we're relying on std.algorithm importing std.ascii.
July 16, 2011
On 16.07.2011 23:05, Willy Martinez wrote:
> Hello. I'm new to D but bear with me, please.
>
> I have several files that look like this:
>
> 71104 08924 72394 13995 49707 98696
> 48245 08311 44066 67172 56025 07952
> 00384 37808 90166 13871 94258 37216
>
> I'm trying to read those files and search for sequences of digits inside,
> hopefully with the Boyer-Moore implementation in std.algorithm.
>
> Right now I have made a small script that iterates over the .txt files in the
> current directory and reads line by line and uses find on it.
>
> But I haven't been able to write a range that removes the whitespace and can
> be used with find. It should generate one long stream of digits like:
>
> 711040892472394139954970798696482450831144066671725602507952003843780890166138719425837216

If you wish to avoid storing all of this in an array by using e.g. filter _and_  use Boyer-Moore search on it then: No, you can't do that. The reason is that filter is ForwardRange with an important consequence that you can't look at arbitrary Nth element in O(1). And Boyer-Moore requires such and access to be anywhere efficient.
Why doesn't filter not provide O(1) random access ? Because to get Nth element you'd need to check at least N (and potentially unlimited)  number of elements before in case they get filtered out.


> Any help?

If I'd had this sort of problem I'd use something along the lines:

auto file = File("yourfile");
foreach( line; file.ByLine)
{
    auto onlyDigitis = array(filter!((x){   return !isWhite(x); })(line)); // this copies all digits to a new array
    auto result = find(onlyDigits, ... ); //your query here
    ///....
}

> Thanks


-- 
Dmitry Olshansky

July 16, 2011
== Quote from Dmitry Olshansky (dmitry.olsh@gmail.com)'s article
> If you wish to avoid storing all of this in an array by using e.g.
> filter _and_  use Boyer-Moore search on it then: No, you can't do that.
> The reason is that filter is ForwardRange with an important consequence
> that you can't look at arbitrary Nth element in O(1). And Boyer-Moore
> requires such and access to be anywhere efficient.
> Why doesn't filter not provide O(1) random access ? Because to get Nth
> element you'd need to check at least N (and potentially unlimited)
> number of elements before in case they get filtered out.
> > Any help?
> If I'd had this sort of problem I'd use something along the lines:
> auto file = File("yourfile");
> foreach( line; file.ByLine)
> {
>      auto onlyDigitis = array(filter!((x){   return !isWhite(x);
> })(line)); // this copies all digits to a new array
>      auto result = find(onlyDigits, ... ); //your query here
>      ///....
> }
> > Thanks

I don't mind storing it in memory. Each .txt file is around 20MB so the filtered string should be even smaller.

Still, calling array gives this error:

..\..\src\phobos\std\algorithm.d(3252): Error: function
std.algorithm.BoyerMooreFinder!(result,string).BoyerMooreFinder.beFound (string
haystack) is not callable using argument types (dchar[])
..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert
expression (haystack) of type dchar[] to string
..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert
expression (needle.beFound((__error))) of type string to dchar[] search_seq.d(13):
Error: template instance std.algorithm.find!(dchar[],result,string) error
instantiating



From this code:

import std.algorithm;
import std.array;
import std.file;
import std.stdio;

void main(string[] args) {
	auto needle = boyerMooreFinder(args[1]);
	foreach (string name; dirEntries(".", SpanMode.shallow)) {
		if (name[$-3 .. $] == "txt") {
			writeln(name);
			string text = readText(name);
			auto haystack = array(filter!("a >= '0' && a <= '9'")(text));
			auto result = find(haystack, needle);
			writeln(result);
		}
	}
}


I'm using DMD 2.054 on Windows if that helps
July 16, 2011
On 7/16/2011 4:41 PM, Willy Martinez wrote:
> 	auto needle = boyerMooreFinder(args[1]);

I'm confused. Boyer-Moore is when you want to find two or more elements efficiently in a range. When you create a BoyerMooreFinder you give it a range of elements it should search for. When you give it args[1] you are giving it a string which is immutable(char)[], or a range of characters. So if args[1] = "123" you're saying "search for 1, 2, or 3".

Are you actually trying to search for multiple elements at the same time?

July 16, 2011
On 7/16/2011 4:43 PM, Johann MacDonagh wrote:
> On 7/16/2011 4:41 PM, Willy Martinez wrote:
>> auto needle = boyerMooreFinder(args[1]);
>
> I'm confused. Boyer-Moore is when you want to find two or more elements
> efficiently in a range. When you create a BoyerMooreFinder you give it a
> range of elements it should search for. When you give it args[1] you are
> giving it a string which is immutable(char)[], or a range of characters.
> So if args[1] = "123" you're saying "search for 1, 2, or 3".
>
> Are you actually trying to search for multiple elements at the same time?
>

Ugh, I'm dumb. I misread the documentation.

Instead of defining "needle", just pass arg[1] into your call to find.
July 16, 2011
On 17.07.2011 0:41, Willy Martinez wrote:
> == Quote from Dmitry Olshansky (dmitry.olsh@gmail.com)'s article
>> If you wish to avoid storing all of this in an array by using e.g.
>> filter _and_  use Boyer-Moore search on it then: No, you can't do that.
>> The reason is that filter is ForwardRange with an important consequence
>> that you can't look at arbitrary Nth element in O(1). And Boyer-Moore
>> requires such and access to be anywhere efficient.
>> Why doesn't filter not provide O(1) random access ? Because to get Nth
>> element you'd need to check at least N (and potentially unlimited)
>> number of elements before in case they get filtered out.
>>> Any help?
>> If I'd had this sort of problem I'd use something along the lines:
>> auto file = File("yourfile");
>> foreach( line; file.ByLine)
>> {
>>       auto onlyDigitis = array(filter!((x){   return !isWhite(x);
>> })(line)); // this copies all digits to a new array
>>       auto result = find(onlyDigits, ... ); //your query here
>>       ///....
>> }
>>> Thanks
> I don't mind storing it in memory. Each .txt file is around 20MB so the filtered
> string should be even smaller.
>
> Still, calling array gives this error:

Not exactly calling array but I perfectly understand why you have confused it.

>
> ..\..\src\phobos\std\algorithm.d(3252): Error: function
> std.algorithm.BoyerMooreFinder!(result,string).BoyerMooreFinder.beFound (string
> haystack) is not callable using argument types (dchar[])
> ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert
> expression (haystack) of type dchar[] to string
> ..\..\src\phobos\std\algorithm.d(3252): Error: cannot implicitly convert
> expression (needle.beFound((__error))) of type string to dchar[] search_seq.d(13):
> Error: template instance std.algorithm.find!(dchar[],result,string) error
> instantiating

Let's drill down to the problem through this barrage of crap:

the problem statement is

 Error: cannot implicitly convert expression (haystack) of type dchar[] to string

So (apparently) the problem is that after array(filter!(... you get array of dchars (unicode codepoints)as a result of filtering string (which is UTF-8 under the hood btw) while you are going to search an UTF-8 string.
And UTF-8 string is (once again) is not random  accessible in sense of codepoints (it's needs an UTF decode though it's clearly not needed in your case).
The simplest workaround I can think of is convert needle to dstring:
auto needle =  boyerMooreFinder(to!dstring(args[1])); //found in std.conv


>
>
>  From this code:
>
> import std.algorithm;
> import std.array;
> import std.file;
> import std.stdio;
>
> void main(string[] args) {
> 	auto needle = boyerMooreFinder(args[1]);
> 	foreach (string name; dirEntries(".", SpanMode.shallow)) {
> 		if (name[$-3 .. $] == "txt") {
> 			writeln(name);
> 			string text = readText(name);
> 			auto haystack = array(filter!("a>= '0'&&  a<= '9'")(text));
> 			auto result = find(haystack, needle);
> 			writeln(result);
> 		}
> 	}
> }
>
>
> I'm using DMD 2.054 on Windows if that helps


-- 
Dmitry Olshansky

July 16, 2011
On 7/16/2011 5:07 PM, Dmitry Olshansky wrote:
>
> Let's drill down to the problem through this barrage of crap:
>
> the problem statement is
>
> Error: cannot implicitly convert expression (haystack) of type dchar[]
> to string
>
> So (apparently) the problem is that after array(filter!(... you get
> array of dchars (unicode codepoints)as a result of filtering string
> (which is UTF-8 under the hood btw) while you are going to search an
> UTF-8 string.
> And UTF-8 string is (once again) is not random accessible in sense of
> codepoints (it's needs an UTF decode though it's clearly not needed in
> your case).
> The simplest workaround I can think of is convert needle to dstring:
> auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.conv
>
>

Nope, that doesn't work.

Here's what he should do:

import std.algorithm;
import std.array;
import std.file;
import std.stdio;

void main(string[] args) {
	foreach (string name; dirEntries(".", SpanMode.shallow)) {
		if (name[$-3 .. $] == "txt") {
			writeln(name);
			string text = readText(name);
			auto haystack = filter!("a >= '0' && a <='9'")(text);
			auto result = find(haystack, args[1]);
			writeln(result);
		}
	}
}

If you call array() on the filter you're already doing O(n) operations, meaning you're not going to get any performance benefit by being able to do Boyer-Moore.

Correct me if I'm wrong, but I believe find() will do Boyer-Moore if it knows its haystack is random access. This is the power of templated programming. find! can generate an efficient find algorithm if it knows the range is random access, or default to a less efficient one if not.
July 16, 2011
On 17.07.2011 1:28, Johann MacDonagh wrote:
> On 7/16/2011 5:07 PM, Dmitry Olshansky wrote:
>>
>> Let's drill down to the problem through this barrage of crap:
>>
>> the problem statement is
>>
>> Error: cannot implicitly convert expression (haystack) of type dchar[]
>> to string
>>
>> So (apparently) the problem is that after array(filter!(... you get
>> array of dchars (unicode codepoints)as a result of filtering string
>> (which is UTF-8 under the hood btw) while you are going to search an
>> UTF-8 string.
>> And UTF-8 string is (once again) is not random accessible in sense of
>> codepoints (it's needs an UTF decode though it's clearly not needed in
>> your case).
>> The simplest workaround I can think of is convert needle to dstring:
>> auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.conv
>>
>>
>
> Nope, that doesn't work.
>
> Here's what he should do:
>
> import std.algorithm;
> import std.array;
> import std.file;
> import std.stdio;
>
> void main(string[] args) {
>     foreach (string name; dirEntries(".", SpanMode.shallow)) {
>         if (name[$-3 .. $] == "txt") {
>             writeln(name);
>             string text = readText(name);
>             auto haystack = filter!("a >= '0' && a <='9'")(text);
>             auto result = find(haystack, args[1]);
>             writeln(result);
>         }
>     }
> }
>
> If you call array() on the filter you're already doing O(n) operations, meaning you're not going to get any performance benefit by being able to do Boyer-Moore.
>

Actually with brute force you do O(mn) comparisons in general while there are Boyer-Moor implementations that will do O(n). Now with an alphabet of 0..9 I'd say you won't get a lot of speed up with Boyer-Moore, in any case I'd suggest OP to at least test both. In this case  however my bet is on  bruteforce (given an extra allocation on each string with B-M).

> Correct me if I'm wrong, but I believe find() will do Boyer-Moore if it knows its haystack is random access. This is the power of templated programming. find! can generate an efficient find algorithm if it knows the range is random access, or default to a less efficient one if not.

No, the problem of BoyerMoor search is there is some preparatory overhead to construct lookup table, apparently that table is stored in that BoyerMoorFinder. Frankly, std.algorithm might add at least one another sting searching algorithm e.g.
http://monge.univ-mlv.fr/~lecroq/string/

-- 
Dmitry Olshansky