Thread overview | ||||||
---|---|---|---|---|---|---|
|
December 06, 2013 Searching for a string in a text buffer with a regular expression | ||||
---|---|---|---|---|
| ||||
While porting a simple Python script to D, I found the following problem. I need to read in some thousand of little text files and search every one for a match with a given regular expression. Obviously, the program can't (and it should not) be certain about the encoding of each input file. I initially used read() casting it with a cast(char[]), but, at some point, the regex engine crashed with an exception: it encountered an UTF-8 character it couldn't automatically decode. This is right, since char[] is not byte[]. Now I'm casting with a Latin1String, since I know this is the right encoding for the input buffers: and it works fine, at last... but what about if I'd need to treat a RAW (binary? unknown encoding?) buffer? Is there a simple and elegant solution in D for such case? Python didn't gave such problems! |
December 06, 2013 Re: Searching for a string in a text buffer with a regular expression | ||||
---|---|---|---|---|
| ||||
Posted in reply to maxpat78 | maxpat78:
> Is there a simple and elegant solution in D for such case?
> Python didn't gave such problems!
Do you mean Python3?
Bye,
bearophile
|
December 06, 2013 Re: Searching for a string in a text buffer with a regular expression | ||||
---|---|---|---|---|
| ||||
Posted in reply to maxpat78 | On 2013-12-06 08:53:04 +0000, maxpat78 said: > While porting a simple Python script to D, I found the following problem. > > I need to read in some thousand of little text files and search every one for a match with a given regular expression. > > Obviously, the program can't (and it should not) be certain about the encoding of each input file. > > I initially used read() casting it with a cast(char[]), but, at some point, the regex engine crashed with an exception: it encountered an UTF-8 character it couldn't automatically decode. This is right, since char[] is not byte[]. > > Now I'm casting with a Latin1String, since I know this is the right encoding for the input buffers: and it works fine, at last... but what about if I'd need to treat a RAW (binary? unknown encoding?) buffer? > > Is there a simple and elegant solution in D for such case? > Python didn't gave such problems! Why don't you follow one of the file reading examples? readText is what you're looking for. http://dlang.org/phobos/std_file.html#.readText |
December 09, 2013 Re: Searching for a string in a text buffer with a regular expression | ||||
---|---|---|---|---|
| ||||
Posted in reply to Shammah Chancellor | I mean a code fragment like this: foreach(i; 1..2085) { // Bugbug: when we read in the buffer, we can't know anything about its encoding... // But REGEX could fail if it contained unknown chars! Latin1String buf; string s; try { buf = cast(Latin1String) read(format("psi\\psi%04d.htm", i)); transcode(buf, s); } catch (Exception e) { writeln("Last record (", i, ") reached."); exit(1); } // Exception "Invalid UTF-8 sequence @index 1" in file 55 enum rx = ctRegex!(`<p class="aggiornamentoAlbo">.+?</div>`, "gs"); auto m = match(s, rx); if (! m.empty()) { if (indexOf(m.captures[0], "xxxxxxxx", 0) > -1 && indexOf(m.captures[0], "1983", 0) > -1) writeln(m.captures[0]); } } The question is: what kind of cast should I use to safely (=without conversion exceptions got raised) scan all possible kind of textual (or binary) buffer, lile in Python 2.7.x? Thanks! |
Copyright © 1999-2021 by the D Language Foundation