Jump to page: 1 2
Thread overview
Issues with std.regex
Feb 16, 2013
MrAppleseed
Feb 16, 2013
FG
Feb 16, 2013
MrAppleseed
Feb 16, 2013
jerro
Feb 20, 2013
MrAppleseed
Feb 20, 2013
MrAppleseed
Feb 16, 2013
FG
Feb 17, 2013
Dmitry Olshansky
Feb 16, 2013
H. S. Teoh
Feb 16, 2013
Namespace
Feb 16, 2013
MrAppleseed
Feb 16, 2013
jerro
February 16, 2013
Hey all,

I'm currently trying to port my small toy language I invented awhile back in Java to D. However, a main part of my lexical analyzer was regular expression matching, which I've been having issues with in D. The regex expression in question is as follows:

[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]

This works well enough in Java to produce a series of tokens that I could then pass to my parser. But when I tried to port this into D, I almost always get an error when using brackets, braces, or parenthesis. I've tried several different combinations, have looked through the std.regex library reference, have Googled this issue, have tested my regular expression in several online-regex testers (primarily http://regexpal.com/, and http://regexhelper.com/), and have even looked it up in the book, "The D Programming Language" (good book, by the way), yet I still can't get it working right. Here's the code I've been using:

...
auto tempCont = cast(char[])read(location, fileSize);
string contents = cast(string)tempCont;
auto reg = regex("[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]");
auto m = match(contents, reg);
auto token = m.captures
...

When I try to run the code above, I get:
parser.d(64): Error: undefined escape sequence \[
parser.d(64): Error: undefined escape sequence \]

When I remove the escaped characters (turning my regex into
"[ 0-9a-zA-Z.*=+-;()\"\'[]<>,{}^#/\\]"), I get no issues compiling or linking. However, on first run, I get the following error (I cut the error short, full error is pasted http://pastebin.com/vjMhkx4N):

std.regex.RegexException@/usr/include/dmd/phobos/std/regex.d(1942): wrong CodepointSet
Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'[]` <--HERE-- `<>,{}^#/\]`

I'm very confused on what to do, and much of the information in the library reference seems to contradict what I'm doing. Any help would greatly appreciated!

Thanks!
~Mr. Appleseed

Additional information:

OS/Compiler information:
Ubuntu 12.10 x64
DMD64 D Compiler v2.061

Compiled with:
dmd main.d parser.d





February 16, 2013
On 2013-02-16 21:22, MrAppleseed wrote:
> auto reg = regex("[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]");
>
> When I try to run the code above, I get:
> parser.d(64): Error: undefined escape sequence \[
> parser.d(64): Error: undefined escape sequence \]
>
> When I remove the escaped characters (turning my regex into
> "[ 0-9a-zA-Z.*=+-;()\"\'[]<>,{}^#/\\]"), I get no issues compiling or linking.
> However, on first run, I get the following error (I cut the error short, full
> error is pasted http://pastebin.com/vjMhkx4N):
>
> std.regex.RegexException@/usr/include/dmd/phobos/std/regex.d(1942): wrong
> CodepointSet
> Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'[]` <--HERE-- `<>,{}^#/\]`
>

Perhaps try this:  "[ 0-9a-zA-Z.*=+-;()\"\'\\[\\]<>,{}^#/\\]"

February 16, 2013
On Sat, Feb 16, 2013 at 09:22:07PM +0100, MrAppleseed wrote:
> Hey all,
> 
> I'm currently trying to port my small toy language I invented awhile back in Java to D. However, a main part of my lexical analyzer was regular expression matching, which I've been having issues with in D. The regex expression in question is as follows:
> 
> [ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]
> 
> This works well enough in Java to produce a series of tokens that I could then pass to my parser. But when I tried to port this into D, I almost always get an error when using brackets, braces, or parenthesis. I've tried several different combinations, have looked through the std.regex library reference, have Googled this issue, have tested my regular expression in several online-regex testers (primarily http://regexpal.com/, and http://regexhelper.com/), and have even looked it up in the book, "The D Programming Language" (good book, by the way), yet I still can't get it working right. Here's the code I've been using:
> 
> ...
> auto tempCont = cast(char[])read(location, fileSize);
> string contents = cast(string)tempCont;
> auto reg = regex("[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]");

The problem is that you're using D's double-quoted string literal, which adds another level of interpretation to the \'s. What you should do is to use the backtick string literal, which does *not* interpret backslashes:

auto reg = regex(`[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]`);

If you have trouble typing `, you can also use r"...", which means the same thing.

Hope this helps.


--T
February 16, 2013
As long as there is \" I get the same error.
February 16, 2013
On Saturday, 16 February 2013 at 20:35:48 UTC, H. S. Teoh wrote:
> On Sat, Feb 16, 2013 at 09:22:07PM +0100, MrAppleseed wrote:
>> Hey all,
>> 
>> I'm currently trying to port my small toy language I invented awhile
>> back in Java to D. However, a main part of my lexical analyzer was
>> regular expression matching, which I've been having issues with in
>> D. The regex expression in question is as follows:
>> 
>> [ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]
>> 
>> This works well enough in Java to produce a series of tokens that I
>> could then pass to my parser. But when I tried to port this into D,
>> I almost always get an error when using brackets, braces, or
>> parenthesis. I've tried several different combinations, have looked
>> through the std.regex library reference, have Googled this issue,
>> have tested my regular expression in several online-regex testers
>> (primarily http://regexpal.com/, and http://regexhelper.com/), and
>> have even looked it up in the book, "The D Programming Language"
>> (good book, by the way), yet I still can't get it working right.
>> Here's the code I've been using:
>> 
>> ...
>> auto tempCont = cast(char[])read(location, fileSize);
>> string contents = cast(string)tempCont;
>> auto reg = regex("[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]");
>
> The problem is that you're using D's double-quoted string literal, which
> adds another level of interpretation to the \'s. What you should do is
> to use the backtick string literal, which does *not* interpret
> backslashes:
>
> auto reg = regex(`[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]`);
>
> If you have trouble typing `, you can also use r"...", which means the
> same thing.
>
> Hope this helps.
>
>
> --T

Thanks for the quick reply!

I replaced the double-quotes with backticks, compiled it with no problems, but on the first run I got a similar error:

std.regex.RegexException@/usr/include/dmd/phobos/std/regex.d(1942): invalid escape sequence
Pattern with error: `[ 0-9a-zA-Z.*=+-;()\"` <--HERE-- `\'[]<>,{}^#/\\]`

After removing the invalid escape sequence, I compiled it, once again with no problems, and attempted to run it, but I got the same error as before:

std.regex.RegexException@/usr/include/dmd/phobos/std/regex.d(1942): wrong CodepointSet
Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'[]` <--HERE-- `<>,{}^#/\\]`

(Entire error here: http://pastebin.com/Su9XzbXW)
February 16, 2013
On Saturday, 16 February 2013 at 20:33:15 UTC, FG wrote:
> On 2013-02-16 21:22, MrAppleseed wrote:
>> auto reg = regex("[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]");
>>
>> When I try to run the code above, I get:
>> parser.d(64): Error: undefined escape sequence \[
>> parser.d(64): Error: undefined escape sequence \]
>>
>> When I remove the escaped characters (turning my regex into
>> "[ 0-9a-zA-Z.*=+-;()\"\'[]<>,{}^#/\\]"), I get no issues compiling or linking.
>> However, on first run, I get the following error (I cut the error short, full
>> error is pasted http://pastebin.com/vjMhkx4N):
>>
>> std.regex.RegexException@/usr/include/dmd/phobos/std/regex.d(1942): wrong
>> CodepointSet
>> Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'[]` <--HERE-- `<>,{}^#/\]`
>>
>
> Perhaps try this:  "[ 0-9a-zA-Z.*=+-;()\"\'\\[\\]<>,{}^#/\\]"

Hey,

Thanks for the reply! You guys are quite the friendly people. :)

I made the changes you suggested above, and although it compiled fine, on the first run I got a similar error:

std.regex.RegexException@/usr/include/dmd/phobos/std/regex.d(1942): unexpected end of CodepointSet
Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'\[\]<>,{}^#/\]` <--HERE-- ``

(Full error is here: http://pastebin.com/rTmHuVjG)

February 16, 2013
> std.regex.RegexException@/usr/include/dmd/phobos/std/regex.d(1942): wrong CodepointSet
> Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'[]` <--HERE-- `<>,{}^#/\\]`
>
> (Entire error here: http://pastebin.com/Su9XzbXW)

You need to put \ in front of [ or ] if you want to match those two characters. The relevant part of std.regex documentation:

\c where c is one of [|*+?()	Matches the character c itself.
February 16, 2013
> Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'\[\]<>,{}^#/\]` <--HERE-- ``

The problem here is that you have \ right before the ] at the end of the string. Because it is preceeded by \, ] is interpretted as a character you are matching on, not as a closing bracket for the initial [. If you want to match \ you need this:

[ 0-9a-zA-Z.*=+-;()"'\[\]<>,{}^#/\\]
February 16, 2013
On 2013-02-16 22:36, MrAppleseed wrote:
>> Perhaps try this:  "[ 0-9a-zA-Z.*=+-;()\"\'\\[\\]<>,{}^#/\\]"
>
> I made the changes you suggested above, and although it compiled fine, on the
> first run I got a similar error:
>
> std.regex.RegexException@/usr/include/dmd/phobos/std/regex.d(1942): unexpected
> end of CodepointSet
> Pattern with error: `[ 0-9a-zA-Z.*=+-;()"'\[\]<>,{}^#/\]` <--HERE-- ``

Ah, right. Sorry for that. You'd need as much as 4 backslashes there. :)
"[ 0-9a-zA-Z.*=+-;()\"\'\\[\\]<>,{}^#/\\\\]"

Ain't pretty so it's better to go with raw strings, but apparently there are some problems with them right now, looking at the other posts here, right?
February 17, 2013
17-Feb-2013 01:36, MrAppleseed пишет:
> On Saturday, 16 February 2013 at 20:33:15 UTC, FG wrote:
>> On 2013-02-16 21:22, MrAppleseed wrote:
>>> auto reg = regex("[ 0-9a-zA-Z.*=+-;()\"\'\[\]<>,{}^#/\\]");
>>>
>>> When I try to run the code above, I get:
>>> parser.d(64): Error: undefined escape sequence \[
>>> parser.d(64): Error: undefined escape sequence \]
>>>
>>> When I remove the escaped characters (turning my regex into
>>> "[ 0-9a-zA-Z.*=+-;()\"\'[]<>,{}^#/\\]"), I get no issues compiling or
>>> linking.

Like others noted the problem is 2-fold:

- escaping special characters (both as D string literla and regex escaping itself). So just use `` or r"" or some form of WYSIWYG string *AND* do escaping for things that are part of regex syntax.

- [] are used for nesting, as std.regex supports set-wise operations inside of [...] character class e.g.
[[A-Z]&&[A-D]]
means intersection and would yield a set of [A-D]. It gets more useful with Unicode character sets.


-- 
Dmitry Olshansky
« First   ‹ Prev
1 2