Thread overview
[Issue 7551] New: Regex parsing bug for right bracket in character class
Feb 20, 2012
Magnus Lie Hetland
Feb 24, 2012
Dmitry Olshansky
Feb 27, 2012
Magnus Lie Hetland
Feb 27, 2012
Magnus Lie Hetland
Feb 27, 2012
Dmitry Olshansky
Feb 27, 2012
Magnus Lie Hetland
Feb 27, 2012
Dmitry Olshansky
February 20, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7551

           Summary: Regex parsing bug for right bracket in character class
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody@puremagic.com
        ReportedBy: magnus@hetland.org


--- Comment #0 from Magnus Lie Hetland <magnus@hetland.org> 2012-02-20 03:06:36 PST ---
It seems that a bug has appeared for charsets in the std.regex. In previous versions, a right bracket could be included in a character set by placing it first, as is the case in many other languages/libraries. In the current version (I'm using the canned DMD 2.058 for OS X), that doesn't work:

import std.regex;
void main() {
    auto r = regex("[]]");
}

This gives the following exception:

std.regex.RegexException@/usr/share/dmd/src/phobos/std/regex.d(1951): wrong
CodepointSet
Pattern with error: `[]` <--HERE-- `]`

This should probably be permitted, as a "least surprise" practice, and to preserve compatibility with older versions. (It doesn't seem to be explicitly documented in the standard library docs, though. Then again, as far as I can see, no other mechanism for including right brackets in charsets is documented either.)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 24, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7551


Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh@gmail.com


--- Comment #1 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-02-24 11:28:25 PST ---
It perfectly fine to use escapes for special characters:

import std.regex;
void main() {
    auto r = regex("[\]]");
}

The reason for killing first bracket doesn't count rule (if ever knew it
existed)
is that new regex allows doing things like
[[abc0-9]--[bcd||1-9]]
i.e. set operations
the above should get you [bc0], it's more useful with \p{xxx} things.
Basically braces do matter more now.
But this many other languages... (or better libraries) - which ones? Unless
there is strong precident I'm not doing another special case.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 27, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7551



--- Comment #2 from Magnus Lie Hetland <magnus@hetland.org> 2012-02-27 00:44:59 PST ---
It did exist in the previous version -- my code broke with the new regexp engine, but worked before :-)

If this is a conscious choice, then that's totally fine by me. Special cases aren't the right way to go when the general mechanism works. I had some trouble getting this to work (did something like what you wrote here, which won't work -- but double-escaping does, of course), so I ended up with using the or-operator, which was kind of hackish ;-)

So, yeah, I guess I "retract" my bug report :->

As for other languages: Yeah, I think this is pretty common. E.g., Python (http://docs.python.org/library/re.html) and in Perl and Perl-compatible regexps, as used in all kinds of places, such as PHP, Apache, Safari, … (http://www.php.net/manual/en/regexp.reference.character-classes.php).

So I think the "place member end brackets as first character" is the "industry standard" behavior.

But as a compromise: Perhaps a useful error message pointing out the escape thing could be added? Or it could be explicitly pointed out in a note in the documentation (to avoid special-casing the error code)?

I think some kind of "least surprise" handling for people coming from basically anywhere else might be useful ;-)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 27, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7551



--- Comment #3 from Magnus Lie Hetland <magnus@hetland.org> 2012-02-27 00:51:18 PST ---
This whole thing goes for start brackets, too, I guess. As far as I can see, they, too, must be escaped when used inside character classes, now. This follows from the definition in the docs, for sure, but wasn't entirely obvious to me -- especially given that it worked before. (I.e., that was another thing that broke in my code recently, when upgrading.)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 27, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7551


Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement


--- Comment #4 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-02-27 02:36:06 PST ---
Full backwards compatibility looked like a nice idea at start.
I'm increasingly regret that decision, as things still got broken as I had to
add new features that block some undocumented behavior.

Ehm escape sequences were partly broken in 2.057 ... sorry about that.

BTW this page shows that [ and ] should be escaped, and not a single word on it used as first character (unlike '-' that is supported). http://www.php.net/manual/en/regexp.reference.character-classes.php

About Python, heh, I'm eager to see how would they go about adding set operations without breaking compatibility (they count [ as plain '[' in the middle of charset). I guess a brand new module if it they ever will.

> 
> But as a compromise: Perhaps a useful error message pointing out the escape thing could be added? Or it could be explicitly pointed out in a note in the documentation (to avoid special-casing the error code)?
> 
> I think some kind of "least surprise" handling for people coming from basically anywhere else might be useful ;-)

Hm.. that's a good idea. Hereby it's an enhacement request ;)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 27, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7551



--- Comment #5 from Magnus Lie Hetland <magnus@hetland.org> 2012-02-27 03:18:50 PST ---
Quoting Dmitry:
> BTW this page shows that [ and ] should be escaped, and not a single word on it used as first character (unlike '-' that is supported). http://www.php.net/manual/en/regexp.reference.character-classes.php

Huh? Did you read the first paragraph…?-)

Quoted, for your convenience (my highlight):
> An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special. **If a closing square bracket is required as a member of the class, it should be the first data character in the class** (after an initial circumflex, if present) or escaped with a backslash.

It says so right there, no? This is the way it's been in several languages I've used throughout the years. I guess they just didn't have escaping inside character classes in the olden days ;-)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 27, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7551



--- Comment #6 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-02-27 05:18:12 PST ---
(In reply to comment #5)
> Quoting Dmitry:
> > BTW this page shows that [ and ] should be escaped, and not a single word on it used as first character (unlike '-' that is supported). http://www.php.net/manual/en/regexp.reference.character-classes.php
> 
> Huh? Did you read the first paragraph…?-)

Searching gets the better of me :( I 'greped' for "["

> 
> Quoted, for your convenience (my highlight):
> > An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special. **If a closing square bracket is required as a member of the class, it should be the first data character in the class** (after an initial circumflex, if present) or escaped with a backslash.
> 
> It says so right there, no? This is the way it's been in several languages I've used throughout the years. I guess they just didn't have escaping inside character classes in the olden days ;-)

Apparently it's one of these historical kind of things.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------