Thread overview
[Issue 3136] New: Incorrect and strange behavior of std.regexp.RegExp if using a pattern with optional prefix and suffix longer than 1 char
Jul 08, 2009
Marcello Gnani
Jun 06, 2011
Dmitry Olshansky
July 05, 2009
http://d.puremagic.com/issues/show_bug.cgi?id=3136

           Summary: Incorrect and strange behavior of std.regexp.RegExp if
                    using a pattern with optional prefix and suffix longer
                    than 1 char
           Product: D
           Version: 2.030
          Platform: x86
        OS/Version: Windows
            Status: NEW
          Severity: major
          Priority: P2
         Component: Phobos
        AssignedTo: nobody@puremagic.com
        ReportedBy: marcellognani@gmail.com


It seems like std.regexp.RegExp get confused if I try using a pattern with
optional prefix and suffix longer than 1 char.
An expression of the form ([A]{0,2})(C)([D]{0,2}) matches all off "AC", "BC",
"CD", "CE", "ACD", "BCE", "ABCDE", "C" (as expected).
An expression of the form ([AB]{0,2})(C)([DE]{0,2}) or
([AB]?[AB]?)(C)([DE]?[DE]?) fails (incorrectly and unexpectedly) in some of the
cases above (both "CD" and "CE", for example).

Here the code:
---
import std.regexp;
import std.stdio;

public
{
    static void main()
    {
        RegExp eTest;
        void SetExp(string pattern)
        {
            eTest=new RegExp(pattern,"g");
            std.stdio.writeln("Testing expression ",pattern);
          }
        void TryString(string s)
        {
            std.stdio.writeln("Trying on string\"",s,"\":");
            auto captures=eTest.exec(s);
            if(captures.length)
            {
                std.stdio.writeln("Success!");
                foreach(uint i,string capture;captures)
                    std.stdio.writeln(i,"): \"",capture,"\"");
            }
            else
            {
                std.stdio.writeln("Failure!");
            }
        }
        SetExp(r"([A]{0,2})(C)([D]{0,2})");
        TryString("AC");
        TryString("BC");
        TryString("CD");
        TryString("CE");
        TryString("ACD");
        TryString("BCE");
        TryString("ABCDE");
        TryString("C");
        TryString("F");
        SetExp(r"([AB]{0,2})(C)([DE]{0,2})");
        TryString("AC");
        TryString("BC");
        TryString("CD");
        TryString("CE");
        TryString("ACD");
        TryString("BCE");
        TryString("ABCDE");
        TryString("C");
        TryString("F");
        SetExp(r"([AB]?[AB]?)(C)([DE]?[DE]?)");
        TryString("AC");
        TryString("BC");
        TryString("CD");
        TryString("CE");
        TryString("ACD");
        TryString("BCE");
        TryString("ABCDE");
        TryString("C");
        TryString("F");
    }
}
---

Here the output:
---
Testing expression ([A]{0,2})(C)([D]{0,2})
Trying on string"AC":
Success!
0): "AC"
1): "A"
2): "C"
3): ""
Trying on string"BC":
Success!
0): "C"
1): ""
2): "C"
3): ""
Trying on string"CD":
Success!
0): "CD"
1): ""
2): "C"
3): "D"
Trying on string"CE":
Success!
0): "C"
1): ""
2): "C"
3): ""
Trying on string"ACD":
Success!
0): "ACD"
1): "A"
2): "C"
3): "D"
Trying on string"BCE":
Success!
0): "C"
1): ""
2): "C"
3): ""
Trying on string"ABCDE":
Success!
0): "CD"
1): ""
2): "C"
3): "D"
Trying on string"C":
Success!
0): "C"
1): ""
2): "C"
3): ""
Trying on string"F":
Failure!
Testing expression ([AB]{0,2})(C)([DE]{0,2})
Trying on string"AC":
Success!
0): "AC"
1): "A"
2): "C"
3): ""
Trying on string"BC":
Success!
0): "BC"
1): "B"
2): "C"
3): ""
Trying on string"CD":
Failure!
Trying on string"CE":
Failure!
Trying on string"ACD":
Success!
0): "ACD"
1): "A"
2): "C"
3): "D"
Trying on string"BCE":
Success!
0): "BCE"
1): "B"
2): "C"
3): "E"
Trying on string"ABCDE":
Success!
0): "ABCDE"
1): "AB"
2): "C"
3): "DE"
Trying on string"C":
Failure!
Trying on string"F":
Failure!
Testing expression ([AB]?[AB]?)(C)([DE]?[DE]?)
Trying on string"AC":
Success!
0): "AC"
1): "A"
2): "C"
3): ""
Trying on string"BC":
Success!
0): "BC"
1): "B"
2): "C"
3): ""
Trying on string"CD":
Failure!
Trying on string"CE":
Failure!
Trying on string"ACD":
Success!
0): "ACD"
1): "A"
2): "C"
3): "D"
Trying on string"BCE":
Success!
0): "BCE"
1): "B"
2): "C"
3): "E"
Trying on string"ABCDE":
Success!
0): "ABCDE"
1): "AB"
2): "C"
3): "DE"
Trying on string"C":
Failure!
Trying on string"F":
Failure!
---

Kind regards,
Marcello Gnani

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 08, 2009
http://d.puremagic.com/issues/show_bug.cgi?id=3136





--- Comment #1 from Marcello Gnani <marcellognani@gmail.com>  2009-07-08 12:06:26 PDT ---
I had the time to investigate further; the problem is related to an incorrect
optimization performed by Phobos on the optional prefix.
The constructor code of the RegExp object calls "public void compile(string
pattern, string attributes)", that builds a correct internal RegExp program;
then, an optimization is tried calling the "void optimize()" function. In this
function, during the optimization of the REbit opcode (the opcode that
implements the prefix match when the prefix is of more than one letter), the
optionality of the prefix is lost, leading to the incorrect behavior reported.

The simplest patch I came up is to modify slightly the "int starrchars(Range r,
const(ubyte)[] prog)" function (that is called by "optimize") as follows:
. . .
        case REnm:
        case REnmq:
        // len, n, m, ()
        len = (cast(uint *)&prog[i + 1])[0];
        n   = (cast(uint *)&prog[i + 1])[1];
        m   = (cast(uint *)&prog[i + 1])[2];
        pop = &prog[i + 1 + uint.sizeof * 3];
        if (!starrchars(r, pop[0 .. len]))
            return 0;
        if (n)
            return 1;
        i += 1 + uint.sizeof * 3 + len;
        break;
. . .
should return 0 if the n operand of the REnm opcode is 0 (this changes the line
before the break statement); this avoids the insertion of the
optionality-killing first filter:
. . .
        case REnm:
        case REnmq:
        // len, n, m, ()
        len = (cast(uint *)&prog[i + 1])[0];
        n   = (cast(uint *)&prog[i + 1])[1];
        m   = (cast(uint *)&prog[i + 1])[2];
        pop = &prog[i + 1 + uint.sizeof * 3];
        if (!starrchars(r, pop[0 .. len]))
            return 0;
        if (n)
            return 1;
        return 0;
        break;
. . .

I tried it and it works now.
Maybe this solves some other regexp bug yet open.

Best regards,
Marcello Gnani

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
October 11, 2009
http://d.puremagic.com/issues/show_bug.cgi?id=3136


Andrei Alexandrescu <andrei@metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |andrei@metalanguage.com
         AssignedTo|nobody@puremagic.com        |andrei@metalanguage.com


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
June 05, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=3136


Andrei Alexandrescu <andrei@metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|andrei@metalanguage.com     |dmitry.olsh@gmail.com


--- Comment #2 from Andrei Alexandrescu <andrei@metalanguage.com> 2011-06-05 08:11:26 PDT ---
Reassigning to Dmitry.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
June 06, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=3136


Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


--- Comment #3 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2011-06-06 08:03:48 PDT ---
Fixed for std.regex https://github.com/D-Programming-Language/phobos/commit/9afb00e36b625322d7f1d8ec0fbd876c2b5c03fc

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------