Thread overview
[Issue 4250] New: std.regex does not support character sets other than unicode
May 29, 2010
Lionello Lunesu
May 30, 2010
Lionello Lunesu
May 30, 2010
Lionello Lunesu
May 30, 2010
Walter Bright
Mar 12, 2012
Dmitry Olshansky
Jul 22, 2012
Dmitry Olshansky
May 29, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4250

           Summary: std.regex does not support character sets other than
                    unicode
           Product: D
           Version: 2.041
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody@puremagic.com
        ReportedBy: lio+bugzilla@lunesu.com


--- Comment #0 from Lionello Lunesu <lio+bugzilla@lunesu.com> 2010-05-29 07:46:59 PDT ---
Created an attachment (id=647)
Patch against phobos/std/regex.d in dmd.2.046.zip

I'm writing an application that works with Chinese text encoded in GBK, http://en.wikipedia.org/wiki/GBK . I could convert all the text to UTF8 first, before using regex, but it's much faster to leave the text as-is and only convert the regular expression to GBK instead.

I suspect the following opcode need patching:
1. REanychar uses std.utf.stride;
2. REdchar and REidchar are used when the character in the regex >= 0x80;
3. REichar and REidchar use std.ctype.toupper (during creation and execution)

Point 1 and 3 are easily solved by providing the user with callback functions.
To prevent unnecessary indirection, these can be aliases if
(is(__traits(compiles, std.utf.stride(new E[], 0)))).d

Attached a proof of concept patch for point 1. If this is OK, I can do the same for point 2 and 3 as well. (Point 2 might not even need a patch; not clear about that now.)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 30, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4250


Lionello Lunesu <lio+bugzilla@lunesu.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #647 is|0                           |1
           obsolete|                            |


--- Comment #1 from Lionello Lunesu <lio+bugzilla@lunesu.com> 2010-05-29 17:53:38 PDT ---
Created an attachment (id=648)
Patch against phobos/std/regex.d in dmd.2.046.zip

Fixed the diff.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 30, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4250



--- Comment #2 from Lionello Lunesu <lio+bugzilla@lunesu.com> 2010-05-29 18:02:11 PDT ---
Created an attachment (id=649)
Testcase (using GB18030 encoded date with std.regex)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 30, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4250


Walter Bright <bugzilla@digitalmars.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugzilla@digitalmars.com
           Severity|normal                      |enhancement


--- Comment #3 from Walter Bright <bugzilla@digitalmars.com> 2010-05-30 11:02:48 PDT ---
It's not designed to do anything but UTF, so marked as an enhancement request.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 09, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=4250


Andrei Alexandrescu <andrei@metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |andrei@metalanguage.com
         AssignedTo|nobody@puremagic.com        |andrei@metalanguage.com


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
March 12, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=4250


Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh@gmail.com
         AssignedTo|andrei@metalanguage.com     |dmitry.olsh@gmail.com


--- Comment #4 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-03-12 03:34:56 PDT ---
The first straightforward step would be to add option to skip UTF-processing
assuming it is plain ASCII, that covers an important use case.
The next move largely depends on std.encoding or whatever it would be.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 22, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=4250


Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #648 is|0                           |1
           obsolete|                            |


--- Comment #5 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-07-22 08:21:13 PDT ---
(From update of attachment 648)
Old regex is gone for good since 2.056.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------