Thread overview | |||||||||
---|---|---|---|---|---|---|---|---|---|
|
May 29, 2010 [Issue 4250] New: std.regex does not support character sets other than unicode | ||||
---|---|---|---|---|
| ||||
http://d.puremagic.com/issues/show_bug.cgi?id=4250 Summary: std.regex does not support character sets other than unicode Product: D Version: 2.041 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Phobos AssignedTo: nobody@puremagic.com ReportedBy: lio+bugzilla@lunesu.com --- Comment #0 from Lionello Lunesu <lio+bugzilla@lunesu.com> 2010-05-29 07:46:59 PDT --- Created an attachment (id=647) Patch against phobos/std/regex.d in dmd.2.046.zip I'm writing an application that works with Chinese text encoded in GBK, http://en.wikipedia.org/wiki/GBK . I could convert all the text to UTF8 first, before using regex, but it's much faster to leave the text as-is and only convert the regular expression to GBK instead. I suspect the following opcode need patching: 1. REanychar uses std.utf.stride; 2. REdchar and REidchar are used when the character in the regex >= 0x80; 3. REichar and REidchar use std.ctype.toupper (during creation and execution) Point 1 and 3 are easily solved by providing the user with callback functions. To prevent unnecessary indirection, these can be aliases if (is(__traits(compiles, std.utf.stride(new E[], 0)))).d Attached a proof of concept patch for point 1. If this is OK, I can do the same for point 2 and 3 as well. (Point 2 might not even need a patch; not clear about that now.) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
May 30, 2010 [Issue 4250] std.regex does not support character sets other than unicode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Lionello Lunesu | http://d.puremagic.com/issues/show_bug.cgi?id=4250 Lionello Lunesu <lio+bugzilla@lunesu.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #647 is|0 |1 obsolete| | --- Comment #1 from Lionello Lunesu <lio+bugzilla@lunesu.com> 2010-05-29 17:53:38 PDT --- Created an attachment (id=648) Patch against phobos/std/regex.d in dmd.2.046.zip Fixed the diff. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
May 30, 2010 [Issue 4250] std.regex does not support character sets other than unicode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Lionello Lunesu | http://d.puremagic.com/issues/show_bug.cgi?id=4250 --- Comment #2 from Lionello Lunesu <lio+bugzilla@lunesu.com> 2010-05-29 18:02:11 PDT --- Created an attachment (id=649) Testcase (using GB18030 encoded date with std.regex) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
May 30, 2010 [Issue 4250] std.regex does not support character sets other than unicode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Lionello Lunesu | http://d.puremagic.com/issues/show_bug.cgi?id=4250 Walter Bright <bugzilla@digitalmars.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bugzilla@digitalmars.com Severity|normal |enhancement --- Comment #3 from Walter Bright <bugzilla@digitalmars.com> 2010-05-30 11:02:48 PDT --- It's not designed to do anything but UTF, so marked as an enhancement request. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
January 09, 2011 [Issue 4250] std.regex does not support character sets other than unicode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Lionello Lunesu | http://d.puremagic.com/issues/show_bug.cgi?id=4250 Andrei Alexandrescu <andrei@metalanguage.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED CC| |andrei@metalanguage.com AssignedTo|nobody@puremagic.com |andrei@metalanguage.com -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
March 12, 2012 [Issue 4250] std.regex does not support character sets other than unicode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Lionello Lunesu | http://d.puremagic.com/issues/show_bug.cgi?id=4250 Dmitry Olshansky <dmitry.olsh@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dmitry.olsh@gmail.com AssignedTo|andrei@metalanguage.com |dmitry.olsh@gmail.com --- Comment #4 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-03-12 03:34:56 PDT --- The first straightforward step would be to add option to skip UTF-processing assuming it is plain ASCII, that covers an important use case. The next move largely depends on std.encoding or whatever it would be. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
July 22, 2012 [Issue 4250] std.regex does not support character sets other than unicode | ||||
---|---|---|---|---|
| ||||
Posted in reply to Lionello Lunesu | http://d.puremagic.com/issues/show_bug.cgi?id=4250 Dmitry Olshansky <dmitry.olsh@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #648 is|0 |1 obsolete| | --- Comment #5 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-07-22 08:21:13 PDT --- (From update of attachment 648) Old regex is gone for good since 2.056. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
Copyright © 1999-2021 by the D Language Foundation