Thread overview
[Issue 6791] New: std.algorithm.splitter random indexes utf strings
Oct 08, 2011
dawg@dawgfoto.de
October 08, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=6791

           Summary: std.algorithm.splitter random indexes utf strings
           Product: D
           Version: D2
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody@puremagic.com
        ReportedBy: dawg@dawgfoto.de


--- Comment #0 from dawg@dawgfoto.de 2011-10-07 22:51:09 PDT ---
Throws an UTFException.

string s = `là dove terminava quella valle`;
foreach(word; std.array.splitter(s))
  writeln(word);

---

The second UTF-8 code point of 'à' is 0xA0 for which isWhite is true.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
August 19, 2013
http://d.puremagic.com/issues/show_bug.cgi?id=6791


hsteoh@quickfur.ath.cx changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hsteoh@quickfur.ath.cx


--- Comment #1 from hsteoh@quickfur.ath.cx 2013-08-18 22:22:41 PDT ---
This is caused by struct SplitterResult in std.algorithm using array slicing and array indexing to pass char (not dchar!) to the lambda. SplitterResult appears to have multiple issues: it uses array slicing without a proper signature constraint on hasSlicing, and doesn't work properly for narrow strings because it uses indexing which for narrow strings doesn't handle multibyte UTF-8 sequences properly.

It appears to be wanting a rewrite that uses only forward range primitives, or at least, an overload for narrow strings that properly take multibyte characters into account.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
August 19, 2013
http://d.puremagic.com/issues/show_bug.cgi?id=6791


monarchdodra@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |monarchdodra@gmail.com
         AssignedTo|nobody@puremagic.com        |monarchdodra@gmail.com


--- Comment #2 from monarchdodra@gmail.com 2013-08-18 23:25:05 PDT ---
(In reply to comment #1)
> This is caused by struct SplitterResult in std.algorithm using array slicing and array indexing to pass char (not dchar!) to the lambda. SplitterResult appears to have multiple issues: it uses array slicing without a proper signature constraint on hasSlicing, and doesn't work properly for narrow strings because it uses indexing which for narrow strings doesn't handle multibyte UTF-8 sequences properly.
> 
> It appears to be wanting a rewrite that uses only forward range primitives, or at least, an overload for narrow strings that properly take multibyte characters into account.

I had submitted a correction for this about 1 year ago, but it ended up being
too big in scope (*all* splitter flavors have bugs). It also ended up being
messy due to (trying to avoid) code duplication.

It might be better to just fix things little by little though, rather than not at all.

I'll fix *just* "splitter!pred": It's the easiest to fix. We'll see where we go from there.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------