Jump to page: 1 2
Thread overview
[Issue 4483] foreach over string or wstring, where element type not specified, does not support unicode
Jan 17, 2014
Lionello Lunesu
Jan 22, 2014
Lionello Lunesu
Jan 22, 2014
Martin Nowak
Jan 22, 2014
yebblies
Jan 24, 2014
Walter Bright
Jan 24, 2014
Daniel Kozak
Jan 26, 2014
Martin Nowak
Jan 26, 2014
Walter Bright
January 17, 2014
https://d.puremagic.com/issues/show_bug.cgi?id=4483


Lionello Lunesu <lio+bugzilla@lunesu.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lio+bugzilla@lunesu.com
            Summary|Make foreach over string or |foreach over string or
                   |wstring where element type  |wstring, where element type
                   |not specified a warning     |not specified, does not
                   |                            |support unicode


--- Comment #2 from Lionello Lunesu <lio+bugzilla@lunesu.com> 2014-01-17 01:30:01 PST ---
I took the liberty to remove the suggested solution from the title, since I think there are a couple of possible fixes here:

1. Issue a warning (original suggestion)
2. Issue an error, always require a value type (breaking change)
3. Infer the value type as "dchar" in all cases (breaking change)
4. Throw an exception at runtime when >char, >wchar unicode is encountered
(breaking change)

I think this issue is serious enough to warrant a breaking change. I taught a D workshop, in China, and everybody expected foreach to "just work", and rightfully so.

foreach(c; "你好") {}

This should just work! And it's hard to explain people why it doesn't, without getting into Unicode encoding issues, which no user wants to care about.

I'm going to argue for fix 3. and I'd say it's worth taking a breaking change for this issue.

The breaking change is compile time only, and limited to foreach over char[] or wchar[], with a non-ref, inferred value type, and where the scope cares about the value type being char or wchar.

That last part is important: In all of druntime and phobos there were only 2 places where that was the case. All others, including all tests(!), compiled (and ran) successfully without changes. The two places were fixed by adding the appropriate type, in both cases "char". A nice side effect of this change is that it makes it immediately obvious that the foreach does NOT handle the full Unicode character set. It's self-documenting, in a way.

Note that we might still choose a runtime exception. It's hardly useful to get a char with value 0xE8 out of a char[]. But throwing a sudden exception is a breaking change that might be too risky to take on.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 21, 2014
https://d.puremagic.com/issues/show_bug.cgi?id=4483


monarchdodra@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |monarchdodra@gmail.com


--- Comment #3 from monarchdodra@gmail.com 2014-01-21 10:07:21 PST ---
(In reply to comment #2)
> I took the liberty to remove the suggested solution from the title, since I think there are a couple of possible fixes here:
> 
> 1. Issue a warning (original suggestion)
> 2. Issue an error, always require a value type (breaking change)
> 3. Infer the value type as "dchar" in all cases (breaking change)
> 4. Throw an exception at runtime when >char, >wchar unicode is encountered
> (breaking change)
> 
> I think this issue is serious enough to warrant a breaking change. I taught a D workshop, in China, and everybody expected foreach to "just work", and rightfully so.
> 
> foreach(c; "你好") {}
> 
> This should just work! And it's hard to explain people why it doesn't, without getting into Unicode encoding issues, which no user wants to care about.
> 
> I'm going to argue for fix 3. and I'd say it's worth taking a breaking change for this issue.
> 
> The breaking change is compile time only, and limited to foreach over char[] or wchar[], with a non-ref, inferred value type, and where the scope cares about the value type being char or wchar.
> 
> That last part is important: In all of druntime and phobos there were only 2 places where that was the case. All others, including all tests(!), compiled (and ran) successfully without changes. The two places were fixed by adding the appropriate type, in both cases "char". A nice side effect of this change is that it makes it immediately obvious that the foreach does NOT handle the full Unicode character set. It's self-documenting, in a way.

The major issue though is that while it still compiles, it becomes a *silent* breaking change, which is the worst kind of change.

I can think of a few places where the foreach is written "foreach(c; s)"
*because* we specifically want to iterate on the raw codeunits (not necessarily
the *char*).

Just because the code still compiles doesn't mean it's still correct. These kinds of changes can introduce some really insidious bugs, or performance issues.

At this point the only "solution" I really see would be to simply deprecate and ban implicit char inference. Code would have to make an explicit choice. After a two or three years of this, we'd be able to safely assume that the is no more implicit char conversion out there, and then we'd be able to set the default to dchar. Anything else (IMO) is just a disaster waiting to happen.

So I'd say I'd opt for "2)", and *once* we've had "2)" for a while, we can opt
to add "3)".

The question though is is it all worth it... I don't know?

> Note that we might still choose a runtime exception. It's hardly useful to get a char with value 0xE8 out of a char[]. But throwing a sudden exception is a breaking change that might be too risky to take on.

Except if you are explicitly iterating the codeunits, because you know your
needle is ASCII. Besides,
1) You'd *kill* performance.
2) You'd make a fair amount of nothrow functions throw.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 22, 2014
https://d.puremagic.com/issues/show_bug.cgi?id=4483



--- Comment #4 from Lionello Lunesu <lio+bugzilla@lunesu.com> 2014-01-21 18:20:57 PST ---
The pull request contains some more arguments against option 2:

* different type for "ref" compared to non ref is unprecedented
* won't fix iteration for composed characters
* what if iteration over code point was intended in the first place? Silent
breaking change

I'm personally particularly interested to fix the foreach over a unicode literal. I guess we could change the type of such a literal and, for example, pick a type that fits the characters in the literal.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 22, 2014
https://d.puremagic.com/issues/show_bug.cgi?id=4483


Martin Nowak <code@dawg.eu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |code@dawg.eu


--- Comment #5 from Martin Nowak <code@dawg.eu> 2014-01-21 18:30:31 PST ---
(In reply to comment #4)
> I'm personally particularly interested to fix the foreach over a unicode literal. I guess we could change the type of such a literal and, for example, pick a type that fits the characters in the literal.

Sounds pretty hacky.

Andrei posted a good alternative solution. http://forum.dlang.org/post/j7soe4$2rvt$1@digitalmars.com

We should also try to use range forech (.front, .popFront, .empty) because decoding is much faster than with the runtime implementation which uses a delegate.

After a proper deprecation cycle this could work like so.

    // iterate over any unicode string
    foreach (c; "foobar") {}  //dchar
    foreach (c; "foobar"c) {} //dchar
    foreach (c; "foobar"w) {} //dchar
    foreach (c; "foobar"d) {} //dchar

    // iterate over representation
    foreach (c; "foobar".rep) {}  //ubyte
    foreach (c; "foobar"c.rep) {} //ubyte
    foreach (c; "foobar"w.rep) {} //ushort
    foreach (c; "foobar"d.rep) {} //uint

    // conversion becomes an error
    foreach (char c; "foobar"c) {}   //error can't convert dchar to char
    foreach (char c; "foobar"w) {}   //error can't convert dchar to char
    foreach (char c; "foobar"d) {}   //error can't convert dchar to char
    foreach (wchar wc; "foobar"c) {} //error can't convert dchar to wchar
    foreach (wchar wc; "foobar"w) {} //error can't convert dchar to wchar
    foreach (wchar wc; "foobar"d) {} //error can't convert dchar to wchar

    // std.utf.transcode for transcoding (lazy iteration)
    foreach (c; "foobar".transcode!char()) {}   //char
    foreach (wc; "foobar".transcode!wchar()) {} //wchar
    // and so on and so forth

    // further changes
    "foobar".length // use "foobar".rep.length
    "foobar"[0]     // use "foobar".rep[0]

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 22, 2014
https://d.puremagic.com/issues/show_bug.cgi?id=4483


yebblies <yebblies@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |yebblies@gmail.com


--- Comment #6 from yebblies <yebblies@gmail.com> 2014-01-22 15:13:11 EST ---
I suggest the following steps:
1. Warn when foreach arg type is inferred to be a narrow string
2. Deprecate ...
3. Error ...
4. New behavior, maybe

This both prevents accidental narrow-char iteration, and lets us put off deciding on the new behavior for at least a year.  Win!

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 22, 2014
https://d.puremagic.com/issues/show_bug.cgi?id=4483



--- Comment #7 from monarchdodra@gmail.com 2014-01-21 23:42:52 PST ---
(In reply to comment #5)
> (In reply to comment #4)
> Sounds pretty hacky.
> 
> Andrei posted a good alternative solution. http://forum.dlang.org/post/j7soe4$2rvt$1@digitalmars.com
> 
> We should also try to use range forech (.front, .popFront, .empty) because decoding is much faster than with the runtime implementation which uses a delegate.

Arguably, that's an implementation problem? In particular, front/popFront is *know* to be a slow iterative process, due to the double decode+stride.

Whenever I need *real* speed for string decoding, *NOTHING* beats std.utf.decode. And technically, I see no reason the built-in foreach couldn't *hope* to one day achieve the same speeds.

> After a proper deprecation cycle this could work like so.
> 
>     // iterate over any unicode string
>     foreach (c; "foobar") {}  //dchar
>     foreach (c; "foobar"c) {} //dchar
>     foreach (c; "foobar"w) {} //dchar
>     foreach (c; "foobar"d) {} //dchar
> 
>     // iterate over representation
>     foreach (c; "foobar".rep) {}  //ubyte
>     foreach (c; "foobar"c.rep) {} //ubyte
>     foreach (c; "foobar"w.rep) {} //ushort
>     foreach (c; "foobar"d.rep) {} //uint

Sounds good to me.

>     // conversion becomes an error
>     foreach (char c; "foobar"c) {}   //error can't convert dchar to char
>     foreach (char c; "foobar"w) {}   //error can't convert dchar to char
>     foreach (char c; "foobar"d) {}   //error can't convert dchar to char
>     foreach (wchar wc; "foobar"c) {} //error can't convert dchar to wchar
>     foreach (wchar wc; "foobar"w) {} //error can't convert dchar to wchar
>     foreach (wchar wc; "foobar"d) {} //error can't convert dchar to wchar
> 
>     // std.utf.transcode for transcoding (lazy iteration)
>     foreach (c; "foobar".transcode!char()) {}   //char
>     foreach (wc; "foobar".transcode!wchar()) {} //wchar
>     // and so on and so forth
> 
>     // further changes
>     "foobar".length // use "foobar".rep.length
>     "foobar"[0]     // use "foobar".rep[0]

Now that just seems excessive to me. Especially the last one. We can't completely act like a string isn't an array. Plus:

auto c = "foobar"[0];
static assert(is(typeof(c) == char); //Fails

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 24, 2014
https://d.puremagic.com/issues/show_bug.cgi?id=4483


Walter Bright <bugzilla@digitalmars.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugzilla@digitalmars.com


--- Comment #8 from Walter Bright <bugzilla@digitalmars.com> 2014-01-23 21:34:22 PST ---
I'm very much opposed to this. The way it works now has been that way from the beginning, and an unknowably vast amount of code depends on it.

Furthermore, I don't like the inherent slowdowns it causes. Only rarely does one need decoded chars, the rest of the time working with bytes is fast and correct.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 24, 2014
https://d.puremagic.com/issues/show_bug.cgi?id=4483


Daniel Kozak <kozzi11@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kozzi11@gmail.com


--- Comment #9 from Daniel Kozak <kozzi11@gmail.com> 2014-01-24 02:22:51 PST ---
(In reply to comment #8)
> I'm very much opposed to this. The way it works now has been that way from the beginning, and an unknowably vast amount of code depends on it.
> 
> Furthermore, I don't like the inherent slowdowns it causes. Only rarely does one need decoded chars, the rest of the time working with bytes is fast and correct.

Yes, I agree with this. string is just an alias for immutable array of chars, there is nothing special about it. (I am not happy with this, but this is current situation)

What I really hate is this:

char[] a = cast(char[])"asdsad";
writeln(typeid(typeof(a.front))) // prints dchar instead char;

This is really unexpected behavior :(.

I think new special type lets call it `EncodeString` should be added.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 26, 2014
https://d.puremagic.com/issues/show_bug.cgi?id=4483



--- Comment #10 from Martin Nowak <code@dawg.eu> 2014-01-25 20:50:26 PST ---
(In reply to comment #8)
> I'm very much opposed to this. The way it works now has been that way from the beginning, and an unknowably vast amount of code depends on it.
> 
You fail to recognize that it's broken from the begging.
The knowably vast amount of people that stumble over this should
show you how surprising this is. It's the only place in the language
where unicode handling is opt-in and choosing an incorrect default
goes against basic D principles.

D is designed such that most "obvious" code is fast and safe. On occasion a function might need to escape the confines of type safety for ultimate speed and control. For such rare cases D offers native pointers, type casts...

> Furthermore, I don't like the inherent slowdowns it causes. Only rarely does one need decoded chars, the rest of the time working with bytes is fast and correct.

That's not the best argument. Handling UTF-8 only adds a single comparison
    if (str[i] < 0x80)
        return str[i];
    else
        decodeUTF8(str, i);
that the branch predictor will always get right.
What's true is that we had several codegen issues with unicode decoding,
e.g. the comparison wasn't inlined.
Currently we decode foreach with a delegate, creeping slow.
https://github.com/rejectedsoftware/vibe.d/pull/327
So handling UTF-8 requires care in library design and a good optimizer but I
don't see how it is inherently slower.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 26, 2014
https://d.puremagic.com/issues/show_bug.cgi?id=4483


Walter Bright <bugzilla@digitalmars.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Platform|Other                       |All


--- Comment #11 from Walter Bright <bugzilla@digitalmars.com> 2014-01-26 01:17:28 PST ---
> You fail to recognize that it's broken from the begging.

I understand and recognize your arguments, I just do not agree with the conclusion.

1. Silently breaking code with such a change is a seriously bad idea at this point.

2. I do not agree with the hypothesis that the slowdown is trivial. I just got done doing a project for a client, where high speed was essential. I got bit by front() doing a decode. The speed hit was not acceptable, and I had to insert lots of ugly casts to ubyte arrays. The code almost never needed a decode character, but simply doing a copy() did a decode and then an encode. Arghh!

Actually needing a dchar is rather rare. The bulk of algorithms with strings work better and faster using octets (like good ole' copy()). I've suspected for a while now that std.algorithm is slower than necessary because the pointless decoding going on, and I know that std.regex suffered serious slowdowns because of it.

To me this is much like the recurring argument that D should hide the reality of twos-complement arithmetic with various checks for overflows and carries. D strings are octets, not dchars, and one simply needs to be aware of that when programming with them.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
« First   ‹ Prev
1 2