Thread overview
Behavior of strings with invalid unicode...
Nov 21, 2012
monarch_dodra
Nov 21, 2012
Jonathan M Davis
Nov 26, 2012
monarch_dodra
Nov 27, 2012
Jonathan M Davis
November 21, 2012
I made a commit that was meant to better certify what functions threw in UTF.

I thus noticed that some of our functions, are unsafe. For example:

strings s = [0b1100_0000]; //1 byte of 2 byte sequence
s.popFront();              //Assertion error because of invalid
                           //slicing of s[2 .. $];

"pop" is nothrow, so throwing exception is out of the question, and the implementation seems to imply that "invalid unicode sequences are removed".

This is a bug, right?

--------
Things get more complicated if you take into account "partial invalidity". For example:

strings s = [0b1100_0000, 'a', 'b'];

Here, the first byte is actually an invalid sequence, since the second byte is not of the form 0b10XX_XXXX. What's more, byte 2 itself is actually a valid sequence. We do not detect this though, and create this output:
s.popFront(); => s == "b";
*arguably*, the correct behavior would be:
s.popFront(); => s == "ab";
Where only the single invalid first byte is removed.

The problem is that doing this would actually be much more expensive, especially for a rare case. Worst yet, chances are you validate again, and again (and again) the same character.

--------
So here are my 2 questions:
1. Is there, or does anyone know of, a standardized "behavior to follow when decoding utf with invalid codes"?

2. Do we even really support invalid UTF after we "leave" the std.utf.decode layer? EG: We simply suppose that the string is valid?

November 21, 2012
On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
> So here are my 2 questions:
> 1. Is there, or does anyone know of, a standardized "behavior to
> follow when decoding utf with invalid codes"?
> 
> 2. Do we even really support invalid UTF after we "leave" the std.utf.decode layer? EG: We simply suppose that the string is valid?

We don't support invalid unicode being providing ways to check for it and in some cases throwing if it's encountered. If you create a string with invalid unicode, then you're shooting yourself in the foot, and you could get weird results. Some code checks for validity and will throw when it's given invalid unicode (decode in particular does this), whereas some code will simply ignore the fact that it's invalid and move on (generally, because it's not bothering to go to the effort of validating it). I believe that at the moment, the idea is that when the full decoding of a character occurs, a UTFException will be thrown if an invalid code point is encountered, whereas anything which partially decodes characters (e.g. just figures out how large a code point is) may or may not throw. popFront used to throw but doesn't any longer in an effort to make it faster, letting decode be the one to throw (so front would still throw, but popFront wouldn't).

I'm not aware of there being any standard way to deal with invalid Unicode, but I believe that popFront currently just treats invalid code points as being of length 1.

- Jonathan M Davis
November 26, 2012
On Wednesday, 21 November 2012 at 18:25:56 UTC, Jonathan M Davis wrote:
> On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
>> So here are my 2 questions:
>> 1. Is there, or does anyone know of, a standardized "behavior to
>> follow when decoding utf with invalid codes"?
>> 
>> 2. Do we even really support invalid UTF after we "leave" the
>> std.utf.decode layer? EG: We simply suppose that the string is
>> valid?
>
> We don't support invalid unicode being providing ways to check for it and in
> some cases throwing if it's encountered. If you create a string with invalid
> unicode, then you're shooting yourself in the foot, and you could get weird
> results. Some code checks for validity and will throw when it's given invalid
> unicode (decode in particular does this), whereas some code will simply ignore
> the fact that it's invalid and move on (generally, because it's not bothering
> to go to the effort of validating it). I believe that at the moment, the idea
> is that when the full decoding of a character occurs, a UTFException will be
> thrown if an invalid code point is encountered, whereas anything which
> partially decodes characters (e.g. just figures out how large a code point is)
> may or may not throw. popFront used to throw but doesn't any longer in an
> effort to make it faster, letting decode be the one to throw (so front would
> still throw, but popFront wouldn't).

OK: I guess that makes sense. I kind of which there'd be more of a documented "two-level" scheme, but that should be fine.

> I'm not aware of there being any standard way to deal with invalid Unicode,
> but I believe that popFront currently just treats invalid code points as being
> of length 1.
>
> - Jonathan M Davis

Well, popFront only pops 1 element only if the very first element of is an invalid code point, but will not "see" if the code point at index 2 is invalid for multi-byte codes.

This kind of gives it a double-standard behavior, but I guess we have to draw a line somewhere.
November 27, 2012
On Monday, November 26, 2012 08:47:48 monarch_dodra wrote:
> OK: I guess that makes sense. I kind of which there'd be more of a documented "two-level" scheme, but that should be fine.

It's pretty much grown over time and isn't necessarily applied consistently.

> Well, popFront only pops 1 element only if the very first element of is an invalid code point, but will not "see" if the code point at index 2 is invalid for multi-byte codes.
> 
> This kind of gives it a double-standard behavior, but I guess we have to draw a line somewhere.

We care about making popFront as fast as possible, and in general, front is called on the character as well (making the whole way that front and popFront work for strings naturally inefficient unfortunately), so it makes sense to skip the checking as much as possible in popFront. It's basically doing the best that it can to be as fast as it can, so any checking that it doesn't need to do is best skipped. Speed is wins over correctness here and anything that we can do to make it faster is desirable. It's not perfect that way, but since in most cases the Unicode will be correct, and the correctness is generally checked by front (or decode), it was deemed to be the best approach.

- Jonathan M Davis