View mode: basic / threaded / horizontal-split · Log in · Help
November 21, 2012
Behavior of strings with invalid unicode...
I made a commit that was meant to better certify what functions 
threw in UTF.

I thus noticed that some of our functions, are unsafe. For 
example:

strings s = [0b1100_0000]; //1 byte of 2 byte sequence
s.popFront();              //Assertion error because of invalid
                           //slicing of s[2 .. $];

"pop" is nothrow, so throwing exception is out of the question, 
and the implementation seems to imply that "invalid unicode 
sequences are removed".

This is a bug, right?

--------
Things get more complicated if you take into account "partial 
invalidity". For example:

strings s = [0b1100_0000, 'a', 'b'];

Here, the first byte is actually an invalid sequence, since the 
second byte is not of the form 0b10XX_XXXX. What's more, byte 2 
itself is actually a valid sequence. We do not detect this 
though, and create this output:
s.popFront(); => s == "b";
*arguably*, the correct behavior would be:
s.popFront(); => s == "ab";
Where only the single invalid first byte is removed.

The problem is that doing this would actually be much more 
expensive, especially for a rare case. Worst yet, chances are you 
validate again, and again (and again) the same character.

--------
So here are my 2 questions:
1. Is there, or does anyone know of, a standardized "behavior to 
follow when decoding utf with invalid codes"?

2. Do we even really support invalid UTF after we "leave" the 
std.utf.decode layer? EG: We simply suppose that the string is 
valid?
November 21, 2012
Re: Behavior of strings with invalid unicode...
On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
> So here are my 2 questions:
> 1. Is there, or does anyone know of, a standardized "behavior to
> follow when decoding utf with invalid codes"?
> 
> 2. Do we even really support invalid UTF after we "leave" the
> std.utf.decode layer? EG: We simply suppose that the string is
> valid?

We don't support invalid unicode being providing ways to check for it and in 
some cases throwing if it's encountered. If you create a string with invalid 
unicode, then you're shooting yourself in the foot, and you could get weird 
results. Some code checks for validity and will throw when it's given invalid 
unicode (decode in particular does this), whereas some code will simply ignore 
the fact that it's invalid and move on (generally, because it's not bothering 
to go to the effort of validating it). I believe that at the moment, the idea 
is that when the full decoding of a character occurs, a UTFException will be 
thrown if an invalid code point is encountered, whereas anything which 
partially decodes characters (e.g. just figures out how large a code point is) 
may or may not throw. popFront used to throw but doesn't any longer in an 
effort to make it faster, letting decode be the one to throw (so front would 
still throw, but popFront wouldn't).

I'm not aware of there being any standard way to deal with invalid Unicode, 
but I believe that popFront currently just treats invalid code points as being 
of length 1.

- Jonathan M Davis
November 26, 2012
Re: Behavior of strings with invalid unicode...
On Wednesday, 21 November 2012 at 18:25:56 UTC, Jonathan M Davis 
wrote:
> On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
>> So here are my 2 questions:
>> 1. Is there, or does anyone know of, a standardized "behavior 
>> to
>> follow when decoding utf with invalid codes"?
>> 
>> 2. Do we even really support invalid UTF after we "leave" the
>> std.utf.decode layer? EG: We simply suppose that the string is
>> valid?
>
> We don't support invalid unicode being providing ways to check 
> for it and in
> some cases throwing if it's encountered. If you create a string 
> with invalid
> unicode, then you're shooting yourself in the foot, and you 
> could get weird
> results. Some code checks for validity and will throw when it's 
> given invalid
> unicode (decode in particular does this), whereas some code 
> will simply ignore
> the fact that it's invalid and move on (generally, because it's 
> not bothering
> to go to the effort of validating it). I believe that at the 
> moment, the idea
> is that when the full decoding of a character occurs, a 
> UTFException will be
> thrown if an invalid code point is encountered, whereas 
> anything which
> partially decodes characters (e.g. just figures out how large a 
> code point is)
> may or may not throw. popFront used to throw but doesn't any 
> longer in an
> effort to make it faster, letting decode be the one to throw 
> (so front would
> still throw, but popFront wouldn't).

OK: I guess that makes sense. I kind of which there'd be more of 
a documented "two-level" scheme, but that should be fine.

> I'm not aware of there being any standard way to deal with 
> invalid Unicode,
> but I believe that popFront currently just treats invalid code 
> points as being
> of length 1.
>
> - Jonathan M Davis

Well, popFront only pops 1 element only if the very first element 
of is an invalid code point, but will not "see" if the code point 
at index 2 is invalid for multi-byte codes.

This kind of gives it a double-standard behavior, but I guess we 
have to draw a line somewhere.
November 27, 2012
Re: Behavior of strings with invalid unicode...
On Monday, November 26, 2012 08:47:48 monarch_dodra wrote:
> OK: I guess that makes sense. I kind of which there'd be more of
> a documented "two-level" scheme, but that should be fine.

It's pretty much grown over time and isn't necessarily applied consistently.

> Well, popFront only pops 1 element only if the very first element
> of is an invalid code point, but will not "see" if the code point
> at index 2 is invalid for multi-byte codes.
> 
> This kind of gives it a double-standard behavior, but I guess we
> have to draw a line somewhere.

We care about making popFront as fast as possible, and in general, front is 
called on the character as well (making the whole way that front and popFront 
work for strings naturally inefficient unfortunately), so it makes sense to skip 
the checking as much as possible in popFront. It's basically doing the best 
that it can to be as fast as it can, so any checking that it doesn't need to 
do is best skipped. Speed is wins over correctness here and anything that we 
can do to make it faster is desirable. It's not perfect that way, but since in 
most cases the Unicode will be correct, and the correctness is generally 
checked by front (or decode), it was deemed to be the best approach.

- Jonathan M Davis
Top | Discussion index | About this forum | D home