Handling invalid UTF sequences (page 2)

21.03.2014 12:25, monarch_dodra пишет: > On Thursday, 20 March 2014 at 23:34:02 UTC, Brad Anderson wrote: >> I'm a fan of this approach but Timon pointed out when I wrote about it >> once that it's rather trivial to get an invalid string through slicing >> mid-code point so now I'm not so sure. > > It's just as easy to slice mid-codepoint as it is to access a range out > of bounds. In both cases, it's a programming error. > > The only excuse I see for throwing an exception for slicing > mid-codepoint, is that > 1. programmers are less aware of the issue, so it's more forgiving in a > released program (nobody likes a crash). > 2. arguably, it's not the *program* state that's bad. It's the *data*. > > Well, in regards to "2", you could argue that program state and data > state is one and the same. > >> I think I'm still in favor of it because you've obviously got a logic >> error if that happens so your program isn't correct anyway (it's not a >> matter of bad user input). > > > If I remember correctly, with a specially written UTF string, it *was* > possible to corrupt program state. I think. I need to double check. I > didn't give it much thought then ("it should virtually never happen"), > but it could be used as deliberate security vulnerability. Almost nothing to add here. We already have `-noboundscheck` which can dramatically increase performance, throwing `UTFError` should either use same mechanics (`-noutfcheck`?) or just be stripped in release. Personally I'd choose the latter as there are lots of (sometimes very slow) assertions stripped with `-release` in real programs, which indicates same critical data corruption. -- Денис В. Шеломовский Denis V. Shelomovskij

21-Mar-2014 02:39, Walter Bright пишет: > Currently we do it by throwing a UTFException. This has problems: > > 1. about anything that deals with UTF cannot be made nothrow > > 2. turns innocuous errors into major problems, such as DOS attack vectors > http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences > > One option to fix this is to treat invalid sequences as: > > 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32) If we talk decoding then only dchar is relevant. If transcoding then, having 0xFF makes for broken UTF-8 encoding so I see no sense in going for it. > > 2. U+FFFD > Also has the benefit of being recommended by the standard specifically for the case of substitution for bad encoding. Details: https://d.puremagic.com/issues/show_bug.cgi?id=12113 > I kinda like option 1. > Not enough of an argument ;) -- Dmitry Olshansky

March 21, 2014

Re: Handling invalid UTF sequences

Posted by Jonathan M Davis
in reply to Walter Bright

Permalink

Jonathan M Davis

Posted in reply to Walter Bright

Permalink

On Thursday, March 20, 2014 15:39:50 Walter Bright wrote:
> Currently we do it by throwing a UTFException. This has problems:
> 
> 1. about anything that deals with UTF cannot be made nothrow
> 
> 2. turns innocuous errors into major problems, such as DOS attack vectors http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
> 
> One option to fix this is to treat invalid sequences as:
> 
> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
> 
> 2. U+FFFD
> 
> I kinda like option 1.
> 
> What do you think?

After a discussion on this a few weeks back (where I was in favor of the current behavior when the discussion started), I'm now completely in favor of making it so that std.utf.decode simply replaces invalid code points with U+FFFD per the standard. Most code won't care and will continue to work as before. The main difference is that invalid Unicode would then fall in the same category as when a program is given a string with characters that it's not supposed to be given. Any code that checks for that sort of thing will then treat invalid Unicode as it would have treated other invalid strings, and code that doesn't care will continue to not care except that now it will work with invalid Unicode instead of throwing.

A prime example is something like find. What does it care if it's given invalid Unicode? It will simply look for what you tell it to look for, and if it's not there, it won't find it. U+FFFD will just be one more character that doesn't match what it's looking for.

The few programs that really care about whether a string that they're given contains any invalid Unicode can simply validate the string ahead of time. The main problem there is that we need to replace std.utf.validate with something like std.utf.isValidUnicode, because validate makes the horrendous decision of throwing rather than returning a bool (which is what triggered the previous discussion on the topic IIRC).

There may be some concern about this change silently changing behavior, but I think that the reality is that the vast majority of programs will continue to work just fine, and our string processing code will be that much cleaner and faster as a result. So, I'm very much inclined to take the path of making this change and putting a warning about it in the changelog rather than not making the change or trying to do this alongside what we currently have.

- Jonathan M Davis

On 3/21/2014 10:14 AM, Dmitry Olshansky wrote: > 21-Mar-2014 02:39, Walter Bright пишет: >> Currently we do it by throwing a UTFException. This has problems: >> >> 1. about anything that deals with UTF cannot be made nothrow >> >> 2. turns innocuous errors into major problems, such as DOS attack vectors >> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences >> >> One option to fix this is to treat invalid sequences as: >> >> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32) > > If we talk decoding then only dchar is relevant. > If transcoding then, having 0xFF makes for broken UTF-8 encoding so I see no > sense in going for it. > >> >> 2. U+FFFD >> > > Also has the benefit of being recommended by the standard specifically for the > case of substitution for bad encoding. > > Details: > https://d.puremagic.com/issues/show_bug.cgi?id=12113 Ah, that's what I was looking for. The wikipedia article was a bit wishy-washy about the whole thing. >> I kinda like option 1. >> > > Not enough of an argument ;) > >

On Friday, 21 March 2014 at 10:39:49 UTC, Denis Shelomovskij wrote: > 21.03.2014 12:25, monarch_dodra пишет: >> If I remember correctly, with a specially written UTF string, it *was* >> possible to corrupt program state. I think. I need to double check. I >> didn't give it much thought then ("it should virtually never happen"), >> but it could be used as deliberate security vulnerability. > > Almost nothing to add here. We already have `-noboundscheck` which can dramatically increase performance, throwing `UTFError` should either use same mechanics (`-noutfcheck`?) or just be stripped in release. Personally I'd choose the latter as there are lots of (sometimes very slow) assertions stripped with `-release` in real programs, which indicates same critical data corruption. Except it's an Unicode *Exception*. Invalid unicode is *NOT* supposed to be an error. Now I remember: Truncated unicode strings can cause slicing out of bounds in popFront. This means we are currently operating on a double standard of sometimes exception, sometimes error, sometimes corruption.

Forums