Handling invalid UTF sequences - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Handling invalid UTF sequences

Thread overview

Handling invalid UTF sequences
Mar 20, 2014 Walter Bright
Mar 20, 2014 deadalnix
Mar 20, 2014 monarch_dodra
Mar 20, 2014 Walter Bright
Mar 20, 2014 Brad Anderson
Mar 21, 2014 monarch_dodra
Mar 21, 2014 Denis Shelomovskij
Mar 22, 2014 monarch_dodra
Mar 20, 2014 Nick Sabalausky
Mar 20, 2014 Chris Williams
Mar 21, 2014 Steven Schveighoffer
Mar 21, 2014 Regan Heath
Mar 21, 2014 Dmitry Olshansky
Mar 21, 2014 Walter Bright
Mar 21, 2014 Jonathan M Davis

March 20, 2014

Handling invalid UTF sequences

Posted by Walter Bright

Walter Bright

Currently we do it by throwing a UTFException. This has problems:

1. about anything that deals with UTF cannot be made nothrow

2. turns innocuous errors into major problems, such as DOS attack vectors
http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

One option to fix this is to treat invalid sequences as:

1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)

2. U+FFFD

I kinda like option 1.

What do you think?

March 20, 2014

Re: Handling invalid UTF sequences

Posted by deadalnix
in reply to Walter Bright

deadalnix

Posted in reply to Walter Bright

On Thursday, 20 March 2014 at 22:39:47 UTC, Walter Bright wrote:
> Currently we do it by throwing a UTFException. This has problems:
>
> 1. about anything that deals with UTF cannot be made nothrow
>
> 2. turns innocuous errors into major problems, such as DOS attack vectors
> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>
> One option to fix this is to treat invalid sequences as:
>
> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
>
> 2. U+FFFD
>
> I kinda like option 1.
>
> What do you think?

Hiding errors under the carpet is not a good strategy. These sequences are invalid, and doomed to explode at some point. I'm not sure what the solution is, but the .init one do not seems like the right one to me.

March 20, 2014

Re: Handling invalid UTF sequences

Posted by monarch_dodra
in reply to Walter Bright

monarch_dodra

Posted in reply to Walter Bright

On Thursday, 20 March 2014 at 22:39:47 UTC, Walter Bright wrote:
> Currently we do it by throwing a UTFException. This has problems:
>
> 1. about anything that deals with UTF cannot be made nothrow
>
> 2. turns innocuous errors into major problems, such as DOS attack vectors
> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>
> One option to fix this is to treat invalid sequences as:
>
> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
>
> 2. U+FFFD
>
> I kinda like option 1.
>
> What do you think?

I had thought of this before, and had an idea along the lines of:
1. strings "inside" the program are always valid.
2. encountering invalid strings "inside" the program  is an Error.
3. strings from the "outside" world must be validated before use.

The advantage is *more* than just a nothrow guarantee, but also a performance guarantee in release. And it *is* a pretty sane approach to the problem:
- User data: validate before use.
- Internal data: if its bad, your program is in a failure state.

----

As for your proposal, I can't really say. Silently accepting invalid sequences sounds nice at first, but its kind of just squelching the problem, isn't it?

----

In any case, both proposals would be major breaking changes...

March 20, 2014

Re: Handling invalid UTF sequences

Posted by Nick Sabalausky
in reply to Walter Bright

Nick Sabalausky

Posted in reply to Walter Bright

On 3/20/2014 6:39 PM, Walter Bright wrote:
> Currently we do it by throwing a UTFException. This has problems:
>
> 1. about anything that deals with UTF cannot be made nothrow
>
> 2. turns innocuous errors into major problems, such as DOS attack vectors
> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>
> One option to fix this is to treat invalid sequences as:
>
> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
>
> 2. U+FFFD
>
> I kinda like option 1.
>
> What do you think?

I'd have to give some thought to have an opinion on the right solution, however I do want to say the current UTFException throwing is something I've always been unhappy with. So it definitely should get addressed in some way.

March 20, 2014

Re: Handling invalid UTF sequences

Posted by Chris Williams
in reply to Walter Bright

Chris Williams

Posted in reply to Walter Bright

To the extent possible, it should try to retain the data. But if ever the character is actually needed for something (like parsing JSON or displaying a glyph), the bad region should be replaced with a series of replacement characters:

http://en.wikipedia.org/wiki/Replacement_character#Replacement_character

March 20, 2014

Re: Handling invalid UTF sequences

Posted by Walter Bright
in reply to monarch_dodra

Walter Bright

Posted in reply to monarch_dodra

On 3/20/2014 3:51 PM, monarch_dodra wrote:
> In any case, both proposals would be major breaking changes...

Or we could do this as alternate names, leaving the originals as throwing.

> Silently accepting invalid sequences sounds nice at first, but its kind of just squelching the problem, isn't it?

Not exactly. The decoded/encoded string will still have invalid code units in it. It'd be like floating point nan, the invalid bits will still be propagated onwards to the output.

I'm also of the belief that UTF sequences should be validated on input, not necessarily on every operation on them.

March 20, 2014

Re: Handling invalid UTF sequences

Posted by Brad Anderson
in reply to monarch_dodra

Brad Anderson

Posted in reply to monarch_dodra

On Thursday, 20 March 2014 at 22:51:27 UTC, monarch_dodra wrote:
> On Thursday, 20 March 2014 at 22:39:47 UTC, Walter Bright wrote:
>> Currently we do it by throwing a UTFException. This has problems:
>>
>> 1. about anything that deals with UTF cannot be made nothrow
>>
>> 2. turns innocuous errors into major problems, such as DOS attack vectors
>> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>>
>> One option to fix this is to treat invalid sequences as:
>>
>> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
>>
>> 2. U+FFFD
>>
>> I kinda like option 1.
>>
>> What do you think?
>
> I had thought of this before, and had an idea along the lines of:
> 1. strings "inside" the program are always valid.
> 2. encountering invalid strings "inside" the program  is an Error.
> 3. strings from the "outside" world must be validated before use.
>
> The advantage is *more* than just a nothrow guarantee, but also a performance guarantee in release. And it *is* a pretty sane approach to the problem:
> - User data: validate before use.
> - Internal data: if its bad, your program is in a failure state.
>

I'm a fan of this approach but Timon pointed out when I wrote about it once that it's rather trivial to get an invalid string through slicing mid-code point so now I'm not so sure. I think I'm still in favor of it because you've obviously got a logic error if that happens so your program isn't correct anyway (it's not a matter of bad user input).

March 21, 2014

Re: Handling invalid UTF sequences

Posted by Steven Schveighoffer
in reply to Walter Bright

Steven Schveighoffer

Posted in reply to Walter Bright

On Thu, 20 Mar 2014 18:39:50 -0400, Walter Bright <newshound2@digitalmars.com> wrote:

> Currently we do it by throwing a UTFException. This has problems:
>
> 1. about anything that deals with UTF cannot be made nothrow
>
> 2. turns innocuous errors into major problems, such as DOS attack vectors
> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>
> One option to fix this is to treat invalid sequences as:
>
> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
>
> 2. U+FFFD
>
> I kinda like option 1.
>
> What do you think?

Can't say I like it. Especially since current code expects a throw.

I understand the need. What about creating a different type which decodes into a known invalid code, and doesn't throw? This leaves the selection of throwing or not up to the type, which is generally decided on declaration, instead of having to change all your calls.

-Steve

March 21, 2014

Re: Handling invalid UTF sequences

Posted by monarch_dodra
in reply to Brad Anderson

monarch_dodra

Posted in reply to Brad Anderson

On Thursday, 20 March 2014 at 23:34:02 UTC, Brad Anderson wrote:
> I'm a fan of this approach but Timon pointed out when I wrote about it once that it's rather trivial to get an invalid string through slicing mid-code point so now I'm not so sure.

It's just as easy to slice mid-codepoint as it is to access a range out of bounds. In both cases, it's a programming error.

The only excuse I see for throwing an exception for slicing mid-codepoint, is that
1. programmers are less aware of the issue, so it's more forgiving in a released program (nobody likes a crash).
2. arguably, it's not the *program* state that's bad. It's the *data*.

Well, in regards to "2", you could argue that program state and data state is one and the same.

> I think I'm still in favor of it because you've obviously got a logic error if that happens so your program isn't correct anyway (it's not a matter of bad user input).

If I remember correctly, with a specially written UTF string, it *was* possible to corrupt program state. I think. I need to double check. I didn't give it much thought then ("it should virtually never happen"), but it could be used as deliberate security vulnerability.

March 21, 2014

Re: Handling invalid UTF sequences

Posted by Regan Heath
in reply to Walter Bright

Regan Heath

Posted in reply to Walter Bright

On Thu, 20 Mar 2014 22:39:50 -0000, Walter Bright <newshound2@digitalmars.com> wrote:

> Currently we do it by throwing a UTFException. This has problems:
>
> 1. about anything that deals with UTF cannot be made nothrow
>
> 2. turns innocuous errors into major problems, such as DOS attack vectors
> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>
> One option to fix this is to treat invalid sequences as:
>
> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
>
> 2. U+FFFD
>
> I kinda like option 1.
>
> What do you think?

In window/Win32..

WideCharToMultiByte has flags for a bunch of similar behaviours and allows you to define a default char to use as a replacement in such cases.

swprintf when passed %S will convert a wchar_t UTF-16 argument into ascii, and replaces invalid characters with ? as it does so.

swprintf_s (the safe version), IIRC, will invoke the invalid parameter handler for sequences which cannot be converted.

I think, ideally, we want some sensible default behaviour but also the ability to alter it globally, and even better in specific calls where it makes sense to do so (where flags/arguments can be passed to that effect).

So, the default behaviour could be to throw (therefore no breaking change) and we provide a function to change this to one of the other options, and another to select a replacement character (which would default to .init or U+FFFD).

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation