Jump to page: 1 25  
Page
Thread overview
[Issue 14519] Get rid of unicode validation in string processing
[Issue 14519] [Enh] foreach on strings should return replacementDchar rather than throwing
Apr 29, 2015
Vladimir Panteleev
Apr 29, 2015
Jonathan M Davis
Apr 29, 2015
Vladimir Panteleev
Apr 29, 2015
Vladimir Panteleev
Apr 29, 2015
Sobirari Muhomori
Apr 29, 2015
Jonathan M Davis
Apr 29, 2015
Vladimir Panteleev
Apr 29, 2015
Jonathan M Davis
Apr 29, 2015
Vladimir Panteleev
Apr 29, 2015
Walter Bright
Apr 29, 2015
Walter Bright
Apr 29, 2015
Walter Bright
Apr 29, 2015
Vladimir Panteleev
Apr 29, 2015
Walter Bright
Apr 29, 2015
Vladimir Panteleev
Apr 29, 2015
Vladimir Panteleev
Apr 29, 2015
Marc Schütz
Apr 29, 2015
Vladimir Panteleev
Apr 29, 2015
Vladimir Panteleev
Apr 29, 2015
Sobirari Muhomori
Apr 29, 2015
Marc Schütz
Apr 30, 2015
Martin Nowak
Apr 30, 2015
Martin Nowak
May 02, 2015
Vladimir Panteleev
May 02, 2015
Vladimir Panteleev
May 05, 2015
Sobirari Muhomori
May 09, 2015
Martin Nowak
May 16, 2015
Vladimir Panteleev
Jul 17, 2015
Martin Nowak
Jul 17, 2015
Martin Nowak
Jul 17, 2015
Martin Nowak
Jul 17, 2015
Martin Nowak
Jul 17, 2015
Sobirari Muhomori
Jul 17, 2015
Vladimir Panteleev
Jul 17, 2015
Jonathan M Davis
Jul 17, 2015
Vladimir Panteleev
Aug 19, 2015
Vladimir Panteleev
May 18, 2016
Jack Stouffer
May 20, 2016
Martin Nowak
Aug 16, 2019
Walter Bright
Nov 07, 2021
Vladimir Panteleev
Nov 07, 2021
Vladimir Panteleev
Dec 17, 2022
Iain Buclaw
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

Vladimir Panteleev <thecybershadow@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |thecybershadow@gmail.com

--- Comment #1 from Vladimir Panteleev <thecybershadow@gmail.com> ---
As discussed in the newsgroup: please don't do this.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

Jonathan M Davis <issues.dlang@jmdavisProg.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |issues.dlang@jmdavisProg.co
                   |                            |m

--- Comment #2 from Jonathan M Davis <issues.dlang@jmdavisProg.com> ---
(In reply to Vladimir Panteleev from comment #1)
> As discussed in the newsgroup: please don't do this.

If anything, I thought that the consensus lead towards making the change, but maybe I didn't pay enough attention to the discussion in question.

Certainly, I'm very much in favor of making use of replacementDchar on invalid UTF in general and have code explicitly validate unicode if that's what's desired. There is some risk with making the change, since existing code may not handle the replacement character well, but very little string processing is going to actually care, and throwing exceptions like we currently do is not only a performance problem, it's hugely annoying when you need to actually process invalid unicode, which is pretty easy to have happen if you're doing stuff like parsing websites or files which were written by programs that didn't handle unicode correctly.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #3 from Vladimir Panteleev <thecybershadow@gmail.com> ---
> it's hugely annoying when you need to actually process invalid unicode

Yeah, and it'll me a lot more than just "annoying" when you discover too late that your data's been irreversibly corrupted because it wasn't in the correct encoding.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #4 from Vladimir Panteleev <thecybershadow@gmail.com> ---
Here's a counter-proposal: when encountering invalid UTF-8, instead of throwing exceptions, throw errors. This will fix the nothrow and performance problems, and will avoid the risk of data corruption. The workaround is to pre-sanitize the input. The impact of breaking existing code is the same as the original proposal.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #5 from Sobirari Muhomori <dfj1esp02@sneakemail.com> ---
Or provide a global override similar to assertHandler.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #6 from Jonathan M Davis <issues.dlang@jmdavisProg.com> ---
(In reply to Vladimir Panteleev from comment #4)
> Here's a counter-proposal: when encountering invalid UTF-8, instead of throwing exceptions, throw errors. This will fix the nothrow and performance problems, and will avoid the risk of data corruption.


Yikes. That is far worse than throwing Exceptions, since it would kill your program, and it's indicative of a bug in the program rather than bad input.

> The workaround is to
> pre-sanitize the input. The impact of breaking existing code is the same as
> the original proposal.

Pre-sanitizing input is exactly what should be done if you care about unicode validation. You validate any strings entering the program from a file, a socket, or from user input, and then you know that you're operating on valid Unicode. But most programs just don't care about how valid the Unicode is, and the fact that throwing is how it's handled is incredibly annoying. It forces validation on all programs whether they need it or not, and it makes it so that string-based code can pretty much never be nothrow. Using the replacement character in the stead of invalid unicode is exactly what it was created for in the first place.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #7 from Vladimir Panteleev <thecybershadow@gmail.com> ---
(In reply to Jonathan M Davis from comment #6)
> Yikes. That is far worse than throwing Exceptions, since it would kill your program, and it's indicative of a bug in the program rather than bad input.

Yes. The bug is that the string should've been sanitized.

> But most programs just don't care about how valid the Unicode is,

Maybe most programs YOU write.

> and the fact that throwing is how it's handled is incredibly annoying.

I can see how it can be annoying - when you don't care about your data.

> It forces validation on all programs whether they need it or not,
> and it makes it so that string-based code can pretty much never be nothrow.

Throwing errors is allowed in nothrow code.

> Using the replacement character in the stead of invalid unicode is exactly what it was created for in the first place.

Yes, in circumstances when you don't care about the "invalid" data, which should always be opt-in.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

bearophile_hugs@eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bearophile_hugs@eml.cc

--- Comment #8 from bearophile_hugs@eml.cc ---
(In reply to Walter Bright from comment #0)

> Changing foreach to return replacementDchar on invalid UTF encodings fixes these problems, and makes it possible to do faster loops.

Another solution is to deprecate foreach iteration on strings, and require something like "foreach(c; mystring.byCharThrowing)" and similar things.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #9 from Jonathan M Davis <issues.dlang@jmdavisProg.com> ---
Most string-based functions work perfectly well with invalid Unicode. Does find care? Does startsWith? Does filter? The replacement character simply won't match what you're looking for. The functions themselves don't care. The replacement character is just another character. They need a way to deal with invalid Unicode, but the replacement character deals with that beautifully.

The concern is whether program input is valid - whether the user manages to type in invalid Unicode due to bad terminal settings, or whether you get junk off a socket, or whether a file has been corrupted. Anything that cares should be checking that when the data enters the program so that the error can be reported to whoever or wherever the data is coming from. Having it done via exceptions later on disconnects the reporting of the error from the point when it can actually be handled. What do you do if you read in an XML file and process half of it before you hit invalid Unicode? If the whole file was read into memory, then you may not even have any any idea where that string came from, and it's likely far too late to report to the user that they're opening a corrupted file. That validation really needs to be done when the string enters the program - not at some arbitrary point later in the program when the invalid portion happens to be decoded. So, if you insist that all strings be validated, then maybe throwing an Error makes sense, but an Exception sure doesn't. And throwing an Error assumes that you always need to validate the Unicode in strings, which definitely isn't the case when the replacement character is used. So, throwing an Error is forcing everyone to validate the Unicode in their strings whether they care or not, and using the replacement character will work, whereas the programs that do care about validating their strings should be doing the validation up front anyway.

So, given that the code that cares about validation needs to be validating up front and therefore doesn't care about the replacement character being used later and that programs that don't care about validating their Unicode input will work just fine with the replacement character, it seems to me that it makes perfect sense to just use the replacement character rather than throwing.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #10 from Vladimir Panteleev <thecybershadow@gmail.com> ---
OK, I see from your post that you don't see many of the problems with the replacement character. Let me show you some example problematic situations:

1.

Bob wants to update his company's documents to use the new name for his product. He writes a program that does a recursive pattern search & replace in a directory. After testing the program on a few sample files, he is satisfied with the results, and runs the program on his company's document store.

Six months later, long after the documents went out of backup rotation, Sue finds that some important historical documents have been irreversibly corrupted and full of Unicode replacement characters encoded as UTF-8. Why? Because these old documents did not use UTF-8, and Bob used D.

2.

Bob is writing a secure server-side software package (let's say, a confidential
document store). He is using a std.algorithm-based hashing algorithm to store
the passwords securely. At some point, Mary signs up and creates a secure
password, which contains entirely Cyrillic letters (let's say, "ЭтоМойПароль").

Not long after, Eve successfully logs into Mary's account with the password
"ЯЯЯЯЯЯЯЯЯЯЯЯ". Why? Because the passwords just happened to be sent in some
non-UTF-8 encoding, and, since Bob used D, when "normalized" through
std.algorithm's replacement character subtitution, all Unicode-only passwords
of the same length have the same hash.

Automatic use of the replacement character will come as a surprise to many people who come from other languages. For example, in Delphi, strings are also the de-facto ubyte[] / void[] type - you can safely read a binary file into a string, perform search and replace, and write it back, knowing that the result will be exactly what you expected.

Furthermore, from your message it appears to me that you've missed the point of my argument:

> What do you do if you read in an XML file and process half of it before you hit invalid Unicode?

You abort! This should not happen. Either the XML file is in an incorrect encoding (which puts to question the integrity of all the data parsed so far - what if it was some 8-bit encoding that only LOOKED like valid UTF-8?) or the program should've sanitized the input first if it really didn't care about data correctness. But this is an XML file, meaning it's very likely to be machine generated - if it contains errors, it might indicate a problem somewhere else in the system, which is why it's all the more important to abort and get the user to figure out the true source of the problem. Ignoring the error here reminds me of how PHP never stops on errors by default, or Basic's "ON ERROR GOTO NEXT".

> So, throwing an Error is forcing everyone to validate the Unicode in their strings whether they care or not, and using the replacement character will work, whereas the programs that do care about validating their strings should be doing the validation up front anyway.

Yes, but then there is no way to make sure you're not accidentally corrupting data! Whereas now we only have a runtime check against invalid UTF-8, now we will have no check at all. With no automatic mechanism to ensure that all text is sanitized before it gets into std.algorithm, it becomes impossible to be sure that you're not accidentally corrupting data along the way.

--
« First   ‹ Prev
1 2 3 4 5