[Issue 17861] UTF Decode fails with exception

Sep 26, 2017

Steven Schveighoffer

Sep 26, 2017

Sep 26, 2017

Sep 26, 2017

Sep 26, 2017

Sep 26, 2017

Sep 27, 2017

Oct 03, 2017

Oct 03, 2017

Oct 03, 2017

Oct 03, 2017

Oct 03, 2017

Oct 03, 2017

Oct 03, 2017

Oct 03, 2017

Dec 17, 2022

Dec 07

dlangBugzillaToGithub

https://issues.dlang.org/show_bug.cgi?id=17861 Steven Schveighoffer <schveiguy@yahoo.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |schveiguy@yahoo.com --- Comment #1 from Steven Schveighoffer <schveiguy@yahoo.com> --- std.utf.decode accepts a parameter to allow for exactly what you ask: import std.utf, std.typecons; void main() { size_t idx = 0; auto x = decode!(Yes.useReplacementDchar)("\xff", idx); assert(x == 0xfffd); } What is the actual use case you are using the druntime version for? --

https://issues.dlang.org/show_bug.cgi?id=17861 --- Comment #2 from Etienne <etcimon@gmail.com> --- That's not terribly useful as it is hidden behind the whole of std.string, which is what actually deals with UTF8 text. --

https://issues.dlang.org/show_bug.cgi?id=17861 --- Comment #3 from Steven Schveighoffer <schveiguy@yahoo.com> --- I looked at decode because that is what I thought you were referring to. What is the actual use case? What is the code that causes the exception? It's hard to suggest help when you are simply asking for a significant (and breaking) implementation change. --

https://issues.dlang.org/show_bug.cgi?id=17861 --- Comment #4 from Etienne <etcimon@gmail.com> --- I was encountering this in the vibe.d parseJsonString function, when decoding some Json string with user-generated data. About a year ago I found this was the simplest way to patch this up but I'll run another test case. I don't remember the exact trace but I see that it does use std.decode without the replacementDchar option, so maybe that would solve it. I remember it being due to phobos calling it though, not sure where. I have an issue with the idea that string parsers can throw on a whim and break the application, because as I said this user generated content can cause a crash that way. --

https://issues.dlang.org/show_bug.cgi?id=17861 --- Comment #5 from Steven Schveighoffer <schveiguy@yahoo.com> --- (In reply to Etienne from comment #4) > I have > an issue with the idea that string parsers can throw on a whim and break the > application, because as I said this user generated content can cause a crash > that way. This is understandable, but it is going to be difficult to justify a change from the default, as much code may depend on the current default. --

https://issues.dlang.org/show_bug.cgi?id=17861 --- Comment #6 from Etienne <etcimon@gmail.com> --- I think it might have thrown this exception in the foreach loop of a string due to an invalid code point. I don't see `rt.decode` called anywhere else, not even in phobos. I'm not sure anyone has actually planned for this behavior, because it's so marginal, undocumented and difficult to trigger, so I'd say it might affect mostly some unit tests that could expect it. Where in documentation does it say that foreach char[] can throw on unicode error? https://tour.dlang.org/tour/en/gems/unicode This specialized article doesn't even talk about any kind of exceptions. --

https://issues.dlang.org/show_bug.cgi?id=17861 Jon Degenhardt <jrdemail2000-dlang@yahoo.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jrdemail2000-dlang@yahoo.co | |m --- Comment #7 from Jon Degenhardt <jrdemail2000-dlang@yahoo.com> --- (Not sure a bug is the correct place to discuss, but...) In many of the apps I work on it's important that the application have control over the error handling behavior when an invalid UTF-8 sequence is encountered. Any time data is received from an external source. When using Phobos, it's often necessary to do this validation on initial input without giving control to Phobos routines. This is not always the most convenient. What would be really useful would be to have some sort of configurable utf-8 error handling setting that could be established for a scope. e.g. When entering a scope, set the invalid character behavior to replace, drop, or throw. The low level routines like std.utf.decode would obey the settings. When exiting the scope the previous error handling setting would be re-established. A somewhat similar idea came up in a forum thread recently with respect to establishing the precision used converting floating point numbers to strings with std.conv.to (https://forum.dlang.org/thread/xlbiekgdijcxwqjsrika@forum.dlang.org). Don't know how hard something like this would be, perhaps prohibitively hard, but it would be very pragmatic for many production applications. --

October 03, 2017

[Issue 17861] UTF Decode fails with exception

Posted by Jonathan M Davis

Permalink

Jonathan M Davis

Permalink

https://issues.dlang.org/show_bug.cgi?id=17861

Jonathan M Davis <issues.dlang@jmdavisProg.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |issues.dlang@jmdavisProg.co
                   |                            |m
           Hardware|x86                         |All
                 OS|Windows                     |All

--- Comment #8 from Jonathan M Davis <issues.dlang@jmdavisProg.com> ---
This has been discussed before. There's a strong argument for making it so that decode uses the replacement character by default (it's even what the Unicode standard says you should do), and all string-based stuff then follows suit, at which point anyone wanting exceptions would need to call decode manually with the template argument indicating that that's what they wanted - which is the opposite of what we have now. And Walter is actually in favor of using the replacement character instead of exceptions and possibly even making the change in spite of the issues, but there have been some folks who have been strongly opposed to that. The problem is twofold:

1. Making the change risks silently breaking a ton of code.

2. Others (Vladimir in particular IIRC) have argued about how negative it is to have the contents of strings silently changed, since there are cases where it would be highly detrimental for that to happen.

And on some level, all of this gets wrapped into the auto-decoding debate, because that's the main reason that this is out of the control of the user. front and popFront on strings call decode for you and call it in the way that results in exceptions on invalid UTF instead of using the replacement character. Anyone making the calls manually has the choice.

So, I think that the chances are very high that we would go with the replacement character by default rather than exceptions (maybe not even have the exceptions at all) if we were starting from scratch - just like we wouldn't have auto-decoding if we were starting from scratch. But it's highly questionable that we can get away with making the change now due to the ramifications that it will have on existing code.

At this point, the situation with decoding code points and not having it throw is in pretty much the same boat as using strings with range-based code and not auto-decoding: you have to use wrappers like byCodeUnit and/or special-case your code on strings. And to avoid the exceptions on bad Unicode, you either have to not be decoding code points, or you need to do so yourself with std.utf.decode. No, that's not ideal, but no one has been able to come up with a reasonable way to change the status quo with any kind of reasonable deprecation process.

--

https://issues.dlang.org/show_bug.cgi?id=17861 --- Comment #9 from Etienne <etcimon@gmail.com> --- The std.utf.decode debate is a rough one and seems to have its workarounds. On the other hand, you have this forgotten code at rt.util.utf where everything is so low-level and there is no option to use a replacement character. This is something that comes up on a foreach loop on a string. I don't know why anyone would expect foreach to throw? It's a little strange. Obviously the whole library should default to a replacement character, but rt.util.utf is a little more urgent because of how much of a surprise it is for a foreach loop to be throwing, whereas you don't really expect it to behave like a function and it can happen in edge cases on release builds where things become very difficult to debug. --

https://issues.dlang.org/show_bug.cgi?id=17861 --- Comment #10 from Jonathan M Davis <issues.dlang@jmdavisProg.com> --- If you don't want foreach to throw when decoding, then don't use foreach to decode. Call std.utf.decode yourself rather than having foreach automatically call the druntime version for you. We can't change foreach right now for exactly the same reasons that we can't change front and popFront for strings. If it were decided that the breakage that would result from changing front and popFront to use the replacement character was low enough that we were willing to just change them and let things break, then we'd change foreach. But as long as it's considered too risk to change front and popFront to not throw, it's too risky to change foreach. There is nothing special about foreach that makes it any different from front and popFront for strings in terms of whether we can risk breaking code. I'm sorry that you find the current behavior surprising, but string handling in D in general requires understanding a number of quirks that we'd like to change but can't due to how much existing code would break if we did make the desired changes. --

Forums