On Sunday, 7 November 2021 at 01:59:47 UTC, jfondren wrote:
>On Sunday, 7 November 2021 at 01:12:19 UTC, zjh wrote:
Rust has more than ten kinds
of strings. Maybe we can add 2/3
one.
November 07, 2021 Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead | ||||
---|---|---|---|---|
| ||||
Posted in reply to jfondren | On Sunday, 7 November 2021 at 01:59:47 UTC, jfondren wrote: >On Sunday, 7 November 2021 at 01:12:19 UTC, zjh wrote: Rust has more than ten |
November 06, 2021 Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead | ||||
---|---|---|---|---|
| ||||
Posted in reply to FeepingCreature | On 11/4/2021 11:30 PM, FeepingCreature wrote:
> These programs are *wrong.* They thought they could only get Unicode and they've gotten non-Unicode. So we know they're written on wrong assumptions; why do we want to continue running code we know is untrustworthy? Let them crash, let them be fixed to make fewer assumptions. Automagically handling errors by propagating them in an inert form robs the developers and users of a chance to avoid a mistake. It's no better than 0.0.
It's much better than 0.0. 0.0 is indistinguishable from valid data, and is a very common valid value.
NaN and ReplacementChar are not valid and are easily distinguished.
|
November 07, 2021 Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead | ||||
---|---|---|---|---|
| ||||
Posted in reply to norm | On Friday, 5 November 2021 at 10:09:40 UTC, norm wrote: >It isn't always that simple, e.g. working on medical devices crashing isn't an option when it comes to how we're going to deal with bad data. Mm, I have a totally different take on this. In my view all incoming data should be sanitised on entry into the application, this takes place at what I think of as leaf nodes in the application. This sanitisation includes conversion of all measurements into standard units, checking validity of strings etc. Once data has entered the main application then the application should fail fast. This is especially important for medical devices. This allows the developers of the application to see, early in development, problems with their code and the logic thereof. Signs of developers ignoring the fail fast principle include a disease I've identified where If I am on a ventilator and the program enters a state that the programmer did not anticipate, then life can start to get very uncomfortable for me. I would far prefer that it stopped, coughed up an error code, and the medical staff can unplug it and (quickly, I hope) replace it with another one. If there is actually a scenario where staggering on is considered better, then at the very least it should be under instruction from the programmer. The idea of the language runtime silently modifying application data is somewhat frightening for me in this scenario. |
November 07, 2021 Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote: > On 11/4/2021 11:30 PM, FeepingCreature wrote: >> [...] Let them crash, [...] Automagically handling errors by propagating them in an inert form robs the developers and users of a chance to avoid a mistake. It's no better than 0.0. > It's much better than 0.0. 0.0 is indistinguishable from valid data, and is a very common valid value. Technically it makes no difference if you do not check for 0.0 or not for NaN. What makes a difference is using "out of band signalling" (exceptions) if its default behavior is process termination. > NaN and ReplacementChar are not valid The replacemment character '�' is a valid Unicode codepoint (U+FFFD). > and are easily distinguished. Someone may forget to write explicit code to handle this case which most likely leads to data corruption. I choose stack trace over potentially corrupted data. |
November 07, 2021 Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead | ||||
---|---|---|---|---|
| ||||
Posted in reply to kdevel | On Sunday, 7 November 2021 at 16:28:33 UTC, kdevel wrote: > On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote: >> On 11/4/2021 11:30 PM, FeepingCreature wrote: >>> [...] >> It's much better than 0.0. 0.0 is indistinguishable from valid data, and is a very common valid value. > > Technically it makes no difference if you do not check for 0.0 or not for NaN. What makes a difference is using "out of band signalling" (exceptions) if its default behavior is process termination. > >> NaN and ReplacementChar are not valid > > The replacemment character '�' is a valid Unicode codepoint (U+FFFD). > >> and are easily distinguished. > > Someone may forget to write explicit code to handle this case which most likely leads to data corruption. I choose stack trace over potentially corrupted data. https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding |
November 07, 2021 Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead | ||||
---|---|---|---|---|
| ||||
Posted in reply to zjh | On Sunday, 7 November 2021 at 02:12:36 UTC, zjh wrote: >On Sunday, 7 November 2021 at 01:59:47 UTC, jfondren wrote: >On Sunday, 7 November 2021 at 01:12:19 UTC, zjh wrote: Rust has more than ten Meanwhile, in Rust:
If you smuggle invalid UTF into a type that Rust expects to be valid UTF (the same case as 104, 101, 108, 108, 110 - "hello" This is similar to |
November 07, 2021 Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead | ||||
---|---|---|---|---|
| ||||
Posted in reply to kdevel | On 11/7/2021 8:28 AM, kdevel wrote:
> Technically it makes no difference if you do not check for 0.0 or not for NaN.
Yes, it does. 0.0 is not distinguishable from valid data. NaN is.
|
November 07, 2021 Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead | ||||
---|---|---|---|---|
| ||||
Posted in reply to Imperatorn | On 11/7/2021 8:46 AM, Imperatorn wrote:
> https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
The money quote:
"By default, each object uses replacement fallback to handle strings that it cannot encode and bytes that it cannot decode, but you can specify that an exception should be thrown instead. For more information, see Replacement fallback and Exception fallback."
|
November 08, 2021 Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Sunday, 7 November 2021 at 23:29:39 UTC, Walter Bright wrote:
> On 11/7/2021 8:46 AM, Imperatorn wrote:
>> https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
>
> The money quote:
>
> "By default, each object uses replacement fallback to handle strings that it cannot encode and bytes that it cannot decode, but you can specify that an exception should be thrown instead. For more information, see Replacement fallback and Exception fallback."
💲💲💲
|
November 08, 2021 Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote: >It's much better than 0.0. 0.0 is indistinguishable from valid data, and is a very common valid value. NaN and ReplacementChar are not valid and are easily distinguished. No, that's exactly the problem. ReplacementChar is not easily distinguished, because it's a valid Unicode character - that's the whole point of it. So just like nan, it can propagate arbitrarily far through your processing pipeline before some downstream process decides that it actually doesn't like it. And at that point you generally have no chance to recover the source of the issue - you know that something maybe has gone wrong, but you don't even know if it was in your process or in the input data. After all, if you were screening your input data for ReplacementChar, you could as easily have been screening it for invalid UTF-8 to begin with. So while yes it's marginally better than 0.0, because at least you know that something is wrong, it does as little as possible to help you locate the problem while technically informing you. And all the workarounds for that take the form of "throw everywhere where a ReplacementChar could be generated." So imo just do the equivalent of turning on FE_INVALID, and do that to begin with. There's no point to getting rid of throw sites when you just force the user to readd them manually because they fulfill a genuine need. IMO if you want to get rid of the exception overhead, I'd go the other way and make invalid unicode an abort(). Check your input data, people. |