dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead (page 9)

Settings

Help

Index » General » dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead (page 9)

November 07, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by zjh
in reply to jfondren

Permalink

zjh

Posted in reply to jfondren

Permalink

On Sunday, 7 November 2021 at 01:59:47 UTC, jfondren wrote:

On Sunday, 7 November 2021 at 01:12:19 UTC, zjh wrote:

Rust has more than ten kinds of strings. Maybe we can add 2/3 one.

November 06, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Walter Bright
in reply to FeepingCreature

Permalink

Walter Bright

Posted in reply to FeepingCreature

Permalink

On 11/4/2021 11:30 PM, FeepingCreature wrote:
> These programs are *wrong.* They thought they could only get Unicode and they've gotten non-Unicode. So we know they're written on wrong assumptions; why do we want to continue running code we know is untrustworthy? Let them crash, let them be fixed to make fewer assumptions. Automagically handling errors by propagating them in an inert form robs the developers and users of a chance to avoid a mistake. It's no better than 0.0.

It's much better than 0.0. 0.0 is indistinguishable from valid data, and is a very common valid value.

NaN and ReplacementChar are not valid and are easily distinguished.

November 07, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Abdulhaq
in reply to norm

Permalink

Abdulhaq

Posted in reply to norm

Permalink

On Friday, 5 November 2021 at 10:09:40 UTC, norm wrote:

It isn't always that simple, e.g. working on medical devices crashing isn't an option when it comes to how we're going to deal with bad data.

Mm, I have a totally different take on this. In my view all incoming data should be sanitised on entry into the application, this takes place at what I think of as leaf nodes in the application. This sanitisation includes conversion of all measurements into standard units, checking validity of strings etc.

Once data has entered the main application then the application should fail fast. This is especially important for medical devices. This allows the developers of the application to see, early in development, problems with their code and the logic thereof.

Signs of developers ignoring the fail fast principle include a disease I've identified where if (x is null) is seen to start proliferating through the code. This happens when you are calling a function that you did not write and one day you find it has returned null, you don't know why. So you add an if (null) return null to your code and carry on. This allows the program to stagger on in the face of being in a state that is not understood by the developer.

If I am on a ventilator and the program enters a state that the programmer did not anticipate, then life can start to get very uncomfortable for me. I would far prefer that it stopped, coughed up an error code, and the medical staff can unplug it and (quickly, I hope) replace it with another one. If there is actually a scenario where staggering on is considered better, then at the very least it should be under instruction from the programmer. The idea of the language runtime silently modifying application data is somewhat frightening for me in this scenario.

November 07, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by kdevel
in reply to Walter Bright

Permalink

kdevel

Posted in reply to Walter Bright

Permalink

On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:
> On 11/4/2021 11:30 PM, FeepingCreature wrote:
>> [...] Let them crash, [...] Automagically handling errors by propagating them in an inert form robs the developers and users of a chance to avoid a mistake. It's no better than 0.0.
> It's much better than 0.0. 0.0 is indistinguishable from valid data, and is a very common valid value.

Technically it makes no difference if you do not check for 0.0 or not for NaN. What makes a difference is using "out of band signalling" (exceptions) if its default behavior is process termination.

> NaN and ReplacementChar are not valid

The replacemment character '�' is a valid Unicode codepoint (U+FFFD).

> and are easily distinguished.

Someone may forget to write explicit code to handle this case which most likely leads to data corruption. I choose stack trace over potentially corrupted data.

November 07, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Imperatorn
in reply to kdevel

Permalink

Imperatorn

Posted in reply to kdevel

Permalink

On Sunday, 7 November 2021 at 16:28:33 UTC, kdevel wrote:
> On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:
>> On 11/4/2021 11:30 PM, FeepingCreature wrote:
>>> [...]
>> It's much better than 0.0. 0.0 is indistinguishable from valid data, and is a very common valid value.
>
> Technically it makes no difference if you do not check for 0.0 or not for NaN. What makes a difference is using "out of band signalling" (exceptions) if its default behavior is process termination.
>
>> NaN and ReplacementChar are not valid
>
> The replacemment character '�' is a valid Unicode codepoint (U+FFFD).
>
>> and are easily distinguished.
>
> Someone may forget to write explicit code to handle this case which most likely leads to data corruption. I choose stack trace over potentially corrupted data.

https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding

November 07, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by jfondren
in reply to zjh

Permalink

jfondren

Posted in reply to zjh

Permalink

On Sunday, 7 November 2021 at 02:12:36 UTC, zjh wrote:

On Sunday, 7 November 2021 at 01:59:47 UTC, jfondren wrote:

On Sunday, 7 November 2021 at 01:12:19 UTC, zjh wrote:

Rust has more than ten kinds of strings. Maybe we can add 2/3 one.

Meanwhile, in Rust:

#[cfg(test)]
mod tests {
    fn type_of<T>(_: T) -> &'static str {
        core::any::type_name::<T>()
    }
    const INVALID: &'static str = unsafe {
        std::str::from_utf8_unchecked(&[
            0x68, 0x65, 0x6c, 0x6c, 0x6f, 0xa7, 0x85, 0xaf, 0x74, 0x68, 0x65, 0x72, 0x65,
        ])
    };
    #[test]
    fn iter_invalid() {
        for c in INVALID.chars() {
            println!("{} {}, {}", type_of(c), c as u32, c);
        }
    }
}

If you smuggle invalid UTF into a type that Rust expects to be valid UTF (the same case as string in D, allegedly), then Rust's equivalent of foreach (dchar c; str) { } just emits invalid chars -- two of 'em, somehow.

104, 101, 108, 108, 110 - "hello"
453, 1012 - ???
104, 101, 114, 101 - "here" (the 't' is lost)

This is similar to foreach (dchar c; std.encoding.codePoints(str)) { } which emits three dchars between "hello" and "there", but which also has an assert failure in non-release builds.

November 07, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Walter Bright
in reply to kdevel

Permalink

Walter Bright

Posted in reply to kdevel

Permalink

On 11/7/2021 8:28 AM, kdevel wrote:
> Technically it makes no difference if you do not check for 0.0 or not for NaN.

Yes, it does. 0.0 is not distinguishable from valid data. NaN is.

November 07, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Walter Bright
in reply to Imperatorn

Permalink

Walter Bright

Posted in reply to Imperatorn

Permalink

On 11/7/2021 8:46 AM, Imperatorn wrote:
> https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding

The money quote:

"By default, each object uses replacement fallback to handle strings that it cannot encode and bytes that it cannot decode, but you can specify that an exception should be thrown instead. For more information, see Replacement fallback and Exception fallback."

November 08, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Imperatorn
in reply to Walter Bright

Permalink

Imperatorn

Posted in reply to Walter Bright

Permalink

On Sunday, 7 November 2021 at 23:29:39 UTC, Walter Bright wrote:
> On 11/7/2021 8:46 AM, Imperatorn wrote:
>> https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
>
> The money quote:
>
> "By default, each object uses replacement fallback to handle strings that it cannot encode and bytes that it cannot decode, but you can specify that an exception should be thrown instead. For more information, see Replacement fallback and Exception fallback."

💲💲💲

November 08, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by FeepingCreature
in reply to Walter Bright

Permalink

FeepingCreature

Posted in reply to Walter Bright

Permalink

On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:

It's much better than 0.0. 0.0 is indistinguishable from valid data, and is a very common valid value.

NaN and ReplacementChar are not valid and are easily distinguished.

No, that's exactly the problem. ReplacementChar is not easily distinguished, because it's a valid Unicode character - that's the whole point of it. So just like nan, it can propagate arbitrarily far through your processing pipeline before some downstream process decides that it actually doesn't like it. And at that point you generally have no chance to recover the source of the issue - you know that something maybe has gone wrong, but you don't even know if it was in your process or in the input data. After all, if you were screening your input data for ReplacementChar, you could as easily have been screening it for invalid UTF-8 to begin with. So while yes it's marginally better than 0.0, because at least you know that something is wrong, it does as little as possible to help you locate the problem while technically informing you. And all the workarounds for that take the form of "throw everywhere where a ReplacementChar could be generated." So imo just do the equivalent of turning on FE_INVALID, and do that to begin with. There's no point to getting rid of throw sites when you just force the user to readd them manually because they fulfill a genuine need.

IMO if you want to get rid of the exception overhead, I'd go the other way and make invalid unicode an abort(). Check your input data, people.

Top | Forum index | About this forum

Forums