dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead (page 2)

Settings

Help

Index » General » dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead (page 2)

November 04, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Dukc
in reply to Adam D Ruppe

Permalink

Dukc

Posted in reply to Adam D Ruppe

Permalink

On Thursday, 4 November 2021 at 02:34:54 UTC, Adam D Ruppe wrote:
>
> I agree it is a good idea. If you want an exception, it is easy enough to just check it in the loop and throw then.
>
> Let's do it.

Plus the present is inconsistent with rest of the language features. Implicit language-level conversions in D do not usually throw.

November 04, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Steven Schveighoffer
in reply to Walter Bright

Permalink

Steven Schveighoffer

Posted in reply to Walter Bright

Permalink

On 11/3/21 10:26 PM, Walter Bright wrote:

https://issues.dlang.org/show_bug.cgi?id=22473

I've tried to fix this before, but too many people objected.

Are we fed up with this yet? I sure am.

Who wants to take up this cudgel and fix the durned thing once and for all?

(It's unclear if it would even break existing code.)

Honestly, I'd say foreach(dchar c; somestr) should not work.

It's slow and calls opaque functions
Adds more requirements to runtime that are simply solved by basic wrappers.
If writing wrappers, you can decide what you want.
It gets people used to language-magic character conversion, when this doesn't work on ranges of char that aren't arrays -- which then performs integer promotion.

What I would not suggest though, is to just disable the feature. If it falls back to integer promotion (which is the worst thing ever for characters), then tons and tons of code will break, and much code will just work for English strings.

Autodecoding might be a huge problem with Phobos, but character promotion is a huge problem with the language.

-Steve

November 04, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by jfondren
in reply to Walter Bright

Permalink

jfondren

Posted in reply to Walter Bright

Permalink

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:

https://issues.dlang.org/show_bug.cgi?id=22473

This doesn't throw, actually:

unittest {
    import std.stdio : writeln;
    enum invalid = "hello\247\205\257there";

    foreach (c; invalid)
        writeln(cast(ubyte) c);
}

Which is per usual in D

@("std.utf.byUTF 2/3 (throwing)")
@safe unittest {
    import std.utf : byUTF, UTFException, UseReplacementDchar;
    import std.exception : assertThrown, assertNotThrown;
    import std.algorithm : count;

    string partial = "hello\247\205\257there";

    // byChar misses the bad UTF8 ...
    assertNotThrown!UTFException(partial.byUTF!(char, UseReplacementDchar.no).count);

    // byDchar objects to it
    assertThrown!UTFException(partial.byUTF!(dchar, UseReplacementDchar.no).count);
}

This does throw:

unittest {
    import std.stdio : writeln;
    enum invalid = "hello\247\205\257there";

    foreach (dchar c; invalid)
        writeln(cast(int) c);
}

but by asking for dchars from an immutable(char)[] you're asking for some unicode work to happen, so throwing is a reasonable default IMO. Emitting the replacement character is also a reasonable default, and objections in the thread can be answered the same way that objections to throwing can be: if you don't like it, iterate some other way:

// throw on invalid UTF
unittest {
    import std.utf : byUTF, UseReplacementDchar, UTFException;

    enum invalid = "hello\247\205\257there";

    int sum;
    try {
        foreach (dchar c; invalid.byUTF!(dchar, UseReplacementDchar.no))
            sum += cast(int) c;
        assert(sum == 197667);
    } catch (UTFException e) {
        assert(sum == 532);
    }
}

// AssertError on invalid UTF
// (release behavior: "\247\205\257" is three dchars!)
unittest {
    import std.stdio : writeln;
    import std.encoding : codePoints;

    enum invalid = "hello\247\205\257there";

    foreach (dchar c; invalid.codePoints)
        writeln(cast(int) c);
}

// stop iterating on invalid UTF
unittest {
    import std.encoding : validLength;

    enum invalid = "hello\247\205\257there";
    char[] s;

    foreach (dchar c; invalid[0 .. invalid.validLength])
        s ~= c;
    assert(s == "hello");
}

November 04, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Walter Bright
in reply to FeepingCreature

Permalink

Walter Bright

Posted in reply to FeepingCreature

Permalink

On 11/3/2021 10:41 PM, FeepingCreature wrote:
> On Thursday, 4 November 2021 at 05:34:29 UTC, FeepingCreature wrote:
>> One may disagree about autodecoding; I for one think it's a sensible idea. However, a program should either process data correctly or, if that is impossible, not at all. It should not, ever, silently modify it "for you" while reading! I predict this will lead to cryptic, hair-pulling bugs in user code involving replacement characters appearing far downstream of the error site.

Surprisingly, the reverse seems to be true. Suppose you're writing a text editor. Then read a file with some bad UTF in it. The editor dies with an exception. You can't even edit the file to fix it.

If you need to display user provided text, like in a browser, or all sorts of tools, you don't want to die with an exception. What are you going to do in an exception handler? You're just going to replace the offending bytes with ReplacementChar and go render it anyway.

> (This is floating point NaN all over again!)

Poor NaNs are terribly misunderstood.

Suppose you have an array of sensors. One goes bad. The "bad" value is 0.0. So now your data analyzer is happily averaging 0.0 into the results, silently skewing them.

Now, if a NaN is returned instead, your "average" will be NaN. You know it's no good. It won't be hidden.

Uninitialized variables are sensors giving bad data. Having a NaN in your result is a *good* thing.

November 04, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Walter Bright
in reply to Elronnd

Permalink

Walter Bright

Posted in reply to Elronnd

Permalink

On 11/4/2021 12:51 AM, Elronnd wrote:
> In the hot path it's the same speed.

C++ sold everyone the myth that exceptions not thrown are zero cost. This has been thoroughly debunked, though the myth persists :-(

November 04, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Walter Bright
in reply to Elronnd

Permalink

Walter Bright

Posted in reply to Elronnd

Permalink

On 11/4/2021 12:55 AM, Elronnd wrote:
> Part of the problem, as mentioned, is that this throws away information, because text may legitimately contain replacement characters.  (And this makes the 'check if replacement char and throw yourself' approach a non-starter).  But there are lossless encodings.  I think if we are really going to go this route, we should use something like raku's utf8-c8 (https://docs.raku.org/language/unicode#UTF8-C8).

There's only one replacement character, and this use is officially what it is for. If you're using it for other porpoises, you've got a whale of a problem.

November 04, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Walter Bright
in reply to jfondren

Permalink

Walter Bright

Posted in reply to jfondren

Permalink

On 11/4/2021 7:52 AM, jfondren wrote:
> Emitting the replacement character is also a reasonable default, and objections in the thread can be answered the same way that objections to throwing can be: if you don't like it, iterate some other way:

Technically, you are correct. But experience shows this does not work, because people will be human.

Two things are abundantly clear:

1. throwing exceptions must not be default behavior

2. allocating with the GC must not be the default behavior

and pushing against that is like trying to get people to eat their vegetables.

November 05, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by deadalnix
in reply to Walter Bright

Permalink

deadalnix

Posted in reply to Walter Bright

Permalink

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
> https://issues.dlang.org/show_bug.cgi?id=22473
>
> I've tried to fix this before, but too many people objected.
>
> Are we fed up with this yet? I sure am.
>
> Who wants to take up this cudgel and fix the durned thing once and for all?
>
> (It's unclear if it would even break existing code.)

For the love of god, if you are going to make a breaking change there, just remove autodecoding altogether.

Trying to fix what shouldn't exist is by far the biggest time sink engineers involves themselves in.

November 05, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Adam D Ruppe
in reply to deadalnix

Permalink

Adam D Ruppe

Posted in reply to deadalnix

Permalink

On Friday, 5 November 2021 at 02:06:01 UTC, deadalnix wrote:
> On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
>> https://issues.dlang.org/show_bug.cgi?id=22473
>
> For the love of god, if you are going to make a breaking change there, just remove autodecoding altogether.

This post isn't about autodecoding. With foreach, you opt into the decoding by specifically asking for it.

November 05, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Mathias LANG
in reply to Walter Bright

Permalink

Mathias LANG

Posted in reply to Walter Bright

Permalink

On Friday, 5 November 2021 at 00:38:59 UTC, Walter Bright wrote:

Surprisingly, the reverse seems to be true. Suppose you're writing a text editor. Then read a file with some bad UTF in it. The editor dies with an exception. You can't even edit the file to fix it.

If you need to display user provided text, like in a browser, or all sorts of tools, you don't want to die with an exception. What are you going to do in an exception handler? You're just going to replace the offending bytes with ReplacementChar and go render it anyway.

If you handle user input, you take it as ubyte[] and validate it.
Any decent editor will try to detect the encoding instead of blindly assuming UTF-8.

If you want to fix it, just deprecate the special case and tell people to use foreach (dchar d; someString.byUTF!(dchar, No.useReplacementDchar)) and voilà. And if they don't want it to throw, it's shorter:
foreach (dchar d; someString.byUTF!dchar) (or byDChar).

Top | Forum index | About this forum

Forums