dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead (page 10)

Settings

Help

Index » General » dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead (page 10)

November 08, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by FeepingCreature
in reply to FeepingCreature

Permalink

FeepingCreature

Posted in reply to FeepingCreature

Permalink

On Monday, 8 November 2021 at 08:11:12 UTC, FeepingCreature wrote:

On Sunday, 7 November 2021 at 04:18:25 UTC, Walter Bright wrote:

It's much better than 0.0. 0.0 is indistinguishable from valid data, and is a very common valid value.

NaN and ReplacementChar are not valid and are easily distinguished.

No, that's exactly the problem. ReplacementChar is not easily distinguished, because it's a valid Unicode character - that's the whole point of it. So just like nan, it can propagate arbitrarily far through your processing pipeline before some downstream process decides that it actually doesn't like it.

Sorry, let me expand on this because I think it's the very core of the disagreement.

I feel you have two options with NaN/ReplacementChar. You can either just accept that this is what you get, and let it propagate throughout your entire pipeline. In that case it's no better than 0.0 - actually, NaN would be worse, because your process would be completely broken with no way to fix it, whereas at least with 0.0 you can maybe get some reasonably-usable data out.

Or you can say that "we don't want to be generating NaN/ReplacementChar." Then where do you draw the line? At the process input/output boundary? But then the process needs to be fixed if it generates nans/fffds. So you want to move your signaling as close to the production site as possible. Preferably, you want to fail at the exact line that the problematic data was produced. So we're back at exceptions in foreach. (Actually, an exception in cast(string) would be the best.)

And that's why I think ReplacementChar/NaN are no better than 0.0. You either embrace them fully as "valid" data, or you handle them at the site of origin; any compromise just makes you worse off than either extreme.

November 08, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Ola Fosheim Grøstad
in reply to FeepingCreature

Permalink

Ola Fosheim Grøstad

Posted in reply to FeepingCreature

Permalink

On Monday, 8 November 2021 at 08:18:51 UTC, FeepingCreature wrote:

(Actually, an exception in cast(string) would be the best.)

D should distinguish more clearly between strong and weak casting at the language level. UTF-8 is now so dominating that D really should reconsider the string type and make it so it is required to be valid UTF-8 (like Python3 did). C++ has even introduced a new character type to signify UTF-8, I use it all the time.

It is very difficult to follow your line of reasoning, because ReplacementChar is nothing like qNaN, it is more like sNaN. ReplacementChar is not the result of an approximation failure, it is corruption of the input (or maybe a foreign encoding).

Getting a 0.0 instead of qNaN in a signal is absolutely disastrous. Walter is 100% right on that one. 0.0 will introduce a peak across the frequency range. qNan can be removed with no distortion.

Should you express your types strongly? Yes, but then you also should include things like negative numbers, denormal numbers, ±infity, ranges [1.0-0.0] and so on.

November 08, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by FeepingCreature
in reply to Ola Fosheim Grøstad

Permalink

FeepingCreature

Posted in reply to Ola Fosheim Grøstad

Permalink

On Monday, 8 November 2021 at 12:02:12 UTC, Ola Fosheim Grøstad wrote:

Getting a 0.0 instead of qNaN in a signal is absolutely disastrous. Walter is 100% right on that one. 0.0 will introduce a peak across the frequency range. qNan can be removed with no distortion.

Should you express your types strongly? Yes, but then you also should include things like negative numbers, denormal numbers, ±infity, ranges [1.0-0.0] and so on.

Yeah I noticed this after I clicked post, but I didn't want to add a third comment. I think the difference is fundamentally one of "time-series vs progressive data". I don't think that's the right word, but I don't know a better one. Like, if you have a measuring series of values interspersed with nans, you can know for instance that the values are assigned to times, or to positions, and then you can semantically decide what to do with the data. For instance you may mark the nans with an error, or drop them and interpolate. However, it is much harder to see where such a behavior would be useful for ReplacementCharacter. Generally, you're reading data that someone wrote for a reason, and ReplacementCharacter would almost universally indicate that there was something you were meant to pick up on but failed to handle. As such, it's much less clear to me whether there even are cases where "text with replacement characters" or "text with replacement characters removed" is even useful.

November 08, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Ola Fosheim Grøstad
in reply to FeepingCreature

Permalink

Ola Fosheim Grøstad

Posted in reply to FeepingCreature

Permalink

On Monday, 8 November 2021 at 12:32:08 UTC, FeepingCreature wrote:

Generally, you're reading data that someone wrote for a reason, and ReplacementCharacter would almost universally indicate that there was something you were meant to pick up on but failed to handle. As such, it's much less clear to me whether there even are cases where "text with replacement characters" or "text with replacement characters removed" is even useful.

It could mean that someone did cut'n'paste of text from a more recent version of the Unicode standard. ReplacementCharacter makes it possible for you to use the input regardless (replacing it with a question mark in a square or something).

I think this is an application level feature, and not a language level feature, so it doesn't make sense for the language to do this IMO. That we can agree on.

(D is not a scripting langauge.)

November 08, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Atila Neves
in reply to max haughton

Permalink

Atila Neves

Posted in reply to max haughton

Permalink

On Friday, 5 November 2021 at 12:38:36 UTC, max haughton wrote:

On Friday, 5 November 2021 at 06:15:44 UTC, Walter Bright wrote:

On 11/4/2021 9:11 PM, max haughton wrote:

On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright wrote:

On 11/4/2021 7:41 PM, Mathias LANG wrote:

If you want to fix it, just deprecate the special case and tell people to use foreach (dchar d; someString.byUTF!(dchar, No.useReplacementDchar)) and voilà.
And if they don't want it to throw, it's shorter:
foreach (dchar d; someString.byUTF!dchar) (or byDChar).

People will always gravitate towards the smaller, simpler syntax. Like [] instead of std::vector<>.

I have never observed this mistake in any C++ cod,

You've never observed people write:

int array[3];

in C++ code?

unless you mean as a point of language design.

D (still) has a rather verbose way of doing lambdas. People constantly complained that D didn't have lambdas. Until the => syntax was added, and suddenly lambdas in D became noticed and useful.

This decision should be guided by how current D programmers act rather than a hyperreal ideal of someone encountering the language.

The only reason D's associative arrays continue to exist is because they are so darned syntactically convenient.

I've seen over and over and over that syntactic convenience matters a lot.

is what I meant, vector doesn't do the same thing as [].

Aside from not depending on GC-allocated memory, what does vector do that [] doesn't?

It's more common in (so-called) modern C++ to see std::array these days than a raw static array in certain contexts since you still want a constant length buffer but want iterators etc..

int src[10]{};
int dst[10]{};
transform(begin(src), end(src), begin(dst), [](int i) { return i + 1; });
for(const auto i: dst)
    cout << i << " ";
cout << endl;

But yes, std::array is an option that's better, but legacy code means C arrays have to be supported.

November 08, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by max haughton
in reply to Atila Neves

Permalink

max haughton

Posted in reply to Atila Neves

Permalink

On Monday, 8 November 2021 at 14:29:47 UTC, Atila Neves wrote:

On Friday, 5 November 2021 at 12:38:36 UTC, max haughton wrote:

On Friday, 5 November 2021 at 06:15:44 UTC, Walter Bright wrote:

On 11/4/2021 9:11 PM, max haughton wrote:

On Friday, 5 November 2021 at 04:02:44 UTC, Walter Bright wrote:

[...]

I have never observed this mistake in any C++ cod,

You've never observed people write:

int array[3];

in C++ code?

unless you mean as a point of language design.

D (still) has a rather verbose way of doing lambdas. People constantly complained that D didn't have lambdas. Until the => syntax was added, and suddenly lambdas in D became noticed and useful.

This decision should be guided by how current D programmers act rather than a hyperreal ideal of someone encountering the language.

The only reason D's associative arrays continue to exist is because they are so darned syntactically convenient.

I've seen over and over and over that syntactic convenience matters a lot.

is what I meant, vector doesn't do the same thing as [].

Aside from not depending on GC-allocated memory, what does vector do that [] doesn't?

It's more common in (so-called) modern C++ to see std::array these days than a raw static array in certain contexts since you still want a constant length buffer but want iterators etc..

int src[10]{};
int dst[10]{};
transform(begin(src), end(src), begin(dst), [](int i) { return i + 1; });
for(const auto i: dst)
    cout << i << " ";
cout << endl;

But yes, std::array is an option that's better, but legacy code means C arrays have to be supported.

In my post I was referring to a C style array (in C++) rather than a D slice, to be clear. It's entirely possible Walter originally meant a slice, but the point about following the syntactic path of least resistance seem to be referring to a [] in C++ rather than a slice i.e. I was intending to get across that I've never seen someone making this mistake in practice (either using a mere [] to pass data around, or using a vector in place of a static array / vice versa )

November 08, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Ola Fosheim Grøstad
in reply to max haughton

Permalink

Ola Fosheim Grøstad

Posted in reply to max haughton

Permalink

On Monday, 8 November 2021 at 22:12:15 UTC, max haughton wrote:

Could happen in C. Does not happen in C++, you use std::span for passing around data.

November 08, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by kdevel
in reply to Ola Fosheim Grøstad

Permalink

kdevel

Posted in reply to Ola Fosheim Grøstad

Permalink

On Monday, 8 November 2021 at 12:02:12 UTC, Ola Fosheim Grøstad wrote:
[...]

ReplacementChar is not the result of an approximation failure, it is corruption of the input (or maybe a foreign encoding).

As in this line I can write down the replacement character '�' since it is a valid Unicode codepoint (U+FFFD). It even round-trips correctly. I think the iconv-library [1] has a nice approach: it stops the conversion among others if it encounters an invalid input sequence.

The ideal conversion without throwing or using the replacement character is IMHO generating a list of pairs of ranges, named "left" and "right". Left contains sucessfully parsed data, right invalid data. For valid utf-8 input this list has only one element. The left element of this pair contains the conversion and the right is empty. From this representation one can easily compute all required presentations.

[1] https://man7.org/linux/man-pages/man3/iconv.3.html

November 10, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Ola Fosheim Grøstad
in reply to Guillaume Piolat

Permalink

Ola Fosheim Grøstad

Posted in reply to Guillaume Piolat

Permalink

On Friday, 5 November 2021 at 10:13:13 UTC, Guillaume Piolat wrote:

Well you only know that it is meant to be utf8 in the context of the auto-decoding foreach (which must still exist). string in actual programs may contains binary files, strings in other codepages encodings.

I had a look at the documentation today, and it said:

«char[] strings are in UTF-8 format.»

I would assume that this is normative? Maybe change the documentation to use more forceful specification language so that it says: «char[] strings MUST be in UTF-8 format.»

So, I think a messed up string should be considered a type error and it would be good if the compiler checked this statically where possible (e.g. literals) and simply assumed it to hold when parsing strings (like in a for loop).

In C++ I use span<uint8_t> for raw string-slices and span<char8_t> for utf8 string-slices. I find that to be quite clear. In C++ these are distinct types.

(newbies need a wrapper that is foolproof)

November 10, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Guillaume Piolat
in reply to Ola Fosheim Grøstad

Permalink

Guillaume Piolat

Posted in reply to Ola Fosheim Grøstad

Permalink

On Wednesday, 10 November 2021 at 10:23:31 UTC, Ola Fosheim Grøstad wrote:

On Friday, 5 November 2021 at 10:13:13 UTC, Guillaume Piolat wrote:

I had a look at the documentation today, and it said:

«char[] strings are in UTF-8 format.»

I would assume that this is normative? Maybe change the documentation to use more forceful specification language so that it says: «char[] strings MUST be in UTF-8 format.»

I'm not sure what is intended.

import("file.stuff") yields string.
So there is at least one gap, as it is often used with binary files that ain't UTF-8.

Also look at that signature: https://dlang.org/phobos/std_utf.html#validate
By spec it shall only return true then.

It seems in practice it doesn't have to be utf-8 until you use something that assume it is. Which is ok for me.

Top | Forum index | About this forum

Forums