dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead (page 8)

On Saturday, 6 November 2021 at 05:36:07 UTC, H. S. Teoh wrote: > > Unfortunately, codepoint != grapheme. This was the fundamental error with autodecoding that made it so bad. It costs us a performance hit but doesn't even produce the right results in return. > > And even more unfortunately, grapheme segmentation is an extremely convoluted (i.e. slow) operation that normally you would *not* want to do it unless your code absolutely has to. > > > T ```D struct graphstring { grapheme[] grapheme_elements; } struct grapheme { dchar[] codepoints; } ``` Would this really be _that_ slow? also, there is no need to do error checks on every action which user may do with graphstrings: no need to check on concatenations or slicings, for instance. but do checks on conversions from other string/ubyte[] types and to those types.

On Saturday, 6 November 2021 at 06:17:55 UTC, Alexey wrote: > On Saturday, 6 November 2021 at 05:36:07 UTC, H. S. Teoh wrote: >> >> Unfortunately, codepoint != grapheme. This was the fundamental error with autodecoding that made it so bad. It costs us a performance hit but doesn't even produce the right results in return. >> >> And even more unfortunately, grapheme segmentation is an extremely convoluted (i.e. slow) operation that normally you would *not* want to do it unless your code absolutely has to. >> >> >> T > > ```D > struct graphstring > { > grapheme[] grapheme_elements; > } > > struct grapheme > { > dchar[] codepoints; > } > > ``` > Would this really be _that_ slow? also, there is no need to do error checks on every action which user may do with graphstrings: no need to check on concatenations or slicings, for instance. but do checks on conversions from other string/ubyte[] types and to those types. This is 1 grapheme A̶͙̜͚̫̬̻ͅ (U+0041 U+0336 U+0359 U+0345 U+031c U+035a U+032b U+032c U+033b) but 9 codepoints (9 dchar, 9 wchar, 17 char (0x41 0xcc 0xb6 0xcd 0x99 0xcd 0x85 0xcc 0x9c 0xcd 0x9a 0xcc 0xab 0xcc 0xac 0xcc 0xbb)

On Saturday, 6 November 2021 at 05:36:07 UTC, H. S. Teoh wrote: > And even more unfortunately, grapheme segmentation is an extremely convoluted (i.e. slow) operation that normally you would *not* want to do it unless your code absolutely has to. It is suitable for a library though.

@safe unittest { import std.array : array; import std.conv : text; import std.range : retro; import std.uni : byGrapheme, byCodePoint; string s = "noe\u0308l"; // noël // reverse it and convert the result to a string string reverse = s.byGrapheme .array .retro .byCodePoint .text; assert(reverse == "le\u0308on"); // lëon }

On 11/5/2021 9:25 PM, max haughton wrote: > Although I understand what Walter is trying to say, he picked a poor example, this one does actually make sense. Although in the world of sanitizers and such it is not a hard thing to catch, bounds checking by default is a win. Not sure what your point is, as D has bounds checking by default with [ ].

On 11/6/2021 9:09 AM, Vladimir Panteleev wrote: > On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote: >> https://issues.dlang.org/show_bug.cgi?id=22473 > > Previous discussions: > > - https://wiki.dlang.org/DIP76 > - https://forum.dlang.org/post/mfvi86$10ml$1@digitalmars.com > - https://issues.dlang.org/show_bug.cgi?id=14519 > - https://github.com/dlang/druntime/pull/1240 > - https://github.com/dlang/druntime/pull/1279 > - https://issues.dlang.org/show_bug.cgi?id=20134 > - https://github.com/dlang/phobos/pull/7144 > Thanks, Vladimir.

November 07, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by jfondren
in reply to zjh

Permalink

jfondren

Posted in reply to zjh

Permalink

On Sunday, 7 November 2021 at 01:12:19 UTC, zjh wrote:

On Thursday, 4 November 2021 at 08:24:59 UTC, zjh wrote:

The fundamental problem is that we should provide users with options at compile time, not we choose for users.
If you choose for users, there will always be dissatisfaction.
You provide options ,and Users choose according to their needs.

auto decoding and utf8 string encoding are both like this. If you choose for users, some people are always not happy.

d index with range checking: arr[ind]
d index without range checking: arr.ptr[ind]

c++ index with range checking: arr.at(ind)
c++ index without range checking: arr[ind]

There are two ways to index, and both D and C++ offer both ways. Neither language removes a choice. If whether arr[ind] should rangecheck were up for debate, what's for debate is what the language should encourage by making that the default--the option's more naturally expressed, that requires less typing.

The question here of "what should a foreach over the dchar of a char[] do?" is the same kind of question.

default: str
throwing: str.byUTF!(dchar, UseReplacementChar.no)
asserting: std.encoding.codePoints(str)
replacement: std.utf.byDchar(str)
truncation: str[0 .. std.encoding.validLength(str)]
promotion: std.string.representation(str)

Put one of those inside foreach (dchar; ...) { } and you get that handling of bad UTF. Changing the default doesn't make the other options go away, and the default has to do something (even a compile-time error of "this is not supported behavior" is something), so you have to make a choice about the default and make some users unhappy.

Forums