November 06, 2021
On Saturday, 6 November 2021 at 05:36:07 UTC, H. S. Teoh wrote:
>
> Unfortunately, codepoint != grapheme. This was the fundamental error with autodecoding that made it so bad. It costs us a performance hit but doesn't even produce the right results in return.
>
> And even more unfortunately, grapheme segmentation is an extremely convoluted (i.e. slow) operation that normally you would *not* want to do it unless your code absolutely has to.
>
>
> T

```D
struct graphstring
{
    grapheme[] grapheme_elements;
}

struct grapheme
{
    dchar[] codepoints;
}

```
Would this really be _that_ slow? also, there is no need to do error checks on every action which user may do with graphstrings: no need to check on concatenations or slicings, for instance. but do checks on conversions from other string/ubyte[] types and to those types.
November 06, 2021
On Saturday, 6 November 2021 at 06:17:55 UTC, Alexey wrote:
> On Saturday, 6 November 2021 at 05:36:07 UTC, H. S. Teoh wrote:
>>
>> Unfortunately, codepoint != grapheme. This was the fundamental error with autodecoding that made it so bad. It costs us a performance hit but doesn't even produce the right results in return.
>>
>> And even more unfortunately, grapheme segmentation is an extremely convoluted (i.e. slow) operation that normally you would *not* want to do it unless your code absolutely has to.
>>
>>
>> T
>
> ```D
> struct graphstring
> {
>     grapheme[] grapheme_elements;
> }
>
> struct grapheme
> {
>     dchar[] codepoints;
> }
>
> ```
> Would this really be _that_ slow? also, there is no need to do error checks on every action which user may do with graphstrings: no need to check on concatenations or slicings, for instance. but do checks on conversions from other string/ubyte[] types and to those types.

This is 1 grapheme A̶͙̜͚̫̬̻ͅ


(U+0041 U+0336 U+0359 U+0345 U+031c U+035a U+032b U+032c U+033b) but 9 codepoints (9 dchar, 9 wchar, 17 char (0x41 0xcc 0xb6 0xcd 0x99 0xcd 0x85 0xcc 0x9c 0xcd 0x9a 0xcc 0xab 0xcc 0xac 0xcc 0xbb)
November 06, 2021
On Saturday, 6 November 2021 at 05:36:07 UTC, H. S. Teoh wrote:
> And even more unfortunately, grapheme segmentation is an extremely convoluted (i.e. slow) operation that normally you would *not* want to do it unless your code absolutely has to.

It is suitable for a library though.

November 06, 2021

On Saturday, 6 November 2021 at 06:17:55 UTC, Alexey wrote:

>
struct graphstring
{
    grapheme[] grapheme_elements;
}

struct grapheme
{
    dchar[] codepoints;
}

std.uni.Grapheme is more complex than a dchar[] (it tries to avoid allocating and it owns the dchars) but it has .length and opIndex that work like dchar[] (but read the warning on opSlice)

A Grapheme[] you can get with just s1.byGrapheme.array.

Round-trip example from std.uni:

@safe unittest {
    import std.array : array;
    import std.conv : text;
    import std.range : retro;
    import std.uni : byGrapheme, byCodePoint;

    string s = "noe\u0308l"; // noël

    // reverse it and convert the result to a string
    string reverse = s.byGrapheme
        .array
        .retro
        .byCodePoint
        .text;

    assert(reverse == "le\u0308on"); // lëon
}
November 06, 2021

On Saturday, 6 November 2021 at 13:07:53 UTC, jfondren wrote:

>

...

I doubt what std.uni.Grapheme works faster than dchar[]. Also I doubt what all the checks and things std.uni.Grapheme does are really necessary in context of hypothetical 'graphstring'

November 06, 2021

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:

>

https://issues.dlang.org/show_bug.cgi?id=22473

Previous discussions:

November 06, 2021
On 11/5/2021 9:25 PM, max haughton wrote:
> Although I understand what Walter is trying to say, he picked a poor example, this one does actually make sense. Although in the world of sanitizers and such it is not a hard thing to catch, bounds checking by default is a win.

Not sure what your point is, as D has bounds checking by default with [ ].
November 06, 2021
On 11/6/2021 9:09 AM, Vladimir Panteleev wrote:
> On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
>> https://issues.dlang.org/show_bug.cgi?id=22473
> 
> Previous discussions:
> 
> - https://wiki.dlang.org/DIP76
> - https://forum.dlang.org/post/mfvi86$10ml$1@digitalmars.com
> - https://issues.dlang.org/show_bug.cgi?id=14519
> - https://github.com/dlang/druntime/pull/1240
> - https://github.com/dlang/druntime/pull/1279
> - https://issues.dlang.org/show_bug.cgi?id=20134
> - https://github.com/dlang/phobos/pull/7144
> 

Thanks, Vladimir.
November 07, 2021

On Thursday, 4 November 2021 at 08:24:59 UTC, zjh wrote:

The fundamental problem is that we should provide users with options at compile time, not we choose for users.
If you choose for users, there will always be dissatisfaction.
You provide options ,and Users choose according to their needs.

auto decoding and utf8 string encoding are both like this. If you choose for users, some people are always not happy.

November 07, 2021

On Sunday, 7 November 2021 at 01:12:19 UTC, zjh wrote:

>

On Thursday, 4 November 2021 at 08:24:59 UTC, zjh wrote:

The fundamental problem is that we should provide users with options at compile time, not we choose for users.
If you choose for users, there will always be dissatisfaction.
You provide options ,and Users choose according to their needs.

auto decoding and utf8 string encoding are both like this. If you choose for users, some people are always not happy.

d index with range checking: arr[ind]
d index without range checking: arr.ptr[ind]

c++ index with range checking: arr.at(ind)
c++ index without range checking: arr[ind]

There are two ways to index, and both D and C++ offer both ways. Neither language removes a choice. If whether arr[ind] should rangecheck were up for debate, what's for debate is what the language should encourage by making that the default--the option's more naturally expressed, that requires less typing.

The question here of "what should a foreach over the dchar of a char[] do?" is the same kind of question.

default: str
throwing: str.byUTF!(dchar, UseReplacementChar.no)
asserting: std.encoding.codePoints(str)
replacement: std.utf.byDchar(str)
truncation: str[0 .. std.encoding.validLength(str)]
promotion: std.string.representation(str)

Put one of those inside foreach (dchar; ...) { } and you get that handling of bad UTF. Changing the default doesn't make the other options go away, and the default has to do something (even a compile-time error of "this is not supported behavior" is something), so you have to make a choice about the default and make some users unhappy.