August 13, 2019
On Tuesday, 13 August 2019 at 18:24:23 UTC, jmh530 wrote:
> On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis wrote:
>> [snip]
>>
>> It's not on the e in both of them. It's on the e on the second line of the "expected" output, but it's on the T in the second line of the "actual" output.
>>
>> - Jonathan M Davis
>
> On my machine & browser, it looks like it is on the e on both.

You're not alone, on my firefox on windows 10 pro the accents are both on the e.
August 13, 2019
On Tue, Aug 13, 2019 at 06:24:23PM +0000, jmh530 via Digitalmars-d wrote:
> On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis wrote:
> > [snip]
> > 
> > It's not on the e in both of them. It's on the e on the second line of the "expected" output, but it's on the T in the second line of the "actual" output.
> > 
> > - Jonathan M Davis
> 
> On my machine & browser, it looks like it is on the e on both.

Probably what Jonathan said about the browser munging the Unicode. Unicode is notoriously hard to process correctly, and I wouldn't be surprised if the majority of applications out there actually don't handle it correctly in all cases.

The whole auto-decoding deal is a prime example of this: even an expert programmer like Andrei fell into the wrong assumption that code point == grapheme. I have no confidence that less capable programmers, who form the majority of today's programmers and write the bulk of the industry's code, are any more likely to get it right.  (For years I myself didn't even know there was such a thing as "graphemes".)  In fact, almost every day I see "enterprise" code that commits atrocities against Unicode -- because QA hasn't thought to pass a *real* Unicode string as test input yet. The day the idea occurs to them, a LOT of code (and I mean a LOT) will need to be rewritten, probably from scratch.


T

--
"Real programmers can write assembly code in any language. :-)" -- Larry Wall
August 13, 2019
On Tuesday, 13 August 2019 at 16:29:33 UTC, jmh530 wrote:
> On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
>> ...
>> Expected output:
>> 	тевирп
>> 	те́вирп
>>
>> Actual output:
>> 	тевирп
>> 	т́евирп
>>
>
> Huh, those two look the same.

Copy and paste Expected and Actual output on notepad and you will see the difference, or just take a look at the HTML page source on your browser (Search for Expected Output):

<span class="forum-quote-prefix">&gt; </span>Expected output:
<span class="forum-quote-prefix">&gt; </span>	тевирп
<span class="forum-quote-prefix">&gt; </span>	те́вирп
<span class="forum-quote-prefix">&gt;</span>
<span class="forum-quote-prefix">&gt; </span>Actual output:
<span class="forum-quote-prefix">&gt; </span>	тевирп
<span class="forum-quote-prefix">&gt; </span>	т́евирп

For me it shows the difference pretty clear.

Matheus.
August 13, 2019
On Tuesday, 13 August 2019 at 19:10:17 UTC, matheus wrote:
> [snip]
>
> Copy and paste Expected and Actual output on notepad and you will see the difference, or just take a look at the HTML page source on your browser (Search for Expected Output):
>
> <span class="forum-quote-prefix">&gt; </span>Expected output:
> <span class="forum-quote-prefix">&gt; </span>	тевирп
> <span class="forum-quote-prefix">&gt; </span>	те́вирп
> <span class="forum-quote-prefix">&gt;</span>
> <span class="forum-quote-prefix">&gt; </span>Actual output:
> <span class="forum-quote-prefix">&gt; </span>	тевирп
> <span class="forum-quote-prefix">&gt; </span>	т́евирп
>
> For me it shows the difference pretty clear.
>
> Matheus.

Interestingly enough, what you have there does not look any different. However, if I actually do what you say and post it to notepad or something, then it does look different.
August 13, 2019
On Tuesday, 13 August 2019 at 19:10:17 UTC, matheus wrote:
> ...

Like others said you may not be able to see through the Browser, because the render may "fix" this.

Here how it looks through the HTML Code Inspection: https://i.imgur.com/e57wCZp.png

Notice the character '´' position.

Matheus.
August 14, 2019
On Tuesday, 13 August 2019 at 16:51:57 UTC, jmh530 wrote:
> On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
>> [snip]
>>
>> The location of the acute accent on the second line is wrong.
>>
>>
>> T
>
> I'm still confused...
>
> What I was first confused about was that the second line of the expected output looks exactly the same as the second line of the actual output. However, you seemed to have indicated that is a problem. From your follow-up post, I'm still confused because the accent seems to be on the "e" on both of them. Isn't that where it's supposed to be?

we can take Chinese char. as an example, it's clear:

```
 writeln("汉语&中国🇨🇳".retro);
 writeln("汉字🐠中国🇨🇳".retro);
```

expected:

🇨🇳国中&语汉
🇨🇳国中🐠字汉

actual:
🇳🇨国中&语汉
🇳🇨国中🐠字汉


--
binghoo dang

August 14, 2019
On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:

> But we can't make that the default because it's a big performance hit, and many string algorithms don't actually need grapheme segmentation.

Can you provide example of algorithms and use cases that don't need grapheme segmentation?
Are they really SO common that the correct default is go for code points?

Is it not better to have as a default the grapheme segmentation, the correct way of handling a string, instead?
August 14, 2019
On Wednesday, 14 August 2019 at 07:15:54 UTC, Argolis wrote:
> On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
>
>> But we can't make that the default because it's a big performance hit, and many string algorithms don't actually need grapheme segmentation.
>
> Can you provide example of algorithms and use cases that don't need grapheme segmentation?
> Are they really SO common that the correct default is go for code points?
>
> Is it not better to have as a default the grapheme segmentation, the correct way of handling a string, instead?

There is no single universally correct way to segment a string. Grapheme segmentation requires a correct assumption of the text encoding in the string and also the assumption that the encoding is flawless. Neither may be guaranteed in general. There is a lot of ways to corrupt UTF-8 strings, for example. And then there is a question of the length of a grapheme: IIRC they can consist of up to 6 or 7 code points with each of them encoded in a varying number of bytes in UTF-8, UTF-16 or UCS-2. So what data type do you use for representing graphemes then that is both not wasteful and doesn't require dynamic memory management?

Then there are other nasty quirks around graphemes: their encoding is not unique. This Unicode TR gives a good impression of how complex this single aspect is: https://unicode.org/reports/tr15/

So if you want to use graphemes, do you want to keep the original encoding or do you implicitly convert them to NFC or NFD? NFC tends to be better for language processing, NFD tends to be better for text rendering (with exceptions). If you don't normalize, semantically equivalent graphemes may not be equal under comparison.

At this point you're probably approaching the complexity of libraries like ICU. You can take a look at it if you want a good scare. ;)
August 14, 2019
On Wed, Aug 14, 2019 at 07:15:54AM +0000, Argolis via Digitalmars-d wrote:
> On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
> 
> > But we can't make that the default because it's a big performance hit, and many string algorithms don't actually need grapheme segmentation.
> 
> Can you provide example of algorithms and use cases that don't need grapheme segmentation?

Most cases of string processing involve:
- Taking substrings: does not need grapheme segmentation; you just slice
  the string.
- Copying one string to another: does not need grapheme segmentation,
  you just use memcpy (or equivalent).
- Concatenating n strings: does not need grapheme segmentation, you just
  use memcpy (or equivalent).  In D, you just use array append, or
  std.array.appender if you get fancy.
- Comparing one string to another: does not need grapheme segmentation;
  you either use strcmp/memcmp, or if you need more delicate semantics,
  call one of the standard Unicode string collation algorithms (std.uni,
  meaning, your code does not need to worry about grapheme segmentation,
  and besides, Unicode collation algorithms operate at the code point
  level, not at the grapheme level).
- Matching a substring: does not need grapheme segmentation; most
  applications just need subarray matching, i.e., treat the substring as
  an opaque blob of bytes, and match it against the target.  If you need
  more delicate semantics, there are standard Unicode algorithms for
  substring matching (i.e., user code does not need to worry about the
  low-level details -- the inputs are basically opaque Unicode strings
  whose internal structure is unimportant).

You really only need grapheme segmentation when:
- Implementing a text layout algorithm where you need to render glyphs
  to some canvas.  Usually, this is already taken care of by the GUI
  framework or the terminal emulator, so user code rarely has to worry
  about this.
- Measuring the size of some piece of text for output alignment
  purposes: in this case, grapheme segmentation isn't enough; you need
  font size information and other such details (like kerning, spacing
  parameters, etc.). Usually, you wouldn't write this yourself, but use
  a text rendering library.  So most user code don't actually have to
  worry about this.  (Note that iterating by graphemes does NOT give you
  the correct value for width even with a fixed-width font in a text
  mode terminal emulator, because there are such things as double-width
  characters in Unicode, which occupy two cells each. And also
  zero-width characters which count as distinct (empty) graphemes, but
  occupy no space.)


And as an appendix, the way most string processing code is done in C/C++ (iterate over characters) is actually wrong w.r.t. Unicode, because it's really only reliable for ASCII inputs. For "real" Unicode strings, you can't really get away with the "character by character" approach, even if you use grapheme segmentation: in some writing systems like Arabic breaking up a string like this can cause incorrect behaviour like breaking ligatures, which may not be intended.  For this sort of operations the application really needs to be using the standard Unicode algorithms, that depend on the *purpose* of the function, not the mechanics of iterating over characters, e.g., find suitable line breaks, find suitable hyphenation points, etc..  There's a reason the Unicode Consortium defines standard algorithms for these operations: it's because naïvely iterating over graphemes, in general, does *not* yield the correct results in all cases.

Ultimately, the whole point behind removing autodecoding is to put the onus on the user code to decide what kind of iteration it wants: code units, code points, or graphemes. (Or just use one of the standard algorithms and don't reinvent the square wheel.)


> Are they really SO common that the correct default is go for code points?

The whole point behind removing autodecoding is so that we do NOT default to code points, which is currently the default.  We want to put the choice in the user's hand, not silently default to iteration by code point under the illusion of correctness, which is actually incorrect for non-trivial inputs.


> Is it not better to have as a default the grapheme segmentation, the correct way of handling a string, instead?

Grapheme segmentation is very complex, and therefore, very slow.  Most string processing doesn't actually need grapheme segmentation.  Setting that as the default would mean D string processing will be excruciatingly slow by default, and furthermore all that extra work will be mostly for nothing because most of the time we don't need it anyway.

Not to repeat that most naïve iterations over graphemes actually do *not* yield what one might think is the correct result. For example, measuring the size of a piece of text in a fixed-width font in a text-mode terminal by counting graphemes is actually wrong, due to double-width and zero-width characters.


T

-- 
The most powerful one-line C program: #include "/dev/tty" -- IOCCC
August 14, 2019
On Wed, Aug 14, 2019 at 09:29:30AM +0000, Gregor Mückl via Digitalmars-d wrote: [...]
> At this point you're probably approaching the complexity of libraries like ICU. You can take a look at it if you want a good scare. ;)

Or, instead of homebrewing your own string-handling algorithms and probably getting it all wrong, actually *use* ICU to handle Unicode strings for you instead.  Saves you from writing more code, and from unintentional bugs.


T

-- 
Truth, Sir, is a cow which will give [skeptics] no more milk, and so they are gone to milk the bull. -- Sam. Johnson