August 13, 2019
On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
> [snip]
>
> Because it *appears* to be right, but it's actually wrong. For example:
>
> 	import std.range : retro;
> 	import std.stdio;
>
> 	void main() {
> 		writeln("привет".retro);
> 		writeln("приве́т".retro);
> 	}
>
> Expected output:
> 	тевирп
> 	те́вирп
>
> Actual output:
> 	тевирп
> 	т́евирп
>

Huh, those two look the same.

August 13, 2019
On Tue, Aug 13, 2019 at 04:29:33PM +0000, jmh530 via Digitalmars-d wrote:
> On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
> > [snip]
> > 
> > Because it *appears* to be right, but it's actually wrong. For example:
> > 
> > 	import std.range : retro;
> > 	import std.stdio;
> > 
> > 	void main() {
> > 		writeln("привет".retro);
> > 		writeln("приве́т".retro);
> > 	}
> > 
> > Expected output:
> > 	тевирп
> > 	те́вирп
> > 
> > Actual output:
> > 	тевирп
> > 	т́евирп
> > 
> 
> Huh, those two look the same.

The location of the acute accent on the second line is wrong.


T

-- 
GEEK = Gatherer of Extremely Enlightening Knowledge
August 13, 2019
On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
> [snip]
>
> The location of the acute accent on the second line is wrong.
>
>
> T

I'm still confused...

What I was first confused about was that the second line of the expected output looks exactly the same as the second line of the actual output. However, you seemed to have indicated that is a problem. From your follow-up post, I'm still confused because the accent seems to be on the "e" on both of them. Isn't that where it's supposed to be?
August 13, 2019
On Tuesday, August 13, 2019 10:51:57 AM MDT jmh530 via Digitalmars-d wrote:
> On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
> > [snip]
> >
> > The location of the acute accent on the second line is wrong.
> >
> >
> > T
>
> I'm still confused...
>
> What I was first confused about was that the second line of the expected output looks exactly the same as the second line of the actual output. However, you seemed to have indicated that is a problem. From your follow-up post, I'm still confused because the accent seems to be on the "e" on both of them. Isn't that where it's supposed to be?

It's not on the e in both of them. It's on the e on the second line of the "expected" output, but it's on the T in the second line of the "actual" output.

- Jonathan M Davis



August 13, 2019
On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis wrote:
> On Tuesday, August 13, 2019 10:51:57 AM MDT jmh530 via Digitalmars-d wrote:
>> On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
>> > [snip]
>> >
>> > The location of the acute accent on the second line is wrong.
>> >
>> >
>> > T
>>
>> I'm still confused...
>>
>> What I was first confused about was that the second line of the expected output looks exactly the same as the second line of the actual output. However, you seemed to have indicated that is a problem. From your follow-up post, I'm still confused because the accent seems to be on the "e" on both of them. Isn't that where it's supposed to be?
>
> It's not on the e in both of them. It's on the e on the second line of the "expected" output, but it's on the T in the second line of the "actual" output.
>
> - Jonathan M Davis

We must be seeing different things then. I've taken a screenshot of how the post looks to me:

http://www.gregor-mueckl.de/~gmueckl/unicode_confusion.png

August 13, 2019
On Tuesday, 13 August 2019 at 16:51:57 UTC, jmh530 wrote:
> On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
>> [snip]
>>
>> The location of the acute accent on the second line is wrong.
>
> I'm still confused...
>
> What I was first confused about was that the second line of the expected output looks exactly the same as the second line of the actual output. However, you seemed to have indicated that is a problem. From your follow-up post, I'm still confused because the accent seems to be on the "e" on both of them. Isn't that where it's supposed to be?

accent in wchar array can looks like:
приве'т - accent to vowel 'e'.. two glyphs combined in one
but in reverse it can be like:
т'евирп - accent to consonants 'т'.. that is wrong, accent can be for vowels only (for RU)

its Russian lang and it have not additions to glyphs but other langs has and one letter can be represented as 2 wchars or just 1.. depends from editor.. and it looks same with same meanings.

OT: about Russian(cyrillic) letter with additions

Е and Ё are 2 different letters, vowels, sometimes interchangeable (when u cannot find letter Ё on keyboard u can use Е)
И and Й are 2 different letters, vowel and consonant, not interchangeable, they are totally different.

Ё(upper) ё(lower), Й(up) й(low) - letters where addition is part of letter itself, it cannot be separated.
E - transcription is "jˈe"
Ё - "jˈɵ"
И - "ˈi"
Й - "j"

Russian has not additions to glyphs except accent for right reading unknown word or for dictionaries like Oxford, Wiki and etc.
Many Europian - Latin or Cyrillic - can have additions to letters, idk their meanings.
August 13, 2019
On Tuesday, August 13, 2019 11:43:19 AM MDT Gregor Mückl via Digitalmars-d wrote:
> On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis
>
> wrote:
> > On Tuesday, August 13, 2019 10:51:57 AM MDT jmh530 via
> >
> > Digitalmars-d wrote:
> >> On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
> >> > [snip]
> >> >
> >> > The location of the acute accent on the second line is wrong.
> >> >
> >> >
> >> > T
> >>
> >> I'm still confused...
> >>
> >> What I was first confused about was that the second line of the expected output looks exactly the same as the second line of the actual output. However, you seemed to have indicated that is a problem. From your follow-up post, I'm still confused because the accent seems to be on the "e" on both of them. Isn't that where it's supposed to be?
> >
> > It's not on the e in both of them. It's on the e on the second line of the "expected" output, but it's on the T in the second line of the "actual" output.
> >
> > - Jonathan M Davis
>
> We must be seeing different things then. I've taken a screenshot of how the post looks to me:
>
> http://www.gregor-mueckl.de/~gmueckl/unicode_confusion.png

I suspect that some clients are not handling the text correctly (probably due to bugs in their Unicode handling). If I view this thread on forum.dlang.org in firefox, then the text ends up with the accent on the T in the code with it being on the B in the expected output and on the e in the actual output. If I view it in chrome, the code has it on the e, the expected output has it on the e, and the actual output has it on the T - which is exactly what happens in my e-mail client. If I run the program on run.dlang.io in either firefox or chrome, it does the same thing as chrome and my e-mail client do with the forum post, putting the accent on the e in the code and putting it on the T in the output. The same thing happens when I run it locally in my console on FreeBSD. In no case do I see the accent on the e in the actual output, but it probably wouldn't be hard for a bug in a program's Unicode handling to put it on the e. Unicode is stupidly hard to process correctly, and the correct output of this program isn't something that you would normally see in real text.

- Jonathan M Davis




August 13, 2019
On Tue, Aug 13, 2019 at 05:43:19PM +0000, Gregor Mückl via Digitalmars-d wrote: [...]
> We must be seeing different things then. I've taken a screenshot of how the post looks to me:
> 
> http://www.gregor-mueckl.de/~gmueckl/unicode_confusion.png

Did you copy-n-paste the code and run it?  If you did, the browser may have done some Unicode processing on the string literal and munged the results.  Maybe spelling out the second string literal might help:

	writeln("приве\u0301т".retro);

Basically, the issue here is that "е\u0301" should be processed as a single grapheme, but since it's two separate code points, auto-decoding splits the grapheme, and when .retro is applied to it, the \u0301 is now attached to the wrong code point.

This is probably not the best example, since е\u0301 isn't really how Russian is normally written (it could be used in some learner dictionaries to indicate stress, but it's non-standard and most printed material don't do that).  Perhaps a better example might be Hangul Jamo or Arabic ligatures, but I'm unfamiliar with those languages so I don't know how to come up with a realistic example.

But the point is that according to Unicode, a grapheme consists of a base character followed by zero or more combining diacritics. Auto-decoding treats the base character separately from any combining diacritics, because it iterates over code points rather than graphemes, thus when the application is logically dealing with graphemes, you'll get incorrect results.  But if you're working only with code points, then auto-decoding works.

The problem is that most of the time, either (1) you're working with
"characters" ("visual" characters, i.e. graphemes), or (2) you don't
actually care about the string contents but just need to copy / move /
erase a substring.  For (1), auto-decoding gives the wrong results.  For
(2), auto-decoding wastes time decoding code units: you could have just
used a straight memcpy / memcmp / etc..

Unless you're implementing Unicode algorithms, you rarely need to work with code points directly. And if you're implementing Unicode algorithms, you already know (or should already know) at which level you need to be working with (code units, code points, or graphemes), so you hardly need the default iteration to be code points (just write .byCodePoint for clarity).

It doesn't make sense to have Phobos iterate over code points *by default* when it's not the common use case, represents a hidden performance hit, and in spite of that still not 100% correct anyway.


T

-- 
Век живи - век учись. А дураком помрёшь.
August 13, 2019
On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis wrote:
> [snip]
>
> It's not on the e in both of them. It's on the e on the second line of the "expected" output, but it's on the T in the second line of the "actual" output.
>
> - Jonathan M Davis

On my machine & browser, it looks like it is on the e on both.
August 13, 2019
On Tuesday, 13 August 2019 at 18:24:23 UTC, jmh530 wrote:
> On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis wrote:
>> [snip]
>>
>> It's not on the e in both of them. It's on the e on the second line of the "expected" output, but it's on the T in the second line of the "actual" output.
>>
>> - Jonathan M Davis
>
> On my machine & browser, it looks like it is on the e on both.

And for me, both on e at Windows but the bottom one on T at Linux, on the same browser (Firefox)!