April 20, 2015
On Monday, 20 April 2015 at 03:39:54 UTC, ketmar wrote:
> On Mon, 20 Apr 2015 01:27:36 +0000, Nick B wrote:
>
>> Perhaps Unicode needs to be rebuild from the ground up ?
>
> alas, it's too late. now we'll live with that "unicode" crap for many
> years.

Perhaps. or perhaps not. This community got together under Walter and Andrei leadership to building a new programming language, on the pillars of the old.
Perhaps a new Unicode standard, could start that way as well ?

April 20, 2015
On 19/04/15 22:58, ketmar wrote:
> On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:
>
> it's not crazy, it's just broken in all possible ways:
> http://file.bestmx.net/ee/articles/uni_vs_code.pdf
>

This is not a very accurate depiction of Unicode.

For example:
And, moreover, BOM is meaningless without mentioning of encoding. So we have to specify encoding anyway.

No. BOM is what lets your auto-detect the encoding. If you know you will be using UTF-8, 16 or 32 with an unknown encoding, BOM will tell you which it is. That is its entire purpose, in fact.

There, pretty much, goes point #1.

And then:
Unicode contains at least “writing direction” control symbols (LTR is U+200E and RTL is U+200F) which role is IDENTICAL to the role of codepage-switching symbols with the associated disadvantages.

That's just ignorance of how the UBA (TR#9) works. LRM and RLM are mere invisible characters with defined directionality. Cutting them away from a substring would not invalidate your text more than cutting away actual text would under the same conditions. In any case, unlike page switching symbols, it would only affect your display, not your understanding of the text.

So point #2 is out.

He has some valid argument under point #3, but also lots of !(@#&$ nonsense. He is right, I think, that denoting units with separate code points makes no sense, but the rest of his arguments seem completely off. For example, asking Latin and Cyrillic to share the same region merely because some letters look alike makes no sense, implementation wise.


Points #4, #5, #6 and #7 are the same point. The main objection I have there is his assumption that the situation is, somehow, worse than it was. Yes, if you knew your encoding was Windows-1255, you could assume the text is Hebrew.

Or Yiddish.

And this, I think, is one of the encodings with the least number of languages riding on it. Windows-1256 has Arabic, Persian, Urdu and others. Windows-1251 has the entire western Europe script. As pointed out elsewhere in this thread, Spanish and French treat case folding of accented letters differently.

Also, we see that the solution he thinks would work better actually doesn't. People living in France don't switch to a QWERTY keyboard when they want to type English. They type English with their AZERTY keyboard. There simply is no automatic way to tell what language something is typed in without a human telling you (or applying content based heuristics).

Microsoft Word stores, for each letter, which was the keyboard language it was typed with. This causes great problems when copying to other editors, performing searches, or simply trying to get bidirectional text to appear correctly. The problem is so bad that phone numbers where the prefix appears after the actual number is not considered bad form or unusual, even in official PR material or when sending resumes.

In fact, the only time you can count on someone to switch keyboards is when they need to switch to a language with a different alphabet. No Russian speaker will type English using the Russian layout, even if what she has to say happens to use letters with the same glyphs. You simply do not plan that much ahead.

The point I'm driving at is that just because some posted some rant on the Internet doesn't mean it's correct. When someone says something is broken, always ask them what they suggest instead.

Shachar
April 20, 2015
On 2015-04-20 08:04, Nick B wrote:

> Perhaps a new Unicode standard, could start that way as well ?

https://xkcd.com/927/

-- 
/Jacob Carlborg
April 20, 2015
On Saturday, 18 April 2015 at 17:04:54 UTC, Tobias Pankrath wrote:
>> Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline.
>
> Yes.
>
>> Then the rest of Phobos doesn't need to mind these combining characters. -- Andrei
>
> I don't think so. The thing is, even after normalization we have to deal with combining characters because in all normalization forms there will be combining characters left after normalization.

Yes, again and again I encountered length related bugs with Unicode characters. Normalization is not 100% reliable. I don't know anyone who works with non English characters who doesn't have problems with Unicode related issues sometimes.
April 20, 2015
>
> Yes, again and again I encountered length related bugs with Unicode characters. Normalization is not 100% reliable.

I think it is 100% reliable, it just doesn't make the problems go away. It just guarantees that two strings normalized to the same form are binary equal iff they are equal in the unicode sense. Nothing about columns or string length or grapheme count.
April 20, 2015
On Monday, 20 April 2015 at 11:04:58 UTC, Panke wrote:
>>
>> Yes, again and again I encountered length related bugs with Unicode characters. Normalization is not 100% reliable.
>
> I think it is 100% reliable, it just doesn't make the problems go away. It just guarantees that two strings normalized to the same form are binary equal iff they are equal in the unicode sense. Nothing about columns or string length or grapheme count.

The problem is not normalization as such, the problem is with string (as opposed to dstring):

import std.uni : normalize, NFC;
void main() {

  dstring de_one = "é";
  dstring de_two = "e\u0301";

  assert(de_one.length == 1);
  assert(de_two.length == 2);

  string e_one = "é";
  string e_two = "e\u0301";

  string random = "ab";

  assert(e_one.length == 2);
  assert(e_two.length == 3);
  assert(e_one.length == random.length);

  assert(normalize!NFC(e_one).length == 2);
  assert(normalize!NFC(e_two).length == 2);
}

This can lead to subtle bugs, cf. length of random and e_one. You have to convert everything to dstring to get the "expected" result. However, this is not always desirable.
April 20, 2015
> This can lead to subtle bugs, cf. length of random and e_one. You have to convert everything to dstring to get the "expected" result. However, this is not always desirable.

There are three things that you need to be aware of when handling unicode: code units, code points and graphems.

In general the length of one guarantees anything about the length of the other, except for utf32, which is a 1:1 mapping between code units and code points.

In this thread, we were discussing the relationship between code points and graphemes. You're examples however apply to the relationship between code units and code points.

To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units.

If you normalize a string (in the sequence of characters/codepoints sense, not object.string) to NFC, it will decompose every precomposed character in the string (like é, single codeunit), establish a defined order between the composite characters and then recompose a selected few graphemes (like é). This way é always ends up as a single code unit in NFC. There are dozens of other combinations where you'll still have n:1 mapping between code points and graphemes left after normalization.

Example given already in this thread: putting an arrow over an latin letter is typical in math and always more than one codepoint.

April 20, 2015
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
> To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units.

Even that's not really true. In the end it's up to the font and layout engine to decide how much space anything takes up. Unicode doesn't play nicely with the idea of text as a grid of rows and fixed-width columns of characters, although quite a lot can (and is, see urxvt for example) be shoe-horned in.
April 20, 2015
On Mon, Apr 20, 2015 at 06:03:49PM +0000, John Colvin via Digitalmars-d wrote:
> On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
> >To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units.
> 
> Even that's not really true. In the end it's up to the font and layout engine to decide how much space anything takes up. Unicode doesn't play nicely with the idea of text as a grid of rows and fixed-width columns of characters, although quite a lot can (and is, see urxvt for example) be shoe-horned in.

Yeah, even the grapheme count does not necessarily tell you how wide the printed string really is. The characters in the CJK block are usually rendered with fonts that are, on average, twice as wide as your typical Latin/Cyrillic character, so even applications like urxvt that shoehorn proportional-width fonts into a text grid render CJK characters as two columns rather than one.

Because of this, I actually wrote a function at one time to determine the width of a given Unicode character (i.e., zero, single, or double) as displayed in urxvt. Obviously, this is no help if you need to wrap lines rendered with a proportional font. And it doesn't even attempt to work correctly with bidi text.

This is why I said at the beginning that wrapping a line of text is a LOT harder than it sounds. A function that only takes a string as input does not have the necessary information to do this correctly in all use cases. The current wrap() function doesn't even do it correctly modulo the information available: it doesn't handle combining diacritics and zero-width characters properly. In fact, it doesn't even handle control characters properly, except perhaps for \t and \n. There are so many things wrong with the current wrap() function (and many other string-processing functions in Phobos) that it makes it look like a joke when we claim that D provides Unicode correctness out-of-the-box.

The only use case where wrap() gives the correct result is when you stick with pre-Unicode Latin strings to be displayed on a text console. As such, I don't really see the general utility of wrap() as it currently stands, and I question its value in Phobos, as opposed to an actually more useful implementation that, for instance, correctly implements the Unicode line-breaking algorithm.


T

-- 
It said to install Windows 2000 or better, so I installed Linux instead.
April 20, 2015
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
>> This can lead to subtle bugs, cf. length of random and e_one. You have to convert everything to dstring to get the "expected" result. However, this is not always desirable.
>
> There are three things that you need to be aware of when handling unicode: code units, code points and graphems.

This is why I use a helper function that uses byCodePoint and byGrapheme. At least for my use cases it returns the correct length. However, I might think about an alternative version based on the discussion here.

> In general the length of one guarantees anything about the length of the other, except for utf32, which is a 1:1 mapping between code units and code points.
>
> In this thread, we were discussing the relationship between code points and graphemes. You're examples however apply to the relationship between code units and code points.
>
> To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units.
>
> If you normalize a string (in the sequence of characters/codepoints sense, not object.string) to NFC, it will decompose every precomposed character in the string (like é, single codeunit), establish a defined order between the composite characters and then recompose a selected few graphemes (like é). This way é always ends up as a single code unit in NFC. There are dozens of other combinations where you'll still have n:1 mapping between code points and graphemes left after normalization.
>
> Example given already in this thread: putting an arrow over an latin letter is typical in math and always more than one codepoint.