April 19, 2015
On 18/04/15 21:40, Walter Bright wrote:
>
> I'm not arguing against the existence of the Unicode standard, I'm
> saying I can't figure any justification for standardizing different
> encodings of the same thing.
>

A lot of areas in Unicode are due to pre-Unicode legacy.

I'm guessing here, but looking at the code points, é (U00e9 - Latin small letter E with acute), which comes from Latin-1, which is designed to follow ISO-8859-1. U0301 (Combining acute accent) comes from "Combining diacritical marks".

The way I understand things, Unicode would really prefer to use U0065+U0301 rather than U00e9. Because of legacy systems, and because they would rather have the ISO-8509 code pages be 1:1 mappings, rather than 1:n mappings, they introduced code points they really would rather do without.

This also explains the "presentation forms" code pages (e.g. http://www.unicode.org/charts/PDF/UFB00.pdf). These were intended to be glyphs, rather than code points. Due to legacy reasons, it was not possible to simply discard them. They received code points, with a warning not to use these code points directly.

Also, notice that some letters can only be achieved using multiple code points. Hebrew diacritics, for example, do not, typically, have a composite form. My name fully spelled (which you rarely would do), שַׁחַר, cannot be represented with less than 6 code points, despite having only three letters.

The last paragraph isn't strictly true. You can use UFB2C + U05B7 for the first letter instead of U05E9 + U05C2 + U05B7. You would be using the presentation form which, as pointed above, is only there for legacy.

Shachar
or shall I say
שחר
April 19, 2015
On Saturday, 18 April 2015 at 16:01:20 UTC, Andrei Alexandrescu wrote:
> On 4/18/15 4:35 AM, Jacob Carlborg wrote:
>> On 2015-04-18 12:27, Walter Bright wrote:
>>
>>> That doesn't make sense to me, because the umlauts and the accented e
>>> all have Unicode code point assignments.
>>
>> This code snippet demonstrates the problem:
>>
>> import std.stdio;
>>
>> void main ()
>> {
>>     dstring a = "e\u0301";
>>     dstring b = "é";
>>     assert(a != b);
>>     assert(a.length == 2);
>>     assert(b.length == 1);
>>     writefln(a, " ", b);
>> }
>>
>> If you run the above code all asserts should pass. If your system
>> correctly supports Unicode (works on OS X 10.10) the two printed
>> characters should look exactly the same.
>>
>> \u0301 is the "combining acute accent" [1].
>>
>> [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
>
> Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline. Then the rest of Phobos doesn't need to mind these combining characters. -- Andrei

Normalisation can allow some simplifications, sometimes, but knowing whether it will or not requires a lot of a priori knowledge about the input as well as the normalisation form.
April 19, 2015
MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:
> On 18/04/15 21:40, Walter Bright wrote:
>>
>> I'm not arguing against the existence of the Unicode standard, I'm
>> saying I can't figure any justification for standardizing different
>> encodings of the same thing.
>>
>
> A lot of areas in Unicode are due to pre-Unicode legacy.
>
> I'm guessing here, but looking at the code points, é (U00e9 - Latin small letter E with acute), which comes from Latin-1, which is designed to follow ISO-8859-1. U0301 (Combining acute accent) comes from "Combining diacritical marks".
>
> The way I understand things, Unicode would really prefer to use U0065+U0301 rather than U00e9. Because of legacy systems, and because they would rather have the ISO-8509 code pages be 1:1 mappings, rather than 1:n mappings, they introduced code points they really would rather do without.
>
> This also explains the "presentation forms" code pages (e.g. http://www.unicode.org/charts/PDF/UFB00.pdf). These were intended to be glyphs, rather than code points. Due to legacy reasons, it was not possible to simply discard them. They received code points, with a warning not to use these code points directly.
>
> Also, notice that some letters can only be achieved using multiple code points. Hebrew diacritics, for example, do not, typically, have a composite form. My name fully spelled (which you rarely would do), שַׁחַר, cannot be represented with less than 6 code points, despite having only three letters.
>
> The last paragraph isn't strictly true. You can use UFB2C + U05B7 for the first letter instead of U05E9 + U05C2 + U05B7. You would be using the presentation form which, as pointed above, is only there for legacy.
>
> Shachar
> or shall I say
> שחר

Yes Arabic is similar too

April 19, 2015
On Saturday, 18 April 2015 at 17:50:12 UTC, Walter Bright wrote:
> On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
>> \u0301 is the "combining acute accent" [1].
>>
>> [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
>
> I won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version.

é might be obvious, but Unicode isn't just for writing European prose. Uses for combining characters includes (but is *nowhere* near to limited to) mathematical notation, where the combinatorial explosion of possible combinations that still belong to one grapheme cluster (character is a familiar but misleading word when talking about Unicode) would trivially become an insanely (more atoms than in the universe levels of) large number of characters.

Unicode is a nightmarish system in some ways, but considering how incredibly difficult the problem it solves is, it's actually not too crazy.
April 19, 2015
On 19/04/15 10:51, Abdulhaq wrote:
> MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:
>> On 18/04/15 21:40, Walter Bright wrote:

>> Also, notice that some letters can only be achieved using multiple
>> code points. Hebrew diacritics, for example, do not, typically, have a
>> composite form. My name fully spelled (which you rarely would do),
>> שַׁחַר, cannot be represented with less than 6 code points, despite
>> having only three letters.
>>
>
> Yes Arabic is similar too
>

Actually, the Arab presentation forms serve a slightly different purpose. In Hebrew, the presentation forms are mostly for Bibilical text, where certain decorations are usually done.

For Arabic, the main reason for the presentation forms is shaping. Almost every Arabic letter can be written in up to four different forms (alone, start of word, middle of word and end of word). This means that Arabic has 28 letters, but over 100 different shapes for those letters. These days, when the font can do the shaping, the 28 letters suffice. During the DOS days, you needed to actually store those glyphs somewhere, which means that you needed to allocate a number to them.

In Hebrew, some letters also have a final form. Since the numbers are so significantly smaller, however, (22 letters, 5 of which have final forms), Hebrew keyboards actually have all 27 letters on them. Going strictly by the "Unicode way", one would be expected to spell שלום with U05DE as the last letter, and let the shaping engine figure out that it should use the final form (or add a ZWNJ). Since all Hebrew code charts contained a final form Mem, however, you actually spell it with U05DD in the end, and it is considered a distinct letter.

Shachar
April 19, 2015
On Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:
> U0065+U0301 rather than U00e9. Because of legacy systems, and because they would rather have the ISO-8509 code pages be 1:1 mappings, rather than 1:n mappings, they introduced code points they really would rather do without.

That's probably right. It is in fact a major feat to have the world adopt a new standard wholesale, but there are also difficult "semiotic" issues when you encode symbols and different languages view symbols differently (e.g. is "ä" an "a" or do you have two unique letters in the alphabet?)

Take "å", it can represent a unit (ångström) or a letter with a circle above it, or a unique letter in the alphabet. The letter "æ" can be seen as a combination of "ae" or a unique letter.

And we can expect languages, signs and practices to evolve over time too. How can you normalize encodings without normalizing writing practice and natural language development? That would be beyond the mandate of a unicode standard organization...
April 19, 2015
On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:

> é might be obvious, but Unicode isn't just for writing European prose.

it is also to insert pictures of the animals into text.

> Unicode is a nightmarish system in some ways, but considering how incredibly difficult the problem it solves is, it's actually not too crazy.

it's not crazy, it's just broken in all possible ways: http://file.bestmx.net/ee/articles/uni_vs_code.pdf

April 19, 2015
On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote:
> On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:
>
>> é might be obvious, but Unicode isn't just for writing European prose.
>
> it is also to insert pictures of the animals into text.

There's other uses for unicode?
🐧
April 20, 2015
On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote:
> On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:

> it's not crazy, it's just broken in all possible ways:
> http://file.bestmx.net/ee/articles/uni_vs_code.pdf

Ketmar

Great link, and a really good arguement about the problems with Unicode.

Quote from 'Instead of Conclusion'

Yes. This is the root of Unicode misdesign. They mixed up two mutually exclusive
approaches. They blended badly two different abstraction levels: the textual level which
corresponds to a language idea and the graphical level which does not care of a language, yet
cares of writing direction, subscripts, superscripts and so on.

In other words we need two different Unicodes built on these two opposite principles,
instead of the one built on an insane mix of controversial axioms.

end quote.

Perhaps Unicode needs to be rebuild from the ground up ?
April 20, 2015
On Mon, 20 Apr 2015 01:27:36 +0000, Nick B wrote:

> Perhaps Unicode needs to be rebuild from the ground up ?

alas, it's too late. now we'll live with that "unicode" crap for many years.