August 15, 2019
On Thursday, August 15, 2019 4:59:45 PM MDT H. S. Teoh via Digitalmars-d wrote:
> On Thu, Aug 15, 2019 at 10:37:57PM +0000, Gregor Mückl via Digitalmars-d
wrote:
> > On Thursday, 15 August 2019 at 22:04:01 UTC, H. S. Teoh wrote:
> > > Basically, there is no sane way to avoid detaching the character encoding from the physical appearance of the character.  It simply makes no sense to have a different character for every variation of glyph across a set of fonts.  You *have* to work on a more abstract level, at the level of the *logical* identity of the character, not its specific physical appearance per font.
> >
> > OK, but Unicode also does the inverse: it has multiple legal representations of characters that are logically the same, mean the same and should appear the exact same (the later doesn't necessarily happen because of font rendering deficiencies). E.g. the word "schön" can be encoded two different ways while using only code points intended for German. So you can get the situation that "schön" != "schön".  This is unnecessary duplication.
>
> Well, yes, that part I agree with.  Unicode does have some dark corners like that.[*]  But I was just pointing out that Walter's ideal of 1 character per glyph is fallacious.
>
> [*] And some worse-than-dark-corners, like the whole codepage dedicated to emoji *and* combining marks for said emoji that changes their *appearance* -- something that ought not to have any place in a character encoding scheme!  Talk about scope creep...

Considering that emojis are supposed to be pictures formed with letters (simple ASCII art basically), they have no business being part part of an encoding scheme in the first place - but having combining marks to change their appearance definitely makes it that much worse.

- Jonathan M Davis




August 15, 2019
On Thu, Aug 15, 2019 at 05:06:57PM -0600, Jonathan M Davis via Digitalmars-d wrote:
> On Thursday, August 15, 2019 4:59:45 PM MDT H. S. Teoh via Digitalmars-d wrote:
[...]
> > Unicode does have some dark corners like that.[*]
[...]
> > [*] And some worse-than-dark-corners, like the whole codepage dedicated to emoji *and* combining marks for said emoji that changes their *appearance* -- something that ought not to have any place in a character encoding scheme!  Talk about scope creep...
> 
> Considering that emojis are supposed to be pictures formed with letters (simple ASCII art basically), they have no business being part part of an encoding scheme in the first place - but having combining marks to change their appearance definitely makes it that much worse.
[...]

It's not just emojis; GUI icons are already a thing in Unicode.  If this trend of encoding graphics in a string continues, in about a decade's time we'll be able to reinvent Nethack with graphical tiles inside a text mode terminal, using Unicode RPG icon "characters" which you can animate by attaching various "combining diacritics".  It would be kewl. But also utterly pointless and ridiculous.

(In fact, I wouldn't be surprised if you can already do this to some extent using emojis and GUI icon "characters". Just add a few more Unicode "characters" for in-game objects and a few more "diacritics" for animation frames, and we're already there. Throw in a zero-width, non-spacing "animation frame variant selector" "character", and we could have an entire animation sequence encoded as a string. Who even needs PNGs and animated SVGs anymore?!)


T

-- 
First Rule of History: History doesn't repeat itself -- historians merely repeat each other.
August 15, 2019
On 8/13/19 2:11 PM, H. S. Teoh wrote:
> But if you're working only with code points,
> then auto-decoding works.

Albeit much slower than necessary in most cases...
August 15, 2019
On 8/13/19 3:17 PM, matheus wrote:
> On Tuesday, 13 August 2019 at 19:10:17 UTC, matheus wrote:
>> ...
> 
> Like others said you may not be able to see through the Browser, because the render may "fix" this.
> 

Jesus, haven't browser devs learned *ANYTHING* from their very own, INFAMOUS, "Let's completely fuck up 'the reliability principle'" debacle? I guess not. Cult of the amateurs wins out again...
August 15, 2019
On 8/15/2019 3:56 PM, H. S. Teoh wrote:
> And now that you agree that character encoding should be based on
> "symbol" rather than "glyph", the next step is the realization that, in
> the wide world of international languages out there, there exist
> multiple "symbols" that are rendered with the *same* glyph.  This is a
> hard fact of reality, and no matter how you wish it to be otherwise, it
> simply ain't so.  Your ideal of "character == glyph" simply doesn't
> work in real life.

Splitting semantic hares is pointless, as the fact remains it worked just fine in real life before Unicode, it's called "printing" on paper.

As for not working in real life, that's Unicode.
August 15, 2019
On 8/15/2019 3:16 PM, H. S. Teoh wrote:
> Please explain how you solve this problem.

The same way printers solved the problem for the last 500 years.
August 16, 2019
On Thursday, 15 August 2019 at 19:05:32 UTC, Gregor Mückl wrote:
> On Thursday, 15 August 2019 at 11:02:54 UTC, Argolis wrote:
>> [...]
>
> This is the point we're trying to get across to you: this isn't sufficient. Depending on the context and the script/language, you need access to the string at various levels. E.g. a font renderer needs to sometimes iterate code points, not graphemes in order to compose the correct glyphs.
>
> [...]

I want to thank you, that's was really inspiring to me in trying to dig harder in the problem!

August 16, 2019
On Friday, 16 August 2019 at 06:28:30 UTC, Walter Bright wrote:
> On 8/15/2019 3:56 PM, H. S. Teoh wrote:
>> And now that you agree that character encoding should be based on
>> "symbol" rather than "glyph", the next step is the realization that, in
>> the wide world of international languages out there, there exist
>> multiple "symbols" that are rendered with the *same* glyph.  This is a
>> hard fact of reality, and no matter how you wish it to be otherwise, it
>> simply ain't so.  Your ideal of "character == glyph" simply doesn't
>> work in real life.
>
> Splitting semantic hares is pointless, as the fact remains it worked just fine in real life before Unicode, it's called "printing" on paper.

Sorry, no it didn't work in reality before Unicode. Multi language system were a mess.
My job is on the biggest translation memory in the world, the Euramis system of the European Union and when I started there in 2002, the system supported only 11 languages. The data in the Oracle database was already in Unicode but all the supporting translation chain was codepage based. It was a catastrophe and the amount of crap, especially in Greek data, was staggering. The issues H.S.Teoh described above were indeed a real pain point. In greek text it was very frequent to have mixed Latin characters with Greek character from codepage 1253. Was the A an alpha or a \x41. This crap made a lot of algorithms that were used downstream from the database (CAT tools, automatic translation etc.) completely bonkers.
For the 2004 extension of the EU we had to support one alphabet more (Cyrillic for Bulgarian) and 4 codepages more (CP-1250 Latin-2 Extended-A, CP-1251 Cyrillic, CP-1257 Baltic and ISO-8859-3 Maltese). It would have been such a mess that we decided to convert everything to Unicode.
We don't have these crap data anymore. Our code is not perfect, far from it, but adopting Unicode through and throug and dropping all support for the old coding crap simplified our lives tremendously.
When we got in 2010 the request from the EEAS (European External Action Service) to support also other languages than the 24 official EU languages, i.e. Russian, Arabic and Chinese, we didn't break a sweat to implement it, thanks to Unicode.

>
> As for not working in real life, that's Unicode.

Unicode works much, much better than anything that existed before. The issue is that not a lot of people work in a multi-language environment and don't have a clue of the unholy mess it was before.



August 16, 2019
On Friday, 16 August 2019 at 06:29:50 UTC, Walter Bright wrote:
> On 8/15/2019 3:16 PM, H. S. Teoh wrote:
>> Please explain how you solve this problem.
>
> The same way printers solved the problem for the last 500 years.

They didn't have to do automatic processing of the represented data, i.e. it was for pure human consumption.
When the data is to be processed automatically, it is a whole other problem. I'm quite sure that you sometime appreciate the results of automatic translation (ggogle translate, yandex, systran etc.). While the results are far from perfect, they would be absolutely impossible if we used what you propose here.

August 16, 2019
On 8/16/2019 2:20 AM, Patrick Schluter wrote:
> Sorry, no it didn't work in reality before Unicode. Multi language system were a mess.

I have several older books that move facilely between multiple languages. It's not a mess.

Since the reader can figure all this out without invisible semantic information in the glyphs, that invisible information is not necessary.

Once you print/display the Unicode string, all that semantic information is gone. It is not needed.


> Unicode works much, much better than anything that existed before. The issue is that not a lot of people work in a multi-language environment and don't have a clue of the unholy mess it was before.

Actually, I do. Zortech C++ supported multiple code pages, multiple multibyte encodings, and had error messages in 4 languages.

Unicode, in its original vision, solved those problems.