August 16, 2019
On Friday, 16 August 2019 at 09:34:21 UTC, Walter Bright wrote:
> On 8/16/2019 2:20 AM, Patrick Schluter wrote:
>> Sorry, no it didn't work in reality before Unicode. Multi language system were a mess.
>
> I have several older books that move facilely between multiple languages. It's not a mess.
>
> Since the reader can figure all this out without invisible semantic information in the glyphs, that invisible information is not necessary.

Unicode's purpose is not limited to the output at the end the processing chain. It's the whole processing chain that is the point.

>
> Once you print/display the Unicode string, all that semantic information is gone. It is not needed.

As said, printing is only a minor part of language processing. To give an example from the EU again, and just to illustrate, we have exactly three laser printer (one is a photocopier) on each floor of our offices. You may say; o you're the IT guys, you don't need to print that much, to which I respond, half of the floor is populated with the english translation unit and while they indeed use the printers more than us, it is not a significant part of their workflow.

>
>
>> Unicode works much, much better than anything that existed before. The issue is that not a lot of people work in a multi-language environment and don't have a clue of the unholy mess it was before.
>
> Actually, I do. Zortech C++ supported multiple code pages, multiple multibyte encodings, and had error messages in 4 languages.

Each string was in its own language. We have to deal with texts that are mixed languages. Sentences in Bulgarian with an office address in Greece, embedded in a xml file. Codepages don't work in that case, or you have to introduce an escaping scheme much more brittle and annoying than utf-8 or utf-16 encoding.
European Parliament's session logs are what is called panaché documents, i.e. the transcripts are in native language of intervening MEP's. So completely mixed documents.

>
> Unicode, in its original vision, solved those problems.

Unicode is not perfect and indeed the crap with emoji is crap, but Unicode is better than what was used before.
And to insist again, Unicode is mostly about "DATA PROCESSING". Sometime it might result to a human readable result, but that is only one part of its purpose.
August 16, 2019
On Friday, August 16, 2019 4:32:06 AM MDT Patrick Schluter via Digitalmars-d wrote:
> > Unicode, in its original vision, solved those problems.
>
> Unicode is not perfect and indeed the crap with emoji is crap,
> but Unicode is better than what was used before.
> And to insist again, Unicode is mostly about "DATA PROCESSING".
> Sometime it might result to a human readable result, but that is
> only one part of its purpose.

I don't think that anyone is arguing that Unicode is worse than what we had before. The problem is that there are aspects of Unicode that are screwed up, making it far worse to deal with than it should be. We'd be way better off if those mistakes had not been made. So, we're better off than we were but also definitely worse off than we should be.

- Jonathan M Davis



August 16, 2019
On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright wrote:
> And yet somehow people manage to read printed material without all these problems.

If same glyphs had same codes, what will you do with these:

1) Sort string.

In my phone's contact lists there are entries in russian, in english and mixed.
Now they are sorted as:
A (latin), B (latin), C, А (ru), Б, В (ru).
Wich is pretty easy to search/navigate.

What would be the order in case Unicode worked the way you want?

2) Convert cases:
- in english: 'B'.toLower == 'b'
- in russian: 'В'.toLower == 'в'



August 16, 2019
On Friday, 16 August 2019 at 10:32:06 UTC, Patrick Schluter wrote:
> On Friday, 16 August 2019 at 09:34:21 UTC, Walter Bright wrote:
>> [...]
>
> Unicode's purpose is not limited to the output at the end the processing chain. It's the whole processing chain that is the point.
>
>> [...]
>
> As said, printing is only a minor part of language processing. To give an example from the EU again, and just to illustrate, we have exactly three laser printer (one is a photocopier) on each floor of our offices. You may say; o you're the IT guys, you don't need to print that much, to which I respond, half of the floor is populated with the english translation unit and while they indeed use the printers more than us, it is not a significant part of their workflow.
>
>> [...]
>
> Each string was in its own language. We have to deal with texts that are mixed languages. Sentences in Bulgarian with an office address in Greece, embedded in a xml file. Codepages don't work in that case, or you have to introduce an escaping scheme much more brittle and annoying than utf-8 or utf-16 encoding.
> European Parliament's session logs are what is called panaché documents, i.e. the transcripts are in native language of intervening MEP's. So completely mixed documents.
>
>> [...]
>
> Unicode is not perfect and indeed the crap with emoji is crap, but Unicode is better than what was used before.
> And to insist again, Unicode is mostly about "DATA PROCESSING". Sometime it might result to a human readable result, but that is only one part of its purpose.

These are great examples and I totally agree with you (and HS Teoh). It's no coincidence that those people who can read, write and speak more than one language with more than one script are those who think Unicode is beneficial. It seems that those who are stuck in the world of anglo/latin characters just don't have the experience required to understand why their simpler schemes won't work.
August 16, 2019
On Thu, Aug 15, 2019 at 11:29:50PM -0700, Walter Bright via Digitalmars-d wrote:
> On 8/15/2019 3:16 PM, H. S. Teoh wrote:
> > Please explain how you solve this problem.
> 
> The same way printers solved the problem for the last 500 years.

Please elaborate.  Because you appear to be saying that Unicode should encode the specific glyph, i.e., every font will have unique encodings for its glyphs, because every unique glyph corresponds to a unique encoding.  This is patently absurd, since your string encoding becomes dependent on font selection.

How do you reconcile these two things:

(1) The encoding of a character should not be font-dependent. I.e., it
    should encode the abstract "symbol" rather than the physical
    rendering of said symbol.

(2) In the real world, there exist different symbols that share the same
    glyph shape.


T

-- 
Customer support: the art of getting your clients to pay for your own incompetence.
August 16, 2019
On Fri, Aug 16, 2019 at 04:41:01PM +0000, Abdulhaq via Digitalmars-d wrote: [...]
> It's no coincidence that those people who can read, write and speak more than one language with more than one script are those who think Unicode is beneficial.

To be clear, there are aspects of Unicode that I don't agree with.  But what Walter is proposing (1 glyph == 1 character) simply does not work. It fails to handle the inherent complexities of working with multi-lingual strings.


> It seems that those who are stuck in the world of anglo/latin characters just don't have the experience required to understand why their simpler schemes won't work.

Walter claims to have experience working with code translated into 4 languages.  I suspect (Walter please correct me if I'm wrong) that it mostly just involved selecting a language at the beginning of the program, and substituting strings with translations into said language during output.  If this is the case, his stance of 1 glyph == 1 character makes sense, because that's all that's needed to support this limited functionality.

Where this scheme falls down is when you need to perform automatic processing of multi-lingual strings -- an unavoidable inevitability in this day and age of global communications. It makes no sense for a single letter to have two different encodings just because your user decided to use a different font, but that's exactly what Walter is proposing -- I wonder if he realizes that.


T

-- 
Written on the window of a clothing store: No shirt, no shoes, no service.
August 16, 2019
On Fri, Aug 16, 2019 at 10:01:57AM -0700, H. S. Teoh via Digitalmars-d wrote: [...]
> How do you reconcile these two things:
> 
> (1) The encoding of a character should not be font-dependent. I.e., it
>     should encode the abstract "symbol" rather than the physical
>     rendering of said symbol.
> 
> (2) In the real world, there exist different symbols that share the same
>     glyph shape.
[...]

Or, to use a different example that stem from the same underlying issue, let's say we take a Russian string:

	Я тебя люблю.

In a cursive font, it might look something like this:

	Я mеδя ∧юδ∧ю.

(I'm deliberately substituting various divergent Unicode characters to
make a point.)

According to your proposal, т and m ought to be encoded differently. So that means that Cyrillic lowercase т has *two* different encodings (and ditto with the other lookalikes).  This is obviously absurd, because it's the SAME LETTER in Cyrillic.  Insisting that they be encoded differently means your string encoding depends on font, which is in itself already ridiculous, and worse yet, it means that if you're writing a web script that accepts input from users, you have no idea which encoding they will use when they want to write Cyrillic lowercase т.  You end up with two strings that are logically identical, but bitwise different because the user happened to have a font where т is displayed as m.  Goodbye, sane substring search, goodbye sane automatic string processing, goodbye, consistent string rendering code.

This is equivalent to saying that English capital A in serif ought to have a different encoding from English capital A in sans serif, because their glyph shapes are different. If you follow that route, pretty soon you'll have a different encoding for bolded A, another encoding for slanted A (which is different from italic A), and the combinatorial explosion of useless redundant encodings thereof. It simply does not make any sense.

The only sane way out of this mess is the way Unicode has taken: you encode *not* the glyph, but the logical entity behind the glyph, i.e., the "symbol" as you call it, or in Unicode parlance, the code point. Cyrillic lowercase т is a unique entity that should correspond with exactly one code point, notwithstanding that some of its forms are lookalikes to Latin lowercase m.  Even if the font ultimately uses literally the same glyph to render them, they remain distinct entities in the encoding because they are *logically different things*.

In today's age of international communications and multilingual strings, the fact of different logical characters sharing the same rendered form is an unavoidable, harsh reality.  You either face it and deal with it in a sane way, or you can hold on to broken old approaches that don't work and fade away in the rearview mirror.  Your choice. :-D


T

-- 
Без труда не выловишь и рыбку из пруда.
August 16, 2019
On Fri, Aug 16, 2019 at 02:34:21AM -0700, Walter Bright via Digitalmars-d wrote: [...]
> Once you print/display the Unicode string, all that semantic information is gone. It is not needed.
[...]

So in other words, we should encode 1, I, |, and l with exactly the same value, because in print, they aII look about the same anyway, and the user is well able to figure out from context which one is meant. After a11, once you print the string the semantic distinction is gone anyway, and human beings are very good at te||ing what was actually intended in spite of the ambiguity.

Bye-bye unambiguous D lexer, we hardly knew you; now we need to rewrite you with a context-sensitive algorithm that figures out whether we meant 11, ||, II, or ll in our source code encoded in Walter Encoding.


T

-- 
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
August 16, 2019
On 8/16/2019 3:32 AM, Patrick Schluter wrote:
> Unicode is not perfect and indeed the crap with emoji is crap, but Unicode is better than what was used before.

I'm not arguing otherwise.

> And to insist again, Unicode is mostly about "DATA PROCESSING". Sometime it might result to a human readable result, but that is only one part of its purpose.

And that's mission creep, which came later and should not have occurred.

With such mission creep, there will be no end of intractable problems. People assign new semantic meanings to characters all the time. Trying to embed that into Unicode is beyond impractical.

To repeat an example:

    a + b = c

Why not have special Unicode code points for when letters are used as mathematical symbols?

   18004775555

Maybe some special Unicode code points for phone numbers?

How about Social Security digits? Credit card digits?
August 16, 2019
On 8/16/2019 10:52 AM, H. S. Teoh wrote:
> So in other words, we should encode 1, I, |, and l with exactly the same
> value, because in print, they aII look about the same anyway, and the
> user is well able to figure out from context which one is meant. After
> a11, once you print the string the semantic distinction is gone anyway,
> and human beings are very good at te||ing what was actually intended in
> spite of the ambiguity.
> 
> Bye-bye unambiguous D lexer, we hardly knew you; now we need to rewrite
> you with a context-sensitive algorithm that figures out whether we meant
> 11, ||, II, or ll in our source code encoded in Walter Encoding.

Fonts people use for programming take pains to distinguish them.