August 15, 2019
On 8/15/2019 2:25 PM, Vladimir Panteleev wrote:
> On Thursday, 15 August 2019 at 21:21:33 UTC, Walter Bright wrote:
>> On 8/15/2019 5:09 AM, Vladimir Panteleev wrote:
>>> I sent a few PRs for the modules that I am listed as a code owner of.
>>
>> Can you please add a link to those PRs in
>> https://github.com/dlang/phobos/pull/7130 ?
> 
> I added a link to #7130 to the PR derciptions, which should do it. Noticed you added some comments just now too doing the same.
> 

I went one better, I added a [no autodecode] label!
August 15, 2019
On 8/15/2019 2:26 PM, a11e99z wrote:
> On Thursday, 15 August 2019 at 19:59:34 UTC, Walter Bright wrote:
>> On 8/15/2019 12:44 PM, Jonathan M Davis wrote:
>>> There should only be a
>>> single way to represent a given character.
>>
>> Exactly. And two glyphs that render identically should be the same code point.
>>
> 
> if it was not sarcasm:
> different code points can ref to same glyphs not vice verse:
> A(EN,\u0041), A(RU,\u0410), A(EL,\u0391)
> else sorting for non English will not work.
> 
> even order(A<B) will be wrong for example such RU glyphs
> ABCEHKMOPTXacepuxy
> corresponds to next English letters by sound or meaning
> AVSENKMORTHaserihu
> as u can see even uppers and lowers don't exists as pairs and have different meanings

Yes, I've heard this argument before.

The answer is that language should not be embedded in Unicode. It will lead to nothing but problems. The language is something externally assigned to a block of text, not the text itself, just like in printed text.

Again,

   a + b = c

Should those be separate code points? How about:

   a) one thing
   b) another
August 15, 2019
On Thu, Aug 15, 2019 at 12:59:34PM -0700, Walter Bright via Digitalmars-d wrote:
> On 8/15/2019 12:44 PM, Jonathan M Davis wrote:
> > There should only be a single way to represent a given character.
> 
> Exactly. And two glyphs that render identically should be the same code point.
[...]

It's not as simple as you imagine.  Letter shapes across different languages can look alike, but have zero correspondence with each other. Conflating two distinct letter forms just because they happen to look alike is the beginning of the road to madness.

First and foremost, the exact glyph shape depends on the font -- a cursive M is a different shape from a serif upright M which is different from a sans-serif bolded M.  They are logically the exact same character, but they are rendered differently depending on the font.

What's the problem with that, you say?  Here's the problem: if we follow
your suggestion of identifying characters by rendered glyph, that means
a lowercase English 'u' ought to be the same character as the cursive
form of Cyrillic и (because that's how it's written in cursive).
However, non-cursive Cyrillic и is printed as и (i.e., the equivalent of
a "backwards" small-caps English N).  You cannot be seriously suggesting
that и and u should be the same character, right?!  The point is that
this changes *based on the font*; Russian speakers recognize the two
*distinct* glyphs as the SAME letter.  They also recognize that it's a
DIFFERENT letter from English u, in spite of the fact the glyphs are
identical.

This is just one of many such examples.  Yet another Cyrillic example: lowercase cursive Т is written with a glyph that, for all practical purposes, is identical to the glyph for English 'm'.  Again, conflating the two based on your idea is outright ridiculous.  Just because the user changes the font, should not mean that now the character becomes a different letter! (Or that the program needs to rewrite all и's into lowercase u's!)

How a letter is rendered is a question of *font*, and I'm sure you'll agree that it doesn't make sense to make decisions on character identity based on which font you happen to be using.

Then take an example from Chinese: the character for "one" is, once you strip away the stylistic embellishments (which is an issue of font, and ought not to come into play with a character encoding), basically the same shape as a hyphen. You cannot seriously be telling me that we should treat the two as the same thing.

Basically, there is no sane way to avoid detaching the character encoding from the physical appearance of the character.  It simply makes no sense to have a different character for every variation of glyph across a set of fonts.  You *have* to work on a more abstract level, at the level of the *logical* identity of the character, not its specific physical appearance per font.

But that *inevitably* means you'll end up with multiple distinct characters that happen to share the same glyph (again, modulo which font the user selected for displaying the text).  See the Cyrillic examples above.  There are many other examples of logically-distinct characters from different languages that happen to share the same glyph shape with some English letter in some cases, which you cannot possibly conflate without ending up with nonsensical results.  You cannot eliminate dependence on the specific font if you insist on identifying characters by shape.  The only sane solution is to work on the abstract level, where the same logical character (e.g., Cyrillic letter N) can have multiple different glyphs depending on the font (in cursive, for example, capital И looks like English U).

But once you work at the abstract level, you cannot avoid some logically-distinct letters coinciding in glyph shape (e.g., English lowercase u vs. Cyrillic и).  And once you start on that slippery slope, you're not very far from descending into the "chaos" of the current Unicode standard -- because inevitably you'll have to make distinctions like "lowercase Greek mu as used in mathematics" vs. "lowercase Greek mu as used by Greeks to write their language" -- because although historically the two were identical, over time their usage has diverged and now there exists some contexts where you have to differentiate between the two.

The fact of the matter is that human language is inherently complex (not to mention *changes over time* -- something many people don't consider), and no amount of cleverness is going to surmount that without producing an inherently-complex solution.


T

-- 
Why ask rhetorical questions? -- JC
August 15, 2019
On Thu, Aug 15, 2019 at 11:38:08PM +0200, ag0aep6g via Digitalmars-d wrote:
> On 15.08.19 21:54, Walter Bright wrote:
> > Unicode also fouled up by adding semantic information that is invisible to the rendering. It should have stuck with the Unicode<=>print round-trip not losing information.
> > 
> > Naturally, people have already used such to trick people, track people, etc.
> 
> 'I' and 'l' are (virtually) identical in many fonts.

And 0 and O are also identical in many fonts.  But none of us would seriously entertain the idea that O and 0 ought to be the same character.


T

-- 
Indifference will certainly be the downfall of mankind, but who cares? -- Miquel van Smoorenburg
August 15, 2019
On Thu, Aug 15, 2019 at 02:42:50PM -0700, Walter Bright via Digitalmars-d wrote:
> On 8/15/2019 2:26 PM, a11e99z wrote:
[...]
> > if it was not sarcasm:
> > different code points can ref to same glyphs not vice verse:
> > A(EN,\u0041), A(RU,\u0410), A(EL,\u0391)
> > else sorting for non English will not work.
> > 
> > even order(A<B) will be wrong for example such RU glyphs
> > ABCEHKMOPTXacepuxy
> > corresponds to next English letters by sound or meaning
> > AVSENKMORTHaserihu
> > as u can see even uppers and lowers don't exists as pairs and have
> > different meanings
> 
> Yes, I've heard this argument before.
> 
> The answer is that language should not be embedded in Unicode. It will lead to nothing but problems. The language is something externally assigned to a block of text, not the text itself, just like in printed text.
[...]

You cannot avoid conveying language in a string. Certain characters only exist in certain languages, and the existence of the character itself already encodes language. But that's a peripheral issue.

The more pertinent point is that *different* languages may reuse the *same* glyphs for different (often completely unrelated) purposes. And because of these different purposes, it changes the way the *same* glyph is printed / laid out, and may affect other things in the surrounding context as well.

Put it this way: you agree that the encoding of a character ought not to change depending on font, right?

If so, consider your proposal to identify characters by glyph shape. A letter with the shape 'u', by your argument, ought to be represented by one, and only one, Unicode code point -- because, after all, it has the same glyph shape.  Correct?

If so, now you have a problem: the shape 'u' in Cyrillic is the cursive lowercase form of и.  So now you're essentially saying that all occurrences of 'u' in Cyrillic text must be substituted with и when you change the font from cursive to non-cursive.  Which is a contradiction of the initial axiom that character encoding should not be font-dependent.

Please explain how you solve this problem.


T

-- 
Real men don't take backups. They put their source on a public FTP-server and let the world mirror it. -- Linus Torvalds
August 15, 2019
On 8/15/2019 2:38 PM, ag0aep6g wrote:
> On 15.08.19 21:54, Walter Bright wrote:
>> Unicode also fouled up by adding semantic information that is invisible to the rendering. It should have stuck with the Unicode<=>print round-trip not losing information.
>>
>> Naturally, people have already used such to trick people, track people, etc.
> 
> 'I' and 'l' are (virtually) identical in many fonts.

That's a problem with some fonts, not the concept. When such fonts are used, the distinguishment comes from the context, not the symbol itself.

On the other hand, the Unicode spec itself routinely shows identical glyphs for different code points.

Consider also:

   (800)555-1212

You know it's a phone number, because of the context. The digits used in are NOT actually numbers, they do not have any mathematical properties. Should Unicode have a separate code point for these?

The point is, the meaning of the symbol comes from its context, not the symbol itself. This is the fundamental error Unicode made.
August 15, 2019
On 8/15/2019 3:04 PM, H. S. Teoh wrote:
> [...]

And yet somehow people manage to read printed material without all these problems.
August 15, 2019
On Thursday, 15 August 2019 at 22:04:01 UTC, H. S. Teoh wrote:
> Basically, there is no sane way to avoid detaching the character encoding from the physical appearance of the character.  It simply makes no sense to have a different character for every variation of glyph across a set of fonts.  You *have* to work on a more abstract level, at the level of the *logical* identity of the character, not its specific physical appearance per font.

OK, but Unicode also does the inverse: it has multiple legal representations of characters that are logically the same, mean the same and should appear the exact same (the later doesn't necessarily happen because of font rendering deficiencies). E.g. the word
"schön" can be encoded two different ways while using only code points intended for German. So you can get the situation that "schön" != "schön". This is unnecessary duplication.
August 15, 2019
On Thu, Aug 15, 2019 at 03:21:32PM -0700, Walter Bright via Digitalmars-d wrote:
> On 8/15/2019 2:38 PM, ag0aep6g wrote:
> > On 15.08.19 21:54, Walter Bright wrote:
> > > Unicode also fouled up by adding semantic information that is invisible to the rendering. It should have stuck with the Unicode<=>print round-trip not losing information.
> > > 
> > > Naturally, people have already used such to trick people, track people, etc.
> > 
> > 'I' and 'l' are (virtually) identical in many fonts.
> 
> That's a problem with some fonts, not the concept. When such fonts are used, the distinguishment comes from the context, not the symbol itself.
[...]

And there you go: you're basically saying that "symbol" is different from "glyph", and therefore, you're contradicting your own axiom that character == glyph.  "Symbol" is basically an abstract notion of a character that exists *apart from the glyph used to render it*.

And now that you agree that character encoding should be based on "symbol" rather than "glyph", the next step is the realization that, in the wide world of international languages out there, there exist multiple "symbols" that are rendered with the *same* glyph.  This is a hard fact of reality, and no matter how you wish it to be otherwise, it simply ain't so.  Your ideal of "character == glyph" simply doesn't work in real life.


T

-- 
There's light at the end of the tunnel. It's the oncoming train.
August 15, 2019
On Thu, Aug 15, 2019 at 10:37:57PM +0000, Gregor Mückl via Digitalmars-d wrote:
> On Thursday, 15 August 2019 at 22:04:01 UTC, H. S. Teoh wrote:
> > Basically, there is no sane way to avoid detaching the character encoding from the physical appearance of the character.  It simply makes no sense to have a different character for every variation of glyph across a set of fonts.  You *have* to work on a more abstract level, at the level of the *logical* identity of the character, not its specific physical appearance per font.
> 
> OK, but Unicode also does the inverse: it has multiple legal representations of characters that are logically the same, mean the same and should appear the exact same (the later doesn't necessarily happen because of font rendering deficiencies). E.g. the word "schön" can be encoded two different ways while using only code points intended for German. So you can get the situation that "schön" != "schön".  This is unnecessary duplication.

Well, yes, that part I agree with.  Unicode does have some dark corners like that.[*]  But I was just pointing out that Walter's ideal of 1 character per glyph is fallacious.

[*] And some worse-than-dark-corners, like the whole codepage dedicated to emoji *and* combining marks for said emoji that changes their *appearance* -- something that ought not to have any place in a character encoding scheme!  Talk about scope creep...


T

-- 
MS Windows: 64-bit rehash of 32-bit extensions and a graphical shell for a 16-bit patch to an 8-bit operating system originally coded for a 4-bit microprocessor, written by a 2-bit company that can't stand 1-bit of competition.