August 16, 2019
On Friday, 16 August 2019 at 20:14:33 UTC, Walter Bright wrote:
> To repeat an example:
>
>     a + b = c
>
> Why not have special Unicode code points for when letters are used as mathematical symbols?

Uhm, well, the Unicode block "Mathematical Alphanumeric Symbols" already exists and is basically that.
August 16, 2019
On Fri, Aug 16, 2019 at 01:18:54PM -0700, Walter Bright via Digitalmars-d wrote:
> On 8/16/2019 10:52 AM, H. S. Teoh wrote:
> > So in other words, we should encode 1, I, |, and l with exactly the same value, because in print, they aII look about the same anyway, and the user is well able to figure out from context which one is meant. After a11, once you print the string the semantic distinction is gone anyway, and human beings are very good at te||ing what was actually intended in spite of the ambiguity.
> > 
> > Bye-bye unambiguous D lexer, we hardly knew you; now we need to rewrite you with a context-sensitive algorithm that figures out whether we meant 11, ||, II, or ll in our source code encoded in Walter Encoding.
> 
> Fonts people use for programming take pains to distinguish them.

So you're saying that what constitutes a "character" should be determined by fonts??


T

-- 
Programming is not just an act of telling a computer what to do: it is also an act of telling other programmers what you wished the computer to do. Both are important, and the latter deserves care. -- Andrew Morton
August 16, 2019
On 8/16/2019 2:27 AM, Patrick Schluter wrote:
> While the results are far from perfect, they would be absolutely impossible if we used what you propose here.

Google translate can (and does) figure it out from the context, just like a human reader would.

Sentences written in mixed languages *are* written for human consumption. I have many books written that way. They are quite readable, and don't have any need to clue in the reader "the next word is in french/latin/greek/german".

And frankly, if data processing software is totally reliant on using the correct language-specific glyph, it will fail, because people will not type in the correct one, and visually they cannot proof it for correctness. Anything that does OCR is going to completely fail at this.

Robust data processing software is going to be forced to accept and allow for multiple encodings of the same glyph, pretty much rendering the semantic difference meaningless.

I bet in 10 or 20 years of being clobbered by experience you'll reluctantly agree with me that assigning semantics to individual code points was a mistake. :-)

BTW, I was a winner in the 1986 Obfuscated C Code Contest with:

-------------------------
#include <stdio.h>
#define O1O printf
#define OlO putchar
#define O10 exit
#define Ol0 strlen
#define QLQ fopen
#define OlQ fgetc
#define O1Q abs
#define QO0 for
typedef char lOL;

lOL*QI[] = {"Use:\012\011dump file\012","Unable to open file '\x25s'\012",
 "\012","   ",""};

main(I,Il)
lOL*Il[];
{	FILE *L;
	unsigned lO;
	int Q,OL[' '^'0'],llO = EOF,

	O=1,l=0,lll=O+O+O+l,OQ=056;
	lOL*llL="%2x ";
	(I != 1<<1&&(O1O(QI[0]),O10(1011-1010))),
	((L = QLQ(Il[O],"r"))==0&&(O1O(QI[O],Il[O]),O10(O)));
	lO = I-(O<<l<<O);
	while (L-l,1)
	{	QO0(Q = 0L;((Q &~(0x10-O))== l);
			OL[Q++] = OlQ(L));
		if (OL[0]==llO) break;
		O1O("\0454x: ",lO);
		if (I == (1<<1))
		{	QO0(Q=Ol0(QI[O<<O<<1]);Q<Ol0(QI[0]);
			Q++)O1O((OL[Q]!=llO)?llL:QI[lll],OL[Q]);/*"
			O10(QI[1O])*/
			O1O(QI[lll]);{}
		}
		QO0 (Q=0L;Q<1<<1<<1<<1<<1;Q+=Q<0100)
		{	(OL[Q]!=llO)? /* 0010 10lOQ 000LQL */
			((D(OL[Q])==0&&(*(OL+O1Q(Q-l))=OQ)),
			OlO(OL[Q])):
			OlO(1<<(1<<1<<1)<<1);
		}
		O1O(QI[01^10^9]);
		lO+=Q+0+l;}
	}
	D(l) { return l>=' '&&l<='\~';
}
-------------------------

http://www.formation.jussieu.fr/ars/2000-2001/C/cours/COMPLEMENTS/DOC/www.ioccc.org/years.html#1986_bright

I am indeed aware of the problems with confusing O0l1|. D does take steps to be more tolerant of bad fonts, such as 10l being allowed in C, but not D. I seriously considered banning the identifiers l and O. Perhaps I should have. | is not a problem because the grammar (i.e. the context) detects errors with it.
August 16, 2019
On 8/16/2019 9:32 AM, xenon325 wrote:
> On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright wrote:
>> And yet somehow people manage to read printed material without all these problems.
> 
> If same glyphs had same codes, what will you do with these:
> 
> 1) Sort string.
> 
> In my phone's contact lists there are entries in russian, in english and mixed.
> Now they are sorted as:
> A (latin), B (latin), C, А (ru), Б, В (ru).
> Wich is pretty easy to search/navigate.

Except that there's no guarantee that whoever entered the data used the right code point.

The pragmatic solution, again, is to use context. I.e. if a glyphy is surrounded by russian characters, it's likely a russian glyph. If it is surrounded by characters that form a common russian word, it's likely a russian glyph.

Of course it isn't perfect, but I bet using context will work better than expecting the code points to have been entered correctly.

I note that you had to tag В with (ru), because otherwise no human reader or OCR would know what it was. This is exactly the problem I'm talking about.

Writing software that relies on invisible semantic information is never going to work.
August 16, 2019
On Fri, Aug 16, 2019 at 01:44:20PM -0700, Walter Bright via Digitalmars-d wrote: [...]
> Google translate can (and does) figure it out from the context, just
> like a human reader would.

Ha!  Actually, IME, randomly substituting lookalike characters from other languages in the input to Google Translate often transmutes the result from passably-understandable to outright hilarious (and ridiculous).  Or the poor befuddled software just gives up and spits the input back at you verbatim.


[...]
> And frankly, if data processing software is totally reliant on using the correct language-specific glyph, it will fail, because people will not type in the correct one, and visually they cannot proof it for correctness.  Anything that does OCR is going to completely fail at this.
> 
> Robust data processing software is going to be forced to accept and allow for multiple encodings of the same glyph, pretty much rendering the semantic difference meaningless.

It's not a hard problem. You just need a preprocessing stage to normalize such stray glyphs into the correct language-specific code points, and all subsequent stages in your software pipeline will Just Work(tm). Think of it as a rudimentary "OCR" stage to sanitize your inputs.

This option would be unavailable if you used an encoding scheme that *cannot* encode language as part of the string.


> I bet in 10 or 20 years of being clobbered by experience you'll reluctantly agree with me that assigning semantics to individual code points was a mistake. :-)

That remains to be seen. :-)


> BTW, I was a winner in the 1986 Obfuscated C Code Contest with:
[...]
> I am indeed aware of the problems with confusing O0l1|. D does take steps to be more tolerant of bad fonts, such as 10l being allowed in C, but not D. I seriously considered banning the identifiers l and O. Perhaps I should have.  | is not a problem because the grammar (i.e. the context) detects errors with it.

I also won an IOCCC award once, albeit anonymously (see 2005/anon)...
though it had nothing to do with lookalike characters, but more to do
with what I call M.A.S.S. (Memory Allocated by Stack-Smashing), in which
the program does not declare any variables (besides the two parameters
to main()) nor calls any memory allocation functions, but happily
manipulates arrays of data. :-D


T

-- 
The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5
August 16, 2019
On Friday, 16 August 2019 at 16:32:05 UTC, xenon325 wrote:
> If same glyphs had same codes, what will you do with these:
> ...
> 2) Convert cases:
> - in english: 'B'.toLower == 'b'
> - in russian: 'В'.toLower == 'в'

FWIW, we have that problem today with Unicode and the letter i:
https://en.wikipedia.org/wiki/Dotted_and_dotless_I#In_computing
August 16, 2019
On 8/16/2019 1:26 PM, lithium iodate wrote:
> On Friday, 16 August 2019 at 20:14:33 UTC, Walter Bright wrote:
>> To repeat an example:
>>
>>     a + b = c
>>
>> Why not have special Unicode code points for when letters are used as mathematical symbols?
> 
> Uhm, well, the Unicode block "Mathematical Alphanumeric Symbols" already exists and is basically that.

ye gawds:

 https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols

I see they forgot the phone number code points.
August 17, 2019
On Friday, 16 August 2019 at 21:05:44 UTC, Walter Bright wrote:
> On 8/16/2019 9:32 AM, xenon325 wrote:
>> On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright wrote:
>>> And yet somehow people manage to read printed material without all these problems.
>> 
>> If same glyphs had same codes, what will you do with these:
>> 
>> 1) Sort string.
>> 
>> In my phone's contact lists there are entries in russian, in english and mixed.
>> Now they are sorted as:
>> A (latin), B (latin), C, А (ru), Б, В (ru).
>> Wich is pretty easy to search/navigate.
>
> Except that there's no guarantee that whoever entered the data used the right code point.

Depends. On smartphones, switching the keyboard language is easy (just a swipe on Android), so users that are regularly multilingual should be fine there. Windows also offers keyboard layout switching on the fly with an awkward keyboard shortcut, but it is pretty well hidden. So again, users that are multilingual in their daily routines should really be fine.

But taking a step back and trying to take a bird's eye view on this discussion, it becomes clear to me that the argument could be solved if there was a clear separation of text representations for processing (sorting, spell checking, whatever other NLP you can think of) and a completely seperate one for display. The transformation to the later would naturally be lossy and not perfectly reversible. The funny thing about that part is that text rendering with OpenType fonts is *already* doing exactly this transformation to derive the font specific glyph indices from the text. But all the bells and whistles in Unicode blur this boundary way too much. And this is what we are getting hung up over, I think.

Man, we really managed to go off track in this thread, didn't we? ;)
August 17, 2019
On Friday, 16 August 2019 at 21:05:44 UTC, Walter Bright wrote:
> On 8/16/2019 9:32 AM, xenon325 wrote:
>> On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright wrote:
>>> And yet somehow people manage to read printed material without all these problems.
>> 
>> If same glyphs had same codes, what will you do with these:
>> 
>> 1) Sort string.
>> 
>> In my phone's contact lists there are entries in russian, in english and mixed.
>> Now they are sorted as:
>> A (latin), B (latin), C, А (ru), Б, В (ru).
>> Wich is pretty easy to search/navigate.
>
> Except that there's no guarantee that whoever entered the data used the right code point.

From my experience, that was an issue that WE encountered often before Unicode the uppercase letters in Greek texts that were mixes of ASCII (A 0x41) and Greek (Α 0xC1 in CP-1253). It was so bad that the Greek translation department didn't use Euramis for a significant amount. It was only when we got completely rid of this crap (and also the RTF file format) and embraced Unicode that we got rid of this issue of mis-used encoding.
While I get that Unicode is (over-)complicated and in some aspects silly. It has nonetheless 2 essential virtues that all other encoding schemes never were able achieve:
- it is a norm that is widely used, almost universal.
- it is a norm that is widely used, almost universal.

Yeah, I'm lame, I repeated it twice :-)

The fact that it is widely adopted even in far east makes it really something essential. Could they have defined things differently or simpler? Maybe but I doubt it, as the complexity of Unicode comes from the complexity of language themselves.


>
> The pragmatic solution, again, is to use context. I.e. if a glyphy is surrounded by russian characters, it's likely a russian glyph. If it is surrounded by characters that form a common russian word, it's likely a russian glyph.

No, that doesn't work for panaché documents, we've been there, we had that and it sucks. UTF was such a relief.
Here little example from our configuration. The regular expression used to detect a document reference in a text as a replaceable:

0:UN:EC_N:((№|č.|nr.|št.|αριθ.|No|nr|N:o|Uimh.|br.|n.|Nr.|Nru|[Nn][º°o]|[Nn].[º°o])[  ][0-9]+/[0-9]+/(EC|ES|EF|EG|EK|EΚ|CE|EÜ|EY|CE|EZ|EB|KE|WE))

What is the context here? Btw the EC is Cyrillic and the first EK is Greek

and their substitution expressions
T:BG:EC_N:№\2/ЕС
T:CS:EC_N:č.\2/ES
T:DA:EC_N:nr.\2/EF
T:DE:EC_N:Nr.\2/EG
T:EL:EC_N:αριθ.\2/EΚ
T:EN:EC_N:No\2/EC
T:ES:EC_N:nº\2/CE
T:ET:EC_N:nr\2/EÜ
T:FI:EC_N:N:o\2/EY
T:FR:EC_N:nº\2/CE
T:GA:EC_N:Uimh.\2/CE
T:HR:EC_N:br.\2/EZ
T:IT:EC_N:n.\2/CE
T:LT:EC_N:Nr.\2/EB
T:LV:EC_N:Nr.\2/EK
T:MT:EC_N:Nru\2/KE
T:NL:EC_N:nr.\2/EG
T:PL:EC_N:nr\2/WE
T:PT:EC_N:n.º\2/CE
T:RO:EC_N:nr.\2/CE
T:SK:EC_N:č.\2/ES
T:SL:EC_N:št.\2/ES
T:SV:EC_N:nr\2/EG

and as said before, such a number can be in a citation in the language of the citation not in the language of the document.

>
> Of course it isn't perfect, but I bet using context will work better than expecting the code points to have been entered correctly.
>
> I note that you had to tag В with (ru), because otherwise no human reader or OCR would know what it was. This is exactly the problem I'm talking about.

Yeah, but what you propose makes it even worse not better.

>
> Writing software that relies on invisible semantic information is never going to work.

Invisible to your eyes, not invisible to the machines, that's the whole point. Why do we need to annotate all the functions in D with these annoying attributes if the compiler can detect them automagically via context? Because in general it can't, the semantic information must be provided somehow.




1 2 3 4 5 6 7 8 9
Next ›   Last »