Updating D beyond Unicode 2.0 (page 6) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Updating D beyond Unicode 2.0 (page 6)

September 24, 2018

Re: Updating D beyond Unicode 2.0

Posted by Steven Schveighoffer
in reply to Jonathan M Davis

Steven Schveighoffer

Posted in reply to Jonathan M Davis

On 9/22/18 8:58 AM, Jonathan M Davis wrote:
> On Saturday, September 22, 2018 6:37:09 AM MDT Steven Schveighoffer via
> Digitalmars-d wrote:
>> On 9/22/18 4:52 AM, Jonathan M Davis wrote:
>>>> I was laughing out loud when reading about composing "family"
>>>> emojis with zero-width joiners. If you told me that was a tech
>>>> parody, I'd have believed it.
>>>
>>> Honestly, I was horrified to find out that emojis were even in Unicode.
>>> It makes no sense whatsover. Emojis are supposed to be sequences of
>>> characters that can be interepreted as images. Treating them like
>>> Unicode symbols is like treating entire words like Unicode symbols.
>>> It's just plain stupid and a clear sign that Unicode has gone
>>> completely off the rails (if it was ever on them). Unfortunately, it's
>>> the best tool that we have for the job.
>> But aren't some (many?) Chinese/Japanese characters representing whole
>> words?
> 
> It's true that they're not characters in the sense that Roman characters are
> characters, but they're still part of the alphabets for those languages.
> Emojis are specifically formed from sequences of characters - e.g. :) is two
> characters which are already expressible on their own. They're meant to
> represent a smiley face, but it's a sequence of characters already. There's
> no need whatsoever to represent anything extra Unicode. It's already enough
> of a disaster that there are multiple ways to represent the same character
> in Unicode without nonsense like emojis. It's stuff like this that really
> makes me wish that we could come up with a new standard that would replace
> Unicode, but that's likely a pipe dream at this point.

But there are tons of emojis that have nothing to do with sequences of characters. Like houses, or planes, or whatever. I don't even know what the sequences of characters are for them.

I think it started out like that, but turned into something else.

Either way, I can't imagine any benefit from using emojis in symbol names.

-Steve

September 24, 2018

Re: Updating D beyond Unicode 2.0

Posted by Steven Schveighoffer
in reply to Neia Neutuladh

Steven Schveighoffer

Posted in reply to Neia Neutuladh

On 9/24/18 12:23 AM, Neia Neutuladh wrote:
> On Monday, 24 September 2018 at 01:39:43 UTC, Walter Bright wrote:
>> On 9/23/2018 3:23 PM, Neia Neutuladh wrote:
>>> Okay, that's why you previously selected C99 as the standard for what characters to allow. Do you want to update to match C11? It's been out for the better part of a decade, after all.
>>
>> I wasn't aware it changed in C11.
> 
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf page 522 (PDF numbering) or 504 (internal numbering).
> 
> Outside the BMP, almost everything is allowed, including many things that are not currently mapped to any Unicode value. Within the BMP, a heck of a lot of stuff is allowed, including a lot that D doesn't currently allow.
> 
> GCC hasn't even updated to the C99 standard here, as far as I can tell, but clang-5.0 is up to date.

I searched around for the current state of symbol names in C, and found some really crappy rules, though maybe this site isn't up to date?:

https://en.cppreference.com/w/c/language/identifier

What I understand from that is:

1. Yes, you can use any unicode character you want in C/C++ (seemingly since C99)
2. There are no rules about what *encoding* is acceptable, it's implementation defined. So various compilers have different rules as to what will be accepted in the actual source code. In fact, I read somewhere that not even ASCII is guaranteed to be supported.

The result being, that you have to write the identifiers with an ASCII escape sequence in order for it to be actually portable. Which to me, completely defeats the purpose of using such identifiers in the first place.

For example, on that page, they have a line that works in clang, not in GCC (tagged as implementation defined):

char *🐱 = "cat";

The portable version looks like this:

char *\U0001f431 = "cat";

Seriously, who wants to use that?

Now, D can potentially do better (especially when all front-ends are the same) and support such things in the spec, but I think the argument "because C supports it" is kind of bunk.

Or am I reading it wrong?

In any case, I would expect that symbol name support should be focused only on languages which people use, not emojis. If there are words in Chinese or Japanese that can't be expressed using D, while other words can, it would seem inconsistent to a Chinese or Japanese speaking user, and I think we should work to fix that. I just have no idea what the state of that is.

I also tend to agree that most code is going to be written in English, even when the primary language of the user is not. Part of the reason, which I haven't read here yet, is that all the keywords are in English. Someone has to kind of understand those to get the meaning of some constructs, and it's going to read strangely with the non-english words.

One group which I believe hasn't spoken up yet is the group making the hunt framework, whom I believe are all Chinese? At least their web site is. It would be good to hear from a group like that which has large experience writing mature D code (it appears all to be in English) and how they feel about the support.

-Steve

September 24, 2018

Re: Updating D beyond Unicode 2.0

Posted by Steven Schveighoffer
in reply to Neia Neutuladh

Steven Schveighoffer

Posted in reply to Neia Neutuladh

On 9/22/18 12:56 PM, Neia Neutuladh wrote:
> On Saturday, 22 September 2018 at 12:35:27 UTC, Steven Schveighoffer wrote:
>> But aren't we arguing about the wrong thing here? D already accepts non-ASCII identifiers.
> 
> Walter was doing that thing that people in the US who only speak English tend to do: forgetting that other people speak other languages, and that people who speak English can learn other languages to work with people who don't speak English.

I don't think he was doing that. I think what he was saying was, D tried to accommodate users who don't normally speak English, and they still use English (for the most part) for coding.

I'm actually surprised there isn't much code out there that is written with other identifiers besides ASCII, given that C99 supported them. I assumed it was because they weren't supported. Now I learn that they are supported, yet almost all C code I've ever seen is written in English. Perhaps that's just because I don't frequent foreign language sites though :) But many people here speak English as a second language, and vouch for their cultures still using English to write code.

> He was saying it's inevitably a mistake to use non-ASCII characters in identifiers and that nobody does use them in practice.

I would expect people probably do try to use them in practice, it's just that the problems they run into aren't worth the effort (tool/environment support). But I have no first or even second hand experience with this. It does seem like Walter has a lot of experience with it though.

> Walter talking like that sounds like he'd like to remove support for non-ASCII identifiers from the language. I've gotten by without maintaining a set of personal patches on top of DMD so far, and I'd like it if I didn't have to start.

I don't think he was saying that. I think he was against expanding support for further Unicode identifiers because the the first effort did not produce any measurable benefit. I'd be shocked from the recent positions of Walter and Andrei if they decided to remove non-ASCII identifiers that are currently supported, thereby breaking any existing code.

>> What languages need an upgrade to unicode symbol names? In other words, what symbols aren't possible with the current support?
> 
> Chinese and Japanese have gained about eleven thousand symbols since Unicode 2.
> 
> Unicode 2 covers 25 writing systems, while Unicode 11 covers 146. Just updating to Unicode 3 would give us Cherokee, Ge'ez (multiple languages), Khmer (Cambodian), Mongolian, Burmese, Sinhala (Sri Lanka), Thaana (Maldivian), Canadian aboriginal syllabics, and Yi (Nuosu).

Very interesting! I would agree that we should at least add support for unicode symbols that are used in spoken languages, especially if we already have support for symbols that aren't ASCII already. I don't see the downside, especially if you can already use Unicode 2.0 symbols for identifiers (the ship has already sailed).

It could be a good incentive to get kids in countries where English isn't commonly spoken to try D out as a first programming language ;) Using your native language to show example code could be a huge benefit for teaching coding.

My recommendation is to put the PR up for review (that you said you had ready) and see what happens. Having an actual patch to talk about could change minds. At the very least, it's worth not wasting your efforts that you have already spent. Even if it does need a DIP, the PR can show that one less piece of effort is needed to get it implemented.

-Steve

September 24, 2018

Re: Updating D beyond Unicode 2.0

Posted by Adam D. Ruppe
in reply to Steven Schveighoffer

Adam D. Ruppe

Posted in reply to Steven Schveighoffer

On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote:
> Part of the reason, which I haven't read here yet, is that all the keywords are in English.

Eh, those are kinda opaque sequences anyway, since the meanings aren't quite what the normal dictionary definition is anyway. Look up "int" in the dictionary... or "void", or even "string". They are just a handful of magic sequences we learn with the programming language. (And in languages like Rust, "fn", lol.)

> One group which I believe hasn't spoken up yet is the group making the hunt framework, whom I believe are all Chinese? At least their web site is.

I know they used a lot of my code as a starting point, and I, of course, wrote it in English, so that could have biased it a bit too. Though that might be a general point where you want to use these libraries and they are in a language.

Just even so, I still find it kinda hard to believe that everybody everywhere uses only English in all their code. Maybe our efforts should be going toward the Chinese market via natural language support instead of competing with Rust on computer language features :P

> It would be good to hear from a group like that which has large experience writing mature D code (it appears all to be in English) and how they feel about the support.

definitely.

September 24, 2018

Re: Updating D beyond Unicode 2.0

Posted by Adam D. Ruppe
in reply to Jonathan M Davis

Adam D. Ruppe

Posted in reply to Jonathan M Davis

On Monday, 24 September 2018 at 10:36:50 UTC, Jonathan M Davis wrote:
> Given that the typical keyboard has none of those characters, maintaining code that used any of them would be a royal pain.

It is pretty easy to type them with a little keyboard config change, and like vim can pick those up from comments in the file even, though you have to train your fingers to know how to use it effectively too... but if you were maintaining something long term, you'd just do that.

September 24, 2018

Re: Updating D beyond Unicode 2.0

Posted by Steven Schveighoffer
in reply to Adam D. Ruppe

Steven Schveighoffer

Posted in reply to Adam D. Ruppe

On 9/24/18 10:14 AM, Adam D. Ruppe wrote:
> On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote:
>> Part of the reason, which I haven't read here yet, is that all the keywords are in English.
> 
> Eh, those are kinda opaque sequences anyway, since the meanings aren't quite what the normal dictionary definition is anyway. Look up "int" in the dictionary... or "void", or even "string". They are just a handful of magic sequences we learn with the programming language. (And in languages like Rust, "fn", lol.)

Well, even on top of that, the standard library is full of English words that read very coherently when used together (if you understand English).

I can't imagine a long chain of English algorithms with some Chinese one pasted in the middle looks very good :) I suppose you could alias them all...

-Steve

September 24, 2018

Re: Updating D beyond Unicode 2.0

Posted by 0xEAB
in reply to Walter Bright

0xEAB

Posted in reply to Walter Bright

On Sunday, 23 September 2018 at 20:49:39 UTC, Walter Bright wrote:
> There's a reason why dmd doesn't have international error messages. My experience with it is that international users don't want it. They prefer the english messages.

I'm a native German speaker.
As for my part, I agree on this, indeed.


There are several reasons for this:
- Usually such translations are terrible, simply put.
- Uncontinuous translations [0]
- Non-idiomatic sentences that still sound like English somehow.
- Translations of tech terms [1]
- Non-idiomatic translations of tech terms [2]

However, well done translations might be quite nice at the beginning when learning programming. Back then, when I coding C# in VS 2010 I was happy with the German error messages. I'm not sure whether it was just delusion but I think it got worse with some later version, though.




[0] There's nothing worse than every single sentence being treated on its own during the translation process. At least that's what you'd often think when you face a longer error message. Usually you're confronted with non-linked and kindergarten-like sentences that don't seem to be meant to be put together. Often you'd think there were several translators. Favorite problem with this: 2 different terms for the same thing in two sentences.

[1] e.g. "integer type" -> "ganzzahliger Datentyp"
This just sounds weird. Anyone using "int" in their code knows what it means anyway...
Nevertheless, there are some common translations that are fine (primarily because they're common), e.g. "error" -> "Fehler"

[2] e.g. "assertion" -> "Assertionsfehler"
This particular one can be found in Windows 10 and is not even proper German.

September 24, 2018

Re: Updating D beyond Unicode 2.0

Posted by 0xEAB
in reply to 0xEAB

0xEAB

Posted in reply to 0xEAB

On Monday, 24 September 2018 at 15:17:14 UTC, 0xEAB wrote:
> Back then, when I coding C# in VS 2010 I was happy with the German error messages.

addendum: I've been using the English version since VS2017

September 24, 2018

Re: Updating D beyond Unicode 2.0

Posted by Martin Tschierschke
in reply to Steven Schveighoffer

Martin Tschierschke

Posted in reply to Steven Schveighoffer

On Monday, 24 September 2018 at 14:34:21 UTC, Steven Schveighoffer wrote:
> On 9/24/18 10:14 AM, Adam D. Ruppe wrote:
>> On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote:
>>> Part of the reason, which I haven't read here yet, is that all the keywords are in English.
>> 
>> Eh, those are kinda opaque sequences anyway, since the meanings aren't quite what the normal dictionary definition is anyway. Look up "int" in the dictionary... or "void", or even "string". They are just a handful of magic sequences we learn with the programming language. (And in languages like Rust, "fn", lol.)
>
> Well, even on top of that, the standard library is full of English words that read very coherently when used together (if you understand English).
>
> I can't imagine a long chain of English algorithms with some Chinese one pasted in the middle looks very good :) I suppose you could alias them all...
>
> -Steve
You might get really funny error messages.

🙂 can't be casted to int.

:-)

And if you have to increment the number of cars you can write: 🚗++; This might give really funny looking programs!

September 24, 2018

Re: Updating D beyond Unicode 2.0

Posted by Steven Schveighoffer
in reply to Martin Tschierschke

Steven Schveighoffer

Posted in reply to Martin Tschierschke

On 9/24/18 2:20 PM, Martin Tschierschke wrote:
> On Monday, 24 September 2018 at 14:34:21 UTC, Steven Schveighoffer wrote:
>> On 9/24/18 10:14 AM, Adam D. Ruppe wrote:
>>> On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote:
>>>> Part of the reason, which I haven't read here yet, is that all the keywords are in English.
>>>
>>> Eh, those are kinda opaque sequences anyway, since the meanings aren't quite what the normal dictionary definition is anyway. Look up "int" in the dictionary... or "void", or even "string". They are just a handful of magic sequences we learn with the programming language. (And in languages like Rust, "fn", lol.)
>>
>> Well, even on top of that, the standard library is full of English words that read very coherently when used together (if you understand English).
>>
>> I can't imagine a long chain of English algorithms with some Chinese one pasted in the middle looks very good :) I suppose you could alias them all...
>>
> You might get really funny error messages.
> 
> 🙂 can't be casted to int.

Haha, it could be cynical as well

int can’t be casted to int🤔

Oh, the games we could play.

-Steve

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation