September 22, 2018
On Saturday, September 22, 2018 6:37:09 AM MDT Steven Schveighoffer via Digitalmars-d wrote:
> On 9/22/18 4:52 AM, Jonathan M Davis wrote:
> >> I was laughing out loud when reading about composing "family" emojis with zero-width joiners. If you told me that was a tech parody, I'd have believed it.
> >
> > Honestly, I was horrified to find out that emojis were even in Unicode. It makes no sense whatsover. Emojis are supposed to be sequences of characters that can be interepreted as images. Treating them like Unicode symbols is like treating entire words like Unicode symbols. It's just plain stupid and a clear sign that Unicode has gone completely off the rails (if it was ever on them). Unfortunately, it's the best tool that we have for the job.
> But aren't some (many?) Chinese/Japanese characters representing whole
> words?

It's true that they're not characters in the sense that Roman characters are characters, but they're still part of the alphabets for those languages. Emojis are specifically formed from sequences of characters - e.g. :) is two characters which are already expressible on their own. They're meant to represent a smiley face, but it's a sequence of characters already. There's no need whatsoever to represent anything extra Unicode. It's already enough of a disaster that there are multiple ways to represent the same character in Unicode without nonsense like emojis. It's stuff like this that really makes me wish that we could come up with a new standard that would replace Unicode, but that's likely a pipe dream at this point.

- Jonathan M Davis



September 22, 2018
On Saturday, 22 September 2018 at 08:52:32 UTC, Jonathan M Davis wrote:
> Unicode identifiers may make sense in a code base that is going to be used solely by a group of developers who speak a particular language that uses a number a of non-ASCII characters (especially languages like Chinese or Japanese), but it has no business in any code that's intended for international use. It just causes problems.

You have a problem when you need to share a codebase between two organizations using different languages. "Just use ASCII" is not the solution. "Use a language that most developers in both organizations can use" is. That's *usually* going to be English, but not always. For instance, a Belorussian company doing outsourcing work for a Russian company might reasonably write code in Russian.

If you're writing for a global audience, as most open source code is, you're usually going to use the most widely spoken language.
September 22, 2018
On Saturday, 22 September 2018 at 12:24:49 UTC, Shachar Shemesh wrote:
> If memory serves me right, hieroglyphs actually represent consonants (vowels are implicit), and as such, are most definitely "characters".

Egyptian hieroglyphics uses logographs (symbols representing whole words, which might be multiple syllables), letters, and determinants (which don't represent any word but disambiguate the surrounding words).

Looking things up serves me better than memory, usually.

> The only language I can think of, off the top of my head, where words have distinct signs is sign language.

Logographic writing systems. There is one logographic writing system still in common use, and it's the standard writing system for Chinese and Japanese. That's about 1.4 billion people. It was used in Korea until hangul became popularized.

Unicode also aims to support writing systems that aren't used anymore. That means Mayan, cuneiform (several variants), Egyptian hieroglyphics and demotic script, several extinct variants on the Chinese writing system, and Luwian.

Sign languages generally don't have writing systems. They're also not generally related to any ambient spoken languages (for instance, American Sign Language is derived from French Sign Language), so if you speak sign language and can write, you're bilingual. Anyway, without writing systems, sign languages are irrelevant to Unicode.
September 22, 2018
On Saturday, 22 September 2018 at 12:35:27 UTC, Steven Schveighoffer wrote:
> But aren't we arguing about the wrong thing here? D already accepts non-ASCII identifiers.

Walter was doing that thing that people in the US who only speak English tend to do: forgetting that other people speak other languages, and that people who speak English can learn other languages to work with people who don't speak English. He was saying it's inevitably a mistake to use non-ASCII characters in identifiers and that nobody does use them in practice.

Walter talking like that sounds like he'd like to remove support for non-ASCII identifiers from the language. I've gotten by without maintaining a set of personal patches on top of DMD so far, and I'd like it if I didn't have to start.

> What languages need an upgrade to unicode symbol names? In other words, what symbols aren't possible with the current support?

Chinese and Japanese have gained about eleven thousand symbols since Unicode 2.

Unicode 2 covers 25 writing systems, while Unicode 11 covers 146. Just updating to Unicode 3 would give us Cherokee, Ge'ez (multiple languages), Khmer (Cambodian), Mongolian, Burmese, Sinhala (Sri Lanka), Thaana (Maldivian), Canadian aboriginal syllabics, and Yi (Nuosu).
September 22, 2018
On Saturday, 22 September 2018 at 16:56:10 UTC, Neia Neutuladh wrote:
On Saturday, 22 September 2018 at 16:56:10 UTC, Neia Neutuladh wrote:
>
> Walter was doing that thing that people in the US who only speak English tend to do: forgetting that other people speak other languages, and that people who speak English can learn other languages to work with people who don't speak English. He was saying it's inevitably a mistake to use non-ASCII characters in identifiers and that nobody does use them in practice.
>

There's a more charitable view and that's that even furriners usually use English identifiers.

Nobody in this thread so far has said they are programming in non-ASCII.

If there was a contingent of Japanese or Chinese users doing that then surely they would speak up here or in Bugzilla to advocate for this feature?
September 22, 2018
On Saturday, 22 September 2018 at 19:59:42 UTC, Erik van Velzen wrote:
> Nobody in this thread so far has said they are programming in non-ASCII.

I did. https://git.ikeran.org/dhasenan/muzikilo
September 23, 2018
On Saturday, 22 September 2018 at 19:59:42 UTC, Erik van Velzen wrote:
> Nobody in this thread so far has said they are programming in non-ASCII.

This is the obvious observation bias I alluded to before: of course people who don't read and write English aren't in this thread, since they cannot read or write the English used in this thread! Ditto for bugzilla.

Absence of evidence CAN be evidence of absence... but not when the absence is so easily explained by our shared bias.

Neia Neutuladh posted one link. I have seen Japanese D code before on twitter, but cannot find it now (surely because the search engines also share this bias). Perhaps those are the only two examples in existence, but I stand by my belief that we must reach out to these other communities somehow and do a proper, proactive study before dismissing the possibility.
September 22, 2018
On Saturday, September 22, 2018 10:07:38 AM MDT Neia Neutuladh via Digitalmars-d wrote:
> On Saturday, 22 September 2018 at 08:52:32 UTC, Jonathan M Davis
>
> wrote:
> > Unicode identifiers may make sense in a code base that is going to be used solely by a group of developers who speak a particular language that uses a number a of non-ASCII characters (especially languages like Chinese or Japanese), but it has no business in any code that's intended for international use. It just causes problems.
>
> You have a problem when you need to share a codebase between two organizations using different languages. "Just use ASCII" is not the solution. "Use a language that most developers in both organizations can use" is. That's *usually* going to be English, but not always. For instance, a Belorussian company doing outsourcing work for a Russian company might reasonably write code in Russian.
>
> If you're writing for a global audience, as most open source code is, you're usually going to use the most widely spoken language.

My point is that if your code base is definitely only going to be used within a group of people who are using a keyboard that supports a Unicode character that you want to use, then it's not necessarily a problem to use it, but if you're writing code that may be seen or used by a general audience (especially if it's going to be open source), then it needs to be in ASCII, or it's a serious problem. Even if it's a character like lambda that most everyone is going to understand, many, many programmers are not going to be able type it on their keyboards, and that's going to cause nothing but problems.

For better or worse, English is the international language of science and engineering, and that includes programming. So, any programs that are intended to be seen and used by the world at large need to be in ASCII. And the biggest practical issue with that is whether a character is even on a typical keyboard. Using a Unicode character in a program makes it so that make programmers cannot type it. And even given the large breadth of Unicode characters, you could even have a keyboard that supports a number of Unicode characters and still not have the Unicode character in question. So, open source programs need to be in ASCII.

Now, I don't know that it's a problem to support a wide range of Unicode characters in identifiers when you consider the issues of folks whose native language is not English (especially when it's a language like Chinese or Japanese), but open source programs should only be using ASCII identifiers. And unfortunately, sometimes, the fact that a language supports Unicode identifiers has lead English speakers to do stupid things like use the lambda character in identifiers. So, I can understand Walter's reticence to go further with supporting Unicode identifiers, but on the other hand, when you consider how many people there are on the planet who use a language that doesn't even use the latin alphabet, it's arguably a good idea to fully support Unicode identifiers.

- Jonathan M Davis



September 23, 2018
On Saturday, 22 September 2018 at 12:37:09 UTC, Steven Schveighoffer wrote:
> But aren't some (many?) Chinese/Japanese characters representing whole words?
>
> -Steve

Kind of hair-splitting, but it's more accurate to say that some Chinese/Japanese words can be written with one character.  Like how English speakers wouldn't normally say that "A" and "I" are characters representing whole words.
September 23, 2018
On Sunday, 23 September 2018 at 00:18:06 UTC, Adam D. Ruppe wrote:
> I have seen Japanese D code before on twitter, but cannot find it now (surely because the search engines also share this bias).

You can find a lot more Japanese D code on this blogging platform:
https://qiita.com/tags/dlang

Here's the most recent post to save you a click:
https://qiita.com/ShigekiKarita/items/9b3aa8f716848278ef62