Updating D beyond Unicode 2.0

OT: Bad translations
Sep 26, 2018 Ali Çehreli
Sep 26, 2018 Simen Kjærås
Sep 26, 2018 Patrick Schluter
Sep 26, 2018 ShadoLight
Sep 26, 2018 abcde1234
Sep 27, 2018 Ali Çehreli
Sep 27, 2018 Jonathan M Davis
Sep 27, 2018 Andrea Fontana
Sep 27, 2018 Paolo Invernizzi

Sep 26, 2018

Andrea Fontana

Sep 26, 2018

Sep 21, 2018

Sep 21, 2018

Sep 22, 2018

Sep 22, 2018

Sep 23, 2018

Sep 25, 2018

Sep 25, 2018

Sep 26, 2018

Sep 26, 2018

Sep 26, 2018

Sep 26, 2018

Sep 26, 2018

Sep 26, 2018

Sep 25, 2018

Sep 26, 2018

Sep 26, 2018

Sep 26, 2018

Sep 26, 2018

Sep 26, 2018

Sep 26, 2018

Sep 27, 2018

Sep 27, 2018

Sep 27, 2018

Sep 27, 2018

Sep 27, 2018

Sep 28, 2018

Sep 28, 2018

Sep 29, 2018

Sep 29, 2018

Sep 29, 2018

Sep 29, 2018

Sep 30, 2018

Sep 27, 2018

Sep 26, 2018

September 21, 2018

Updating D beyond Unicode 2.0

Posted by Neia Neutuladh

Permalink

Neia Neutuladh

Permalink

D's currently accepted identifier characters are based on Unicode 2.0:

* ASCII range values are handled specially.
* Letters and combining marks from Unicode 2.0 are accepted.
* Numbers outside the ASCII range are accepted.
* Eight random punctuation marks are accepted.

This follows the C99 standard.

Many languages use the Unicode standard explicitly: C#, Go, Java, Python, ECMAScript, just to name a few. A small number of languages reject non-ASCII characters: Dart, Perl. Some languages are weirdly generous: Swift and C11 allow everything outside the Basic Multilingual Plane.

I'd like to update that so that D accepts something as a valid identifier character if it's a letter or combining mark or modifier symbol that's present in Unicode 11, or a non-ASCII number. This allows the 146 most popular writing systems and a lot more characters from those writing systems. This *would* reject those eight random punctuation marks, so I'll keep them in as legacy characters.

It would mean we don't have to reference the C99 standard when enumerating the allowed characters; we just have to refer to the Unicode standard, which we already need to talk about in the lexical part of the spec.

It might also make the lexer a tiny bit faster; it reduces the number of valid-ident-char segments to search from 245 to 134. On the other hand, it will change the ident char ranges from wchar to dchar, which means the table takes up marginally more memory.

And, of course, it lets you write programs entirely in Linear B, and that's a marketing ploy not to be missed.

I've got this coded up and can submit a PR, but I thought I'd get feedback here first.

Does anyone see any horrible potential problems here?

Or is there an interestingly better option?

Does this need a DIP?

September 21, 2018

Re: Updating D beyond Unicode 2.0

Posted by Walter Bright
in reply to Neia Neutuladh

Permalink

Walter Bright

Posted in reply to Neia Neutuladh

Permalink

When I originally started with D, I thought non-ASCII identifiers with Unicode was a good idea. I've since slowly become less and less enthusiastic about it.

First off, D source text simply must (and does) fully support Unicode in comments, characters, and string literals. That's not an issue.

But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases. I don't see much point in expanding the support of it. If people use such identifiers, the result would most likely be annoyance rather than illumination when people who don't know that language have to work on the code.

Extending it further will also cause problems for all the tools that work with D object code, like debuggers, disassemblers, linkers, filesystems, etc.

Absent a much more compelling rationale for it, I'd say no.

September 21, 2018

Re: Updating D beyond Unicode 2.0

Posted by Erik van Velzen
in reply to Neia Neutuladh

Permalink

Erik van Velzen

Posted in reply to Neia Neutuladh

Permalink

Agreed with Walter.

I'm all on board with i18n but I see no need for non-ascii identifiers.

Even identifiers with a non-latin origin are usually written in the latin script.

As for real-world usage I've seen Cyrillic identifiers a few times in PHP.

September 21, 2018

Re: Updating D beyond Unicode 2.0

Posted by Seb
in reply to Erik van Velzen

Permalink

Seb

Posted in reply to Erik van Velzen

Permalink

On Friday, 21 September 2018 at 23:00:45 UTC, Erik van Velzen wrote:
> Agreed with Walter.
>
> I'm all on board with i18n but I see no need for non-ascii identifiers.
>
> Even identifiers with a non-latin origin are usually written in the latin script.
>
> As for real-world usage I've seen Cyrillic identifiers a few times in PHP.

A: Wait. Using emojis as identifiers is not a good idea?
B: Yes.
A: But the cool kids are doing it:

https://codepen.io/andresgalante/pen/jbGqXj

In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it.
(This is already supported in D.)

September 21, 2018

Re: Updating D beyond Unicode 2.0

Posted by Adam D. Ruppe
in reply to Walter Bright

Permalink

Adam D. Ruppe

Posted in reply to Walter Bright

Permalink

On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:
> But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases.

Do you look at Japanese D code much? Or Turkish? Or Chinese?

I know there are decently sized D communities in those languages, and I am pretty sure I have seen identifiers in their languages before, but I can't find it right now.

Just there's a pretty clear potential for observation bias here. Even our search engine queries are going to be biased toward English-language results, so there can be a whole D world kinda invisible to you and I.

We should reach out and get solid stats before making a final decision.

> most likely be annoyance rather than illumination when people who don't know that language have to work on the code.

Well, for example, with a Chinese company, they may very well find forced English identifiers to be an annoyance.

September 22, 2018

Re: Updating D beyond Unicode 2.0

Posted by Neia Neutuladh
in reply to Seb

Permalink

Neia Neutuladh

Posted in reply to Seb

Permalink

On Friday, 21 September 2018 at 23:17:42 UTC, Seb wrote:
> A: Wait. Using emojis as identifiers is not a good idea?
> B: Yes.
> A: But the cool kids are doing it:

The C11 spec says that emoji should be allowed in identifiers (ISO publication N1570 page 504/522), so it's not just the cool kids.

I'm not in favor of emoji in identifiers.

> In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it.

It's supported because λ is a letter in a language spoken by thirteen million people. I mean, would you want to have to name a variable "lumиnosиty" because someone got annoyed at people using "i" as a variable name?

September 22, 2018

Re: Updating D beyond Unicode 2.0

Posted by Neia Neutuladh
in reply to Walter Bright

Permalink

Neia Neutuladh

Posted in reply to Walter Bright

Permalink

On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:
> But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases. I don't see much point in expanding the support of it. If people use such identifiers, the result would most likely be annoyance rather than illumination when people who don't know that language have to work on the code.

...you *do* know that not every codebase has people working on it who only know English, right?

If I took a software development job in China, I'd need to learn Chinese. I'd expect the codebase to be in Chinese. Because a Chinese company generally operates in Chinese, and they're likely to have a lot of employees who only speak Chinese.

And no, you can't just transcribe Chinese into ASCII.

Same for Spanish, Norwegian, German, Polish, Russian -- heck, it's almost easier to list out the languages you *don't* need non-ASCII characters for.

Anyway, here's some more D code using non-ASCII identifiers, in case you need examples: https://git.ikeran.org/dhasenan/muzikilo

September 22, 2018

Re: Updating D beyond Unicode 2.0

Posted by rikki cattermole
in reply to Seb

Permalink

rikki cattermole

Posted in reply to Seb

Permalink

On 22/09/2018 11:17 AM, Seb wrote:
> In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it.
> (This is already supported in D.)

This can be strongly mitigated by using a compose key. But they are not terribly common unfortunately.

September 22, 2018

Re: Updating D beyond Unicode 2.0

Posted by Joakim
in reply to Walter Bright

Permalink

Joakim

Posted in reply to Walter Bright

Permalink

On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:
> When I originally started with D, I thought non-ASCII identifiers with Unicode was a good idea. I've since slowly become less and less enthusiastic about it.
>
> First off, D source text simply must (and does) fully support Unicode in comments, characters, and string literals. That's not an issue.
>
> But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases. I don't see much point in expanding the support of it. If people use such identifiers, the result would most likely be annoyance rather than illumination when people who don't know that language have to work on the code.
>
> Extending it further will also cause problems for all the tools that work with D object code, like debuggers, disassemblers, linkers, filesystems, etc.

To wit, Windows linker error with Unicode symbol:

https://github.com/ldc-developers/ldc/pull/2850#issuecomment-422968161

> Absent a much more compelling rationale for it, I'd say no.

I'm torn. I completely agree with Adam and others that people should be able to use any language they want. But the Unicode spec is such a tire fire that I'm leery of extending support for it.

Someone linked this Swift chapter on Unicode handling in an earlier forum thread, read the section on emoji in particular:

https://oleb.net/blog/2017/11/swift-4-strings/

I was laughing out loud when reading about composing "family" emojis with zero-width joiners. If you told me that was a tech parody, I'd have believed it.

I believe Swift just punts their Unicode support to ICU, like most any other project these days. That's a horrible sign, that you've created a spec so grotesquely complicated that most everybody relies on a single project to not have to deal with it.

September 22, 2018

Re: Updating D beyond Unicode 2.0

Posted by Neia Neutuladh
in reply to Joakim

Permalink

Neia Neutuladh

Posted in reply to Joakim

Permalink

On Saturday, 22 September 2018 at 04:54:59 UTC, Joakim wrote:
> To wit, Windows linker error with Unicode symbol:
>
> https://github.com/ldc-developers/ldc/pull/2850#issuecomment-422968161

That's a good argument for sticking to ASCII for name mangling.

> I'm torn. I completely agree with Adam and others that people should be able to use any language they want. But the Unicode spec is such a tire fire that I'm leery of extending support for it.

The compiler doesn't have to do much with Unicode processing, fortunately.

Top | Forum index | About this forum

Forums