Jump to page: 1 211  
Page
Thread overview
Updating D beyond Unicode 2.0
Sep 21, 2018
Neia Neutuladh
Sep 21, 2018
Walter Bright
Sep 21, 2018
Adam D. Ruppe
Sep 23, 2018
Ali Çehreli
Sep 23, 2018
Kagamin
Sep 22, 2018
Neia Neutuladh
Sep 22, 2018
Thomas Mader
Sep 22, 2018
Neia Neutuladh
Sep 22, 2018
Erik van Velzen
Sep 22, 2018
Neia Neutuladh
Sep 23, 2018
Adam D. Ruppe
Sep 23, 2018
sarn
Sep 23, 2018
Shachar Shemesh
Sep 23, 2018
sarn
Sep 24, 2018
Shachar Shemesh
Sep 23, 2018
aliak
Sep 22, 2018
Joakim
Sep 22, 2018
Neia Neutuladh
Sep 22, 2018
Jonathan M Davis
Sep 22, 2018
Shachar Shemesh
Sep 22, 2018
Thomas Mader
Sep 22, 2018
Jonathan M Davis
Sep 22, 2018
Shachar Shemesh
Sep 22, 2018
Thomas Mader
Sep 22, 2018
Shachar Shemesh
Sep 22, 2018
Neia Neutuladh
Sep 23, 2018
Ali Çehreli
Sep 22, 2018
Jonathan M Davis
Sep 23, 2018
sarn
Sep 22, 2018
Neia Neutuladh
Sep 23, 2018
Jonathan M Davis
Sep 23, 2018
Walter Bright
Sep 23, 2018
Neia Neutuladh
Sep 24, 2018
Walter Bright
Sep 24, 2018
Neia Neutuladh
Sep 24, 2018
Adam D. Ruppe
Sep 24, 2018
Patrick Schluter
Sep 24, 2018
Dennis
Sep 24, 2018
Walter Bright
Sep 24, 2018
Dennis
Sep 24, 2018
Jonathan M Davis
Sep 24, 2018
Dennis
Sep 24, 2018
Adam D. Ruppe
Sep 23, 2018
Abdulhaq
Sep 25, 2018
Walter Bright
Sep 23, 2018
aliak
Sep 23, 2018
Walter Bright
Sep 24, 2018
0xEAB
Sep 24, 2018
0xEAB
OT: Bad translations
Sep 26, 2018
Ali Çehreli
Sep 26, 2018
Simen Kjærås
Sep 26, 2018
Patrick Schluter
Sep 26, 2018
ShadoLight
Sep 26, 2018
abcde1234
Sep 27, 2018
Ali Çehreli
Sep 27, 2018
Jonathan M Davis
Sep 27, 2018
Andrea Fontana
Sep 27, 2018
Paolo Invernizzi
Sep 26, 2018
Andrea Fontana
Sep 26, 2018
Jonathan M Davis
Sep 21, 2018
Erik van Velzen
Sep 21, 2018
Seb
Sep 22, 2018
Neia Neutuladh
Sep 22, 2018
rikki cattermole
Sep 23, 2018
Kagamin
Sep 25, 2018
FeepingCreature
Sep 25, 2018
Dukc
Sep 26, 2018
Shachar Shemesh
Sep 26, 2018
Dukc
Sep 26, 2018
Shachar Shemesh
Sep 26, 2018
Dukc
Sep 26, 2018
Walter Bright
Sep 25, 2018
Jacob Carlborg
Sep 26, 2018
rjframe
Sep 26, 2018
Walter Bright
Sep 26, 2018
Adam D. Ruppe
Sep 26, 2018
Neia Neutuladh
Sep 27, 2018
aliak
Sep 27, 2018
Shachar Shemesh
Sep 27, 2018
aliak
Sep 27, 2018
Shachar Shemesh
Sep 27, 2018
aliak
Sep 28, 2018
sarn
Sep 28, 2018
Dukc
Sep 29, 2018
sarn
Sep 29, 2018
Shachar Shemesh
Sep 29, 2018
Dukc
Sep 29, 2018
Shachar Shemesh
Sep 30, 2018
Shachar Shemesh
Sep 27, 2018
Walter Bright
Sep 26, 2018
Walter Bright
September 21, 2018
D's currently accepted identifier characters are based on Unicode 2.0:

* ASCII range values are handled specially.
* Letters and combining marks from Unicode 2.0 are accepted.
* Numbers outside the ASCII range are accepted.
* Eight random punctuation marks are accepted.

This follows the C99 standard.

Many languages use the Unicode standard explicitly: C#, Go, Java, Python, ECMAScript, just to name a few. A small number of languages reject non-ASCII characters: Dart, Perl. Some languages are weirdly generous: Swift and C11 allow everything outside the Basic Multilingual Plane.

I'd like to update that so that D accepts something as a valid identifier character if it's a letter or combining mark or modifier symbol that's present in Unicode 11, or a non-ASCII number. This allows the 146 most popular writing systems and a lot more characters from those writing systems. This *would* reject those eight random punctuation marks, so I'll keep them in as legacy characters.

It would mean we don't have to reference the C99 standard when enumerating the allowed characters; we just have to refer to the Unicode standard, which we already need to talk about in the lexical part of the spec.

It might also make the lexer a tiny bit faster; it reduces the number of valid-ident-char segments to search from 245 to 134. On the other hand, it will change the ident char ranges from wchar to dchar, which means the table takes up marginally more memory.

And, of course, it lets you write programs entirely in Linear B, and that's a marketing ploy not to be missed.

I've got this coded up and can submit a PR, but I thought I'd get feedback here first.

Does anyone see any horrible potential problems here?

Or is there an interestingly better option?

Does this need a DIP?
September 21, 2018
When I originally started with D, I thought non-ASCII identifiers with Unicode was a good idea. I've since slowly become less and less enthusiastic about it.

First off, D source text simply must (and does) fully support Unicode in comments, characters, and string literals. That's not an issue.

But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases. I don't see much point in expanding the support of it. If people use such identifiers, the result would most likely be annoyance rather than illumination when people who don't know that language have to work on the code.

Extending it further will also cause problems for all the tools that work with D object code, like debuggers, disassemblers, linkers, filesystems, etc.

Absent a much more compelling rationale for it, I'd say no.
September 21, 2018
Agreed with Walter.

I'm all on board with i18n but I see no need for non-ascii identifiers.

Even identifiers with a non-latin origin are usually written in the latin script.

As for real-world usage I've seen Cyrillic identifiers a few times in PHP.


September 21, 2018
On Friday, 21 September 2018 at 23:00:45 UTC, Erik van Velzen wrote:
> Agreed with Walter.
>
> I'm all on board with i18n but I see no need for non-ascii identifiers.
>
> Even identifiers with a non-latin origin are usually written in the latin script.
>
> As for real-world usage I've seen Cyrillic identifiers a few times in PHP.

A: Wait. Using emojis as identifiers is not a good idea?
B: Yes.
A: But the cool kids are doing it:

https://codepen.io/andresgalante/pen/jbGqXj

In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it.
(This is already supported in D.)
September 21, 2018
On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:
> But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases.

Do you look at Japanese D code much? Or Turkish? Or Chinese?

I know there are decently sized D communities in those languages, and I am pretty sure I have seen identifiers in their languages before, but I can't find it right now.

Just there's a pretty clear potential for observation bias here. Even our search engine queries are going to be biased toward English-language results, so there can be a whole D world kinda invisible to you and I.

We should reach out and get solid stats before making a final decision.

> most likely be annoyance rather than illumination when people who don't know that language have to work on the code.

Well, for example, with a Chinese company, they may very well find forced English identifiers to be an annoyance.
September 22, 2018
On Friday, 21 September 2018 at 23:17:42 UTC, Seb wrote:
> A: Wait. Using emojis as identifiers is not a good idea?
> B: Yes.
> A: But the cool kids are doing it:

The C11 spec says that emoji should be allowed in identifiers (ISO publication N1570 page 504/522), so it's not just the cool kids.

I'm not in favor of emoji in identifiers.

> In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it.

It's supported because λ is a letter in a language spoken by thirteen million people. I mean, would you want to have to name a variable "lumиnosиty" because someone got annoyed at people using "i" as a variable name?
September 22, 2018
On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:
> But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases. I don't see much point in expanding the support of it. If people use such identifiers, the result would most likely be annoyance rather than illumination when people who don't know that language have to work on the code.

...you *do* know that not every codebase has people working on it who only know English, right?

If I took a software development job in China, I'd need to learn Chinese. I'd expect the codebase to be in Chinese. Because a Chinese company generally operates in Chinese, and they're likely to have a lot of employees who only speak Chinese.

And no, you can't just transcribe Chinese into ASCII.

Same for Spanish, Norwegian, German, Polish, Russian -- heck, it's almost easier to list out the languages you *don't* need non-ASCII characters for.

Anyway, here's some more D code using non-ASCII identifiers, in case you need examples: https://git.ikeran.org/dhasenan/muzikilo
September 22, 2018
On 22/09/2018 11:17 AM, Seb wrote:
> In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it.
> (This is already supported in D.)

This can be strongly mitigated by using a compose key. But they are not terribly common unfortunately.
September 22, 2018
On Friday, 21 September 2018 at 20:25:54 UTC, Walter Bright wrote:
> When I originally started with D, I thought non-ASCII identifiers with Unicode was a good idea. I've since slowly become less and less enthusiastic about it.
>
> First off, D source text simply must (and does) fully support Unicode in comments, characters, and string literals. That's not an issue.
>
> But identifiers? I haven't seen hardly any use of non-ascii identifiers in C, C++, or D. In fact, I've seen zero use of it outside of test cases. I don't see much point in expanding the support of it. If people use such identifiers, the result would most likely be annoyance rather than illumination when people who don't know that language have to work on the code.
>
> Extending it further will also cause problems for all the tools that work with D object code, like debuggers, disassemblers, linkers, filesystems, etc.

To wit, Windows linker error with Unicode symbol:

https://github.com/ldc-developers/ldc/pull/2850#issuecomment-422968161

> Absent a much more compelling rationale for it, I'd say no.

I'm torn. I completely agree with Adam and others that people should be able to use any language they want. But the Unicode spec is such a tire fire that I'm leery of extending support for it.

Someone linked this Swift chapter on Unicode handling in an earlier forum thread, read the section on emoji in particular:

https://oleb.net/blog/2017/11/swift-4-strings/

I was laughing out loud when reading about composing "family" emojis with zero-width joiners. If you told me that was a tech parody, I'd have believed it.

I believe Swift just punts their Unicode support to ICU, like most any other project these days. That's a horrible sign, that you've created a spec so grotesquely complicated that most everybody relies on a single project to not have to deal with it.
September 22, 2018
On Saturday, 22 September 2018 at 04:54:59 UTC, Joakim wrote:
> To wit, Windows linker error with Unicode symbol:
>
> https://github.com/ldc-developers/ldc/pull/2850#issuecomment-422968161

That's a good argument for sticking to ASCII for name mangling.

> I'm torn. I completely agree with Adam and others that people should be able to use any language they want. But the Unicode spec is such a tire fire that I'm leery of extending support for it.

The compiler doesn't have to do much with Unicode processing, fortunately.
« First   ‹ Prev
1 2 3 4 5 6 7 8 9 10 11