Thread overview | |||||||||
---|---|---|---|---|---|---|---|---|---|
|
October 30, 2009 [Issue 3455] New: Some Unicode characters not allowed in identifiers | ||||
---|---|---|---|---|
| ||||
http://d.puremagic.com/issues/show_bug.cgi?id=3455 Summary: Some Unicode characters not allowed in identifiers Product: D Version: unspecified Platform: Other OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: DMD AssignedTo: nobody@puremagic.com ReportedBy: andrei@metalanguage.com --- Comment #0 from Andrei Alexandrescu <andrei@metalanguage.com> 2009-10-30 09:30:44 PDT --- Consider: void main() { auto aλ = "9"; auto a9 = "9"; } The first identifier is an "a" followed by this: http://www.fileformat.info/info/unicode/char/03bb/index.htm The second identifier is an "a" followed by this: http://www.fileformat.info/info/unicode/char/ff19/index.htm Both string literals contain the latter. The second identifier does not compile, although I checked that my editor inserted the correct three-byte UTF-8 code. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
October 30, 2009 [Issue 3455] Some Unicode characters not allowed in identifiers | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | http://d.puremagic.com/issues/show_bug.cgi?id=3455 Matti Niemenmaa <matti.niemenmaa+dbugzilla@iki.fi> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |spec CC| |matti.niemenmaa+dbugzilla@i | |ki.fi Platform|Other |All OS/Version|Linux |All Severity|normal |enhancement --- Comment #1 from Matti Niemenmaa <matti.niemenmaa+dbugzilla@iki.fi> 2009-10-30 09:51:09 PDT --- As http://www.digitalmars.com/d/1.0/lex.html#identifier very clearly states, the allowed characters in identifiers are those defined in the C99 standard, ISO/IEC 9899:1999(E) Annex D. Have a look at it: http://www.open-std.org/JTC1/SC22/wg14/www/docs/n1124.pdf 9, code point 0xff19, is not in that list. The maximum one is 0xd7a3, in fact. This is not a bug, this is an enhancement. However, rather than an arbitrary and frozen list, I /would/ prefer basing it simply on Unicode properties, such as Java's choice: identifiers may start with letters or numeric letters, and may contain, in addition to those, connecting punctuation, decimal digits, and combining and non-spacing marks. In other words: Identifiers may start with code points from the general categories Ll, Lm, Lo, Lt, Lu, Nl. Identifiers may contain code points from the general categories Ll, Lm, Lo, Lt, Lu, Mc, Mn, Nd, Nl, No, Pc. Java also allows Cc and Cf, of whose usefulness I'm not so convinced. These are control characters and things like "soft hyphen", which isn't even supposed to be displayed unless the word line-wraps. Too much potential for confusion IMHO. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
October 30, 2009 [Issue 3455] Some Unicode characters not allowed in identifiers | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | http://d.puremagic.com/issues/show_bug.cgi?id=3455 --- Comment #2 from Andrei Alexandrescu <andrei@metalanguage.com> 2009-10-30 11:40:05 PDT --- (In reply to comment #1) > As http://www.digitalmars.com/d/1.0/lex.html#identifier very clearly states, the allowed characters in identifiers are those defined in the C99 standard, ISO/IEC 9899:1999(E) Annex D. Have a look at it: http://www.open-std.org/JTC1/SC22/wg14/www/docs/n1124.pdf > > 9, code point 0xff19, is not in that list. The maximum one is 0xd7a3, in fact. This is not a bug, this is an enhancement. > > However, rather than an arbitrary and frozen list, I /would/ prefer basing it simply on Unicode properties, such as Java's choice: identifiers may start with letters or numeric letters, and may contain, in addition to those, connecting punctuation, decimal digits, and combining and non-spacing marks. In other words: > > Identifiers may start with code points from the general categories Ll, Lm, Lo, Lt, Lu, Nl. > > Identifiers may contain code points from the general categories Ll, Lm, Lo, Lt, Lu, Mc, Mn, Nd, Nl, No, Pc. > > Java also allows Cc and Cf, of whose usefulness I'm not so convinced. These are control characters and things like "soft hyphen", which isn't even supposed to be displayed unless the word line-wraps. Too much potential for confusion IMHO. Oh ok. Thanks Matti. I'm leaving this as an enhancement request. Currently the error message is: invalid UTF-8 sequence unsupported char 0x99 This is factually incorrect because the UTF-8 sequence is correct. I suggest instead: Unicode character 0xFF19 not allowed in a symbol Andrei -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
December 12, 2009 [Issue 3455] Some Unicode characters not allowed in identifiers | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | http://d.puremagic.com/issues/show_bug.cgi?id=3455 Walter Bright <bugzilla@digitalmars.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bugzilla@digitalmars.com --- Comment #3 from Walter Bright <bugzilla@digitalmars.com> 2009-12-12 00:17:37 PST --- I'm slowly becoming convinced that allowing unicode characters in identifiers is just a bad idea anyway. While there is plenty of interest in writing code that manipulates unicode and has unicode strings, there is little interest in writing the code itself in unicode. There's a growing consensus that code should be written in ascii, for a long list of reasons. For C compatibility, D should support the C identifiers, but I don't think there's an advantage to going beyond that. For instance, the unicode character used in Andrei's test case won't even display properly in Explorer. I'll fix the error message, then call it resolved. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
December 12, 2009 [Issue 3455] Some Unicode characters not allowed in identifiers | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | http://d.puremagic.com/issues/show_bug.cgi?id=3455 Kosmonaut <Kosmonaut@tempinbox.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |Kosmonaut@tempinbox.com --- Comment #4 from Kosmonaut <Kosmonaut@tempinbox.com> 2009-12-12 14:21:14 PST --- [leandro]Relevant SVN commit:[/leandro] http://www.dsource.org/projects/dmd/changeset/292 -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
December 31, 2009 [Issue 3455] Some Unicode characters not allowed in identifiers | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | http://d.puremagic.com/issues/show_bug.cgi?id=3455 Walter Bright <bugzilla@digitalmars.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Comment #5 from Walter Bright <bugzilla@digitalmars.com> 2009-12-31 11:11:58 PST --- Fixed dmd 1.054 and 2.038 -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
January 01, 2010 [Issue 3455] Some Unicode characters not allowed in identifiers | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | http://d.puremagic.com/issues/show_bug.cgi?id=3455 Ali Cehreli <acehreli@yahoo.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |acehreli@yahoo.com --- Comment #6 from Ali Cehreli <acehreli@yahoo.com> 2009-12-31 16:04:07 PST --- (In reply to comment #3) > there is little interest in writing the code itself in unicode. There's a growing consensus that code should be written in ascii, for a long list of reasons. Thank you very much for allowing us to program in UTF-8. There is a yet-to-grow Turkish D community out there who have tremendous joy in being able to program in Turkish. I may be in the minority here, but UTF-8 identifiers has been the most important feature for me to consider D. Ali -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
Copyright © 1999-2021 by the D Language Foundation