Thread overview | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
September 22, 2006 identifiers & "unialpha" | ||||
---|---|---|---|---|
| ||||
Attachments: | http://www.digitalmars.com/d/lex.html#identifier # Identifiers start with a letter, _, or universal alpha, and are followed # by any number of letters, _, digits, or universal alphas. Universal # alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the # C99 Standard.) Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing "universal alpha". Sample: \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but allowed by Appendix D in identifiers. "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to drop the redirection via "Appendix D" and use "ISO/IEC TR 10176 (current)" instead of the dated version "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a chunk of CJK and Math characters that can be found in the current version. Thomas |
September 22, 2006 Re: identifiers & "unialpha" | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kuehne | Thomas Kuehne wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> http://www.digitalmars.com/d/lex.html#identifier
> # Identifiers start with a letter, _, or universal alpha, and are followed
> # by any number of letters, _, digits, or universal alphas. Universal
> # alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the
> # C99 Standard.)
>
> Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining
> "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing
> "universal alpha".
>
> Sample:
> \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but
> allowed by Appendix D in identifiers.
>
> "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing
> "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to
> drop the redirection via "Appendix D" and use
> "ISO/IEC TR 10176 (current)" instead of the dated version
> "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a
> chunk of CJK and Math characters that can be found in the current version.
Agreed. Incidentally, the 2003 revision to the C++ standard ("ISO/IEC 14882:2003(E)"), Appendix E, contains a revised copy of the character table (which is likely from "ISO/IEC TR 10176:2003") and appears to have done away with the "special characters" section entirely. So I suspect your suggestion would eliminate the problem you mention above as well?
Sean
|
September 22, 2006 Re: identifiers & "unialpha" | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sean Kelly Attachments: | Sean Kelly schrieb am 2006-09-22: > Thomas Kuehne wrote: >> >> http://www.digitalmars.com/d/lex.html#identifier >> # Identifiers start with a letter, _, or universal alpha, and are followed >> # by any number of letters, _, digits, or universal alphas. Universal >> # alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the >> # C99 Standard.) >> >> Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining >> "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing >> "universal alpha". >> >> Sample: >> \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but >> allowed by Appendix D in identifiers. >> >> "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing >> "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to >> drop the redirection via "Appendix D" and use >> "ISO/IEC TR 10176 (current)" instead of the dated version >> "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a >> chunk of CJK and Math characters that can be found in the current version. > > Agreed. Incidentally, the 2003 revision to the C++ standard ("ISO/IEC 14882:2003(E)"), Appendix E, contains a revised copy of the character table (which is likely from "ISO/IEC TR 10176:2003") and appears to have done away with the "special characters" section entirely. So I suspect your suggestion would eliminate the problem you mention above as well? Yes. How about this rewrite: # Identifier: # IdentiferStart # IdentiferStart IdentifierChars # # IdentifierChars: # IdentiferChar # IdentiferChar IdentifierChars # # IdentifierStart: # _ # Letter # # IdentifierChar: # IdentiferStart # Number # NonspacingMark # # Identifiers start with a letter, or _ and are followed # by any number of letters, _, or digits. Letters, Numbers and # NonspacingMarks are those defined in ISO/IEC TR 10176. Accessing ISO standarts can be complicated. Here are the crossreferences for Unicode's UnicodeData.txt. For the relation between Unicode and ISO10176 see http://en.wikipedia.org/wiki/ISO/IEC_10646#Differences_between_ISO_10646_and_Unicode Letters: Uppercase_Letter (Lu) Lowercase_Letter (Ll) Titlecase_Letter (Lt) Modifier_Letter (Lm) Other_Letter (Lo) NonspacingMarks: Nonspacing_Mark (Mn) Numbers: Decimal_Number (Nd) Letter_Number (Nl) Other_Number (No) Thomas |
September 22, 2006 Re: identifiers & "unialpha" | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kuehne | Thomas Kuehne wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> http://www.digitalmars.com/d/lex.html#identifier
> # Identifiers start with a letter, _, or universal alpha, and are followed
> # by any number of letters, _, digits, or universal alphas. Universal
> # alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the
> # C99 Standard.)
>
> Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining
> "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing
> "universal alpha".
>
> Sample:
> \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but
> allowed by Appendix D in identifiers.
>
> "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing
> "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to
> drop the redirection via "Appendix D" and use
> "ISO/IEC TR 10176 (current)" instead of the dated version
> "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a
> chunk of CJK and Math characters that can be found in the current version.
I'd like to leave things as they are for 1.0. I don't think that anyone's code will be adversely affected by not having the latest alpha character additions to identifiers, and I also don't think math characters should be part of identifiers. What is CJK?
As it is now, it matches standard C's definition of identifiers, which is the intent of the reference. I haven't checked, but I think it matches Java's idea of an identifier character, too.
P.S. It also bugs me that the unicode people can't seem to make up their minds. Do character sets really need to change every 2 or 3 years?
|
September 22, 2006 Re: identifiers & "unialpha" | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | > Thomas Kuehne wrote:
>> What is CJK?
Just a guess: "Chinese, Japanese & Korean"?
- Eric
|
September 22, 2006 Re: identifiers & "unialpha" | ||||
---|---|---|---|---|
| ||||
Posted in reply to Pragma | Pragma wrote: >> Thomas Kuehne wrote: >>> What is CJK? > > Just a guess: "Chinese, Japanese & Korean"? > > - Eric Your guess is correct. Wikipedia does a great job explaining CJK: http://en.wikipedia.org/wiki/CJK |
September 22, 2006 Re: identifiers & "unialpha" | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright Attachments: | Walter Bright schrieb am 2006-09-22: > Thomas Kuehne wrote: >> >> http://www.digitalmars.com/d/lex.html#identifier >> # Identifiers start with a letter, _, or universal alpha, and are followed >> # by any number of letters, _, digits, or universal alphas. Universal >> # alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the >> # C99 Standard.) >> >> Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining >> "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing >> "universal alpha". >> >> Sample: >> \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but >> allowed by Appendix D in identifiers. >> >> "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing >> "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to >> drop the redirection via "Appendix D" and use >> "ISO/IEC TR 10176 (current)" instead of the dated version >> "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a >> chunk of CJK and Math characters that can be found in the current version. > > I'd like to leave things as they are for 1.0. I don't think that anyone's code will be adversely affected by not having the latest alpha character additions to identifiers, and I also don't think math characters should be part of identifiers. What is CJK? CJK: Chinese, Japanese & Korean 0x20000 .. 0x2A6D6 CJK Ideograph Extension B 0x2F800 .. 0x2FA1D CJK COMPATIBILITY IDEOGRAPHS > As it is now, it matches standard C's definition of identifiers, which is the intent of the reference. I haven't checked, but I think it matches Java's idea of an identifier character, too. ISO/IEC 9899:1999 (E) Appendix D # 1) This clause lists the hexadecimal code values that are valid in # universal character names in identifiers. Whereas Appendix D defines valid characters in identifiers, D uses it as a source for "universal alpha". As a consequence std.uni.isUniAlpha claims that \u00B7 (MIDDLE DOT) is a letter... > P.S. It also bugs me that the unicode people can't seem to make up their minds. Do character sets really need to change every 2 or 3 years? Task at hand: Create a table of all characters used by humans all over the world and minimize friction due to political issues (e.g. characters' names). Except for bug fixes (typos...) the unicode people usually only extend previous versions of the standard. Thomas |
September 22, 2006 Re: identifiers & "unialpha" | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kuehne | Thomas Kuehne wrote: > Walter Bright schrieb am 2006-09-22: >> What is CJK? > > CJK: Chinese, Japanese & Korean > 0x20000 .. 0x2A6D6 CJK Ideograph Extension B > 0x2F800 .. 0x2FA1D CJK COMPATIBILITY IDEOGRAPHS Thank-you. >> As it is now, it matches standard C's definition of identifiers, which is the intent of the reference. I haven't checked, but I think it matches Java's idea of an identifier character, too. > > ISO/IEC 9899:1999 (E) Appendix D > # 1) This clause lists the hexadecimal code values that are valid in > # universal character names in identifiers. > > Whereas Appendix D defines valid characters in identifiers, D uses it > as a source for "universal alpha". As a consequence std.uni.isUniAlpha > claims that \u00B7 (MIDDLE DOT) is a letter... I guess I don't see why C99 would say . is a valid identifier character if it isn't an alpha. It's all confusing to me, and I think needlessly complicated. Is \u00B7 the only difference? > >> P.S. It also bugs me that the unicode people can't seem to make up their minds. Do character sets really need to change every 2 or 3 years? > > Task at hand: Create a table of all characters used by humans all over > the world and minimize friction due to political issues > (e.g. characters' names). Except for bug fixes (typos...) the unicode people > usually only extend previous versions of the standard. Chinese, Japanese, and Korean are hardly obscure so I don't see why the character sets for them seem to need large numbers of additions this late in the game. |
September 22, 2006 Re: identifiers & "unialpha" | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kuehne Attachments: | Thomas Kuehne schrieb am 2006-09-22: > Walter Bright schrieb am 2006-09-22: >> Thomas Kuehne wrote: <snip> >> I'd like to leave things as they are for 1.0. I don't think that anyone's code will be adversely affected by not having the latest alpha character additions to identifiers, and I also don't think math characters should be part of identifiers. What is CJK? > > CJK: Chinese, Japanese & Korean > 0x20000 .. 0x2A6D6 CJK Ideograph Extension B > 0x2F800 .. 0x2FA1D CJK COMPATIBILITY IDEOGRAPHS A closer look reveals that Appendix D is also missing (among many others): 0x0712 .. 0x072F SYRIAC LETTER 0x1200 .. 0x1248 ETHIOPIC SYLLABLE 0x13A0 .. 0x13F4 CHEROKEE LETTER 0x3400 .. 0x4DB5 CJK Ideograph Extension A 0xA016 .. 0xA48C YI SYLLABLE 0xF900 .. 0xFAD9 CJK COMPATIBILITY IDEOGRAPH 0xFB46 .. 0xFBB1 HEBREW / ARABIC LETTER 0xFF21 .. 0xFF3A FULLWIDTH LATIN CAPITAL LETTER 0xFF41 .. 0xFF5A FULLWIDTH LATIN SMALL LETTER Thomas |
September 22, 2006 Re: identifiers & "unialpha" | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Walter Bright wrote: > Thomas Kuehne wrote: >> >> ISO/IEC 9899:1999 (E) Appendix D >> # 1) This clause lists the hexadecimal code values that are valid in >> # universal character names in identifiers. >> >> Whereas Appendix D defines valid characters in identifiers, D uses it >> as a source for "universal alpha". As a consequence std.uni.isUniAlpha >> claims that \u00B7 (MIDDLE DOT) is a letter... > > I guess I don't see why C99 would say . is a valid identifier character if it isn't an alpha. It's all confusing to me, and I think needlessly complicated. Is \u00B7 the only difference? No, there are other differences as well. I think C99 was simply referring to the latest version of the document available in 1999, and it has since been revised (in 2003, apparently). But I have no idea why characters present in the 1999 doc are not present in the 2003 doc. To pass the buck even further, "ISO/IEC TR 10176:2003" Annex A says the following: This list comprises the letters (combining or not), syllables, and ideographs from ISO/IEC 10646-1, together with the modifier letters and marks conventionally used as parts of words. So their list of characters is copied from the Unicode standard (ISO/IEC 10646). I can only conclude that the Unicode standard changed between 1999-2003 and ISO/IEC 10176 simply incorporated the new list. But who knows why the list was changed. This does raise an interesting point however. Since the C and C++ standards separately refer to SO/IEC 10176 for their character list, the identifiers a compliant C99 and C++2003 compiler should accept are different. This seems contrary to the usual C++ practice of deferring to the C standard on semantic issues. >>> P.S. It also bugs me that the unicode people can't seem to make up their minds. Do character sets really need to change every 2 or 3 years? >> >> Task at hand: Create a table of all characters used by humans all over >> the world and minimize friction due to political issues >> (e.g. characters' names). Except for bug fixes (typos...) the unicode people >> usually only extend previous versions of the standard. > > Chinese, Japanese, and Korean are hardly obscure so I don't see why the character sets for them seem to need large numbers of additions this late in the game. Me either. But then I'm not terribly inclined to read the Unicode standards committee minutes to find out either :-) Sean |
Copyright © 1999-2021 by the D Language Foundation