Jump to page: 1 2
Thread overview
identifiers & "unialpha"
Sep 22, 2006
Thomas Kuehne
Sep 22, 2006
Sean Kelly
Sep 22, 2006
Thomas Kuehne
Sep 22, 2006
Walter Bright
Sep 22, 2006
Pragma
Sep 22, 2006
nobody
Sep 22, 2006
Thomas Kuehne
Sep 22, 2006
Walter Bright
Sep 22, 2006
Sean Kelly
Sep 22, 2006
Thomas Kuehne
Sep 23, 2006
Kevin Bealer
Sep 23, 2006
Kristian
[OT] Re: identifiers & "unialpha"
Sep 23, 2006
Sean Kelly
Sep 25, 2006
Kristian
Sep 26, 2006
Kevin Bealer
Sep 26, 2006
Thomas Kuehne
Sep 22, 2006
Thomas Kuehne
September 22, 2006
http://www.digitalmars.com/d/lex.html#identifier
# Identifiers start with a letter, _, or universal alpha, and are followed
# by any number of letters, _, digits, or universal alphas. Universal
# alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the
# C99 Standard.)

Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining
"universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing
"universal alpha".

Sample:
\u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but
allowed by Appendix D in identifiers.

"ISO/IEC 9899:1999 (E) Appendix D" itself is referencing
"ISO/IEC TR 10176:1998" for the character data. I strongly suggest to
drop the redirection via "Appendix D" and use
"ISO/IEC TR 10176 (current)" instead of the dated version
"ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a
chunk of CJK and Math characters that can be found in the current version.

Thomas


September 22, 2006
Thomas Kuehne wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> http://www.digitalmars.com/d/lex.html#identifier
> # Identifiers start with a letter, _, or universal alpha, and are followed
> # by any number of letters, _, digits, or universal alphas. Universal
> # alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the
> # C99 Standard.)
> 
> Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining
> "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing
> "universal alpha".
> 
> Sample:
> \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but
> allowed by Appendix D in identifiers.
> 
> "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing
> "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to
> drop the redirection via "Appendix D" and use
> "ISO/IEC TR 10176 (current)" instead of the dated version
> "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a
> chunk of CJK and Math characters that can be found in the current version.

Agreed.  Incidentally, the 2003 revision to the C++ standard ("ISO/IEC 14882:2003(E)"), Appendix E, contains a revised copy of the character table (which is likely from "ISO/IEC TR 10176:2003") and appears to have done away with the "special characters" section entirely.  So I suspect your suggestion would eliminate the problem you mention above as well?


Sean
September 22, 2006
Sean Kelly schrieb am 2006-09-22:
> Thomas Kuehne wrote:
>> 
>> http://www.digitalmars.com/d/lex.html#identifier
>> # Identifiers start with a letter, _, or universal alpha, and are followed
>> # by any number of letters, _, digits, or universal alphas. Universal
>> # alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the
>> # C99 Standard.)
>> 
>> Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining
>> "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing
>> "universal alpha".
>> 
>> Sample:
>> \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but
>> allowed by Appendix D in identifiers.
>> 
>> "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing
>> "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to
>> drop the redirection via "Appendix D" and use
>> "ISO/IEC TR 10176 (current)" instead of the dated version
>> "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a
>> chunk of CJK and Math characters that can be found in the current version.
>
> Agreed.  Incidentally, the 2003 revision to the C++ standard ("ISO/IEC 14882:2003(E)"), Appendix E, contains a revised copy of the character table (which is likely from "ISO/IEC TR 10176:2003") and appears to have done away with the "special characters" section entirely.  So I suspect your suggestion would eliminate the problem you mention above as well?

Yes. How about this rewrite:

# Identifier:
#	IdentiferStart
#	IdentiferStart IdentifierChars
#
# IdentifierChars:
#	IdentiferChar
#	IdentiferChar IdentifierChars
#
# IdentifierStart:
#	_
#	Letter
#
# IdentifierChar:
#	IdentiferStart
#	Number
#	NonspacingMark
#
# Identifiers start with a letter, or _ and are followed
# by any number of letters, _, or digits. Letters, Numbers and
# NonspacingMarks are those defined in ISO/IEC TR 10176.

Accessing ISO standarts can be complicated. Here are the crossreferences for Unicode's UnicodeData.txt. For the relation between Unicode and ISO10176 see http://en.wikipedia.org/wiki/ISO/IEC_10646#Differences_between_ISO_10646_and_Unicode

Letters:
	Uppercase_Letter (Lu)
	Lowercase_Letter (Ll)
	Titlecase_Letter (Lt)
	Modifier_Letter (Lm)
	Other_Letter (Lo)

NonspacingMarks:
	Nonspacing_Mark (Mn)

Numbers:
	Decimal_Number (Nd)
	Letter_Number (Nl)
	Other_Number (No)

Thomas


September 22, 2006
Thomas Kuehne wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> http://www.digitalmars.com/d/lex.html#identifier
> # Identifiers start with a letter, _, or universal alpha, and are followed
> # by any number of letters, _, digits, or universal alphas. Universal
> # alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the
> # C99 Standard.)
> 
> Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining
> "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing
> "universal alpha".
> 
> Sample:
> \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but
> allowed by Appendix D in identifiers.
> 
> "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing
> "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to
> drop the redirection via "Appendix D" and use
> "ISO/IEC TR 10176 (current)" instead of the dated version
> "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a
> chunk of CJK and Math characters that can be found in the current version.

I'd like to leave things as they are for 1.0. I don't think that anyone's code will be adversely affected by not having the latest alpha character additions to identifiers, and I also don't think math characters should be part of identifiers. What is CJK?

As it is now, it matches standard C's definition of identifiers, which is the intent of the reference. I haven't checked, but I think it matches Java's idea of an identifier character, too.

P.S. It also bugs me that the unicode people can't seem to make up their minds. Do character sets really need to change every 2 or 3 years?
September 22, 2006
> Thomas Kuehne wrote:
>> What is CJK?

Just a guess: "Chinese, Japanese & Korean"?

- Eric
September 22, 2006
Pragma wrote:
>> Thomas Kuehne wrote:
>>> What is CJK?
> 
> Just a guess: "Chinese, Japanese & Korean"?
> 
> - Eric

Your guess is correct. Wikipedia does a great job explaining CJK:

http://en.wikipedia.org/wiki/CJK
September 22, 2006
Walter Bright schrieb am 2006-09-22:
> Thomas Kuehne wrote:
>> 
>> http://www.digitalmars.com/d/lex.html#identifier
>> # Identifiers start with a letter, _, or universal alpha, and are followed
>> # by any number of letters, _, digits, or universal alphas. Universal
>> # alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the
>> # C99 Standard.)
>> 
>> Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining
>> "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing
>> "universal alpha".
>> 
>> Sample:
>> \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but
>> allowed by Appendix D in identifiers.
>> 
>> "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing
>> "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to
>> drop the redirection via "Appendix D" and use
>> "ISO/IEC TR 10176 (current)" instead of the dated version
>> "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a
>> chunk of CJK and Math characters that can be found in the current version.
>
> I'd like to leave things as they are for 1.0. I don't think that anyone's code will be adversely affected by not having the latest alpha character additions to identifiers, and I also don't think math characters should be part of identifiers. What is CJK?

CJK: Chinese, Japanese & Korean
0x20000 .. 0x2A6D6 CJK Ideograph Extension B
0x2F800 .. 0x2FA1D CJK COMPATIBILITY IDEOGRAPHS

> As it is now, it matches standard C's definition of identifiers, which is the intent of the reference. I haven't checked, but I think it matches Java's idea of an identifier character, too.

ISO/IEC 9899:1999 (E) Appendix D
# 1) This clause lists the hexadecimal code values that are valid in
# universal character names in identifiers.

Whereas Appendix D defines valid characters in identifiers, D uses it as a source for "universal alpha". As a consequence std.uni.isUniAlpha claims that \u00B7 (MIDDLE DOT) is a letter...

> P.S. It also bugs me that the unicode people can't seem to make up their minds. Do character sets really need to change every 2 or 3 years?

Task at hand: Create a table of all characters used by humans all over
the world and minimize friction due to political issues
(e.g. characters' names). Except for bug fixes (typos...) the unicode people
usually only extend previous versions of the standard.

Thomas


September 22, 2006
Thomas Kuehne wrote:
> Walter Bright schrieb am 2006-09-22:
>> What is CJK?
> 
> CJK: Chinese, Japanese & Korean
> 0x20000 .. 0x2A6D6 CJK Ideograph Extension B
> 0x2F800 .. 0x2FA1D CJK COMPATIBILITY IDEOGRAPHS

Thank-you.

>> As it is now, it matches standard C's definition of identifiers, which is the intent of the reference. I haven't checked, but I think it matches Java's idea of an identifier character, too.
> 
> ISO/IEC 9899:1999 (E) Appendix D
> # 1) This clause lists the hexadecimal code values that are valid in
> # universal character names in identifiers.
> 
> Whereas Appendix D defines valid characters in identifiers, D uses it
> as a source for "universal alpha". As a consequence std.uni.isUniAlpha
> claims that \u00B7 (MIDDLE DOT) is a letter...

I guess I don't see why C99 would say . is a valid identifier character if it isn't an alpha. It's all confusing to me, and I think needlessly complicated. Is \u00B7 the only difference?

> 
>> P.S. It also bugs me that the unicode people can't seem to make up their minds. Do character sets really need to change every 2 or 3 years?
> 
> Task at hand: Create a table of all characters used by humans all over
> the world and minimize friction due to political issues
> (e.g. characters' names). Except for bug fixes (typos...) the unicode people
> usually only extend previous versions of the standard.

Chinese, Japanese, and Korean are hardly obscure so I don't see why the character sets for them seem to need large numbers of additions this late in the game.
September 22, 2006
Thomas Kuehne schrieb am 2006-09-22:
> Walter Bright schrieb am 2006-09-22:
>> Thomas Kuehne wrote:

<snip>
>> I'd like to leave things as they are for 1.0. I don't think that anyone's code will be adversely affected by not having the latest alpha character additions to identifiers, and I also don't think math characters should be part of identifiers. What is CJK?
>
> CJK: Chinese, Japanese & Korean
> 0x20000 .. 0x2A6D6 CJK Ideograph Extension B
> 0x2F800 .. 0x2FA1D CJK COMPATIBILITY IDEOGRAPHS

A closer look reveals that Appendix D is also missing
(among many others):

0x0712 .. 0x072F SYRIAC LETTER
0x1200 .. 0x1248 ETHIOPIC SYLLABLE
0x13A0 .. 0x13F4 CHEROKEE LETTER
0x3400 .. 0x4DB5 CJK Ideograph Extension A
0xA016 .. 0xA48C YI SYLLABLE
0xF900 .. 0xFAD9 CJK COMPATIBILITY IDEOGRAPH
0xFB46 .. 0xFBB1 HEBREW / ARABIC LETTER
0xFF21 .. 0xFF3A FULLWIDTH LATIN CAPITAL LETTER
0xFF41 .. 0xFF5A FULLWIDTH LATIN SMALL LETTER

Thomas

September 22, 2006
Walter Bright wrote:
> Thomas Kuehne wrote:
>>
>> ISO/IEC 9899:1999 (E) Appendix D
>> # 1) This clause lists the hexadecimal code values that are valid in
>> # universal character names in identifiers.
>>
>> Whereas Appendix D defines valid characters in identifiers, D uses it
>> as a source for "universal alpha". As a consequence std.uni.isUniAlpha
>> claims that \u00B7 (MIDDLE DOT) is a letter...
> 
> I guess I don't see why C99 would say . is a valid identifier character if it isn't an alpha. It's all confusing to me, and I think needlessly complicated. Is \u00B7 the only difference?

No, there are other differences as well.  I think C99 was simply referring to the latest version of the document available in 1999, and it has since been revised (in 2003, apparently).  But I have no idea why characters present in the 1999 doc are not present in the 2003 doc.  To pass the buck even further, "ISO/IEC TR 10176:2003" Annex A says the following:

    This list comprises the letters (combining or not), syllables, and
    ideographs from ISO/IEC 10646-1, together with the modifier letters
    and marks conventionally used as parts of words.

So their list of characters is copied from the Unicode standard (ISO/IEC 10646).  I can only conclude that the Unicode standard changed between 1999-2003 and ISO/IEC 10176 simply incorporated the new list.  But who knows why the list was changed.

This does raise an interesting point however.  Since the C and C++ standards separately refer to SO/IEC 10176 for their character list, the identifiers a compliant C99 and C++2003 compiler should accept are different.  This seems contrary to the usual C++ practice of deferring to the C standard on semantic issues.

>>> P.S. It also bugs me that the unicode people can't seem to make up their minds. Do character sets really need to change every 2 or 3 years?
>>
>> Task at hand: Create a table of all characters used by humans all over
>> the world and minimize friction due to political issues
>> (e.g. characters' names). Except for bug fixes (typos...) the unicode people
>> usually only extend previous versions of the standard.
> 
> Chinese, Japanese, and Korean are hardly obscure so I don't see why the character sets for them seem to need large numbers of additions this late in the game.

Me either.  But then I'm not terribly inclined to read the Unicode standards committee minutes to find out either :-)


Sean
« First   ‹ Prev
1 2