Why is BOM required to use unicode in tokens? (page 2)

On 9/15/20 10:59 AM, Steven Schveighoffer wrote: >> Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses. >> >> What will it take (i.e. order of difficulty) to get this fixed -- will merely a bug report (and PR, not sure if I can tackle or not) do it, or will this require more in-depth discussion with compiler maintainers? > > I'm thinking your issue will not be fixed (just like we don't allow $abc to be an identifier). But the spec can be fixed to refer to the correct standards. > > -Steve Steve: It sounds as if the spec is correct but the glyph (codepoint?) range is outdated. If this is the case, it would be a worthwhile update. Do you really think it would be rejected out of hand?

September 15, 2020

Re: Why is BOM required to use unicode in tokens?

Posted by Steven Schveighoffer
in reply to James Blachly

Permalink

Steven Schveighoffer

Posted in reply to James Blachly

Permalink

On 9/15/20 8:10 PM, James Blachly wrote:
> On 9/15/20 10:59 AM, Steven Schveighoffer wrote:
>>> Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses.
>>>
>>> What will it take (i.e. order of difficulty) to get this fixed -- will merely a bug report (and PR, not sure if I can tackle or not) do it, or will this require more in-depth discussion with compiler maintainers?
>>
>> I'm thinking your issue will not be fixed (just like we don't allow $abc to be an identifier). But the spec can be fixed to refer to the correct standards.
>>
> 
> Steve: It sounds as if the spec is correct but the glyph (codepoint?) range is outdated. If this is the case, it would be a worthwhile update. Do you really think it would be rejected out of hand?
> 

I don't really know the answer, as I'm not a unicode expert.

Someone should verify that the character you want to use for a symbol name is actually considered a letter or not. Using phobos to prove this is kind of self-defeating, as I'm pretty sure it would be in league with DMD if there is a bug.

But if it's not a letter, then it would take more than just updating the range. It would be a change in the philosophy of what constitutes an identifier name.

-Steve

On 9/15/20 8:10 PM, James Blachly wrote: > Steve: It sounds as if the spec is correct but the glyph (codepoint?) range is outdated. If this is the case, it would be a worthwhile update. Do you really think it would be rejected out of hand? OK interestingly this code point 0x2202 falls within the range "mathematical operators" [0] , and I could see why in general a range called "operators" (which includes e.g. set membership, relations, operators you would see in abstract algebra, etc.) however, the first 8 codepoints in the range are "Miscellaneous mathematical symbols" and include several that would be appropriately included as/in token names. Indeed, chapter 22, page 823 of the Unicode standard groups ∂ U+2202 (the partial differential symbol in question) along with "Basic Set of Alphanumeric Characters" that includes Latin 0-9, [a-z,A-Z], uppercase greek A-Ω, nabla and variant theta, the lowercase Greek letters, and besides U+2202 ∂, six additional glyph variants. Due to de-duplication of code points, some things that may rightly appear in multiple ranges (like U+2202 ∂) are deduplicated and that I think is the fate that befell this variant delta.

On Wednesday, 16 September 2020 at 00:22:15 UTC, Steven Schveighoffer wrote: > Someone should verify that the character you want to use for a symbol name is actually considered a letter or not. Using phobos to prove this is kind of self-defeating, as I'm pretty sure it would be in league with DMD if there is a bug. UnicodeData.txt (a data file provided by the unicode organization itself since version 1) contains exactly the necessary properties (in an easy parsable format), so we don't need to hard-code the list of allowed identifier characters, but can instead use the latest version provided by unicode (changing every year!). We only need to define which properties a character need to be allowed in an identifier.

On Wednesday, 16 September 2020 at 07:38:26 UTC, Dominikus Dittes Scherkl wrote: > We only need to define which properties a character need to be allowed in an identifier. I think the following change in the grammar would be sufficient: Identifier: IdentifierStart IdentifierStart IdentifierChars IdentifierChars: IdentifierChar IdentifierChar IdentifierChars IdentifierStart: _ Any Unicode codepoint with general category Lu, Ll, Lt, Lo, Nl or No IdentifierChar: IdentifierStart Any Unicode codepoint with general category Lm, Mn, Me, Mc or Nd

On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly wrote: > I wish to write a function including ∂x and ∂y (these are trivial to type with appropriate keyboard shortcuts - alt+d on Mac), but without a unicode byte order mark at the beginning of the file, the lexer rejects the tokens. > > It is not apparently easy to insert such marks (AFAICT no common tool does this specifically), while other languages work fine (i.e., accept unicode in their source) without it. > > Is there a downside to at least presuming UTF-8? As you probably already know BOM means byte order mark so it is only relevant for multi byte encodings (UTF-16, UTF-32). A BOM for UTF-8 isn't required an in fact it's discouraged. Your editor should automatically insert a BOM if appropriate when you save your file. Probably you need to select the appropriate encoding for your file. Typically this is available in the 'Save as..' dialog, or the settings.

On Tuesday, 15 September 2020 at 16:23:01 UTC, Jon Degenhardt wrote: > # The 'Ш' and 'ä' characters are fine. > $ echo $'import std.stdio; void Шä() { writeln("Hello World!"); } void main() { Шä(); }' | dmd -run - > Hello World! > > # But not '∂' > $ echo $'import std.stdio; void x∂() { writeln("Hello World!"); } void main() { x∂(); }' | dmd -run - > __stdin.d(1): Error: char 0x2202 not allowed in identifier Yes. The same troubles for widely used Greek symbols (Sigma, alpha and some other). Unfortunally...

On Wednesday, 16 September 2020 at 00:22:15 UTC, Steven Schveighoffer wrote: > On 9/15/20 8:10 PM, James Blachly wrote: >> On 9/15/20 10:59 AM, Steven Schveighoffer wrote: >>>[...] >> >> Steve: It sounds as if the spec is correct but the glyph (codepoint?) range is outdated. If this is the case, it would be a worthwhile update. Do you really think it would be rejected out of hand? >> > > I don't really know the answer, as I'm not a unicode expert. > > Someone should verify that the character you want to use for a symbol name is actually considered a letter or not. Using phobos to prove this is kind of self-defeating, as I'm pretty sure it would be in league with DMD if there is a bug. I checked, it's not a letter. None of the math symbols are. > > But if it's not a letter, then it would take more than just updating the range. It would be a change in the philosophy of what constitutes an identifier name. >

Forums