Why is BOM required to use unicode in tokens? - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Why is BOM required to use unicode in tokens?

Thread overview

Why is BOM required to use unicode in tokens?
Sep 15, 2020 James Blachly
Sep 15, 2020 Paul Backus
Sep 15, 2020 Jon Degenhardt
Sep 15, 2020 Dominikus Dittes Scherkl
Sep 15, 2020 James Blachly
Sep 15, 2020 Steven Schveighoffer
Sep 15, 2020 Jon Degenhardt
Sep 18, 2020 GK
Sep 16, 2020 James Blachly
Sep 16, 2020 Steven Schveighoffer
Sep 16, 2020 Dominikus Dittes Scherkl
Sep 16, 2020 Dominikus Dittes Scherkl
Sep 18, 2020 Patrick Schluter
Sep 16, 2020 James Blachly
Sep 16, 2020 James Blachly
Sep 15, 2020 H. S. Teoh
Sep 15, 2020 Ola Fosheim Grøstad
Sep 16, 2020 starcanopy
Sep 16, 2020 wjoe

September 14, 2020

Why is BOM required to use unicode in tokens?

Posted by James Blachly

James Blachly

I wish to write a function including ∂x and ∂y (these are trivial to type with appropriate keyboard shortcuts - alt+d on Mac), but without a unicode byte order mark at the beginning of the file, the lexer rejects the tokens.

It is not apparently easy to insert such marks (AFAICT no common tool does this specifically), while other languages work fine (i.e., accept unicode in their source) without it.

Is there a downside to at least presuming UTF-8?

September 15, 2020

Re: Why is BOM required to use unicode in tokens?

Posted by Paul Backus
in reply to James Blachly

Paul Backus

Posted in reply to James Blachly

On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly wrote:
> I wish to write a function including ∂x and ∂y (these are trivial to type with appropriate keyboard shortcuts - alt+d on Mac), but without a unicode byte order mark at the beginning of the file, the lexer rejects the tokens.
>
> It is not apparently easy to insert such marks (AFAICT no common tool does this specifically), while other languages work fine (i.e., accept unicode in their source) without it.
>
> Is there a downside to at least presuming UTF-8?

According to the spec [1] this should Just Work. I'd recommend filing a bug.

[1] https://dlang.org/spec/lex.html#source_text

September 14, 2020

Re: Why is BOM required to use unicode in tokens?

Posted by H. S. Teoh
in reply to James Blachly

H. S. Teoh

Posted in reply to James Blachly

On Mon, Sep 14, 2020 at 09:49:13PM -0400, James Blachly via Digitalmars-d-learn wrote:
> I wish to write a function including ∂x and ∂y (these are trivial to
> type with appropriate keyboard shortcuts - alt+d on Mac), but without
> a unicode byte order mark at the beginning of the file, the lexer
> rejects the tokens.
> 
> It is not apparently easy to insert such marks (AFAICT no common tool
> does this specifically), while other languages work fine (i.e., accept
> unicode in their source) without it.
> 
> Is there a downside to at least presuming UTF-8?

Tested it locally, with and without BOM; the lexer rejects ∂ as a valid token. I suspect the reason has nothing to do with BOMs, but with the fact that ∂ is not classified as an alphanumeric (see std.uni.isAlpha, which returns false for ∂).  The following code, which contains Cyrillic letters, compiles just fine without BOM (std.uni.isAlpha('Ш') returns true):

	void main() {
		int Ш = 1;
		writeln(Ш);
	}

As the docs for std.uni.isAlpha states, it tests for general Unicode category 'Alphabetic'.  Probably identifiers are restricted to characters of this category plus the numerics and '_' (and maybe one or two others, perhaps '$'? Don't remember now).

T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG

September 15, 2020

Re: Why is BOM required to use unicode in tokens?

Posted by Jon Degenhardt
in reply to Paul Backus

Jon Degenhardt

Posted in reply to Paul Backus

On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote:
> On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly wrote:
>> I wish to write a function including ∂x and ∂y (these are trivial to type with appropriate keyboard shortcuts - alt+d on Mac), but without a unicode byte order mark at the beginning of the file, the lexer rejects the tokens.
>>
>> It is not apparently easy to insert such marks (AFAICT no common tool does this specifically), while other languages work fine (i.e., accept unicode in their source) without it.
>>
>> Is there a downside to at least presuming UTF-8?
>
> According to the spec [1] this should Just Work. I'd recommend filing a bug.
>
> [1] https://dlang.org/spec/lex.html#source_text

Under the identifiers section (https://dlang.org/spec/lex.html#identifiers) it describes identifiers as:

> Identifiers start with a letter, _, or universal alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D of the C99 Standard.

I was unable to find the definition of a "universal alpha", or whether that includes non-ascii alphabetic characters.

September 15, 2020

Re: Why is BOM required to use unicode in tokens?

Posted by Dominikus Dittes Scherkl
in reply to Jon Degenhardt

Dominikus Dittes Scherkl

Posted in reply to Jon Degenhardt

On Tuesday, 15 September 2020 at 06:49:08 UTC, Jon Degenhardt wrote:
> On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote:
>> Identifiers start with a letter, _, or universal alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D of the C99 Standard.
>
> I was unable to find the definition of a "universal alpha", or whether that includes non-ascii alphabetic characters.

ISO/IEC 9899:1999 (E)
Annex D

Universal character names for identifiers
-----------------------------------------

Latin: 00AA, 00BA, 00C0-00D6, 00D8-00F6, 00F8-01F5, 01FA-0217, 0250-02A8, 1E00-1E9B, 1EA0-1EF9, 207F
Greek: 0386, 0388-038A, 038C, 038E-03A1, 03A3-03CE, 03D0-03D6, 03DA, 03DC, 03DE, 03E0, 03E2-03F3, 1F00-1F15, 1F18-1F1D, 1F20-1F45, 1F48-1F4D, 1F50-1F57, 1F59, 1F5B, 1F5D, 1F5F-1F7D, 1F80-1FB4, 1FB6-1FBC, 1FC2-1FC4, 1FC6-1FCC, 1FD0-1FD3, 1FD6-1FDB, 1FE0-1FEC, 1FF2-1FF4, 1FF6-1FFC
Cyrillic: 0401-040C, 040E-044F, 0451-045C, 045E-0481, 0490-04C4, 04C7-04C8, 04CB-04CC, 04D0-04EB, 04EE-04F5, 04F8-04F9
Armenian: 0531-0556, 0561-0587
Hebrew: 05B0-05B9, 05BB-05BD, 05BF, 05C1-05C2, 05D0-05EA, 05F0-05F2
Arabic: 0621-063A, 0640-0652, 0670-06B7, 06BA-06BE, 06C0-06CE, 06D0-06DC, 06E5-06E8, 06EA-06ED
Devanagari: 0901-0903, 0905-0939, 093E-094D, 0950-0952, 0958-0963
Bengali: 0981-0983, 0985-098C, 098F-0990, 0993-09A8, 09AA-09B0, 09B2, 09B6-09B9, 09BE-09C4, 09C7-09C8, 09CB-09CD, 09DC-09DD, 09DF-09E3, 09F0-09F1
Gurmukhi: 0A02, 0A05-0A0A, 0A0F-0A10, 0A13-0A28, 0A2A-0A30, 0A32-0A33, 0A35-0A36, 0A38-0A39, 0A3E-0A42, 0A47-0A48, 0A4B-0A4D, 0A59-0A5C, 0A5E, 0A74
Gujarati: 0A81-0A83, 0A85-0A8B, 0A8D, 0A8F-0A91, 0A93-0AA8, 0AAA-0AB0, 0AB2-0AB3, 0AB5-0AB9, 0ABD-0AC5, 0AC7-0AC9, 0ACB-0ACD, 0AD0, 0AE0
Oriya: 0B01-0B03, 0B05-0B0C, 0B0F-0B10, 0B13-0B28, 0B2A-0B30, 0B32-0B33, 0B36-0B39, 0B3E-0B43, 0B47-0B48, 0B4B-0B4D, 0B5C-0B5D, 0B5F-0B61
Tamil: 0B82-0B83, 0B85-0B8A, 0B8E-0B90, 0B92-0B95, 0B99-0B9A, 0B9C, 0B9E-0B9F, 0BA3-0BA4, 0BA8-0BAA, 0BAE-0BB5, 0BB7-0BB9, 0BBE-0BC2, 0BC6-0BC8, 0BCA-0BCD
Telugu: 0C01-0C03, 0C05-0C0C, 0C0E-0C10, 0C12-0C28, 0C2A-0C33, 0C35-0C39, 0C3E-0C44, 0C46-0C48, 0C4A-0C4D, 0C60-0C61
Kannada: 0C82-0C83, 0C85-0C8C, 0C8E-0C90, 0C92-0CA8, 0CAA-0CB3, 0CB5-0CB9, 0CBE-0CC4, 0CC6-0CC8, 0CCA-0CCD, 0CDE, 0CE0-0CE1
Malayalam: 0D02-0D03, 0D05-0D0C, 0D0E-0D10, 0D12-0D28, 0D2A-0D39, 0D3E-0D43, 0D46-0D48, 0D4A-0D4D, 0D60-0D61
Thai: 0E01-0E3A, 0E40-0E5B
Lao: 0E81-0E82, 0E84, 0E87-0E88, 0E8A, 0E8D, 0E94-0E97, 0E99-0E9F, 0EA1-0EA3, 0EA5, 0EA7, 0EAA-0EAB, 0EAD-0EAE, 0EB0-0EB9, 0EBB-0EBD, 0EC0-0EC4, 0EC6, 0EC8-0ECD, 0EDC-0EDD
Tibetan: 0F00, 0F18-0F19, 0F35, 0F37, 0F39, 0F3E-0F47, 0F49-0F69, 0F71-0F84, 0F86-0F8B, 0F90-0F95, 0F97, 0F99-0FAD, 0FB1-0FB7, 0FB9
Georgian: 10A0-10C5, 10D0-10F6
Hiragana: 3041-3093, 309B-309C
Katakana: 30A1-30F6, 30FB-30FC
Bopomofo: 3105-312C
CJK Unified Ideographs: 4E00-9FA5
Hangul: AC00-D7A3
Digits: 0660-0669, 06F0-06F9, 0966-096F, 09E6-09EF, 0A66-0A6F, 0AE6-0AEF, 0B66-0B6F, 0BE7-0BEF, 0C66-0C6F, 0CE6-0CEF, 0D66-0D6F, 0E50-0E59, 0ED0-0ED9, 0F20-0F33
Special characters: 00B5, 00B7, 02B0-02B8, 02BB, 02BD-02C1, 02D0-02D1, 02E0-02E4, 037A, 0559, 093D, 0B3D, 1FBE, 203F-2040, 2102, 2107, 210A-2113, 2115, 2118-211D, 2124, 2126, 2128, 212A-2131, 2133-2138, 2160-2182, 3005-3007, 3021-3029

-----------------------

This is outdated to the brim. Also it doesn't allow for letter-like symbols (which is debatable, but especially the mathematical ones like double-struck letters are intended for such use).
Instead of some old C-Standard, D should better rely directly on the properties from UnicodeData.txt, which is updated with every new unicode version.

September 15, 2020

Re: Why is BOM required to use unicode in tokens?

Posted by James Blachly
in reply to Dominikus Dittes Scherkl

James Blachly

Posted in reply to Dominikus Dittes Scherkl

On 9/15/20 4:36 AM, Dominikus Dittes Scherkl wrote:
> On Tuesday, 15 September 2020 at 06:49:08 UTC, Jon Degenhardt wrote:
>> On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote:
>>> Identifiers start with a letter, _, or universal alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D of the C99 Standard.
>>
>> I was unable to find the definition of a "universal alpha", or whether that includes non-ascii alphabetic characters.
> 
> ISO/IEC 9899:1999 (E)
> Annex D
> 
> Universal character names for identifiers
> -----------------------------------------
...
> -----------------------
> 
> This is outdated to the brim. Also it doesn't allow for letter-like symbols (which is debatable, but especially the mathematical ones like double-struck letters are intended for such use).
> Instead of some old C-Standard, D should better rely directly on the properties from UnicodeData.txt, which is updated with every new unicode version.
> 

Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses.

What will it take (i.e. order of difficulty) to get this fixed -- will merely a bug report (and PR, not sure if I can tackle or not) do it, or will this require more in-depth discussion with compiler maintainers?

James

September 15, 2020

Re: Why is BOM required to use unicode in tokens?

Posted by Steven Schveighoffer
in reply to James Blachly

Steven Schveighoffer

Posted in reply to James Blachly

On 9/15/20 10:18 AM, James Blachly wrote:
> On 9/15/20 4:36 AM, Dominikus Dittes Scherkl wrote:
>> On Tuesday, 15 September 2020 at 06:49:08 UTC, Jon Degenhardt wrote:
>>> On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote:
>>>> Identifiers start with a letter, _, or universal alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D of the C99 Standard.
>>>
>>> I was unable to find the definition of a "universal alpha", or whether that includes non-ascii alphabetic characters.
>>
>> ISO/IEC 9899:1999 (E)
>> Annex D
>>
>> Universal character names for identifiers
>> -----------------------------------------
> ....
>> -----------------------
>>
>> This is outdated to the brim. Also it doesn't allow for letter-like symbols (which is debatable, but especially the mathematical ones like double-struck letters are intended for such use).
>> Instead of some old C-Standard, D should better rely directly on the properties from UnicodeData.txt, which is updated with every new unicode version.
>>
> 
> Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses.
> 
> What will it take (i.e. order of difficulty) to get this fixed -- will merely a bug report (and PR, not sure if I can tackle or not) do it, or will this require more in-depth discussion with compiler maintainers?

I'm thinking your issue will not be fixed (just like we don't allow $abc to be an identifier). But the spec can be fixed to refer to the correct standards.

-Steve

September 15, 2020

Re: Why is BOM required to use unicode in tokens?

Posted by Jon Degenhardt
in reply to Steven Schveighoffer

Jon Degenhardt

Posted in reply to Steven Schveighoffer

On Tuesday, 15 September 2020 at 14:59:03 UTC, Steven Schveighoffer wrote:
> On 9/15/20 10:18 AM, James Blachly wrote:
>> What will it take (i.e. order of difficulty) to get this fixed -- will merely a bug report (and PR, not sure if I can tackle or not) do it, or will this require more in-depth discussion with compiler maintainers?
>
> I'm thinking your issue will not be fixed (just like we don't allow $abc to be an identifier). But the spec can be fixed to refer to the correct standards.

Looks like it has to do with the '∂' character. But non-ascii alphabetic characters work generally.

# The 'Ш' and 'ä' characters are fine.
$ echo $'import std.stdio; void Шä() { writeln("Hello World!"); } void main() { Шä(); }' | dmd -run -
Hello World!

# But not '∂'
$ echo $'import std.stdio; void x∂() { writeln("Hello World!"); } void main() { x∂(); }' | dmd -run -
__stdin.d(1): Error: char 0x2202 not allowed in identifier
__stdin.d(1): Error: character 0x2202 is not a valid token
__stdin.d(1): Error: char 0x2202 not allowed in identifier
__stdin.d(1): Error: character 0x2202 is not a valid token

However, 'Ш' and 'ä' satisfy the definition of a Unicode letter, '∂' does not. (Using D's current Unicode definitions). I'll use tsv-filter (from tsv-utils) to show this rather than writing out the full D code. But, this uses std.regex.matchFirst().

# The input
$ echo $'x\n∂\nШ\nä'
x
∂
Ш
ä

The input filtered by Unicode letter '\p{L}'
$ echo $'x\n∂\nШ\nä' | tsv-filter --regex 1:'^\p{L}$'
x
Ш
ä

The spec can be made more clear and correct. But if a "universal alpha" is essentially about Unicode letters you might be looking for a change in the spec to use the symbol chosen.

--Jon

September 15, 2020

Re: Why is BOM required to use unicode in tokens?

Posted by Ola Fosheim Grøstad
in reply to James Blachly

Ola Fosheim Grøstad

Posted in reply to James Blachly

On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly wrote:
> I wish to write a function including ∂x and ∂y (these are

You can use the greek letter delta instead:

δ

September 16, 2020

Re: Why is BOM required to use unicode in tokens?

Posted by starcanopy
in reply to Ola Fosheim Grøstad

starcanopy

Posted in reply to Ola Fosheim Grøstad

On Tuesday, 15 September 2020 at 21:27:25 UTC, Ola Fosheim Grøstad wrote:
> On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly wrote:
>> I wish to write a function including ∂x and ∂y (these are
>
> You can use the greek letter delta instead:
>
> δ

Wouldn't that imply a normal differential?

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation