DMD: invalid UTF character `\U0000d800`

Nov 07, 2020

Per Nordlöw

Nov 07, 2020

Jacob Carlborg

Nov 08, 2020

Nov 08, 2020

Nov 08, 2020

Nov 08, 2020

Nov 08, 2020

Nov 09, 2020

Nov 12, 2020

November 07, 2020

DMD: invalid UTF character `\U0000d800`

Posted by Per Nordlöw

Permalink

Per Nordlöw

Permalink

I'm writing a parser generator for ANTLR-grammars and have come across the rule

fragment Letter
    : [a-zA-Z$_] // these are below 0x7F
    | ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
    | [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
    ;

at

https://github.com/antlr/grammars-v4/blob/master/cto/CtoLexer.g4#L158

This rule is converted into

    Match m__Letter()
    {
        return alt(alt(rng('a', 'z'), rng('A', 'Z'), ch('$'), ch('_')),
                   not(alt(rng('\u0000', '\u007F'), rng('\uD800', '\uDBFF'))),
                   seq(rng('\uD800', '\uDBFF'), rng('\uDC00', '\uDFFF')));
    }

given suitable defs of alt, rng, seq, not.

This errors as

 CtoLexer_parser.d   665  57 error           invalid UTF character \U0000d800
 CtoLexer_parser.d   665  67 error           invalid UTF character \U0000dbff
 CtoLexer_parser.d   666  28 error           invalid UTF character \U0000d800
 CtoLexer_parser.d   666  38 error           invalid UTF character \U0000dbff
 CtoLexer_parser.d   666  53 error           invalid UTF character \U0000dc00
 CtoLexer_parser.d   666  63 error           invalid UTF character \U0000dfff

Doesn't DMD support these Unicodes yet?

On Saturday, 7 November 2020 at 16:12:06 UTC, Per Nordlöw wrote: > CtoLexer_parser.d 665 57 error invalid UTF character \U0000d800 > CtoLexer_parser.d 665 67 error invalid UTF character \U0000dbff > CtoLexer_parser.d 666 28 error invalid UTF character \U0000d800 > CtoLexer_parser.d 666 38 error invalid UTF character \U0000dbff > CtoLexer_parser.d 666 53 error invalid UTF character \U0000dc00 > CtoLexer_parser.d 666 63 error invalid UTF character \U0000dfff > > Doesn't DMD support these Unicodes yet? They're not valid: "The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points" [1]. "... the standard states that such arrangements should be treated as encoding errors" [1]. Perhaps they need to be combined with other code points to form a valid character. [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF -- /Jacob Carlborg

On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg wrote: > [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF Thanks! I'm only using these UTF characters to create ranges that source code characters as checked against during parsing. Therefore I would like to just convert these to a `dchar` for now using a `cast`. Can I just do, for instance, cast(dchar)0x0000d8000 for `\U0000d800` to accomplish this?

On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote: > cast(dchar)0x0000d8000 To clarify, enum dch1 = cast(dchar)0xa0a0; enum dch2 = '\ua0a0'; assert(dch1 == dch2); works. Can I use the first-variant if I want to postpone these encoding questions for now?

On 11/8/20 5:47 AM, Per Nordlöw wrote: > On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg wrote: >> [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF > > Thanks! > > I'm only using these UTF characters to create ranges that source code characters as checked against during parsing. Therefore I would like to just convert these to a `dchar` for now using a `cast`. Can I just do, for instance, > > cast(dchar)0x0000d8000 > > for > > `\U0000d800` > > to accomplish this? Yes, use the cast. It should work. It's just the D grammar that is stopping you, a dchar is just an integer under the hood, so the cast should be fine. -Steve

On 2020-11-08 13:39, Kagamin wrote: > Surrogate pairs are used in rules because java strings are utf-16 encoded, it doesn't make much sense for other encodings. D supports the UTF-16 encoding as well. The compiler doesn't accept the surrogate pairs even for UTF-16 strings. -- /Jacob Carlborg

On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote: > Can I just do, for instance, > > cast(dchar)0x0000d8000 > > for > > `\U0000d800` > > to accomplish this? There's also: dchar(0x0000d8000)

Forums