Thread overview
DMD: invalid UTF character `\U0000d800`
Nov 07, 2020
Per Nordlöw
Nov 07, 2020
Jacob Carlborg
Nov 08, 2020
Per Nordlöw
Nov 08, 2020
Per Nordlöw
Nov 08, 2020
Kagamin
Nov 08, 2020
Jacob Carlborg
Nov 09, 2020
Boris Carvajal
Nov 12, 2020
Per Nordlöw
November 07, 2020
I'm writing a parser generator for ANTLR-grammars and have come across the rule

fragment Letter
    : [a-zA-Z$_] // these are below 0x7F
    | ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
    | [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
    ;

at

https://github.com/antlr/grammars-v4/blob/master/cto/CtoLexer.g4#L158

This rule is converted into

    Match m__Letter()
    {
        return alt(alt(rng('a', 'z'), rng('A', 'Z'), ch('$'), ch('_')),
                   not(alt(rng('\u0000', '\u007F'), rng('\uD800', '\uDBFF'))),
                   seq(rng('\uD800', '\uDBFF'), rng('\uDC00', '\uDFFF')));
    }

given suitable defs of alt, rng, seq, not.

This errors as

 CtoLexer_parser.d   665  57 error           invalid UTF character \U0000d800
 CtoLexer_parser.d   665  67 error           invalid UTF character \U0000dbff
 CtoLexer_parser.d   666  28 error           invalid UTF character \U0000d800
 CtoLexer_parser.d   666  38 error           invalid UTF character \U0000dbff
 CtoLexer_parser.d   666  53 error           invalid UTF character \U0000dc00
 CtoLexer_parser.d   666  63 error           invalid UTF character \U0000dfff

Doesn't DMD support these Unicodes yet?
November 07, 2020
On Saturday, 7 November 2020 at 16:12:06 UTC, Per Nordlöw wrote:

>  CtoLexer_parser.d   665  57 error           invalid UTF character \U0000d800
>  CtoLexer_parser.d   665  67 error           invalid UTF character \U0000dbff
>  CtoLexer_parser.d   666  28 error           invalid UTF character \U0000d800
>  CtoLexer_parser.d   666  38 error           invalid UTF character \U0000dbff
>  CtoLexer_parser.d   666  53 error           invalid UTF character \U0000dc00
>  CtoLexer_parser.d   666  63 error           invalid UTF character \U0000dfff
>
> Doesn't DMD support these Unicodes yet?

They're not valid:

"The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points" [1].

"... the standard states that such arrangements should be treated as encoding errors" [1].

Perhaps they need to be combined with other code points to form a valid character.

[1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF

--
/Jacob Carlborg


November 08, 2020
On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg wrote:
> [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF

Thanks!

I'm only using these UTF characters to create ranges that source code characters as checked against during parsing. Therefore I would like to just convert these to a `dchar` for now using a `cast`. Can I just do, for instance,

    cast(dchar)0x0000d8000

for

    `\U0000d800`

to accomplish this?
November 08, 2020
On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:
>     cast(dchar)0x0000d8000

To clarify,

    enum dch1 = cast(dchar)0xa0a0;
    enum dch2 = '\ua0a0';
    assert(dch1 == dch2);

works. Can I use the first-variant if I want to postpone these encoding questions for now?
November 08, 2020
On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:
> dchar

Surrogate pairs are used in rules because java strings are utf-16 encoded, it doesn't make much sense for other encodings.
November 08, 2020
On 11/8/20 5:47 AM, Per Nordlöw wrote:
> On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg wrote:
>> [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF
> 
> Thanks!
> 
> I'm only using these UTF characters to create ranges that source code characters as checked against during parsing. Therefore I would like to just convert these to a `dchar` for now using a `cast`. Can I just do, for instance,
> 
>      cast(dchar)0x0000d8000
> 
> for
> 
>      `\U0000d800`
> 
> to accomplish this?

Yes, use the cast. It should work.

It's just the D grammar that is stopping you, a dchar is just an integer under the hood, so the cast should be fine.

-Steve
November 08, 2020
On 2020-11-08 13:39, Kagamin wrote:

> Surrogate pairs are used in rules because java strings are utf-16 encoded, it doesn't make much sense for other encodings.

D supports the UTF-16 encoding as well. The compiler doesn't accept the surrogate pairs even for UTF-16 strings.

-- 
/Jacob Carlborg
November 09, 2020
On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:
> Can I just do, for instance,
>
>     cast(dchar)0x0000d8000
>
> for
>
>     `\U0000d800`
>
> to accomplish this?

There's also:

dchar(0x0000d8000)
November 12, 2020
On Monday, 9 November 2020 at 16:39:49 UTC, Boris Carvajal wrote:
> There's also:
>
> dchar(0x0000d8000)

Thanks