Thread overview | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
|
November 07, 2020 DMD: invalid UTF character `\U0000d800` | ||||
---|---|---|---|---|
| ||||
I'm writing a parser generator for ANTLR-grammars and have come across the rule fragment Letter : [a-zA-Z$_] // these are below 0x7F | ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate | [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF ; at https://github.com/antlr/grammars-v4/blob/master/cto/CtoLexer.g4#L158 This rule is converted into Match m__Letter() { return alt(alt(rng('a', 'z'), rng('A', 'Z'), ch('$'), ch('_')), not(alt(rng('\u0000', '\u007F'), rng('\uD800', '\uDBFF'))), seq(rng('\uD800', '\uDBFF'), rng('\uDC00', '\uDFFF'))); } given suitable defs of alt, rng, seq, not. This errors as CtoLexer_parser.d 665 57 error invalid UTF character \U0000d800 CtoLexer_parser.d 665 67 error invalid UTF character \U0000dbff CtoLexer_parser.d 666 28 error invalid UTF character \U0000d800 CtoLexer_parser.d 666 38 error invalid UTF character \U0000dbff CtoLexer_parser.d 666 53 error invalid UTF character \U0000dc00 CtoLexer_parser.d 666 63 error invalid UTF character \U0000dfff Doesn't DMD support these Unicodes yet? |
November 07, 2020 Re: DMD: invalid UTF character `\U0000d800` | ||||
---|---|---|---|---|
| ||||
Posted in reply to Per Nordlöw | On Saturday, 7 November 2020 at 16:12:06 UTC, Per Nordlöw wrote: > CtoLexer_parser.d 665 57 error invalid UTF character \U0000d800 > CtoLexer_parser.d 665 67 error invalid UTF character \U0000dbff > CtoLexer_parser.d 666 28 error invalid UTF character \U0000d800 > CtoLexer_parser.d 666 38 error invalid UTF character \U0000dbff > CtoLexer_parser.d 666 53 error invalid UTF character \U0000dc00 > CtoLexer_parser.d 666 63 error invalid UTF character \U0000dfff > > Doesn't DMD support these Unicodes yet? They're not valid: "The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points" [1]. "... the standard states that such arrangements should be treated as encoding errors" [1]. Perhaps they need to be combined with other code points to form a valid character. [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF -- /Jacob Carlborg |
November 08, 2020 Re: DMD: invalid UTF character `\U0000d800` | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg wrote:
> [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF
Thanks!
I'm only using these UTF characters to create ranges that source code characters as checked against during parsing. Therefore I would like to just convert these to a `dchar` for now using a `cast`. Can I just do, for instance,
cast(dchar)0x0000d8000
for
`\U0000d800`
to accomplish this?
|
November 08, 2020 Re: DMD: invalid UTF character `\U0000d800` | ||||
---|---|---|---|---|
| ||||
Posted in reply to Per Nordlöw | On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:
> cast(dchar)0x0000d8000
To clarify,
enum dch1 = cast(dchar)0xa0a0;
enum dch2 = '\ua0a0';
assert(dch1 == dch2);
works. Can I use the first-variant if I want to postpone these encoding questions for now?
|
November 08, 2020 Re: DMD: invalid UTF character `\U0000d800` | ||||
---|---|---|---|---|
| ||||
Posted in reply to Per Nordlöw | On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:
> dchar
Surrogate pairs are used in rules because java strings are utf-16 encoded, it doesn't make much sense for other encodings.
|
November 08, 2020 Re: DMD: invalid UTF character `\U0000d800` | ||||
---|---|---|---|---|
| ||||
Posted in reply to Per Nordlöw | On 11/8/20 5:47 AM, Per Nordlöw wrote:
> On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg wrote:
>> [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF
>
> Thanks!
>
> I'm only using these UTF characters to create ranges that source code characters as checked against during parsing. Therefore I would like to just convert these to a `dchar` for now using a `cast`. Can I just do, for instance,
>
> cast(dchar)0x0000d8000
>
> for
>
> `\U0000d800`
>
> to accomplish this?
Yes, use the cast. It should work.
It's just the D grammar that is stopping you, a dchar is just an integer under the hood, so the cast should be fine.
-Steve
|
November 08, 2020 Re: DMD: invalid UTF character `\U0000d800` | ||||
---|---|---|---|---|
| ||||
Posted in reply to Kagamin | On 2020-11-08 13:39, Kagamin wrote: > Surrogate pairs are used in rules because java strings are utf-16 encoded, it doesn't make much sense for other encodings. D supports the UTF-16 encoding as well. The compiler doesn't accept the surrogate pairs even for UTF-16 strings. -- /Jacob Carlborg |
November 09, 2020 Re: DMD: invalid UTF character `\U0000d800` | ||||
---|---|---|---|---|
| ||||
Posted in reply to Per Nordlöw | On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:
> Can I just do, for instance,
>
> cast(dchar)0x0000d8000
>
> for
>
> `\U0000d800`
>
> to accomplish this?
There's also:
dchar(0x0000d8000)
|
November 12, 2020 Re: DMD: invalid UTF character `\U0000d800` | ||||
---|---|---|---|---|
| ||||
Posted in reply to Boris Carvajal | On Monday, 9 November 2020 at 16:39:49 UTC, Boris Carvajal wrote:
> There's also:
>
> dchar(0x0000d8000)
Thanks
|
Copyright © 1999-2021 by the D Language Foundation