Thread overview
Handling of U+2028 and U+2029 in source code
Oct 15, 2023
kdevel
Oct 16, 2023
kdevel
Oct 17, 2023
Walter Bright
Oct 17, 2023
deadalnix
Oct 18, 2023
Walter Bright
October 15, 2023

According to [1] U+2028 and U+2029 are considered end-of-line characters. Does this make sense?

$ cat lsps.d
void main ()
{
   enum b = 8;
   mixin ("enum a1 =\u2028b; pragma (msg, a1);");
   mixin ("enum a2\u2028= b; pragma (msg, a2);");
   mixin ("enum\u2028a3 = b; pragma (msg, a3);");
}
$ dmd lsps.d
8
lsps.d-mixin-5(5): Error: char 0x2028 not allowed in identifier
lsps.d-mixin-6(6): Error: char 0x2028 not allowed in identifier

[1] https://dlang.org/spec/lex.html#end_of_line

October 16, 2023
Based upon how the identifier tokenization occurs greedily, yes it makes sense.
October 16, 2023
On Monday, 16 October 2023 at 02:45:17 UTC, Richard (Rikki) Andrew Cattermole wrote:
> Based upon how the identifier tokenization occurs greedily, yes it makes sense.

The error message is confusing, compare with this code, using returns:

```
$ cat ret.d
void main ()
{
   enum b = 8;
   mixin ("enum a1 =\rb; pragma (msg, a1);");
   mixin ("enum a2\r= b; pragma (msg, a2);");
   mixin ("enum\ra3 = b; pragma (msg, a3);");
}
$ dmd ret.d
8
8
8
```

Why are U+2028 and U+2029 handled unlike \r and \n?
October 17, 2023
https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578

Basically its in multi-byte UTF-8 character, checks if its in the non-ASCII character ranges. No special handling of new lines is provided, but probably should be.
October 17, 2023
On 10/16/2023 5:37 PM, Richard (Rikki) Andrew Cattermole wrote:
> https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578
> 
> Basically its in multi-byte UTF-8 character, checks if its in the non-ASCII character ranges. No special handling of new lines is provided, but probably should be.

Yes, please file a bugzilla!
October 17, 2023
On Tuesday, 17 October 2023 at 00:37:41 UTC, Richard (Rikki) Andrew Cattermole wrote:
> https://github.com/dlang/dmd/blob/master/compiler/src/dmd/lexer.d#L578
>
> Basically its in multi-byte UTF-8 character, checks if its in the non-ASCII character ranges. No special handling of new lines is provided, but probably should be.

I've noticed that in the past, but this is clearly wrong. It's not just whitespace, it's also punctuation, emoji, a ton of stuff that are just not identifiers.

The lexer should match the proper charset as a character start.
October 18, 2023
https://issues.dlang.org/show_bug.cgi?id=24190
October 17, 2023
On 10/17/2023 4:18 PM, deadalnix wrote:
> this is clearly wrong.

I blame my parents.