On Thursday, 25 November 2021 at 10:41:05 UTC, Rumbu wrote:
> Well:
#line IntegerLiteral Filespec? EndOfLine
Having EndOfLine at the end means for me that there are no other EOLs between, otherwise this syntax should pass but it's not (DMD last):
#line 12
"source.d"
The lexical grammar section starts with:
> The source text is decoded from its source representation into Unicode Characters. The Characters are further divided into: WhiteSpace, EndOfLine, Comments, SpecialTokenSequences, and Tokens, with the source terminated by an EndOfFile.
What it's failing to mention is how in the lexical grammar rules, spaces denote 'immediate concatenation' of the characters/rules before and after it, e.g.:
DecimalDigits:
DecimalDigit
DecimalDigit DecimalDigits
3 1 4
is not a single IntegerLiteral
, it needs to be 314
.
Now in the parsing grammar, it should mention that spaces denote immediate concatenation of Tokens, with arbitrary Comments and WhiteSpace inbetween. So the rule:
AtAttribute:
@ nogc
Means: an @ token, followed by arbitrary comments and whitespace, followed by an identifier token that equals "nogc". That explains your first example.
Regarding this lexical rule:
#line IntegerLiteral Filespec? EndOfLine
This is wrong already from a lexical standpoint, it would suggest a SpecialTokenSequence looks like this:
#line10"file"
The implementation actually looks for a # token, skips WhiteSpace and Comments, looks for an identifier token ("line"), and then it goes into a custom loop that allows separation by WhiteSpace but not Comment, and also the first '\n' will be assumed to be the final EndOfLine, which is why this fails:
#line 12
"source.d"
It thinks it's done after "12".
In conclusion the specification should:
- define the notation used in lexical / parsing grammar blocks
- clearly distinguish lexical / parsing blocks
- fix up the
SpecialTokenSequence
definition (and maybe change dmd as well)
By the way, the parsing grammar defines:
LinkageType:
C
C++
D
Windows
System
Objective-C
C++ and Objective-C cannot be single tokens currently, so they are actually 2/3, which is why these are allowed:
extern(C
++)
void f() {}
extern(Objective
-
C)
void g() {}
This should also be fixed in the spec.
> I am not asking this questions out of thin air, I am trying to write a conforming lexer and this is one of the ambiguities.
That's cool! Are you writing an editor plugin?