std.d.lexer: pre-voting review / discussion (page 7)

On 2013-09-11 21:57, Piotr Szturmaj wrote: > Delphi designers realized this problem years ago and they came up with a > solution: > http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers > > > Basically, Delphi allows escaping reserved identifiers with a '&'. I > wonder how D solves that problem when interfacing to COM classes if they > have for example a function named "scope". Scala does it as well: `keyword` if I recall correctly. Seems like you can put basically anything between the backticks in Scala. -- /Jacob Carlborg

On 2013-09-12 00:36, Martin Nowak wrote: > Also a convenience function that reads a file and processes UTF BOM > marks would be nice (see toUtf8 > https://github.com/dawgfoto/lexer/blob/master/dlexer/dlexer.d#L1429), > but that could as well fit somewhere else into phobos. Sounds like that would fit in std.file or similar. -- /Jacob Carlborg

September 12, 2013

Re: std.d.lexer: pre-voting review / discussion

Posted by Jacob Carlborg
in reply to Dicebot

Permalink

Jacob Carlborg

Posted in reply to Dicebot

Permalink

On 2013-09-11 17:01, Dicebot wrote:
> std.d.lexer is standard module for lexing D code, written by Brian Schott

Finally :)

* How does it handler errors, just returns TokenType.invalid?

* Personally I think the module is too big. I would go with:

- std.d.lexer.token
- std.d.lexer.tokentype
- std.d.lexer.lexer - contains the rest
- std.d.lexer.config - IterationStyle, TokenStyle, LexerConfig
- CircularRange, StringCache, possibly put somewhere else. I assume this can be used for other things than lexing?
- Trie related code, same as above

* I see that errorMessage throws an exception. Do we really want that? I would except it just returns an invalid token.

If we do decide it should throw, it should absolutely _not_ throw a plain Exception. Create a new type, LexException or similar. I hate when code throws plain Exceptions, it makes it useless to catch them.

I would also expect this LexException to contain a Token. It shouldn't be needed to parse the exception message to get line and column information.

* I like that you overall use clear and descriptive variable and function names. Except "sbox": https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L3265

* Could we get some unit tests for string literals, comments and identifies out side of the ASCII table

* I would like to see a short description for each unit test, what it's testing. Personally I have started with this style:

@describe("byToken")
{
    @context("valid string literal")
    {
        @it("should return a token with the type TokenType.stringLiteral") unittest
        {
            // test
        }

        @it("should return a token with the correct lexeme") unittest
        {
            // test
        }
    }
}

Better formatted: http://pastebin.com/Dx78Vw6r

People here might think that would be a bit too verbose. The following would be ok as well:

@describe("short description of the unit test") unittest { }

* Could you remove debug code and other code that is commented out:

- 344
- 1172
- 1226, is that needed?
- 3165-3166
- 3197-3198
- 3392
- 3410
- 3434

Spell errors:

* "forwarad" - 292
* "commemnt" - 2031
* "sentenels" - 299
* "messsage" - 301
* "underliying" - 2454
* "alloctors" - 3230
* "strightforward" - 2276

-- 
/Jacob Carlborg

On Thursday, 12 September 2013 at 06:17:04 UTC, Walter Bright wrote: > On 9/11/2013 10:10 PM, deadalnix wrote: >> See my comment, it is possible, with increased parser complexity, to handle many >> cases where you don't know what you are parsing yet. Doing so, lookahead is only >> required to find matching closing token. I suspect that a fast path in the lexer >> for that precise use case may be faster than buffering tokens, as it allow to >> save one branch per token. > > I don't believe that, because you can see about anything for tokens in lookahead and so have to duplicate nearly the whole lexer anyway for the 'fast path', but you're free to try it out and prove me wrong. I plan to, but you know what it is, the best optimization is the one that go from non working to working state.

I got some time to work on the lexer this evening. Changeset here: https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d The DDoc page has moved here: http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.html * There are a few more unit tests now * bitAnd renamed to amp * slice rename to dotdot * Much more cross-referencing in the doc comments * Start line and column can be specified in the lexer config

Am 12.09.2013 10:15, schrieb Brian Schott: > I got some time to work on the lexer this evening. Changeset > here: > https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d > > The DDoc page has moved here: > http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.html > > * There are a few more unit tests now > * bitAnd renamed to amp > * slice rename to dotdot > * Much more cross-referencing in the doc comments > * Start line and column can be specified in the lexer config > problem: many occurences of the same string you should use constants for the tokens (and others) string asm_token = "asm"; ... immutable(string[TokenType.max + 1]) tokenValues = [ ... asm_token ... ] and reuse these constants in your "optimization" maybe you can replace these lines with something getting feed with asm_token and give the same result but without 'a' and "sm" as splitted and different magic values - maybe a nice template or subrange... case 'a': if (input[1..$].equal("sm")) return TokenType.asm_; else ... break; and in your unit tests for example on auto expected =

On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright wrote: > On 9/11/2013 6:30 PM, deadalnix wrote: >> Indeed. What solution do you have in mind ? > > The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list. I think this is the right approach. It can probably be another function, we can put into std.range and reuse for other lexer/parsers. The lexer or the parser should not be made more complex for this.

just an idea regarding your string[TokenType.max + 1] in immutable(string[TokenType.max + 1]) tokenValues = [..] it seems that you try to reduce the memory usage wouldn't it be a nice idea to generate a combined imutable string at compiletime like this one "...pragmaexportpackageprivate..." and generated string slice accesses? imuteable string big_one = generated_from(toke_list); imutable string export_token = big_one[10..6];

On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote: > Most important goal of this review is to determine any API / design problems. Any internal implementation tweaks may happen after inclusion to Phobos but it is important to assure that no breaking changes will be required any time soon after module will get wider usage. > One quick remark : we need some kind of value provider to reuse across different lexing, and can be used outside the lexer. If I process a module and have to kick in a new lexing phase because of mixin, I want to generate identifiers out of the same pool.

On 09/12/2013 12:09 AM, Manfred Nowak wrote: > Walter Bright wrote: > >> Since the very beginning. >> >> One example is determining if something is a declaration or an expression. > I see now, that you wrote about parsing---not about lexing. > > Btw. I wrote an LALR(1)-parser for an early version of D. This means a lookahead of one was sufficient---or I made terrible mistakes. > > -manfred I had problems with it, especially with /IdentifierList: Identifier . IdentifierList TemplateInstance. IdentifierList And Bison also complaint. /

Forums