Formal review of std.lexer (page 5)

On Sunday, 16 March 2014 at 22:13:16 UTC, Martin Nowak wrote: > On 02/22/2014 09:31 PM, "Marc Schütz" <schuetzm@gmx.net>" wrote: >> But that still doesn't explain why a custom hash table implementation is >> necessary. Maybe a lightweight wrapper around built-in AAs is sufficient? > > I'm also wondering what benefit this hash table provides. The hash table was added by Dmitry a while back. I assume it's because he had numbers to back it up.

On Monday, 14 April 2014 at 20:43:58 UTC, Alix Pexton wrote: > I know the official review period is long past but I'd not had a good look at this module until this past weekend. > > Last year I had been working on my own xml lexer/parser but as of yet I have nothing to show for it so I took a look a this proposal with an eye towards using it to make my efforts easier. > > Andrei's posts about the possible design of a generic lexer had also influenced me, so I was expecting to find similarities between this module and my own work, albeit with the added benefits of being generic (in the good way). I have, however, found it very difficult to understand much of it, which I entirely put down to my own deficiencies with templates and especially the use of mixins. > > In the example Dlang lexer, the constructor takes a ubyte[] as input and wraps it in a LexerRange struct which defines the normal input range primitives as well as various functions for lookahead. It is not documented whether the lexer needs these extra features or if they are only provided for use within the tokenising functions that are supplied to the template by the user. If they are used by the core of the lexer then it would seem to preclude the use of any other type of input that cannot be coerced into a ubyte[] without the effort on the part of the user to implement the same interface. ubyte[] is required for the lexer to work quickly. That lexer range is designed to keep track of the column and line numbers. > I think the description of the functionality required of the tokenSeparatingFunction that the user must supply needs to be much better. If I understand correctly, it is intended to differentiate between keywords and identifiers which begin with a keyword. THe more I think about this the less certain I am. That function's purpose is to determine if the current code unit signifies the end of an identifier/keyword. When lexing "fortunate", the lexer would spot "for", and then call the separating function with the index of "u". In the case of D, it would say that this does NOT end a word, and "fortunate" should be lexed as an identifier. If it was called with an index of a space or parenthesis, it should return true. > When the lexer dispatches to a token handler, the front of the range is left pointing to the beginning of the character sequence that was matched, allowing it to be included in the returned token. However, many of the handlers in the example Dlang lexer begin with a number of blind popFront calls to jump to the end of that match. I am aware that in well meaning code this is a case of the range being logically !empty, but I also wonder how often it might get overlooked when 2 matches of different lengths are dispatched the same handler. (I had a similar situation in my own code, and my solution was to have a variable storing the .save of my inputRange and count how many chars were consumed since it was updated. This way I could either return the whole match or part of it in the token or discard it and include only what came after it.) As there has been some contention about the correct use of the range primitives of late, I will refrain from making any other comment on their use in this module, especially as I am no longer sure that I have been using them correctly myself. If more than one prefix is dispatched to the same handler, the handler cannot blindly call popFront or popFrontN. If the lexer author knows that the handler will only be called in certain places, than checking !empty is a waste of cycles. I don't think it's possible for the lexer generator to enforce this. > In the short time that I have been looking at the features of this lexer I have not been able to figure out a way of writing a standards compliant XML parser without having to lex some parts of the document at least twice, or subverting the token handlers to change behaviour according to context. Several non compliant single pass XML lexers would be possible, but they would not be able to process documents that used some (admittedly obscure and often overlooked) features. The only scalable technique that I can think of to allow XML to be lexed in a single pass in a fully spec compliant way would be to allow handlers to return multiple tokens. I am not sure how feasible this would be or what mechanism would be best to implement it. XML doesn't seem to have very distinct lexing/parsing phases like JSON markup or Java code does. This lexer generator may not be the best way to deal with XML. > On the whole I think the the overall design of the module shows promise but requires polish to make it both more idiomatically Dlang-y and easier for the user to build upon (both by documentation and interface). If you have specific problems, please list them. > On a side not related to the example lexer for Dlang, I believe the predicate function isDocComment will produce false positives for the following comment delimiters which to my knowledge are not valid DDoc delimiters... > > //* //+ /*+ /*/ /+* /+/ https://github.com/Hackerpilot/Dscanner/issues/163 > > As the Dlang lexer is not part of the review proper I have not inspected it carefully, this function just happens to be the first one declared in that example. If you do find issues with the lexer or parser, please file them here: https://github.com/Hackerpilot/Dscanner/issues > Again, my apologies for the tardiness of this review. > > A...

April 15, 2014

Re: Formal review of std.lexer

Posted by Alix Pexton
in reply to Brian Schott

Permalink

Alix Pexton

Posted in reply to Brian Schott

Permalink

On 14/04/2014 10:34 PM, Brian Schott wrote:
> ubyte[] is required for the lexer to work quickly. That lexer range is
> designed to keep track of the column and line numbers.

I can understand that speed requires the input to be handled as bytes instead of chars, but the restriction to an ubyte[] over an randomAccessRange seems to me un-Dlang-y.

If the LexerRange is only there to optionally add line/column numbering support then I think it needs a much more descriptive name and much better documentation.

> That function's purpose is to determine if the current code unit
> signifies the end of an identifier/keyword. When lexing "fortunate", the
> lexer would spot "for", and then call the separating function with the
> index of "u". In the case of D, it would say that this does NOT end a
> word, and "fortunate" should be lexed as an identifier. If it was called
> with an index of a space or parenthesis, it should return true.

Somehow I had skipped the salient part of the documentation for this, so my apologies (at first I thought that the DDoc output I was reading must have been out of sync with the source (it was, but not that much), but further investigations suggest some form of temporary blindness).

This description squares with what I had surmised from reading the code, and I can see why it would be more efficient than the common technique of comparing every matched identifier to the list of keywords. I do wonder however if there might be another way to attain the same efficiency without the need for the separating function (I should have replied to this last night when my ideas on the matter were clearer, sleep seems to have stolen my theoretical alternative ><).

I'm also curious about the decision that the signature of the separating function should take the offset to the character than needs to be tested. A more Dlang-y thing to do in my mind would be to pass a range that begins at the first character after the matched keyword.

> If more than one prefix is dispatched to the same handler, the handler
> cannot blindly call popFront or popFrontN. If the lexer author knows
> that the handler will only be called in certain places, than checking
> !empty is a waste of cycles. I don't think it's possible for the lexer
> generator to enforce this.

I don't think the lexer needs to do anything extra to ensure that a handler can begin its work both without repeating calls to .empty or blindly calling popFront. To do so requires that the input be treated a a forward range. Before reading a token, store the .save of the input, then advance the input as the token is matched counting the consumed elements. When the handler is called it will have the option of including the token by adding to the count and returning a slice that begins at the .save'd position or ignoring the length of the match and returning a slice that begins at the position after the matched token.

I strongly believe that a model that requires the user to reason about a library providing ranges that are "logicaly !empty" is a misstep.

> XML doesn't seem to have very distinct lexing/parsing phases like JSON
> markup or Java code does. This lexer generator may not be the best way
> to deal with XML.

I don't really know how the grammar in the XML standard compares to others in general. It is certainly more convoluted than the published grammar for Dlang, but we all know that that doesn't quite match the language. Some criticise XML for being too complicated, but in some respects it is even more complicated than that, it seems ironic that it was supposed to be simple.

But, I might have typed too soon when I wrote that this lexer might not be able to fully tokenize XML in a single pass. As I was drifting off to sleep last night I had the idea of using the extraFields of TokenStructure to add a token pointer (or equivalent construct) to the tokens, making a singly linked list and allowing me to return multiple tokens from a token handler. I just need to convince myself that it is a sane thing to do.

> If you have specific problems, please list them.

I think the calculator example should be improved or replaced so that the documentation's first example uses all the string[]s that can be passed to the lexer. Perhaps a simple symbolic calculator that accepts letter sequences as variables and reserves pi as a keyword?

I think that the section that describes the token handlers needs to fully document the primitives that the user has access to in order to lex a token and what state to expect them to be in when the handler is called and what state they should be in when it returns.

If tracking lines and columns is optional and supported only when wrapping the input with LexerRange then I think that by default Tokens should not contain fields for them. Perhaps an enum qstring that declares the required fields can be introduced and an example shown where it is concatenated with user declarations to be passed in via extraFields.

I'll again restate that I think LexerRange needs a name that better reflects its functionality. From reading the code I see that it does do a little more that simply track line/column by providing more efficient implementations of some of the range primitives. I Think that the user should be able to take advantage of these savings without having to also commit to line/column tracking. I think there should also be the option of tracking only lines, as column information is not always as accurate or useful.

I think that is all my concerns for now, I'll leave my crazy ideas for "improvements" for another day.

> If you do find issues with the lexer or parser, please file them here:
> https://github.com/Hackerpilot/Dscanner/issues

Will do!

A...

17-Mar-2014 02:13, Martin Nowak пишет: > On 02/22/2014 09:31 PM, "Marc Schütz" <schuetzm@gmx.net>" wrote: >> But that still doesn't explain why a custom hash table implementation is >> necessary. Maybe a lightweight wrapper around built-in AAs is sufficient? > > I'm also wondering what benefit this hash table provides. Getting back to this. The custom hash map originaly was a product of optimization, the benefits over built-in AAs are: a) Allocation was amortized by allocating nodes in batches. b) Allowed custom hash function to be used with built-in type (string). Not sure how much of that stands today. -- Dmitry Olshansky

Forums