July 31, 2012
On Tuesday, 31 July 2012 at 21:10:52 UTC, Philippe Sigaud wrote:
>> I've only glimpsed at your code. For most languages lexing is far more
>> expensive then parsing
>
> Is that so?

I have no numbers. It's a statement that I found in some (1-3) sources about parsing. I'll share if I can find them.

>> and thus the lexer has to be very fast and I wouldn't
>> pursue your approach and instead use something like ragel. It already has D
>> output but needs a separate build step.
>
> Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.

Yeah.
July 31, 2012
On 01-Aug-12 01:10, Philippe Sigaud wrote:
> On Tue, Jul 31, 2012 at 7:29 PM, Tobias Pankrath <tobias@pankrath.net> wrote:
>> On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:
>>
>>> I know I'm a little late coming into this conversation. This seems like a
>>> nice thread to toss myself into. I've started working on a generic lexer
>>> that is based off of a defined grammar.
>>
>>
>> Every helping hand is appreciated :-)
>
> Hi Kai, welcome here!
>
>>> As I read through the thread (I unfortunately don't have enough time to
>>> read every post, but I skimmed through as many as I could, and read the ones
>>> that seemed important), it seems like we need a parser in D that can lex D,
>>> and provide a Range of tokens that could be consumed.
>>
>>
>> Yes. To make this statement more precise: We need a lexer that provides
>> a range of tokens and we need a parser which makes it possible to build
>> an AST. Pegged combines both approaches but imposes an overhead if you
>> just need a token list. However I'm not sure if this is a problem.
>
> I think we need a tokenizer generator and a parser generator. They can
> be grouped or not, but we need both, in Phobos.
>
> We also need to define what's needed in a token. Kai's approach is OK,
> but what's the _type field for an operator or and identifier?
> Also, a lexer can fill a symbol table and produce identifier tokens
> that refer to a particular entry in the symbol table.
>
Yeah.

> I guess creating a tree of symbol tables according to scope visibility
> is then more the job of the parser, but I'm not sure.
>
Parser can use constant IDs for nested tables, IDs point to string table.
String table is populated by lexer.
>
>> I've only glimpsed at your code. For most languages lexing is far more
>> expensive then parsing
>
> Is that so?

It usually is. Unless parser is overly generic as GLL/GLR (not to claim that it's very slow but GLL/GLR are slower for obvious reasons).

>
>> and thus the lexer has to be very fast and I wouldn't
>> pursue your approach and instead use something like ragel. It already has D
>> output but needs a separate build step.

Have to agree here if anything a better DFA generator is needed, current std.regex can't get as good in this field because of:

a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in ctRegex via codegen)
b) full unicode features commitment (again expect some improvement here in near future)
c) designed to take multiple captures from matched piece of text.

I'm not sure when (or even if) std.regex will ever special case all of the above.

>
> Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.
>
> The only problem I see in Kai's approach (which is the standard one, a
> prioritized list of regexes) is how to recognize tokens that are not
> regular (I mean, that cannot be encoded as a regex), like nesting
> comments?

See below.

> How does the lexer know when to stop producing a 'comment'
> token?

Call special function skipComment when lexer hits comment_start pseudotoken.

Typically lexeres are regular as it allows them to be fast.


-- 
Dmitry Olshansky
July 31, 2012
On Tue, Jul 31, 2012 at 11:20 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
>> Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.
>
> I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see).

That was quick! Cool!

>Writing it has been going surprisingly
> quickly actually, and I've already found some bugs in the spec as a result (some of which have been fixed, some of which I still need to create pull requests for). So, regardless of what happens with my lexer, at least the spec will be more accurate.

Could you please describe the kind of token it produces?
Can it build a symbol table?
Does it recognize all kind of strings (including q{ } ones)?
How does it deal with comments, particularly nested ones?
Does it automatically discard whitespaces or produce them as a token?
I'd favor this approach, if only because wrapping the lexer in a
filter!noWS(tokenRange) is easy.
Does it produce a lazy range btw?
July 31, 2012
On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:

>> I guess creating a tree of symbol tables according to scope visibility is then more the job of the parser, but I'm not sure.
>>
> Parser can use constant IDs for nested tables, IDs point to string table. String table is populated by lexer.

The latter, I get. The former, not so much.


>>> I've only glimpsed at your code. For most languages lexing is far more expensive then parsing
>>
>>
>> Is that so?
>
>
> It usually is. Unless parser is overly generic as GLL/GLR (not to claim that it's very slow but GLL/GLR are slower for obvious reasons).

I'd have thought that, seeing as 'the end result' is more complicated (richer, if I may say so) for parsing, then parsing is more expensive. I'm reading on GLR, and was wondering what the practical limits are. Do people use GLR to parse thousands of pages of text?


> Have to agree here if anything a better DFA generator is needed, current std.regex can't get as good in this field because of:
>
> a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in
> ctRegex via codegen)
> b) full unicode features commitment (again expect some improvement here in
> near future)
> c) designed to take multiple captures from matched piece of text.
>
> I'm not sure when (or even if) std.regex will ever special case all of the
> above.

Well,

- for a lexer lookahead is sometimes useful (the Dragon book cite the
FORTRAN grammar, for which keywords are not reserved and so when you
encounter IF, you don't know if (!) it's a function call or a 'real'
if)
- Unicode is needed to lex D correctly, no?
- multiple captures doesn't seem necessary *for a lexer*


>> How does the lexer know when to stop producing a 'comment' token?
>
>
> Call special function skipComment when lexer hits comment_start pseudotoken.

Ah, and this special function must somehow maintain a stack, to 'count' the comment_start and comment_end pseudotokens. So in a way, it quits the regular world to become temporarily more powerful.

> Typically lexeres are regular as it allows them to be fast.

Hmm, but then it has to treat comments a one token. So no Ddoc love?
July 31, 2012
On Tuesday, July 31, 2012 23:39:38 Philippe Sigaud wrote:
> On Tue, Jul 31, 2012 at 11:20 PM, Jonathan M Davis <jmdavisProg@gmx.com>
wrote:
> > On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
> >> Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.
> > 
> > I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see).
> 
> That was quick! Cool!

Yeah. Once I started on it, I made a lot of progress really quickly. There's still a fair bit to do (primarily having to do with literals), but it probably won't take all that much longer. Certainly, I'd expect to have it done within a couple of weeks if not sooner, unless something goes wrong.

> >Writing it has been going surprisingly
> >
> > quickly actually, and I've already found some bugs in the spec as a result (some of which have been fixed, some of which I still need to create pull requests for). So, regardless of what happens with my lexer, at least the spec will be more accurate.
> 
> Could you please describe the kind of token it produces?
> Can it build a symbol table?
> Does it recognize all kind of strings (including q{ } ones)?
> How does it deal with comments, particularly nested ones?
> Does it automatically discard whitespaces or produce them as a token?
> I'd favor this approach, if only because wrapping the lexer in a
> filter!noWS(tokenRange) is easy.
> Does it produce a lazy range btw?

Well, it's still a work in progress, so it certainly can be adjusted as necessary. I intend it to fully implement the spec (and make sure that both it and the spec match what dmd is doing) as far as lexing goes. The idea is that you should be able to build a fully compliant D parser on top of it and build a fully compliant D compiler on top of that.

It already supports all of the comment types and several of the string literal types. I haven't sorted out q{} yet, but I will before I'm done, and that may or may not affect how some things work, since I'm not quite sure how to handle q{} yet (it may end up being done with tokens marking the beginning and end of the token sequence encompassed by q{}, but we'll see). I'm in the middle of dealing with the named entity stuff at the moment, which unfortunately has revealed a rather nasty compiler bug with regards to template compile times, which I still need to report (I intend to do that this evening). The file generating the table of named entities currently takes over 6 minutes to compile on my Phenom II thanks to that bug, so I'm not quite sure how that's going to affect things. Regardless, the lexer should support _everything_ as far as what's required for fully lexing D goes by the time that I'm done.

I don't have the code with me at the moment, but I believe that the token type looks something like

struct Token
{
 TokenType type;
 string str;
 LiteralValue value;
 SourcePos pos;
}

struct SourcePos
{
 size_t line;
 size_ col;
 size_t tabWidth = 8;
}

The type is an enum which gives the type of the token (obviously) which includes the various comment types and an error type (so errors are reported by getting a token that was an error token rather than throwing or anything like that, which should make lexing passed malformed stuff easy).

str holds the exact text which was lexed (this includes the entire comment for the various comment token types), which in the case of lexing string rather than another range type would normally (always? - I don't remember) be a slice of the string being lexed, which should help make lexing string very efficient. It may or may not make sense to change that to the range type used rather than string. For nesting block comments, the whole comment is one token (with the token type which is specifically for nested comments), regardless of whether there's any nesting going on. But that could be changed if there were a need to get separate tokens for the comments inside.

LiteralValue is a VariantN of the types that a literal can be (long, ulong, real, and string IIRC) and is empty unless the token is a literal type (the various string postfixes - c,w, and d - are treated as different token types rather than giving the literal value different string types - the same with the integral and floating point literals).

And pos holds the position in the text where the token started, which should make it easy to use for syntax highlighting and the like (as well as indicating where an error occurred). The initial position is passed as an optional argument to the lexing function, so it doesn't have to be 1:1 (though that's the default), and it allows you to select the tabwidth.

So, you'll pass a range and an optional starting position to the lexing function, and it'll return a lazy range of Tokens. Whitespace is stripped as part of the lexing process, but if you take the pos properties of two adjacent tokens, you should be able to determine how much whitespace was between them.

I _think_ that that's how it currently is, but again, I don't have the code with me at the moment, so it may not be 100% correct. And since it's a work in progress, it's certainly open to changes.

- Jonathan M Davis
July 31, 2012
On 08/01/2012 12:01 AM, Philippe Sigaud wrote:
> On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
> <dmitry.olsh@gmail.com> wrote:
>> Typically lexeres are regular as it allows them to be fast.

Regularity of the language is not required for speed.

>
> Hmm, but then it has to treat comments a one token. So no Ddoc love?

Ddoc is typically not required. By default it should be treated as
whitespace. If it is required, one token seems reasonable: The
post-processing of the doc comment is best done as a separate step.
July 31, 2012
On Wednesday, August 01, 2012 00:54:56 Timon Gehr wrote:
> Ddoc is typically not required. By default it should be treated as whitespace. If it is required, one token seems reasonable: The post-processing of the doc comment is best done as a separate step.

That was how I was intending to deal with ddoc. It's just a nested block comment token. The whole comment string is there, so the ddoc processor can use that to do whatever it does. ddoc isn't part of lexing really. It's a separate thing.

- Jonathan M Davis
August 01, 2012
On Wed, Aug 1, 2012 at 12:58 AM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On Wednesday, August 01, 2012 00:54:56 Timon Gehr wrote:
>> Ddoc is typically not required. By default it should be treated as whitespace. If it is required, one token seems reasonable: The post-processing of the doc comment is best done as a separate step.
>
> That was how I was intending to deal with ddoc. It's just a nested block comment token. The whole comment string is there, so the ddoc processor can use that to do whatever it does. ddoc isn't part of lexing really. It's a separate thing.

OK. Same for standard comment and doc comments?
I was wondering how to get the code possibly inside a ---- / ----
block (I never dealt with something like documentation or syntax
highlighting), but your solution makes it easy:

Toten(TokenType.DocComment, "/** ... */"), Token(TokenType.Module,
"module"), ...

A small specialised parser can then extract text, DDocs macros and code blocks from inside the comment. Findind and stripping '----' is easy and then the lexer can be locally reinvoked on the slice containing the example code.
August 01, 2012
On Wednesday, August 01, 2012 07:26:07 Philippe Sigaud wrote:
> On Wed, Aug 1, 2012 at 12:58 AM, Jonathan M Davis <jmdavisProg@gmx.com>
wrote:
> > On Wednesday, August 01, 2012 00:54:56 Timon Gehr wrote:
> >> Ddoc is typically not required. By default it should be treated as whitespace. If it is required, one token seems reasonable: The post-processing of the doc comment is best done as a separate step.
> > 
> > That was how I was intending to deal with ddoc. It's just a nested block
> > comment token. The whole comment string is there, so the ddoc processor
> > can
> > use that to do whatever it does. ddoc isn't part of lexing really. It's a
> > separate thing.
> 
> OK. Same for standard comment and doc comments?

>From the TokenType enum declaration:

    blockComment,         /// $(D /* */)
    lineComment,          /// $(D // )
    nestingBlockComment,  /// $(D /+ +/)

There are then functions which operate on Tokens to give you information about them. Among them is isDdocComment, which will return true if the Token type is a comment, and that comment is a ddoc comment (i.e. starts with /**, ///, or /++ rather than /*, //, or /+). So, anything that wants to process ddoc comments can lex them out and process them, and if they want to know what symbols that a ddoc comment applies to, then they look at the tokens that follow (though a full-on parser would be required to do that correctly).

> I was wondering how to get the code possibly inside a ---- / ---- block (I never dealt with something like documentation or syntax highlighting), but your solution makes it easy:
> 
> Toten(TokenType.DocComment, "/** ... */"), Token(TokenType.Module,
> "module"), ...
> 
> A small specialised parser can then extract text, DDocs macros and code blocks from inside the comment. Findind and stripping '----' is easy and then the lexer can be locally reinvoked on the slice containing the example code.

Yes. The lexer isn't concerned with where the text comes from, and it isn't concerned with lexing comments beyond putting them in a token. But that should be powerful enough to lex the examples if you've already extracted them.

- Jonathan M Davis
August 01, 2012
On 01-Aug-12 02:54, Timon Gehr wrote:
> On 08/01/2012 12:01 AM, Philippe Sigaud wrote:
>> On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
>> <dmitry.olsh@gmail.com> wrote:
>>> Typically lexeres are regular as it allows them to be fast.
>
> Regularity of the language is not required for speed.

That's why there is "allows" (not "required") the reason in general is that plain deterministic automation is insanely fast compared with just about any CFG parsing scheme.

Yet there are certain grammars/cases where it doesn't matter much. Also tweaking DFA by "semantic" actions could make it handle some irregularities at no extra cost etc.

>
>>
>> Hmm, but then it has to treat comments a one token. So no Ddoc love?
>
> Ddoc is typically not required. By default it should be treated as
> whitespace. If it is required, one token seems reasonable: The
> post-processing of the doc comment is best done as a separate step.

Yup.

-- 
Dmitry Olshansky