August 01, 2012
Thanks for taking the time to write such an explanation!


> Yeah. Once I started on it, I made a lot of progress really quickly. There's still a fair bit to do (primarily having to do with literals), but it probably won't take all that much longer. Certainly, I'd expect to have it done within a couple of weeks if not sooner, unless something goes wrong.

Is it based on a prioritized list of regexes?

> Well, it's still a work in progress, so it certainly can be adjusted as necessary. I intend it to fully implement the spec (and make sure that both it and the spec match what dmd is doing) as far as lexing goes. The idea is that you should be able to build a fully compliant D parser on top of it and build a fully compliant D compiler on top of that.

My fear (having encoded the D grammar once) is that 99% of the code constructs are easily done (even new stuff like the => syntax), but that the remaining 1% is a nightmare. The many different kind of strings with or without escaping, for example. Your lexer return a string literal as a string field, right? If so, escaping chars like " and ` makes me nervous.

>
> It already supports all of the comment types and several of the string literal types. I haven't sorted out q{} yet, but I will before I'm done, and that may or may not affect how some things work, since I'm not quite sure how to handle q{} yet (it may end up being done with tokens marking the beginning and end of the token sequence encompassed by q{}, but we'll see).

I see. My question was because, as I said before I found this to be a difficult and hairy part of the D grammar, and an IDE might want to colorize/manipulate the content of a string, when it knows its a string mixin.


> I don't have the code with me at the moment, but I believe that the token type looks something like
>
> struct Token
> {
>  TokenType type;
>  string str;
>  LiteralValue value;
>  SourcePos pos;
> }

Seems OK. I guess it can be templated on the string type, because you sure want to lex dstrings.


> The type is an enum which gives the type of the token (obviously) which includes the various comment types and an error type (so errors are reported by getting a token that was an error token rather than throwing or anything like that, which should make lexing passed malformed stuff easy).

And the recovery is done on the next token. Good idea.

> str holds the exact text which was lexed (this includes the entire comment for the various comment token types), which in the case of lexing string rather than another range type would normally (always? - I don't remember) be a slice of the string being lexed, which should help make lexing string very efficient.

Yeahs, you may as well use this excellent feature. But that means the input must be held in memory always, no? If it's an input range (coming from a big file), can you keep slices from it?

> It may or may not make sense to change that to the range type used rather than string. For nesting block comments, the whole comment is one token (with the token type which is specifically for nested comments), regardless of whether there's any nesting going on. But that could be changed if there were a need to get separate tokens for the comments inside.

That's OK, as I reply in another message. It's now quite clear for me
how to deal with that.
IDE/doc generator can parse the comment at their leasure, as they are
not long nor critical section of code.
They can even easily get the code fragment used as examples inside
---/--- markers.

> LiteralValue is a VariantN of the types that a literal can be (long, ulong, real, and string IIRC) and is empty unless the token is a literal type


You can make it an Algebraic!(long, ulong, real, string), as your universe of type is restricted. That can even catch a non-legal value. Arrays literals are not considered literals, since they can contain arbitrary expressions, is that right?

And it's a good idea to drop the complex literals once and for all.

> (the
> various string postfixes - c,w, and d - are treated as different token types
> rather than giving the literal value different string types - the same with the
> integral and floating point literals).

That's seem reasonable enough, but can you really store  a dstring literal in a string field?

> And pos holds the position in the text where the token started, which should make it easy to use for syntax highlighting

Does syntax highlighting need more that a token stream? Without having thought a lot about it, it seems to me IDE tend to highlight based just on the token type, not on a parse tree. So that means your lexer can be used directly by interested people, that's nice.

> and the like (as well as
> indicating where an error occurred). The initial position is passed as an
> optional argument to the lexing function, so it doesn't have to be 1:1 (though
> that's the default), and it allows you to select the tabwidth.

I really like the idea of having malformed token become error token, with the lexing that continue. As I said before, that enables an IDE to continue to do syntax highlighting.


> So, you'll pass a range and an optional starting position to the lexing function, and it'll return a lazy range of Tokens. Whitespace is stripped as part of the lexing process, but if you take the pos properties of two adjacent tokens, you should be able to determine how much whitespace was between them.

OK, so whitespace is strictly that, and comments are not dropped. That's OK. I guess a D parser would then just filter the comments out of the token range.


> I _think_ that that's how it currently is, but again, I don't have the code with me at the moment, so it may not be 100% correct. And since it's a work in progress, it's certainly open to changes.

My only quip for now would be the string/dstring thingy.

Did you achieve UFCS, wrt floating point values? is 1.f seen as a
float or as f called on 1 (int)?
August 01, 2012
On 01-Aug-12 02:01, Philippe Sigaud wrote:
> On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
> <dmitry.olsh@gmail.com> wrote:
>
>>> I guess creating a tree of symbol tables according to scope visibility
>>> is then more the job of the parser, but I'm not sure.
>>>
>> Parser can use constant IDs for nested tables, IDs point to string table.
>> String table is populated by lexer.
>
> The latter, I get. The former, not so much.
>
>
>>>> I've only glimpsed at your code. For most languages lexing is far more
>>>> expensive then parsing
>>>
>>>
>>> Is that so?
>>
>>
>> It usually is. Unless parser is overly generic as GLL/GLR (not to claim that
>> it's very slow but GLL/GLR are slower for obvious reasons).
>
> I'd have thought that, seeing as 'the end result' is more complicated
> (richer, if I may say so) for parsing, then parsing is more expensive.
> I'm reading on GLR, and was wondering what the practical limits are.
> Do people use GLR to parse thousands of pages of text?
>
>
>> Have to agree here if anything a better DFA generator is needed, current
>> std.regex can't get as good in this field because of:
>>
>> a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in
>> ctRegex via codegen)
>> b) full unicode features commitment (again expect some improvement here in
>> near future)
>> c) designed to take multiple captures from matched piece of text.
>>
>> I'm not sure when (or even if) std.regex will ever special case all of the
>> above.
>
> Well,
>
> - for a lexer lookahead is sometimes useful (the Dragon book cite the
> FORTRAN grammar, for which keywords are not reserved and so when you
> encounter IF, you don't know if (!) it's a function call or a 'real'
> if)

Well while lookahead will help, there are simpler ways. e.g.
regex ("IF\ZYOUR_LOOKAHEAD");

\Z means that you are capturing text only up to \Z. It's still regular. I think Boost regex does support this. We could have supported this cleanly but have chosen ECM-262 standard.


Critical difference is that lookahead can be used like this:

regex("blah(?=lookahead)some_other_stuff_that_lookahead_also_checks");

> - Unicode is needed to lex D correctly, no?

Not that much of unicode, e.g. it could skip decoding stuff most of the time.

> - multiple captures doesn't seem necessary *for a lexer*
>
>
>>> How does the lexer know when to stop producing a 'comment'
>>> token?
>>
>>
>> Call special function skipComment when lexer hits comment_start pseudotoken.
>
> Ah, and this special function must somehow maintain a stack, to
> 'count' the comment_start and comment_end pseudotokens. So in a way,
> it quits the regular world to become temporarily more powerful.
>
>> Typically lexeres are regular as it allows them to be fast.
>
> Hmm, but then it has to treat comments a one token. So no Ddoc love?
>


-- 
Dmitry Olshansky
August 01, 2012
On Wednesday, August 01, 2012 07:44:33 Philippe Sigaud wrote:
> Is it based on a prioritized list of regexes?

I'm not using regexes at all. It's using string mixins to reduce code duplication, but it's effectively hand-written. If I do it right, it should be _very_ difficult to make it any faster than it's going to be. It even specifically avoids decoding unicode characters and operates on ASCII characters as much as possible.

> > Well, it's still a work in progress, so it certainly can be adjusted as necessary. I intend it to fully implement the spec (and make sure that both it and the spec match what dmd is doing) as far as lexing goes. The idea is that you should be able to build a fully compliant D parser on top of it and build a fully compliant D compiler on top of that.
> 
> My fear (having encoded the D grammar once) is that 99% of the code constructs are easily done (even new stuff like the => syntax), but that the remaining 1% is a nightmare. The many different kind of strings with or without escaping, for example. Your lexer return a string literal as a string field, right? If so, escaping chars like " and ` makes me nervous.

Well, if you want the original text, there's the str property, which contains everything that was in the source code (including whatever escaping there was), whereas the value property includes only the body, and it will have already processed whatever escaped characters needed to be processed. It's the actual value of the literal. So, \` would be ` in the value property, named entities would be their actual code points, etc.

So, I don't really see a problem there. Sure, depending on what is done with the processed string, it could cause further problems, but that always happens when doing string processing.

> > It already supports all of the comment types and several of the string literal types. I haven't sorted out q{} yet, but I will before I'm done, and that may or may not affect how some things work, since I'm not quite sure how to handle q{} yet (it may end up being done with tokens marking the beginning and end of the token sequence encompassed by q{}, but we'll see).
> 
> I see. My question was because, as I said before I found this to be a difficult and hairy part of the D grammar, and an IDE might want to colorize/manipulate the content of a string, when it knows its a string mixin.

Well, it could very well cause me trouble. I haven't gotten to it yet, and I'll need to study it to fully understand what it does before crafting my solution, but it'll probably end up being a sequence of tokens of some kind given that that's what it is - a string of tokens. Regardless, it'll be done in a way that whatever is using the lexer will be able to know what it is and process is it accordingly.

> > str holds the exact text which was lexed (this includes the entire comment for the various comment token types), which in the case of lexing string rather than another range type would normally (always? - I don't remember) be a slice of the string being lexed, which should help make lexing string very efficient.
> Yeahs, you may as well use this excellent feature. But that means the input must be held in memory always, no? If it's an input range (coming from a big file), can you keep slices from it?

Well, whatever is using the lexer is going to have to make sure that what it passes to the lexer continues to exist while it uses the lexer. Given how slicing works in D, it's pretty much a given that lexers and parsers are going to use them heavily, and any code using a lexer or parser is going to have to take that into account. Slicing is one of the major reasons that Tango's XML parser is so insanely fast.

> > LiteralValue is a VariantN of the types that a literal can be (long,
> > ulong,
> > real, and string IIRC) and is empty unless the token is a literal type
> 
> You can make it an Algebraic!(long, ulong, real, string), as your universe of type is restricted. That can even catch a non-legal value.

I'll have to look at that to see whether using Algebraic is better. I'm not super-familiar with std.variant, so I may have picked the wrong type. However, VariantN already holds a specific set of types though (unlike Variant), so that part isn't a problem.

> Arrays literals are not considered literals, since they can contain arbitrary expressions, is that right?

There is no concept of array literal in the lexing portion of the spec. There may be at the level of the parser, but that's not the lexer's problem.

> And it's a good idea to drop the complex literals once and for all.

I'm not supporting any symbols which are known to be scheduled for deprecation (e.g. !<> and !>=). The _only_ stuff which I'm supporting along those lines is to-be-deprecated keywords (e.g. volatile and delete), since they still won't be legal to use as identifiers. And there's a function to query a token as to whether it's using a deprecated or unused keyword so that the program using the lexer can flag it if it wants to.

> > (the
> > various string postfixes - c,w, and d - are treated as different token
> > types rather than giving the literal value different string types - the
> > same with the integral and floating point literals).
> 
> That's seem reasonable enough, but can you really store  a dstring literal in a string field?

Yeah. Why not? The string is the same in the source code regardless of the type of the literal. The only difference is the letter tacked onto the end. That will be turned into the appropriate string type be the semantic analyzer, but the lexer doesn't care.

> Does syntax highlighting need more that a token stream? Without having thought a lot about it, it seems to me IDE tend to highlight based just on the token type, not on a parse tree. So that means your lexer can be used directly by interested people, that's nice.

If all you want to do is highlight specific symbols and keywords, then you don't need a parser. If you want to do fancier stuff related to templates and the like, then you'll need a parser and maybe even a semantic analyzer. But a lexer should be plenty for basic syntax highlighting.

> I really like the idea of having malformed token become error token, with the lexing that continue. As I said before, that enables an IDE to continue to do syntax highlighting.

That was the idea. You need to be able to continue lexing beyond errors in many cases, and that seemed the natural way to do it.

> Did you achieve UFCS, wrt floating point values? is 1.f seen as a
> float or as f called on 1 (int)?

I haven't tackled floating point literals yet, but since the f in 1.f is not part of the floating point literal, I'll have to handle it. I suspect that that's one area where I'll need to update the spec though, since it probably hasn't been updated for that yet.

Basically, the lexer that I'm writing needs to be 100% compliant with the D spec and dmd (which means updating the spec or fixing dmd in some cases), and it needs to be possible to build on top of it anything and everything that dmd does that would use a lexer (even if it's not the way that dmd currently does it) so that it's possible to build a fully compliant D parser and compiler on top of it as well as a ddoc generator and anything else that you'd need to do with a lexer for D. So, if you have any questions about what my lexer does (or is supposed to do) with regards to the spec, that should answer it. If my lexer doesn't match the spec or dmd when I'm done (aside from specific exceptions relating to stuff like deprecated symbols), then I screwed up.

- Jonathan M Davis
August 01, 2012
On Wed, Aug 1, 2012 at 7:48 AM, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:
>> Well,
>>
>> - for a lexer lookahead is sometimes useful (the Dragon book cite the FORTRAN grammar, for which keywords are not reserved and so when you encounter IF, you don't know if (!) it's a function call or a 'real' if)
>
>
> Well while lookahead will help, there are simpler ways. e.g.
> regex ("IF\ZYOUR_LOOKAHEAD");
>
> \Z means that you are capturing text only up to \Z. It's still regular. I think Boost regex does support this. We could have supported this cleanly but have chosen ECM-262 standard.
>
>
> Critical difference is that lookahead can be used like this:
>
> regex("blah(?=lookahead)some_other_stuff_that_lookahead_also_checks");

In the FORTRAN case, you indeed *need* to re-lex the stuff after IF, with another regex, once you've determined it's an IF instruction and not some moron who used IF as an identifier.

You know, as a first step, I'd be happy to get ctRegex to recognize the \Z flag.
August 01, 2012
On 01-Aug-12 02:01, Philippe Sigaud wrote:
> On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
> <dmitry.olsh@gmail.com> wrote:
>
>>> I guess creating a tree of symbol tables according to scope visibility
>>> is then more the job of the parser, but I'm not sure.
>>>
>> Parser can use constant IDs for nested tables, IDs point to string table.
>> String table is populated by lexer.
>
> The latter, I get. The former, not so much.

Okay. Say lexer maps all unique strings that are not keywords to some ID.

Then parser creates a stack of scoped symbol tables.
These nested symbol tables use only IDs not strings themselves.
(Though I expect that only semantic analysis require the use of these tables)

-- 
Dmitry Olshansky
August 01, 2012
On Wed, Aug 1, 2012 at 8:11 AM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On Wednesday, August 01, 2012 07:44:33 Philippe Sigaud wrote:
>> Is it based on a prioritized list of regexes?
>
> I'm not using regexes at all. It's using string mixins to reduce code duplication, but it's effectively hand-written. If I do it right, it should be _very_ difficult to make it any faster than it's going to be. It even specifically avoids decoding unicode characters and operates on ASCII characters as much as possible.

OK. It'll more difficult to genericize, then, but that's not your
problem (could be mine, though).


> Well, whatever is using the lexer is going to have to make sure that what it passes to the lexer continues to exist while it uses the lexer.

Yeah, I realized that just after posting. And anyway, the token are made to be consumed at once, normally.

> I'll have to look at that to see whether using Algebraic is better. I'm not super-familiar with std.variant, so I may have picked the wrong type. However, VariantN already holds a specific set of types though (unlike Variant), so that part isn't a problem.

OK, I forgot about VariantN (I'm not so used to std.variant either)


> I'm not supporting any symbols which are known to be scheduled for deprecation (e.g. !<> and !>=). The _only_ stuff which I'm supporting along those lines is to-be-deprecated keywords (e.g. volatile and delete), since they still won't be legal to use as identifiers. And there's a function to query a token as to whether it's using a deprecated or unused keyword so that the program using the lexer can flag it if it wants to.

Good idea. A nice bunch of query functions will be a nice complement

- isDoc
- isComment
- isString
- isDeprecated
...


>> That's seem reasonable enough, but can you really store  a dstring literal in a string field?
>
> Yeah. Why not? The string is the same in the source code regardless of the type of the literal. The only difference is the letter tacked onto the end. That will be turned into the appropriate string type be the semantic analyzer, but the lexer doesn't care.

I don't get it. Say I have an literal with non UTF-8 chars, how will it be stored inside the .str field as a string?


> Basically, the lexer that I'm writing needs to be 100% compliant with the D spec and dmd (which means updating the spec or fixing dmd in some cases), and it needs to be possible to build on top of it anything and everything that dmd does that would use a lexer (even if it's not the way that dmd currently does it) so that it's possible to build a fully compliant D parser and compiler on top of it as well as a ddoc generator and anything else that you'd need to do with a lexer for D. So, if you have any questions about what my lexer does (or is supposed to do) with regards to the spec, that should answer it. If my lexer doesn't match the spec or dmd when I'm done (aside from specific exceptions relating to stuff like deprecated symbols), then I screwed up.

That's a lofty goal, but that would indeed be quite good to have an officially integrated lexer in Phobos that would (as Andrei said) "be it". The token spec would be the lexer.
August 01, 2012
On Wed, Aug 1, 2012 at 8:12 AM, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:
> On 01-Aug-12 02:01, Philippe Sigaud wrote:
>>
>> On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:
>>
>>>> I guess creating a tree of symbol tables according to scope visibility is then more the job of the parser, but I'm not sure.
>>>>
>>> Parser can use constant IDs for nested tables, IDs point to string table. String table is populated by lexer.
>>
>>
>> The latter, I get. The former, not so much.
>
>
> Okay. Say lexer maps all unique strings that are not keywords to some ID.
>
> Then parser creates a stack of scoped symbol tables.
> These nested symbol tables use only IDs not strings themselves.
> (Though I expect that only semantic analysis require the use of these
> tables)

Ah, that! This I know, I misunderstood your initial post. The symbol table can also be pre-charged with keywords, if needed.
August 01, 2012
On Wednesday, August 01, 2012 08:20:04 Philippe Sigaud wrote:
> OK. It'll more difficult to genericize, then, but that's not your
> problem (could be mine, though).

It was never intended to be even vaguely generic. It's targeting D specifically. If someone can take it and make it generic when I'm done, then great. But it's goal is to lex D as efficiently as possible, and it'll do whatever it takes to do that. From how the main switch statement's cases are constructed though, there's a lot there which could be genericized. I currently have several mixins used to create them, but I'm pretty sure that I can generate a _lot_ of the case statements using a single mixin which just takes the list of symbols and their associated tokens, which I'll probably do before I'm done. So, I'm sure that pieces of what I'm doing could be used to generate a lexer for another language, but plenty of it is very specific to D.

> >> That's seem reasonable enough, but can you really store  a dstring literal in a string field?
> > 
> > Yeah. Why not? The string is the same in the source code regardless of the
> > type of the literal. The only difference is the letter tacked onto the
> > end.
> > That will be turned into the appropriate string type be the semantic
> > analyzer, but the lexer doesn't care.
> 
> I don't get it. Say I have an literal with non UTF-8 chars, how will it be stored inside the .str field as a string?

The literal is written in whatever encoding the range is in. If it's UTF-8, it's UTF-8. If it's UTF-32, it's UTF-32. UTF-8 can hold exactly the same set of characters that UTF-32 can. Your range could be UTF-32, but the string literal is supposed to be UTF-8 ultimately. Or the range could be UTF-8 when the literal is UTF-32. The characters themselves are in the encoding type of the range regardless. It's just the values that the compiler generates which change.

"hello world"
"hello world"c
"hello world"w
"hello world"d

are absolutely identical as far as lexing goes save for the trailing character. It would be the same regardless of the characters in the strings or the encoding used in the source file.

In either case, a lot of string literals have to be decoded (e.g if they contain escaped characters), so you often can't create them with a slice anyway, and if a range is used which isn't one of the string types, then it's impossible for Token's value property to use the range type whenever it can't use a slice. So, it's just simpliest to always use string. It may be a slight performance hit for lexing wstrings and dstrings, since they _could_ be both sliced and created as new strings (unlike other ranges), but I don't think that it's worth the extra complication to make it so that the string literal's value could be other string types, especially when lexing strings is likely to be the common case.

> > Basically, the lexer that I'm writing needs to be 100% compliant with the
> > D
> > spec and dmd (which means updating the spec or fixing dmd in some cases),
> > and it needs to be possible to build on top of it anything and everything
> > that dmd does that would use a lexer (even if it's not the way that dmd
> > currently does it) so that it's possible to build a fully compliant D
> > parser and compiler on top of it as well as a ddoc generator and anything
> > else that you'd need to do with a lexer for D. So, if you have any
> > questions about what my lexer does (or is supposed to do) with regards to
> > the spec, that should answer it. If my lexer doesn't match the spec or
> > dmd when I'm done (aside from specific exceptions relating to stuff like
> > deprecated symbols), then I screwed up.
> That's a lofty goal, but that would indeed be quite good to have an officially integrated lexer in Phobos that would (as Andrei said) "be it". The token spec would be the lexer.

Well, I think that that's what a lexer in Phobos _has_ to do, or it can't be in Phobos. And if Jacob Carlborg gets his way, dmd's frontend will eventually switch to using the lexer and parser from Phobos, and in that sort of situation, it's that much more imperative that they follow the spec exactly.

- Jonathan M Davis
August 01, 2012
"Jonathan M Davis" , dans le message (digitalmars.D:173860), a écrit :
> struct Token
> {
>  TokenType type;
>  string str;
>  LiteralValue value;
>  SourcePos pos;
> }
> 
> struct SourcePos
> {
>  size_t line;
>  size_t col;
>  size_t tabWidth = 8;
> }

The occurence of tabWidth surprises me.
What is col supposed to be ? an index (code unit), a character number
(code point), an estimation of where the caracter is supposed to be
printed on the line, given the provided tabwidth ?

I don't think the lexer can realy try to calculate at what column the character is printed, since it depends on the editor (if you want to use the lexer to syntax highlight for example), how it supports combining characters, zero or multiple column characters, etc. (which you may not want to have to decode).

You may want to provide the number of tabs met so far. Note that there are other whitespace that you may want to count, but you shouldn't have a very complicated SourcePos structure. It might be easier to have whitespace, endofline and endoffile tokens, and let the user filter out or take into account what he wants to take into account. Or just let the user look into the original string...

August 01, 2012
On Wednesday, August 01, 2012 07:12:30 Christophe Travert wrote:
> "Jonathan M Davis" , dans le message (digitalmars.D:173860), a écrit :
> > struct Token
> > {
> > 
> >  TokenType type;
> >  string str;
> >  LiteralValue value;
> >  SourcePos pos;
> > 
> > }
> > 
> > struct SourcePos
> > {
> > 
> >  size_t line;
> >  size_t col;
> >  size_t tabWidth = 8;
> > 
> > }
> 
> The occurence of tabWidth surprises me.
> What is col supposed to be ? an index (code unit), a character number
> (code point), an estimation of where the caracter is supposed to be
> printed on the line, given the provided tabwidth ?

col counts code points. tabWidth is the number of code points that '\t' is considered to be. That's it. So, in theory, an editor should be able to use it to indicate where on the line the token starts.

If the code using the lexer wants to treat tabs as having the widtho of a single code point, then all they have to do is pass in a SourcePos with a tabWidth of 1. But if the lexer doesn't have a way to count tabs differently, then there's no way for the code using the lexer to figure out the tabs without going back and lexing the whitespace itself. But counting tabs as a different width than everything else is so common that it seemed prudent to add it. Given that Phobos doesn't support graphemes and that ranges use code points, a code point is the closest to a character that the lexer would be counting, and it just makes sense to count code points.

Now, the one thing that might be a problem with treating tabs as a fixed width is that it's not uncommon to treat tabs as being up to a particular width rather than having a fixed width such that if there are other characters before a tab, then the tabs width is the max tab width minus the number of characters since the start of that tab width. e.g. if tabs had a max width of 8, then

\t123\t

could have the first tab with a width of 8 and the second as only having a width of 5. But that's too complicated to deal with in the lexer IMHO.

Maybe the tab width isn't worth having in SourePos and it will ultimately be removed, but it struck me as a useful feature, so I added it.

- Jonathan M Davis