August 01, 2012
On 2012-07-31 23:20, Jonathan M Davis wrote:

> I'm actually quite far along with one now - one which is specifically written
> and optimized for lexing D. I'll probably be done with it not too long after
> the 2.060 release (though we'll see). Writing it has been going surprisingly
> quickly actually, and I've already found some bugs in the spec as a result
> (some of which have been fixed, some of which I still need to create pull
> requests for). So, regardless of what happens with my lexer, at least the spec
> will be more accurate.
>
> - Jonathan M Davis
>

That's awesome. I really looking forward to this. Keep up the good work.

-- 
/Jacob Carlborg
August 01, 2012
On 2012-08-01 00:38, Jonathan M Davis wrote:

> I don't have the code with me at the moment, but I believe that the token type
> looks something like
>
> struct Token
> {
>   TokenType type;
>   string str;
>   LiteralValue value;
>   SourcePos pos;
> }
>
> struct SourcePos
> {
>   size_t line;
>   size_ col;
>   size_t tabWidth = 8;
> }
>

What about the end/length of a token? Token.str.length would give the number of bytes (code units?) instead of the number of characters (code points?). I'm not entirely sure what's needed when, for example, doing syntax highlighting. I assume you would know the length in characters of a given token internally inside the lexer?

-- 
/Jacob Carlborg
August 01, 2012
On 2012-08-01 07:44, Philippe Sigaud wrote:

> Does syntax highlighting need more that a token stream? Without having
> thought a lot about it, it seems to me IDE tend to highlight based
> just on the token type, not on a parse tree. So that means your lexer
> can be used directly by interested people, that's nice.

Some IDE's do a more advanced syntax highlighting based on the semantic analysis. For example, Eclipse highlights instance variables differently.

-- 
/Jacob Carlborg
August 01, 2012
On Wednesday, August 01, 2012 10:25:18 Jacob Carlborg wrote:
> On 2012-08-01 00:38, Jonathan M Davis wrote:
> > I don't have the code with me at the moment, but I believe that the token type looks something like
> > 
> > struct Token
> > {
> > 
> >   TokenType type;
> >   string str;
> >   LiteralValue value;
> >   SourcePos pos;
> > 
> > }
> > 
> > struct SourcePos
> > {
> > 
> >   size_t line;
> >   size_ col;
> >   size_t tabWidth = 8;
> > 
> > }
> 
> What about the end/length of a token? Token.str.length would give the number of bytes (code units?) instead of the number of characters (code points?). I'm not entirely sure what's needed when, for example, doing syntax highlighting. I assume you would know the length in characters of a given token internally inside the lexer?

I'm not sure. I don't think so. It doesn't really keep track of code points. It operates in code units as much as possible, and pos doesn't really help, because any newline that occurred would make it so that subtracting the start col from the end col would be completely bogus (that and tabs would mess that up pretty thoroughly, but as Christophe pointed out, the whole tabWidth thing may not actually have been a good idea anyway).

It could certainly be added, but unless the lexer always knows it (and I'm pretty sure that it doesn't), then keeping track of that entails extra overhead. But maybe it's worth that overhead. I'll have to look at what I have and see. Worst case, the caller can just use walkLength on str, but if it has to do that all the time, then that's not exactly conducive to good performance.

- Jonathan M Davis
August 01, 2012
On 2012-08-01 08:11, Jonathan M Davis wrote:

> I'm not using regexes at all. It's using string mixins to reduce code
> duplication, but it's effectively hand-written. If I do it right, it should be
> _very_ difficult to make it any faster than it's going to be. It even
> specifically avoids decoding unicode characters and operates on ASCII
> characters as much as possible.

That's good idea. Most code can be treated as ASCII (I assume most people code in english). It would basically only be string literals containing characters outside the ASCII table.

BTW, have you seen this:

http://woboq.com/blog/utf-8-processing-using-simd.html

-- 
/Jacob Carlborg
August 01, 2012
On 2012-08-01 08:39, Jonathan M Davis wrote:

> Well, I think that that's what a lexer in Phobos _has_ to do, or it can't be
> in Phobos. And if Jacob Carlborg gets his way, dmd's frontend will eventually
> switch to using the lexer and parser from Phobos, and in that sort of
> situation, it's that much more imperative that they follow the spec exactly.

:)

-- 
/Jacob Carlborg
August 01, 2012
On Wednesday, August 01, 2012 11:14:52 Jacob Carlborg wrote:
> On 2012-08-01 08:11, Jonathan M Davis wrote:
> > I'm not using regexes at all. It's using string mixins to reduce code duplication, but it's effectively hand-written. If I do it right, it should be _very_ difficult to make it any faster than it's going to be. It even specifically avoids decoding unicode characters and operates on ASCII characters as much as possible.
> 
> That's good idea. Most code can be treated as ASCII (I assume most people code in english). It would basically only be string literals containing characters outside the ASCII table.

What's of particular importance is the fact that _all_ of the language constructs are ASCII. So, unicode comes in exclusively with identifiers, string literals, char literals, and whitespace. And with those, ASCII is still going to be far more common, so coding it in a way that makes ASCII faster at slight cost to performance for unicode is perfectly acceptable.

> BTW, have you seen this:
> 
> http://woboq.com/blog/utf-8-processing-using-simd.html

No, I'll have to take a look. I know pretty much nothing about SIMD though. I've only heard of it, because Walter implemented some SIMD stuff in dmd not too long ago.

- Jonathan M Davis
August 01, 2012
On 2012-08-01 10:40, Jonathan M Davis wrote:

> It could certainly be added, but unless the lexer always knows it (and I'm
> pretty sure that it doesn't), then keeping track of that entails extra
> overhead. But maybe it's worth that overhead. I'll have to look at what I have
> and see. Worst case, the caller can just use walkLength on str, but if it has
> to do that all the time, then that's not exactly conducive to good
> performance.

Doing a syntax highlighter and calculate the length (walkLength) for each token would most likely slow it down a bit.

-- 
/Jacob Carlborg
August 01, 2012
On 2012-07-31 23:20, Jonathan M Davis wrote:

> I'm actually quite far along with one now - one which is specifically written
> and optimized for lexing D. I'll probably be done with it not too long after
> the 2.060 release (though we'll see). Writing it has been going surprisingly
> quickly actually, and I've already found some bugs in the spec as a result
> (some of which have been fixed, some of which I still need to create pull
> requests for). So, regardless of what happens with my lexer, at least the spec
> will be more accurate.

BTW, do you have to code online somewhere?

-- 
/Jacob Carlborg
August 01, 2012
On Wed, Aug 1, 2012 at 8:39 AM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:

> It was never intended to be even vaguely generic. It's targeting D specifically. If someone can take it and make it generic when I'm done, then great. But it's goal is to lex D as efficiently as possible, and it'll do whatever it takes to do that.

That's exactly what I had in mind. Anyway, we need a D lexer. We also need a generic lexer generator, but as a far-away second choice and we can admit it being less efficient. Of course, any trick used on the D lexer can most probably be reused for Algol-family lexers.


>> I don't get it. Say I have an literal with non UTF-8 chars, how will it be stored inside the .str field as a string?
>
> The literal is written in whatever encoding the range is in. If it's UTF-8, it's UTF-8. If it's UTF-32, it's UTF-32. UTF-8 can hold exactly the same set of characters that UTF-32 can. Your range could be UTF-32, but the string literal is supposed to be UTF-8 ultimately. Or the range could be UTF-8 when the literal is UTF-32. The characters themselves are in the encoding type of the range regardless. It's just the values that the compiler generates which change.
>
> "hello world"
> "hello world"c
> "hello world"w
> "hello world"d
>
> are absolutely identical as far as lexing goes save for the trailing character. It would be the same regardless of the characters in the strings or the encoding used in the source file.

Everytime I think I understand D strings, you prove me wrong. So, I *still* don't get how that works:

say I have

auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d;

Then, the "..." part is lexed as a string literal. How can the string field in the Token magically contain UTF32 characters? Or, are they automatically cut in four nonsense chars each? What about comments containing non-ASCII chars? How can code coming after the lexer make sense of it?

As Jacob say, many people code in English. That's right, but

1- they most probably use their own language for internal documentation
2- any in8n part of a code base will have non-ASCII chars
3- D is supposed to accept UTF-16 and UTF-32 source code.

So, wouldn't it make sense to at least provide an option on the lexer to specifically store identifier lexemes and comments as a dstring?