September 12, 2013
On Wednesday, 11 September 2013 at 20:28:06 UTC, H. S. Teoh wrote:
> On Wed, Sep 11, 2013 at 10:18:12PM +0200, Dicebot wrote:
>> On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh wrote:
>> >On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
>> >>On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
>> >>>I disagree. I think it's more readable to use a consistent prefix,
>> >>>like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's
>> >>>clear you're referring to token types, not the actual keyword.
>> >>
>> >>Not unless you want to change the style guide and break existing
>> >>Phobos code ;)
>> >
>> >How would that break Phobos code? Phobos code doesn't even use
>> >std.d.lexer right now.
>> 
>> Phobos code must conform its style guide. You can't change it
>> without changing existing Phobos code that relies on it.
>> Inconsistent style is worst of all options.
>
> This doesn't violate Phobos style guidelines:
>
> 	enum TokenType {
> 		kwInt,
> 		kwFloat,
> 		kwDouble,
> 		...
> 		kwFunction,
> 		kwScope,
> 		... // etc.
> 	}
>

Int, Function, Scope, Import are all valid identifiers.

random minimization like kw is really bad. It is even worse when it doesn't make anything sorter.
September 12, 2013
On 9/11/2013 6:30 PM, deadalnix wrote:
> Indeed. What solution do you have in mind ?

The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.
September 12, 2013
On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright wrote:
> On 9/11/2013 6:30 PM, deadalnix wrote:
>> Indeed. What solution do you have in mind ?
>
> The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.

But then, you have an extra step when looking up every tokens + memory management overhead. How big is the performance improvement ?
September 12, 2013
On Thursday, September 12, 2013 03:37:06 deadalnix wrote:
> Int, Function, Scope, Import are all valid identifiers.

All of which violate Phobos' naming conventions for enum values (they must start with a lowercase letter), which is why we went with adding an _ on the end. And it's pretty much as close as you can get to the keyword without actually using the keyword, which is a plus IMHO (though from the sounds of it, H.S. Teoh would consider that a negative due to possible confusion with the keyword).

- Jonathan M Davis
September 12, 2013
On 09/12/2013 03:30 AM, deadalnix wrote:
>>
>> That's correct, but that implies re-lexing the tokens, which has
>> negative performance implications.
>
> Indeed. What solution do you have in mind ?

Buffering the tokens would work. There are some ways promote input ranges to forward ranges. But there are also some pitfalls like the implicit save on copy.

I have two prototypes for a generic input range buffer.

https://gist.github.com/dawgfoto/2187220 - uses growing ring buffer
https://gist.github.com/dawgfoto/1257196 - uses ref counted lookahead buffers in a singly linked list

The lexer itself has a ringbuffer for input ranges.
https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2278
September 12, 2013
On 09/12/2013 03:39 AM, Walter Bright wrote:
> On 9/11/2013 6:30 PM, deadalnix wrote:
>> Indeed. What solution do you have in mind ?
>
> The solution dmd uses is to put in an intermediary layer that saves the
> lookahead tokens in a linked list.

Linked list sounds bad.
Do you have a rough idea how often lookahead is needed, i.e. is it performance relevant? If so it might be worth tuning.
September 12, 2013
Jonathan M Davis wrote:

> You have to look ahead to figure out whether it's .. or a floating point literal.

This lookahead is introduced by using a petty grammar.

Please reconsider, that lexing searches for the leftmost longest pattern
of the rest of the input. This means that introducing a pattern like
 `<int>\.\.'    	return TokenType.INTDOTDOT;
would eliminate the lookahead in the lexer.

In the parser an additional rule then has to be added:
  <range> ::= INT DOTDOT INT
           |  INTDOTDOT INT

-manfred

September 12, 2013
Brian Schott wrote:

>>> > Parsing D requires arbitrary lookahead.
> Yeah. D requires lookahead in both lexing and parsing.

Walter road about _arbitrary_ overhead, i.e. unlimited overhead.

-manfred
September 12, 2013
On Wed, Sep 11, 2013 at 10:06:11PM -0400, Jonathan M Davis wrote:
> On Thursday, September 12, 2013 03:37:06 deadalnix wrote:
> > Int, Function, Scope, Import are all valid identifiers.
> 
> All of which violate Phobos' naming conventions for enum values (they must start with a lowercase letter), which is why we went with adding an _ on the end. And it's pretty much as close as you can get to the keyword without actually using the keyword, which is a plus IMHO (though from the sounds of it, H.S. Teoh would consider that a negative due to possible confusion with the keyword).
[...]

Actually, the main issue I have is that some of the enum values end with _ while others don't. This is inconsistent. I'd rather have consistency than superficial resemblance to the keywords as typed. Either *all* of the enum values should end with _, or *none* of them should.  Having a mixture of both is an eyesore, and leads to people wondering, should I add a _ at the end or not?

If people insist that the 'default' keyword absolutely must be represented as TokenType.default_ (I really don't see why), then *all* TokenType values should end with _. But honestly, I find that really ugly. Writing something like kwDefault, or tokenTypeDefault, would be far better.

Sigh, Andrei was right. Once the bikeshed is up for painting, even the rainbow won't suffice. :-P


T

-- 
MASM = Mana Ada Sistem, Man!
September 12, 2013
On Thu, Sep 12, 2013 at 03:17:06AM +0200, Brian Schott wrote:
> On Thursday, 12 September 2013 at 00:13:36 UTC, H. S. Teoh wrote:
> >But then the code example proceeds to pass byLine() to it. Is that correct? If it is, then the docs need to be updated, because last time I checked, byLine() isn't a range of char, but a range of char *arrays*.
> >
> >
> >T
> 
> The example doesn't pass the result of byLine to the byToken function directly.

*facepalm* You're right, it's calling join() on it. Nevermind what I
said then. :-P  Sorry for all the unnecessary noise, I don't know what
got into me that I didn't see the join().


T

-- 
May you live all the days of your life. -- Jonathan Swift