October 05, 2013
On Saturday, 5 October 2013 at 00:24:22 UTC, Andrei Alexandrescu wrote:
> Vision
> ======
>
> I'd been following the related discussions for a while, but I have made up my mind today as I was working on a C++ lexer today. The C++ lexer is for Facebook's internal linter. I'm translating the lexer from C++.
>
> Before long I realized two simple things. First, I can't reuse anything from Brian's code (without copying it and doing surgery on it), although it is extremely similar to what I'm doing.
>
> Second, I figured that it is almost trivial to implement a simple, generic, and reusable (across languages and tasks) static trie searcher that takes a compile-time array with all tokens and keywords and returns the token at the front of a range with minimum comparisons.
>
> Such a trie searcher is not intelligent, but is very composable and extremely fast. It is just smart enough to do maximum munch (e.g. interprets "==" and "foreach" as one token each, not two), but is not smart enough to distinguish an identifier "whileTrue" from the keyword "while" (it claims "while" was found and stops right at the beginning of "True" in the stream). This is for generality so applications can define how identifiers work (e.g. Lisp allows "-" in identifiers but D doesn't etc). The trie finder doesn't do numbers or comments either. No regexen of any kind.
>
> The beauty of it all is that all of these more involved bits (many of which are language specific) can be implemented modularly and trivially as a postprocessing step after the trie finder. For example the user specifies "/*" as a token to the trie finder. Whenever a comment starts, the trie finder will find and return it; then the user implements the alternate grammar of multiline comments.
>

That is more or less how SDC's lexer works. You pass it 2AA : one with string associated with tokens type, and one with string to function's name that return the actual token (for instance to handle /*) and finally one when nothing matches.

A giant 3 headed monster mixin is created from these data.

That has been really handy so far.

> If what we need at this point is a conventional lexer for the D language, std.d.lexer is the ticket. But I think it wouldn't be difficult to push our ambitions way beyond that. What say you?
>

Yup, I do agree.
October 05, 2013
On 10/4/2013 5:24 PM, Andrei Alexandrescu wrote:
> Such a trie searcher is not intelligent, but is very composable and extremely
> fast.

Well, boys, I reckon this is it — benchmark combat toe to toe with the cooders. Now look, boys, I ain't much of a hand at makin' speeches, but I got a pretty fair idea that something doggone important is goin' on around there. And I got a fair idea the kinda personal emotions that some of you fellas may be thinkin'. Heck, I reckon you wouldn't even be human bein's if you didn't have some pretty strong personal feelin's about benchmark combat. I want you to remember one thing, the folks back home is a-countin' on you and by golly, we ain't about to let 'em down. I tell you something else, if this thing turns out to be half as important as I figure it just might be, I'd say that you're all in line for some important promotions and personal citations when this thing's over with. That goes for ever' last one of you regardless of your race, color or your creed. Now let's get this thing on the hump - we got some benchmarkin' to do.
October 05, 2013
On 2013-10-05 02:24, Andrei Alexandrescu wrote:

> Thanks all involved for the work, first of all Brian.
>
> I have the proverbial good news and bad news. The only bad news is that
> I'm voting "no" on this proposal.
>
> [Snip]

Is this something in the middle of a hand written lexer and a lexer automatically generated?

I think we can have both. A hand written lexer, specifically targeted for D that is very fast. Then a more general lexer that can be used for many languages.

I have to say I think this is a bit unfair to dump this huge thing in the voting thread. You haven't made a single post in the discussion thread and now you're coming with this big suggestions in the voting thread.

-- 
/Jacob Carlborg
October 05, 2013
On 10/05/13 13:45, Jacob Carlborg wrote:
> I think we can have both. A hand written lexer, specifically targeted for D that is very fast. Then a more general lexer that can be used for many languages.

The assumption, that a hand-written lexer will be much faster than a generated
one, is wrong.
If there's any significant perf difference then it's just a matter of improving
the generator. An automatically generated lexer will be much more flexible (the
source spec can be reused without a single modification for anything from an
intelligent LOC-like counter or a syntax highlighter to a compiler), easier to
maintain/review and less buggy.

Compare the perf numbers previously posted here for the various lexers with:

$ time ./tokenstats stats std/datetime.d
Lexed 1589336 bytes, found 461315 tokens, 13770 keywords, 65946 identifiers.
Comments:  Line: 958 @ ~40.16  Block: 1 @ ~16  Nesting: 534 @ ~441.7 [count @ avg_len]
0m0.010s user   0m0.001s system   0m0.011s elapsed   99.61% CPU
$ time ./tokenstats dump-no-io std/datetime.d
0m0.013s user   0m0.001s system   0m0.014s elapsed   99.78% CPU

'tokenstats' is built from PEG-like spec plus a bit CT magic. The generator
supports inline rules written in D too, but the only ones actually written in D
are for defining what an identifier is, matching EOLs and handling DelimitedStrings.
Initially, performance was not a consideration at all and there's some very low
hanging fruit in there; there's still room for improvement.
Unfortunately, the language and compiler situation has prevented me from doing
any work on this for the last half year or so. The code won't work with any
current compiler and needs a lot of cleanups (which I have been planning to do
/after/ updating the tooling, which seems very unlikely to be possible now), hence
it's not in a releasable state. [1]

artur

[1] If anyone wants to play with it, use as a reference etc and isn't
    afraid of running a binary, a linux x86 one can be gotten from
    http://d-h.st/xtX
    The only really useful functionality is 'tokenstats dump file.d',
    which will dump all found tokens with line and columns numbers.
    It's just a tool i've been using for identifying regressions and benching.
October 05, 2013
Jacob Carlborg <doob@me.com> wrote:
> On 2013-10-05 02:24, Andrei Alexandrescu wrote:
> 
>> Thanks all involved for the work, first of all Brian.
>> 
>> I have the proverbial good news and bad news. The only bad news is that I'm voting "no" on this proposal.
>> 
>> [Snip]
> 
> Is this something in the middle of a hand written lexer and a lexer automatically generated?

I don't understand this question.

> I think we can have both. A hand written lexer, specifically targeted for
> D that is very fast. Then a more general lexer that can be used for many languages.

I agree with Artur that this is a fallacy.

> I have to say I think this is a bit unfair to dump this huge thing in the voting thread. You haven't made a single post in the discussion thread and now you're coming with this big suggestions in the voting thread.

The way I see it it's unfair of you to claim that. All I did was to vote and to explain that vote. I was very explicit I don't want to pull rank or anything. Besides it was an idea and such things are hard to time.

I think std.d.lexer is a fine product that works as advertised. But I also believe very strongly that it doesn't exploit D's advantages and that adopting it would lock us into a suboptimal API. I have strengthened this opinion only since yesterday morning.


Andrei
October 06, 2013
On Saturday, 5 October 2013 at 11:45:47 UTC, Jacob Carlborg wrote:
> On 2013-10-05 02:24, Andrei Alexandrescu wrote:
>
>> Thanks all involved for the work, first of all Brian.
>>
>> I have the proverbial good news and bad news. The only bad news is that
>> I'm voting "no" on this proposal.
>>
>> [Snip]
>
> Is this something in the middle of a hand written lexer and a lexer automatically generated?
>
> I think we can have both. A hand written lexer, specifically targeted for D that is very fast. Then a more general lexer that can be used for many languages.
>
> I have to say I think this is a bit unfair to dump this huge thing in the voting thread. You haven't made a single post in the discussion thread and now you're coming with this big suggestions in the voting thread.

I asked the same question about support any grammar, not only D grammar, but Brian did not respond:
http://forum.dlang.org/post/itlyubosepuqcchhuwdh@forum.dlang.org
October 06, 2013
Yes


I see the token type discussion as a matter of taste and one that could cleanly be changed using a deprecation path (I wouldn't mind the Tok!str approach, though). I also see no fundamental reason why the API forbids extension for shared sting tables or table-less lexing. And pure implementation details (IMO) shouldn't be a primary voting concern. Much more important is that we don't defer this another year or more for no good reason (changing pure implementation details or purely extending the API are no good reasons when there is a solid implementation/API already).

(sorry for the additional rationale)
October 06, 2013
On 2013-10-05 19:52, Artur Skawina wrote:

> The assumption, that a hand-written lexer will be much faster than a generated
> one, is wrong.

I never said that the generated one would be slow. I only said that the hand written would be fast :)

-- 
/Jacob Carlborg
October 06, 2013
On 2013-10-05 20:45, Andrei Alexandrescu wrote:

> I don't understand this question.
>
>> I think we can have both. A hand written lexer, specifically targeted for
>> D that is very fast. Then a more general lexer that can be used for many languages.
>
> I agree with Artur that this is a fallacy.

I never said that the generated one would be slow. I only said that the hand written would be fast :)

>> I have to say I think this is a bit unfair to dump this huge thing in the
>> voting thread. You haven't made a single post in the discussion thread
>> and now you're coming with this big suggestions in the voting thread.
>
> The way I see it it's unfair of you to claim that. All I did was to vote
> and to explain that vote. I was very explicit I don't want to pull rank or
> anything. Besides it was an idea and such things are hard to time.
>
> I think std.d.lexer is a fine product that works as advertised. But I also
> believe very strongly that it doesn't exploit D's advantages and that
> adopting it would lock us into a suboptimal API. I have strengthened this
> opinion only since yesterday morning.

I just think that if you were not completely satisfied with the current API or implementation you could have said so in the discussion thread. It would have at least given Brian a chance to do something about it, before the voting began.

-- 
/Jacob Carlborg
October 06, 2013
On 2013-10-05 02:24, Andrei Alexandrescu wrote:

> Such a trie searcher is not intelligent, but is very composable and
> extremely fast. It is just smart enough to do maximum munch (e.g.
> interprets "==" and "foreach" as one token each, not two), but is not
> smart enough to distinguish an identifier "whileTrue" from the keyword
> "while" (it claims "while" was found and stops right at the beginning of
> "True" in the stream). This is for generality so applications can define
> how identifiers work (e.g. Lisp allows "-" in identifiers but D doesn't
> etc). The trie finder doesn't do numbers or comments either. No regexen
> of any kind.

Would it be able to lex Scala and Ruby? Method names in Scala can contain many symbols that is not usually allowed in other languages. You can have a method named "==". In Ruby method names are allowed to end with "=", "?" or "!".

-- 
/Jacob Carlborg