October 06, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | On Sunday, 6 October 2013 at 08:59:57 UTC, Jacob Carlborg wrote:
> I just think that if you were not completely satisfied with the current API or implementation you could have said so in the discussion thread. It would have at least given Brian a chance to do something about it, before the voting began.
Maybe we went to the voting too fast, and somebody had not enough time to read documentation and write a opinion?
Maybe we should wait at least 1-2 weeks from last review before start a new voting? Maybe we should announce upcoming voting for one week prior to start a new voting thread? I belive that it pays additional attention to the new module and helps avoid situations like this.
|
October 06, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Am 05.10.2013 02:24, schrieb Andrei Alexandrescu: > Instead of associating token types with small integers, we associate > them with string addresses. (For efficiency we may use pointers to > zero-terminated strings, but I don't think that's necessary). would it be also more efficent to generate a big string out of the token list containing all tokes concatenated and use a generated string-slice for the associated string accesses? imutable string generated_flat_token_stream = "...publicprivateclass..." "public" = generated_flat_token_stream[3..9] or would that kill caching on todays machines? |
October 06, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Saturday, 5 October 2013 at 00:24:22 UTC, Andrei Alexandrescu wrote:
> On 10/2/13 7:41 AM, Dicebot wrote:
>> After brief discussion with Brian and gathering data from the review
>> thread, I have decided to start voting for `std.d.lexer` inclusion into
>> Phobos.
>
> Thanks all involved for the work, first of all Brian.
>
> I have the proverbial good news and bad news. The only bad news is that I'm voting "no" on this proposal.
>
> But there's plenty of good news.
>
> 1. I am not attempting to veto this, so just consider it a normal vote when tallying.
>
> 2. I do vote for inclusion in the /etc/ package for the time being.
>
> 3. The work is good and the code valuable, so even in the case my suggestions (below) will be followed, a virtually all code pulp that gets work done can be reused.
>
> Vision
> ======
>
> I'd been following the related discussions for a while, but I have made up my mind today as I was working on a C++ lexer today. The C++ lexer is for Facebook's internal linter. I'm translating the lexer from C++.
>
> Before long I realized two simple things. First, I can't reuse anything from Brian's code (without copying it and doing surgery on it), although it is extremely similar to what I'm doing.
>
> Second, I figured that it is almost trivial to implement a simple, generic, and reusable (across languages and tasks) static trie searcher that takes a compile-time array with all tokens and keywords and returns the token at the front of a range with minimum comparisons.
>
> Such a trie searcher is not intelligent, but is very composable and extremely fast. It is just smart enough to do maximum munch (e.g. interprets "==" and "foreach" as one token each, not two), but is not smart enough to distinguish an identifier "whileTrue" from the keyword "while" (it claims "while" was found and stops right at the beginning of "True" in the stream). This is for generality so applications can define how identifiers work (e.g. Lisp allows "-" in identifiers but D doesn't etc). The trie finder doesn't do numbers or comments either. No regexen of any kind.
>
> The beauty of it all is that all of these more involved bits (many of which are language specific) can be implemented modularly and trivially as a postprocessing step after the trie finder. For example the user specifies "/*" as a token to the trie finder. Whenever a comment starts, the trie finder will find and return it; then the user implements the alternate grammar of multiline comments.
>
> To encode the tokens returned by the trie, we must do away with definitions such as
>
> enum TokenType : ushort { invalid, assign, ... }
>
> These are fine for a tokenizer written in C, but are needless duplication from a D perspective. I think a better approach is:
>
> struct TokenType {
> string symbol;
> ...
> }
>
> TokenType tok(string s)() {
> static immutable string interned = s;
> return TokenType(interned);
> }
>
> Instead of associating token types with small integers, we associate them with string addresses. (For efficiency we may use pointers to zero-terminated strings, but I don't think that's necessary). Token types are interned by design, i.e. to compare two tokens for equality it suffices to compare the strings with "is" (this can be extended to general identifiers, not only statically-known tokens). Then, each token type has a natural representation that doesn't require the user to remember the name of the token. The left shift token is simply tok!"<<" and is application-global.
>
> The static trie finder does not even build a trie - it simply generates a bunch of switch statements. The signature I've used is:
>
> Tuple!(size_t, size_t, Token)
> staticTrieFinder(alias TokenTable, R)(R r) {
>
> It returns a tuple with (a) whitespace characters before token, (b) newlines before token, and (c) the token itself, returned as tok!"whatever". To use for C++:
>
> alias CppTokenTable = TypeTuple!(
> "~", "(", ")", "[", "]", "{", "}", ";", ",", "?",
> "<", "<<", "<<=", "<=", ">", ">>", ">>=", "%", "%=", "=", "==", "!", "!=",
> "^", "^=", "*", "*=",
> ":", "::", "+", "++", "+=", "&", "&&", "&=", "|", "||", "|=",
> "-", "--", "-=", "->", "->*",
> "/", "/=", "//", "/*",
> "\\",
> ".",
> "'",
> "\"",
> "#", "##",
> "and",
> "and_eq",
> "asm",
> "auto",
> ...
> );
>
> Then the code uses staticTrieFinder!([CppTokenTable])(range). Of course, it's also possible to define the table itself as an array. I'm exploring right now in search for the most advantageous choices.
>
> I think the above would be a true lexer in the D spirit:
>
> - exploits D's string templates to essentially define non-alphanumeric symbols that are easy to use and understand, not confined to predefined tables (that enum!) and cheap to compare;
>
> - exploits D's code generation abilities to generate really fast code using inlined trie searching;
>
> - offers and API that is generic, flexible, and infinitely reusable.
>
> If what we need at this point is a conventional lexer for the D language, std.d.lexer is the ticket. But I think it wouldn't be difficult to push our ambitions way beyond that. What say you?
How quickly do you think this vision could be realized? If soon, I'd say it's worth delaying a decision on the current proposed lexer, if not ... well, jam tomorrow, perfect is the enemy of good, and all that ...
|
October 06, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | On 10/6/13 2:10 AM, Jacob Carlborg wrote:
> On 2013-10-05 02:24, Andrei Alexandrescu wrote:
>
>> Such a trie searcher is not intelligent, but is very composable and
>> extremely fast. It is just smart enough to do maximum munch (e.g.
>> interprets "==" and "foreach" as one token each, not two), but is not
>> smart enough to distinguish an identifier "whileTrue" from the keyword
>> "while" (it claims "while" was found and stops right at the beginning of
>> "True" in the stream). This is for generality so applications can define
>> how identifiers work (e.g. Lisp allows "-" in identifiers but D doesn't
>> etc). The trie finder doesn't do numbers or comments either. No regexen
>> of any kind.
>
> Would it be able to lex Scala and Ruby? Method names in Scala can
> contain many symbols that is not usually allowed in other languages. You
> can have a method named "==". In Ruby method names are allowed to end
> with "=", "?" or "!".
Yes, easily. Have the trie matcher stop upon whatever symbol it detects and then handle the tail with Ruby-specific code.
Andrei
|
October 06, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Joseph Rushton Wakeling | On 10/6/13 5:40 AM, Joseph Rushton Wakeling wrote: > How quickly do you think this vision could be realized? If soon, I'd say > it's worth delaying a decision on the current proposed lexer, if not ... > well, jam tomorrow, perfect is the enemy of good, and all that ... I'm working on related code, and got all the way there in one day (Friday) with a C++ tokenizer for linting purposes (doesn't open #includes or expand #defines etc; it wasn't meant to). The core generated fragment that does the matching is at https://dpaste.de/GZY3. The surrounding switch statement (also in library code) handles whitespace and line counting. The client code needs to handle by hand things like parsing numbers (note how the matcher stops upon the first digit), identifiers, comments (matcher stops upon detecting "//" or "/*") etc. Such things can be achieved with hand-written code (as I do), other similar tokenizers, DFAs, etc. The point is that the core loop that looks at every character looking for a lexeme is fast. Andrei |
October 06, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | On 10/6/13 1:59 AM, Jacob Carlborg wrote: >> I think std.d.lexer is a fine product that works as advertised. But I >> also >> believe very strongly that it doesn't exploit D's advantages and that >> adopting it would lock us into a suboptimal API. I have strengthened this >> opinion only since yesterday morning. > > I just think that if you were not completely satisfied with the current > API or implementation you could have said so in the discussion thread. > It would have at least given Brian a chance to do something about it, > before the voting began. I've always thought we must invest effort into generic lexers and parsers as opposed to ones for dedicated languages, and I have said so several times, most strongly in http://forum.dlang.org/thread/jii1gk$76s$1@digitalmars.com. When discussion and voting had started, I had acquiesced to not interfere because I thought I shouldn't discuss a working design against a hypothetical one. *That* would have been unfair. But now that such a design exists, I think it's fair to bring it up. Andrei |
October 06, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On 06/10/13 18:07, Andrei Alexandrescu wrote:
> I'm working on related code, and got all the way there in one day (Friday) with
> a C++ tokenizer for linting purposes (doesn't open #includes or expand #defines
> etc; it wasn't meant to).
>
> The core generated fragment that does the matching is at https://dpaste.de/GZY3.
>
> The surrounding switch statement (also in library code) handles whitespace and
> line counting. The client code needs to handle by hand things like parsing
> numbers (note how the matcher stops upon the first digit), identifiers, comments
> (matcher stops upon detecting "//" or "/*") etc. Such things can be achieved
> with hand-written code (as I do), other similar tokenizers, DFAs, etc. The point
> is that the core loop that looks at every character looking for a lexeme is fast.
What I'm getting at is that I'd be prepared to give a vote "no to std, yes to etc" for Brian's d.lexer, _if_ I was reasonably certain that we'd see an alternative lexer module submitted to Phobos within the next month :-)
|
October 06, 2013 etc vs. package mangers | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Saturday, 5 October 2013 at 00:24:22 UTC, Andrei Alexandrescu wrote:
> 2. I do vote for inclusion in the /etc/ package for the time being.
What is your vision for the future of etc.*, assuming that we are also going to promote DUB (or another package manager) to "official" status soon as well?
Personally, I always found etc.* to be on some strange middle ground between official and non-official – Can I expect these modules to stay around for a longer amount of time? Keep API compatibility according to Phobos policies? The fact that e.g. the libcurl C API modules are also in there makes it seem like a grab-bag of random stuff we didn't quite want to put anywhere else, at least to me.
The docs aren't really helpful either: »Modules in etc are not standard D modules. They are here because they are experimental, or for some other reason are not quite suitable for std, although they are still useful.«
David
|
October 06, 2013 Re: etc vs. package mangers | ||||
---|---|---|---|---|
| ||||
Posted in reply to David Nadlinger | On 06/10/13 18:57, David Nadlinger wrote:
> The docs aren't really helpful either: »Modules in etc are not standard D
> modules. They are here because they are experimental, or for some other reason
> are not quite suitable for std, although they are still useful.«
I actually realized I had no idea about what etc was until the last couple of days, and then I thought -- isn't this really what has just been discussed under the proposed name of stdx?
... and if so, why isn't it being used?
|
October 06, 2013 Re: etc vs. package mangers | ||||
---|---|---|---|---|
| ||||
Posted in reply to Joseph Rushton Wakeling | On Sunday, 6 October 2013 at 17:08:25 UTC, Joseph Rushton Wakeling wrote:
> isn't this really what has just been discussed under the proposed name of stdx?
>
> ... and if so, why isn't it being used?
This is exactly why I'm not too thrilled to make another attempt at establishing something like that. ;)
David
|
Copyright © 1999-2021 by the D Language Foundation