October 09, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Tuesday, 8 October 2013 at 00:16:45 UTC, Andrei Alexandrescu wrote:
> To put my money where my mouth is, I have a proof-of-concept tokenizer for C++ in working state.
>
> http://dpaste.dzfl.pl/d07dd46d
Why do you use "\0" as end-of-stream token:
/**
* All token types include regular and reservedTokens, plus the null
* token ("") and the end-of-stream token ("\0").
*/
We can have situation when the "\0" is a valid token, for example for binary formats. Is it possible to indicate end-of-stream another way, maybe via "empty" property for range-based API?
|
October 09, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to ilya-stromberg | On 10/8/13 11:11 PM, ilya-stromberg wrote:
> On Tuesday, 8 October 2013 at 00:16:45 UTC, Andrei Alexandrescu wrote:
>> To put my money where my mouth is, I have a proof-of-concept tokenizer
>> for C++ in working state.
>>
>> http://dpaste.dzfl.pl/d07dd46d
>
> Why do you use "\0" as end-of-stream token:
>
> /**
> * All token types include regular and reservedTokens, plus the null
> * token ("") and the end-of-stream token ("\0").
> */
>
> We can have situation when the "\0" is a valid token, for example for
> binary formats. Is it possible to indicate end-of-stream another way,
> maybe via "empty" property for range-based API?
I'm glad you asked. It's simply a decision by convention. I know no C++ source can contain a "\0", so I append it to the input and use it as a sentinel.
A general lexer should take the EOF symbol as a parameter.
One more thing: the trie matcher knows a priori (statically) what the maximum lookahead is - it's the maximum of all symbols. That can be used to pre-fill the input buffer such that there's never an out-of-bounds access, even with input ranges.
Andrei
|
October 09, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Wednesday, 9 October 2013 at 07:49:55 UTC, Andrei Alexandrescu wrote:
> On 10/8/13 11:11 PM, ilya-stromberg wrote:
>> On Tuesday, 8 October 2013 at 00:16:45 UTC, Andrei Alexandrescu wrote:
>>> To put my money where my mouth is, I have a proof-of-concept tokenizer
>>> for C++ in working state.
>>>
>>> http://dpaste.dzfl.pl/d07dd46d
>>
>> Why do you use "\0" as end-of-stream token:
>>
>> /**
>> * All token types include regular and reservedTokens, plus the null
>> * token ("") and the end-of-stream token ("\0").
>> */
>>
>> We can have situation when the "\0" is a valid token, for example for
>> binary formats. Is it possible to indicate end-of-stream another way,
>> maybe via "empty" property for range-based API?
>
> I'm glad you asked. It's simply a decision by convention. I know no C++ source can contain a "\0", so I append it to the input and use it as a sentinel.
>
> A general lexer should take the EOF symbol as a parameter.
>
> One more thing: the trie matcher knows a priori (statically) what the maximum lookahead is - it's the maximum of all symbols. That can be used to pre-fill the input buffer such that there's never an out-of-bounds access, even with input ranges.
>
>
> Andrei
So, it's interesting to see a new improved API, because we need a really generic lexer. I think it's not so difficult.
|
October 09, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On 10/7/13 5:16 PM, Andrei Alexandrescu wrote: > On 10/4/13 5:24 PM, Andrei Alexandrescu wrote: >> On 10/2/13 7:41 AM, Dicebot wrote: >>> After brief discussion with Brian and gathering data from the review >>> thread, I have decided to start voting for `std.d.lexer` inclusion into >>> Phobos. >> >> Thanks all involved for the work, first of all Brian. >> >> I have the proverbial good news and bad news. The only bad news is that >> I'm voting "no" on this proposal. >> >> But there's plenty of good news. >> >> 1. I am not attempting to veto this, so just consider it a normal vote >> when tallying. >> >> 2. I do vote for inclusion in the /etc/ package for the time being. >> >> 3. The work is good and the code valuable, so even in the case my >> suggestions (below) will be followed, a virtually all code pulp that >> gets work done can be reused. > [snip] > > To put my money where my mouth is, I have a proof-of-concept tokenizer > for C++ in working state. > > http://dpaste.dzfl.pl/d07dd46d I made an improvement to the way tokens are handled. In the paste above, "tk" is a function. A CTFE-able function that just returns a compile-time constant, but a function nevertheless. To actually reduce "tk" to a compile-time constant in all cases, I changed it as follows: template tk(string symbol) { import std.range; static if (symbol == "") { // Token ID 0 is reserved for "unrecognized token". enum tk = TokenType2(0); } else static if (symbol == "\0") { // Token ID max is reserved for "end of input". enum tk = TokenType2( cast(TokenIDRep) (1 + tokens.length + reservedTokens.length)); } else { //enum id = chain(tokens, reservedTokens).countUntil(symbol); // Find the id within the regular tokens realm enum idTokens = tokens.countUntil(symbol); static if (idTokens >= 0) { // Found, regular token. Add 1 because 0 is reserved. enum id = idTokens + 1; } else { // not found, only chance is within the reserved tokens realm enum idResTokens = reservedTokens.countUntil(symbol); enum id = idResTokens >= 0 ? tokens.length + idResTokens + 1 : -1; } static assert(id >= 0 && id < TokenIDRep.max, "Invalid token: " ~ symbol); enum tk = TokenType2(id); } This is even better now because token types are simple static constants. Andrei |
October 10, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On 10/08/2013 05:05 PM, Andrei Alexandrescu wrote:
> But no matter. My most significant bit is, we need a trie lexer
> generator ONLY from the token strings, no TK_XXX user-provided symbols
> necessary. If all we need is one language (D) this is a non-issue
> because the library writer provides the token definitions. If we need to
> support user-provided languages, having the library manage the string ->
> small integer mapping becomes essential.
It's good to get rid of the symbol names.
You should try to map the strings onto an enum so that final switch works.
final switch (t.type_)
{
case t!"<<": break;
// ...
}
|
October 10, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sönke Ludwig | On 10/06/2013 10:18 AM, Sönke Ludwig wrote:
> I also see no fundamental reason why the API forbids
> extension for shared sting tables or table-less lexing.
The current API requires to copy slices of the const(ubyte)[] input to string values in every token. This can't be done efficiently without a
string table. But a string table is unnecessary for many use-cases,
so the API has a built-in performance/memory issue.
|
October 10, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On 10/08/2013 07:22 AM, Jonathan M Davis wrote:
> We've had great stuff reviewed and merged thus far,
> but we also tend to end up having to make minor tweaks to the API or later
> come to regret including it at all (e.g. std.net.curl). Having some sort of
> intermediate step prior to full inclusion for at least one or two releases
> would be a good move IMHO.
It usually takes me a few month until I get to try a new module at which point it's mostly already voted and included.
So the current approach doesn't work for me at all.
|
October 10, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dicebot | On Wednesday, October 02, 2013 16:41:54 Dicebot wrote:
> After brief discussion with Brian and gathering data from the review thread, I have decided to start voting for `std.d.lexer` inclusion into Phobos.
I'm going to have to vote no.
While Brian has done some great work, I think that it's clear from the discussion that there are still some potential issues (e.g. requiring a string table) that need further discussion and possibly API changes. Also, while I question that a generated lexer can beat a hand-written one, I think that we really should look at what Andrei's proposing and look at adjusting whan Brian has done accordingly - or at least do enough so that we can benchmark the two approaches. As such, accepting the lexer right now doesn't really make sense.
However, we may want to make it so that the lexer is in some place of prominence (outside of Phobos - probably on dub but mentioned somewhere on dlang.org) as an _experimental_ module which is clearly marked as not finalized but which is ready for people to use and bang on. That way, we may be able to get some better feedback generated from more real world use.
- Jonathan M Davis
|
October 10, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Martin Nowak | Am 10.10.2013 03:25, schrieb Martin Nowak:
> On 10/06/2013 10:18 AM, Sönke Ludwig wrote:
>> I also see no fundamental reason why the API forbids
>> extension for shared sting tables or table-less lexing.
>
> The current API requires to copy slices of the const(ubyte)[] input to
> string values in every token. This can't be done efficiently without a
> string table. But a string table is unnecessary for many use-cases,
> so the API has a built-in performance/memory issue.
But it could be extended later to accept immutable input as a special case, thus removing that requirement, if I'm not overlooking something. In that case it still is a pure implementation detail.
|
October 10, 2013 Re: std.d.lexer : voting thread | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On Thursday, 10 October 2013 at 04:33:15 UTC, Jonathan M Davis wrote:
> On Wednesday, October 02, 2013 16:41:54 Dicebot wrote:
>> After brief discussion with Brian and gathering data from the
>> review thread, I have decided to start voting for `std.d.lexer`
>> inclusion into Phobos.
>
> I'm going to have to vote no.
>
> While Brian has done some great work, I think that it's clear from the
> discussion that there are still some potential issues (e.g. requiring a string
> table) that need further discussion and possibly API changes. Also, while I
> question that a generated lexer can beat a hand-written one, I think that we
> really should look at what Andrei's proposing and look at adjusting whan Brian
> has done accordingly - or at least do enough so that we can benchmark the two
> approaches. As such, accepting the lexer right now doesn't really make sense.
>
> However, we may want to make it so that the lexer is in some place of
> prominence (outside of Phobos - probably on dub but mentioned somewhere on
> dlang.org) as an _experimental_ module which is clearly marked as not finalized
> but which is ready for people to use and bang on. That way, we may be able to
> get some better feedback generated from more real world use.
>
> - Jonathan M Davis
Vote: No.
Same reason as Jonathan above.
|
Copyright © 1999-2021 by the D Language Foundation