January 27, 2013
27-Jan-2013 23:48, Walter Bright пишет:
> On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
>> Walter seems to think if a lexer is not able to vomit thousands
>> of tokens a seconds, then it's not good.
>
> Speed is critical for a lexer.
>
> This means, for example, you'll need to squeeze pretty much all storage
> allocation out of it. A lexer that does an allocation per token will is
> not going to do very well at all.

I concur. One of the biggest reason* there is a separate lexer step is because it could be made to do this stage very-very fast. Then the rest of the parser will greatly benefit from this underlying speed.

*Otherwise we could have just as well add the lexer stage as simple rules to the grammar that treats all of codepoints as terminals.

-- 
Dmitry Olshansky
January 27, 2013
27-Jan-2013 23:46, Walter Bright пишет:
> On 1/27/2013 1:51 AM, Brian Schott wrote:
>> I'm interested in ideas on the API design and other high-level issues
>> at the
>> moment. I don't consider this ready for inclusion. (The current module
>> being
>> reviewed for inclusion in Phobos is the new std.uni.)
>
> Just a quick comment: byToken() should not accept a filename. It's input
> should be via an InputRange, not a file.

InputRange of char/wchar/dchar I guess as lexer should have an ability to avoid decoding.

-- 
Dmitry Olshansky
January 27, 2013
On Sun, Jan 27, 2013 at 8:48 PM, Walter Bright <newshound2@digitalmars.com> wrote:
> On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
>>
>> Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good.
>
>
> Speed is critical for a lexer.

Something I never tought about: given a 10k lines module, how many tokens does that represent, roughly. Ten times as much?

> This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.

How does one do that? Honest question, I'm not really concerned by extreme speed, most of the time, and have lots to learn in this area.
January 27, 2013
Am 27.01.2013 20:48, schrieb Walter Bright:
> On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
>> Walter seems to think if a lexer is not able to vomit thousands
>> of tokens a seconds, then it's not good.
>
> Speed is critical for a lexer.
>
> This means, for example, you'll need to squeeze pretty much all storage
> allocation out of it. A lexer that does an allocation per token will is
> not going to do very well at all.

I second that.
January 27, 2013
> I concur. One of the biggest reason* there is a separate lexer step is because it could be made to do this stage very-very fast. Then the rest of the parser will greatly benefit from this underlying speed.
>
> *Otherwise we could have just as well add the lexer stage as simple rules to the grammar that treats all of codepoints as terminals.

... which is exactly what parsing expression grammars (and other
scannerless parsers) do.

AFAICT, one interesting consequence is the ability to have composition
of grammars, which I sure have with Pegged. But maybe it's not related
that much, that's not something I stopped to think about.
In any case, grammar composition is something I've learnt to like quite a lot.
January 27, 2013
28-Jan-2013 00:18, Philippe Sigaud пишет:
>> I concur. One of the biggest reason* there is a separate lexer step is
>> because it could be made to do this stage very-very fast. Then the rest of
>> the parser will greatly benefit from this underlying speed.
>>
>> *Otherwise we could have just as well add the lexer stage as simple rules to
>> the grammar that treats all of codepoints as terminals.
>
> ... which is exactly what parsing expression grammars (and other
> scannerless parsers) do.
>

That is only a happenstance and is highly overrated. What does it buy you is the right question to ask. After all every LL-parser can be written avoiding notion of lexer and strangely I see no fuss about it.
So is it just hype? I dunno.

> AFAICT, one interesting consequence is the ability to have composition
> of grammars, which I sure have with Pegged. But maybe it's not related
> that much, that's not something I stopped to think about.
> In any case, grammar composition is something I've learnt to like quite a lot.
>

You can still have composability of grammars. In fact I'd define a couple of a less powerful but better optimized variations if I were you.

PEG-style memoization is still possible as long as underlying grammars are context-free all the useful usually are.

LR hardly composable ( I haven't seen a single one that does).
But all top-down one always are unless there is some explicit work spent to undermine this ability ;)

-- 
Dmitry Olshansky
January 27, 2013
>> ... which is exactly what parsing expression grammars (and other
>> scannerless parsers) do.
>>
>
> That is only a happenstance and is highly overrated. What does it buy you is
> the right question to ask. After all every LL-parser can be written avoiding
> notion of lexer and strangely I see no fuss about it.
> So is it just hype? I dunno.

I don't know. I discovered what scannerless parsing was after
discovering (and getting interested) in PEG.
That's a somewhat unusual path, as I'm now discovering standard
lexer+parser parsing :)

Anyway, let's not derail the OP thread too much.

>> AFAICT, one interesting consequence is the ability to have composition
>> of grammars, which I sure have with Pegged. But maybe it's not related
>> that much, that's not something I stopped to think about.
>> In any case, grammar composition is something I've learnt to like quite a
>> lot.
>>
>
> You can still have composability of grammars. In fact I'd define a couple of a less powerful but better optimized variations if I were you.

Yes, getting an automaton to deal with the regular part would be good. But then, there already is std.regex ;)

it's good to know composition is possible elsewhere. I really like it: it makes me use grammars (and the related, generated parsers) as I use functions in PL: slowly building layers of abstraction and reuse.

> PEG-style memoization is still possible as long as underlying grammars are context-free all the useful usually are.

Yes, I agree.


> LR hardly composable ( I haven't seen a single one that does).
> But all top-down one always are unless there is some explicit work spent to
> undermine this ability ;)

What I still can't get an easy answer on is: is the D grammar LR(1),
LR(0), LALR?
January 27, 2013
28-Jan-2013 00:45, Philippe Sigaud пишет:
[snip]
>>> AFAICT, one interesting consequence is the ability to have composition
>>> of grammars, which I sure have with Pegged. But maybe it's not related
>>> that much, that's not something I stopped to think about.
>>> In any case, grammar composition is something I've learnt to like quite a
>>> lot.
>>>
>>
>> You can still have composability of grammars. In fact I'd define a couple of
>> a less powerful but better optimized variations if I were you.
>
> Yes, getting an automaton to deal with the regular part would be good.
> But then, there already is std.regex ;)

Reminds me to tweak/refactor/optimize CT-part of it.
[snip]
> What I still can't get an easy answer on is: is the D grammar LR(1),
> LR(0), LALR?
>

I *think* you can shoehorn it into any one of these: LALR(1),  LL(k)+ some lookahead, LL(*) a generalization of it, or PEG. As usual you might need semantic predicates to iron out some quirks though. We are getting OT quick ;)

The hint is that your question is a bit faulty: by calling it "the D grammar" do you mean the exact one listed on the website or any equivalent that parses the same language (including the ones obtained by simple transformations)?

-- 
Dmitry Olshansky
January 27, 2013
On 1/27/2013 12:11 PM, Dmitry Olshansky wrote:
> 27-Jan-2013 23:46, Walter Bright пишет:
>> On 1/27/2013 1:51 AM, Brian Schott wrote:
>>> I'm interested in ideas on the API design and other high-level issues
>>> at the
>>> moment. I don't consider this ready for inclusion. (The current module
>>> being
>>> reviewed for inclusion in Phobos is the new std.uni.)
>>
>> Just a quick comment: byToken() should not accept a filename. It's input
>> should be via an InputRange, not a file.
>
> InputRange of char/wchar/dchar I guess as lexer should have an ability to avoid
> decoding.

Just char would suffice. If a user needs wchar/dchar, they can put a map in between the InputRange and the lexer.

January 27, 2013
On 1/27/2013 12:18 PM, Philippe Sigaud wrote:
>> I concur. One of the biggest reason* there is a separate lexer step is
>> because it could be made to do this stage very-very fast. Then the rest of
>> the parser will greatly benefit from this underlying speed.
>>
>> *Otherwise we could have just as well add the lexer stage as simple rules to
>> the grammar that treats all of codepoints as terminals.
>
> ... which is exactly what parsing expression grammars (and other
> scannerless parsers) do.
>
> AFAICT, one interesting consequence is the ability to have composition
> of grammars, which I sure have with Pegged. But maybe it's not related
> that much, that's not something I stopped to think about.
> In any case, grammar composition is something I've learnt to like quite a lot.
>

If you do such a thing when designing a language, I guarantee that you will find it impossible to resist having custom "tokens" in various parts of the grammar. This will make it impossible to write tools that only need to tokens, such as those syntax highlighting editors that really are token highlighters.