July 11, 2012
On 07/11/2012 11:55 PM, David Piepgrass wrote:
> On Tuesday, 10 July 2012 at 23:49:58 UTC, Timon Gehr wrote:
>> On 07/11/2012 01:16 AM, deadalnix wrote:
>>> On 09/07/2012 10:14, Christophe Travert wrote:
>>>> deadalnix , dans le message (digitalmars.D:171330), a écrit :
>>>>> D isn't 100% CFG. But it is close.
>>>> What makes D fail to be a CFG?
>>> type[something] <= something can be a type or an expression.
>>> typeid(somethning) <= same here
>>> identifier!(something) <= again
>>
>> 'something' is context-free:
>>
>> something ::= type | expression.
>
> I don't see how "type | expression" is context free. The input "Foo"
> could be a type or expression, you can't tell which without looking at
> the context.

'Context free' means that all the production rule left hand sides do
not have a context, i.e. they consist of a single non-terminal.

What you are describing is the fact that the grammar is ambiguous. The parser can just assume whatever works and the analyzer takes care of the
interpretation of the resulting AST. The parser does not have to
give a meaning to the code. It just turns it into a representation that
is easy to work with.
July 23, 2012
On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath <tobias@pankrath.net> wrote:

> If you are interested in parsing, than I wouldn't recommend the dragon book, but this one
>
> It really is an awesome book.

This little post to say a big 'Thanks' to Tobias. I'm reading Grune's Parsing Techniques and find it absolutely wonderful: easy to read, going into details and limits of the many parsing algos. Very cool!

I'm playing with fully-backtracking recursive-descent parsers right now, to deal with ambiguous grammars and the small parser works beautifully. Hey, I just saw he gives the code on his website, nice!

Now, on to LR. Nick, here I come. I'll at last know how Goldie works.


Philippe
July 23, 2012
On 7/23/12 9:22 AM, Philippe Sigaud wrote:
> On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath<tobias@pankrath.net>  wrote:
>
>> If you are interested in parsing, than I wouldn't recommend the dragon book,
>> but this one
>>
>> It really is an awesome book.
>
> This little post to say a big 'Thanks' to Tobias. I'm reading Grune's
> Parsing Techniques and find it absolutely wonderful: easy to read,
> going into details and limits of the many parsing algos. Very cool!
>
> I'm playing with fully-backtracking recursive-descent parsers right
> now, to deal with ambiguous grammars and the small parser works
> beautifully. Hey, I just saw he gives the code on his website, nice!
>
> Now, on to LR. Nick, here I come. I'll at last know how Goldie works.

When will we see the fruit of that all in the form of D libraries?

Andrei


July 23, 2012
On Monday, 23 July 2012 at 13:22:47 UTC, Philippe Sigaud wrote:
> On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath <tobias@pankrath.net> wrote:
>
>> If you are interested in parsing, than I wouldn't recommend the dragon book,
>> but this one
>>
>> It really is an awesome book.
>
> This little post to say a big 'Thanks' to Tobias. I'm reading Grune's
> Parsing Techniques and find it absolutely wonderful: easy to read,
> going into details and limits of the many parsing algos. Very cool!

Yes, I'm also reading it and find it amusingly high-quality.
July 23, 2012
On Monday, 23 July 2012 at 13:22:47 UTC, Philippe Sigaud wrote:
> On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath <tobias@pankrath.net> wrote:
>
>> If you are interested in parsing, than I wouldn't recommend the dragon book,
>> but this one
>>
>> It really is an awesome book.
>
> This little post to say a big 'Thanks' to Tobias. I'm reading Grune's
> Parsing Techniques and find it absolutely wonderful: easy to read,
> going into details and limits of the many parsing algos. Very cool!
>

I'm glad that you like it.
July 31, 2012
On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:
> There are more and more projects requiring parsing D code (IDE plugins,
> DCT and dfmt now).
>
> We definitely need a _standard_ fast D parser (suitable as tokenizer).
> We already have a good one at least in Visual D. Maybe dmd's parser is
> faster. If so, it can be (easily?) rewritten in D. We also already have
> some other parsers (in C# from Mono-D etc.).
>
> What about to get one and call it standard?
>

I know I'm a little late coming into this conversation. This seems like a nice thread to toss myself into. I've started working on a generic lexer that is based off of a defined grammar.

As I read through the thread (I unfortunately don't have enough time to read every post, but I skimmed through as many as I could, and read the ones that seemed important), it seems like we need a parser in D that can lex D, and provide a Range of tokens that could be consumed.

With some very minor tweaks, and a properly written Grammar class, I basically have it already done. D was going to be one of the first languages I would have written a definition for.

https://github.com/kai4785/firstfront

I haven't had time to look through Pegged, but there are some differences in my approach that I can immediately see in Pegged's.

1) Pegged does compile-time or run-time code generation for your parser. Mine is run-time only and regex based.
2) Pegged uses PEG format, which I have been introduced to through this thread. One of my goals was to actually invent something like PEG. This will save me time :)

I would love to receive some critical feedback on my approach as well as the actual code base.

To build it, you'll need cmake, and cmaked2 from here:
http://code.google.com/p/cmaked2/wiki/GettingStarted

Or just dmd *.d :) I haven't tried to build it on Windows yet, but I don't see anything that immediately jumps out as not cross-platform. I've been working on it on both Fedora and CentOS.

-Kai Meyer
July 31, 2012
On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:

> I know I'm a little late coming into this conversation. This seems like a nice thread to toss myself into. I've started working on a generic lexer that is based off of a defined grammar.

Every helping hand is appreciated :-)

> As I read through the thread (I unfortunately don't have enough time to read every post, but I skimmed through as many as I could, and read the ones that seemed important), it seems like we need a parser in D that can lex D, and provide a Range of tokens that could be consumed.

Yes. To make this statement more precise: We need a lexer that provides
a range of tokens and we need a parser which makes it possible to build
an AST. Pegged combines both approaches but imposes an overhead if you
just need a token list. However I'm not sure if this is a problem.

There are already some working D-Parsers buried in this thread.

> With some very minor tweaks, and a properly written Grammar class, I basically have it already done. D was going to be one of the first languages I would have written a definition for.
>
> https://github.com/kai4785/firstfront

I've only glimpsed at your code. For most languages lexing is far more expensive then parsing and thus the lexer has to be very fast and I wouldn't pursue your approach and instead use something like ragel. It already has D output but needs a separate build step.






July 31, 2012
On Tue, Jul 31, 2012 at 7:29 PM, Tobias Pankrath <tobias@pankrath.net> wrote:
> On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:
>
>> I know I'm a little late coming into this conversation. This seems like a nice thread to toss myself into. I've started working on a generic lexer that is based off of a defined grammar.
>
>
> Every helping hand is appreciated :-)

Hi Kai, welcome here!

>> As I read through the thread (I unfortunately don't have enough time to read every post, but I skimmed through as many as I could, and read the ones that seemed important), it seems like we need a parser in D that can lex D, and provide a Range of tokens that could be consumed.
>
>
> Yes. To make this statement more precise: We need a lexer that provides a range of tokens and we need a parser which makes it possible to build an AST. Pegged combines both approaches but imposes an overhead if you just need a token list. However I'm not sure if this is a problem.

I think we need a tokenizer generator and a parser generator. They can be grouped or not, but we need both, in Phobos.

We also need to define what's needed in a token. Kai's approach is OK,
but what's the _type field for an operator or and identifier?
Also, a lexer can fill a symbol table and produce identifier tokens
that refer to a particular entry in the symbol table.

I guess creating a tree of symbol tables according to scope visibility is then more the job of the parser, but I'm not sure.


> There are already some working D-Parsers buried in this thread.

Yes. We also need a D parser (not generic), but no one pushed one for
Phobos inclusion right now.
Anyway, we can put generic parsers in Phobos too (LALR, LL(*), etc),
but I'd say the first priority would be to have a D lexer (producing a
range of tokens) and a D parser (consuming a range of tokens, and
executing semantic actions, like the building of an AST). Generic
parsers can come later on. Having a D parser in the standard
distribution would create much goodwill. Many people want that.


> I've only glimpsed at your code. For most languages lexing is far more expensive then parsing

Is that so?

> and thus the lexer has to be very fast and I wouldn't
> pursue your approach and instead use something like ragel. It already has D
> output but needs a separate build step.

Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.

The only problem I see in Kai's approach (which is the standard one, a prioritized list of regexes) is how to recognize tokens that are not regular (I mean, that cannot be encoded as a regex), like nesting comments? How does the lexer know when to stop producing a 'comment' token?

Philippe
July 31, 2012
On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
> Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.

I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see). Writing it has been going surprisingly quickly actually, and I've already found some bugs in the spec as a result (some of which have been fixed, some of which I still need to create pull requests for). So, regardless of what happens with my lexer, at least the spec will be more accurate.

- Jonathan M Davis
July 31, 2012
On 31-Jul-12 20:10, Kai Meyer wrote:
> On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:
>> There are more and more projects requiring parsing D code (IDE plugins,
>> DCT and dfmt now).
>>
>> We definitely need a _standard_ fast D parser (suitable as tokenizer).
>> We already have a good one at least in Visual D. Maybe dmd's parser is
>> faster. If so, it can be (easily?) rewritten in D. We also already have
>> some other parsers (in C# from Mono-D etc.).
>>
>> What about to get one and call it standard?
>>
>
> I know I'm a little late coming into this conversation. This seems like
> a nice thread to toss myself into. I've started working on a generic
> lexer that is based off of a defined grammar.
>
> As I read through the thread (I unfortunately don't have enough time to
> read every post, but I skimmed through as many as I could, and read the
> ones that seemed important), it seems like we need a parser in D that
> can lex D, and provide a Range of tokens that could be consumed.
>
> With some very minor tweaks, and a properly written Grammar class, I
> basically have it already done. D was going to be one of the first
> languages I would have written a definition for.
>
> https://github.com/kai4785/firstfront
>
> I haven't had time to look through Pegged, but there are some
> differences in my approach that I can immediately see in Pegged's.
>
> 1) Pegged does compile-time or run-time code generation for your parser.
> Mine is run-time only and regex based.
> 2) Pegged uses PEG format, which I have been introduced to through this
> thread. One of my goals was to actually invent something like PEG. This
> will save me time :)
>

> I would love to receive some critical feedback on my approach as well as
> the actual code base.

Okay. Critics go as follows:

First, as an author of std.regex I'm pleased that you find it suitable to use it in lexer.  :)

Still there is a BIG word of warning:

Lexer based on trying out an array of regexes until one will match is NOT fast and not even close to fast ones. It is acceptable as proof of concept/prototype kind of thing only.

To help this use case I though of making multi-regex matching a-la:
match(text, regex1, regex2, regex3...);
then  lexing would be effectively a loop of form:

foreach(tok; match(text, regex1, regex2, regex3...)){
	switch(tok[0]){//suppose match returns tuple - N of regex matched + the usual piece of text
		...
	}
}
(with some glitches on /+ ... +/ and other non-regular stuff).

But until such time it's not to be used seriously in this context.

The reason is that each call to match does have some overhead to setup regex engine, it's constant but it's huge compared to what usual lexers typically do. The other thing is that you should use ctRegex for this kind of job if you can (i.e. compiler doesn't explode on it).


(keep in mind I only glimpsed the source, will post more feedback later)

>
> To build it, you'll need cmake, and cmaked2 from here:
> http://code.google.com/p/cmaked2/wiki/GettingStarted
>
> Or just dmd *.d :) I haven't tried to build it on Windows yet, but I
> don't see anything that immediately jumps out as not cross-platform.
> I've been working on it on both Fedora and CentOS.
>

rdmd for the win!


-- 
Dmitry Olshansky