March 13, 2012
On 03/13/2012 04:28 AM, Dmitry Olshansky wrote:
> On 12.03.2012 16:43, bls wrote:
>> On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
>>> Hello,
>>>
>>> I created a new Github project, Pegged, a Parsing Expression Grammar
>>> (PEG) generator in D.
>>>
>>> https://github.com/PhilippeSigaud/Pegged
>>>
>>> docs: https://github.com/PhilippeSigaud/Pegged/wiki
>>
>> Just WOW!
>>
>> Nice to have on your WIKI would be a EBNF to PEG sheet.
>>
>> Wirth EBNF Pegged
>> A = BC. A <- B C
>> A = B|C. A <- C / C
>
> Maybe A <- B / C. And even then it's not exactly equivalent if the
> grammar was ambiguous.
> Imagine: B <- a, C <- aa
PEG is pretty new to me. Can you elaborate a bit ?
>

My mistake.. cleaned up stuff..

Pegged				Wirth EBNF

Sequence
A <- B C			A = BC.

B or C
A <- B / C			A = B|C.

Zero or one B
A <- B?				A = [B].

Zero or more Bs
A <- B*				A = {B}.

One or more Bs
A <- B+				Not available

PEG description of EBNF

EBNF <- Procuction+
Production <- Identifier '=' Expression '.'
Expression <- Term ( '|' Term)*
Term <- Factor Factor*
Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression '}' / '(' Expression ')'
lowerCase  <- [a-z]
upperCase  <- [A-Z]
Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*
Literal <- ("'" .+ "'") /  ('"' .+ '"')

Still not sure if this is correct. Especially :
Term <- Factor Factor*


Another thing I never really understand is the "production" order, In other words : Why not top down ..
Start :
lowerCase  <- [a-z]
upperCase  <- [A-Z]
Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*
....

End :
EBNF <- Procuction+

where End is Root..

TIA, Bjoern
March 13, 2012
On 12-03-2012 13:43, bls wrote:
> On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
>> Hello,
>>
>> I created a new Github project, Pegged, a Parsing Expression Grammar
>> (PEG) generator in D.
>>
>> https://github.com/PhilippeSigaud/Pegged
>>
>> docs: https://github.com/PhilippeSigaud/Pegged/wiki
>
> Just WOW!
>
> Nice to have on your WIKI would be a EBNF to PEG sheet.
>
> Wirth EBNF Pegged
> A = BC. A <- B C
> A = B|C. A <- C / C
> A = [B]. A <- B?
> A = {B}. A <- B*
>
>
> Having EBNF expressed in Pegged ...
>
> EBNF <- Procuction+
> Production <- Identifier '=' Expression '.'
> Expression <- Term ( '|' Term)*
> Term <- Factor Factor*
> Factor <- Identifier /
> Literal /
> '[' Expression ']'
> / '{' Expression }'/
> '(' Expression ')'
> lowerCase <- [a-z]
> upperCase <- [A-Z]
> Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*
> Literal <- ("'" .+ "'" / '"' .+ '"')
>
> Due to the fact that EBNF can be expressed in EBNF it should be possible
> to parse an arbitrary EBNF file and generate PEG output.
> What do you think ?
>
> BTW.. Is my EBNF PEG description correct ? TIA Bjoern

The thing is, PEG cannot represent an ambiguous grammar. It is fully ordered, so it's just plain impossible.

That's not to say you can't turn an EBNF grammar into PEG, but it won't necessarily be useful in all cases.

-- 
- Alex
March 13, 2012
On 12.03.2012 17:45, bls wrote:
> On 03/13/2012 04:28 AM, Dmitry Olshansky wrote:
>> On 12.03.2012 16:43, bls wrote:
>>> On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
>>>> Hello,
>>>>
>>>> I created a new Github project, Pegged, a Parsing Expression Grammar
>>>> (PEG) generator in D.
>>>>
>>>> https://github.com/PhilippeSigaud/Pegged
>>>>
>>>> docs: https://github.com/PhilippeSigaud/Pegged/wiki
>>>
>>> Just WOW!
>>>
>>> Nice to have on your WIKI would be a EBNF to PEG sheet.
>>>
>>> Wirth EBNF Pegged
>>> A = BC. A <- B C
>>> A = B|C. A <- C / C
>>
>> Maybe A <- B / C. And even then it's not exactly equivalent if the
>> grammar was ambiguous.
>> Imagine: B <- a, C <- aa
> PEG is pretty new to me. Can you elaborate a bit ?

PEG defines order of alternatives, that is pretty much like a top-down recursive descent parser would parse it. Alternatives are tried from left to right, if first one fails, it tries next and so on.
In an example I give B is always picked first and so C is never ever looked at.

Somewhat less artificial example:
Literal <- IntL| FloatL
FloatL <- [0-9]+(.[0-9]+)?
IntL <- [0-9]+

If you change it to: Literal <- FloatL| IntL then integer literals would get parsed as floating point.

>>
>
> My mistake.. cleaned up stuff..
>
> Pegged Wirth EBNF
>
> Sequence
> A <- B C A = BC.
>
> B or C
> A <- B / C A = B|C.
>
> Zero or one B
> A <- B? A = [B].
>
> Zero or more Bs
> A <- B* A = {B}.
>
> One or more Bs
> A <- B+ Not available
>
> PEG description of EBNF
>
> EBNF <- Procuction+
> Production <- Identifier '=' Expression '.'
> Expression <- Term ( '|' Term)*
> Term <- Factor Factor*
> Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression '}'
> / '(' Expression ')'
> lowerCase <- [a-z]
> upperCase <- [A-Z]
> Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*

Why not:
Identifier <- [a-zA-Z]+

> Literal <- ("'" .+ "'") / ('"' .+ '"')
>
This needs escaping. Plain '.+' in pattern asks for trouble 99% of time.

> Still not sure if this is correct. Especially :
> Term <- Factor Factor*
>
>
> Another thing I never really understand is the "production" order, In
> other words : Why not top down ..
> Start :
> lowerCase <- [a-z]
> upperCase <- [A-Z]
> Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*

> ....
>
> End :
> EBNF <- Procuction+
>
> where End is Root..

In fact grammars are usually devised the other way around, e.g.
Start:
 Program -> ...
Ehm... what the whole program is exactly ? Ok, let it be Declaration* for now. What kind of declarations do we have? ... and so on. Latter grammars get tweaked and extended numerous times.

At any rate production order has no effect on the grammar, it's still the same. The only thing of importance is what non-terminal considered final (or start if you are LL-centric).

>
> TIA, Bjoern


-- 
Dmitry Olshansky
March 13, 2012
On 13-03-2012 17:17, Dmitry Olshansky wrote:
> On 12.03.2012 17:45, bls wrote:
>> On 03/13/2012 04:28 AM, Dmitry Olshansky wrote:
>>> On 12.03.2012 16:43, bls wrote:
>>>> On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
>>>>> Hello,
>>>>>
>>>>> I created a new Github project, Pegged, a Parsing Expression Grammar
>>>>> (PEG) generator in D.
>>>>>
>>>>> https://github.com/PhilippeSigaud/Pegged
>>>>>
>>>>> docs: https://github.com/PhilippeSigaud/Pegged/wiki
>>>>
>>>> Just WOW!
>>>>
>>>> Nice to have on your WIKI would be a EBNF to PEG sheet.
>>>>
>>>> Wirth EBNF Pegged
>>>> A = BC. A <- B C
>>>> A = B|C. A <- C / C
>>>
>>> Maybe A <- B / C. And even then it's not exactly equivalent if the
>>> grammar was ambiguous.
>>> Imagine: B <- a, C <- aa
>> PEG is pretty new to me. Can you elaborate a bit ?
>
> PEG defines order of alternatives, that is pretty much like a top-down
> recursive descent parser would parse it. Alternatives are tried from
> left to right, if first one fails, it tries next and so on.
> In an example I give B is always picked first and so C is never ever
> looked at.
>
> Somewhat less artificial example:
> Literal <- IntL| FloatL
> FloatL <- [0-9]+(.[0-9]+)?
> IntL <- [0-9]+
>
> If you change it to: Literal <- FloatL| IntL then integer literals would
> get parsed as floating point.
>
>>>
>>
>> My mistake.. cleaned up stuff..
>>
>> Pegged Wirth EBNF
>>
>> Sequence
>> A <- B C A = BC.
>>
>> B or C
>> A <- B / C A = B|C.
>>
>> Zero or one B
>> A <- B? A = [B].
>>
>> Zero or more Bs
>> A <- B* A = {B}.
>>
>> One or more Bs
>> A <- B+ Not available
>>
>> PEG description of EBNF
>>
>> EBNF <- Procuction+
>> Production <- Identifier '=' Expression '.'
>> Expression <- Term ( '|' Term)*
>> Term <- Factor Factor*
>> Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression '}'
>> / '(' Expression ')'
>> lowerCase <- [a-z]
>> upperCase <- [A-Z]
>> Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*
>
> Why not:
> Identifier <- [a-zA-Z]+

That was an illustrative example from the Pegged docs. But yeah, you should just use a range; reads nicer.

>
>> Literal <- ("'" .+ "'") / ('"' .+ '"')
>>
> This needs escaping. Plain '.+' in pattern asks for trouble 99% of time.
>
>> Still not sure if this is correct. Especially :
>> Term <- Factor Factor*
>>
>>
>> Another thing I never really understand is the "production" order, In
>> other words : Why not top down ..
>> Start :
>> lowerCase <- [a-z]
>> upperCase <- [A-Z]
>> Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*
>
>> ....
>>
>> End :
>> EBNF <- Procuction+
>>
>> where End is Root..
>
> In fact grammars are usually devised the other way around, e.g.
> Start:
> Program -> ...
> Ehm... what the whole program is exactly ? Ok, let it be Declaration*
> for now. What kind of declarations do we have? ... and so on. Latter
> grammars get tweaked and extended numerous times.
>
> At any rate production order has no effect on the grammar, it's still
> the same. The only thing of importance is what non-terminal considered
> final (or start if you are LL-centric).
>
>>
>> TIA, Bjoern
>
>


-- 
- Alex
March 13, 2012
I am impressed. That's a really nice showcase for the D compile time features.

Can I use PEG to parse languages like python and haskell where indention matters without preprocessing?

Will you make it work with input ranges of dchar? So that I can easily plug in some preprocessing steps?


March 13, 2012
On Tue, Mar 13, 2012 at 18:05, Alex Rønne Petersen <xtzgzorex@gmail.com> wrote:

>>> lowerCase <- [a-z]
>>> upperCase <- [A-Z]
>>> Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*
>>
>>
>> Why not:
>> Identifier <- [a-zA-Z]+
>
>
> That was an illustrative example from the Pegged docs. But yeah, you should just use a range; reads nicer.

The docs are for teaching PEG :) (btw, it's the docs describe C-like
identifiers, that's why I chose a longer approach)
It's always this 'tension', between inlining and refactoring.
[a-zA-Z]+ is shorter and more readable. But If you decide to extend
your grammar to UTF-32, it'd be easier to just change the 'letter'
rule.
March 13, 2012
On Tue, Mar 13, 2012 at 17:17, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:

> PEG defines order of alternatives, that is pretty much like a top-down recursive descent parser would parse it. Alternatives are tried from left to right, if first one fails, it tries next and so on.

Yes.

> In an example I give B is always picked first and so C is never ever looked at.

That's one of the caveats on PEG. That and greedy operators.

'a*a' never succeeds because 'a*' consumes all the available a's.


>> Literal <- ("'" .+ "'") / ('"' .+ '"')
>>
> This needs escaping. Plain '.+' in pattern asks for trouble 99% of time.

This is an example were PEG would munch anything till the end, and
fail (since """ is not found in an empty string).
The standard PEG way to do that is

(!EndMarker .)+ EndMarker

It's a common enough pattern I tend to define a parameterized rule for that:

Until(Expr) <- (!Expr .)* Expr


Gotta love parametrized rules! 15' to code, a neverending stream of joy.


> In fact grammars are usually devised the other way around, e.g.
> Start:
>  Program -> ...
> Ehm... what the whole program is exactly ? Ok, let it be Declaration* for
> now. What kind of declarations do we have? ... and so on. Latter grammars
> get tweaked and extended numerous times.
>
> At any rate production order has no effect on the grammar, it's still the same. The only thing of importance is what non-terminal considered final (or start if you are LL-centric).

Yes. The PEG standard seem to be that the first rule is the start
(root) rule. Pegged let you decide which rule you use for a start. The
rest is automatic.
I might change that.
March 13, 2012
On Mon, Mar 12, 2012 at 13:43, bls <bizprac@orange.fr> wrote:

> Just WOW!

Thanks! Don't be too excited, it's still quite slow as a parser. But that is a fun project :)

> Nice to have on your WIKI would be a EBNF to PEG sheet.
>
> Wirth EBNF      Pegged
> A = BC.         A <- B C
> A = B|C.        A <- C / C
> A = [B].        A <- B?
> A = {B}.        A <- B*

fact is, I don't know EBNF that much. I basically learned everything I know about parsing or grammars while coding Pegged in February :) I probably made every mistakes in the book.

Hey, it's a github public wiki, I guess you can create a new page?
March 14, 2012
On 14.03.2012 2:49, Philippe Sigaud wrote:
> On Tue, Mar 13, 2012 at 17:17, Dmitry Olshansky<dmitry.olsh@gmail.com>  wrote:
>
>> PEG defines order of alternatives, that is pretty much like a top-down
>> recursive descent parser would parse it. Alternatives are tried from left to
>> right, if first one fails, it tries next and so on.
>
> Yes.
>
>> In an example I give B is always picked first and so C is never ever looked
>> at.
>
> That's one of the caveats on PEG. That and greedy operators.
>
> 'a*a' never succeeds because 'a*' consumes all the available a's.
>

Hey, wait, I thought there has to be backtrack here, i.e. when second 'a' finally fails?

>
>>> Literal<- ("'" .+ "'") / ('"' .+ '"')
>>>
>> This needs escaping. Plain '.+' in pattern asks for trouble 99% of time.
>
> This is an example were PEG would munch anything till the end, and
> fail (since """ is not found in an empty string).
> The standard PEG way to do that is
>
> (!EndMarker .)+ EndMarker
>
> It's a common enough pattern I tend to define a parameterized rule for that:
>
> Until(Expr)<- (!Expr .)* Expr
>
>
> Gotta love parametrized rules! 15' to code, a neverending stream of joy.

Nice!

>
>
>> In fact grammars are usually devised the other way around, e.g.
>> Start:
>>   Program ->  ...
>> Ehm... what the whole program is exactly ? Ok, let it be Declaration* for
>> now. What kind of declarations do we have? ... and so on. Latter grammars
>> get tweaked and extended numerous times.
>>
>> At any rate production order has no effect on the grammar, it's still the
>> same. The only thing of importance is what non-terminal considered final (or
>> start if you are LL-centric).
>
> Yes. The PEG standard seem to be that the first rule is the start
> (root) rule. Pegged let you decide which rule you use for a start. The
> rest is automatic.
> I might change that.


-- 
Dmitry Olshansky
March 14, 2012
On Saturday, 10 March 2012 at 23:28:42 UTC, Philippe Sigaud wrote:
> * Grammars can be dumped in a file to create a D module.
>


 In reading the D spec I've seen a few instance where there are infered items such as auto for variables and various parameters in templates.  I presume your D grammar will have to have some rules to be able to infer these as well.

It seems like it would not be a big step to output what are the infered proper statements as the .capture output.  Is that correct?  I think that would be helpful in some cases to be able to view the fully expanded expression.