April 15, 2011
I've used your tool yesterday. I used it on a simple C file with the ANSI C grammar from the gold website. It does seem to work fine, but yeah I have to preprocess a C file first (I've spent so much time with D that I almost completely forgot about the C preprocessor in the first place).

I've tried a file with your ParseAnything sample. It works ok as long as all the types are defined. If not I usually get a Token exception of some sort. Is this considered the semantic pass stage?

Btw, is there a grammar file for C99? What about C++, I haven't seen a grammar on the Gold website? (well, C++ is a monster, I know..).

I'm also trying to figure out whether to go with the static or dynamic approach (I've looked at your docs). The static examples seem quite complex, but perhaps they're more reliable. I think I'll do a few tryouts with dynamic style since it looks much easier to do. If I get anything done you'll know about it. :)
April 15, 2011
"Andrej Mitrovic" <andrej.mitrovich@gmail.com> wrote in message news:mailman.3531.1302884207.4748.digitalmars-d-announce@puremagic.com...
> I've used your tool yesterday. I used it on a simple C file with the ANSI C grammar from the gold website. It does seem to work fine, but yeah I have to preprocess a C file first (I've spent so much time with D that I almost completely forgot about the C preprocessor in the first place).
>
> I've tried a file with your ParseAnything sample. It works ok as long as all the types are defined. If not I usually get a Token exception of some sort. Is this considered the semantic pass stage?
>

Like any generalized parsing tool (AFAIK), Goldie doesn't really have a semantic stage (because language semantics isn't something that's easily formalized).

Probably the C grammar just considers something in your source to be either a syntax or grammatical error. (This could be a bug or limitation in the C grammar.) Goldie currently handles syntax/grammatical errors by throwing a ParseException when it detects all the errors it can find. The message of the exception is the "filename(line:col): Error: Description of error" message that you'd normally expect a compiler to output. Most of the apps in Goldie catch this exception and just output the message, but I guess I didn't do that in ParseAnything.

Of course, it could also be a bug in either ParseAnything or Goldie. Can you send one of the C files that's getting an error? I'll take a look and see what's going on.

You may want to try "goldie-parse" instead of "goldie-parseAnything" (I really should rename one of them, it's probably confusing). "goldie-parseAnything" is mainly intended as an example of how to use Goldie (like the Calculator examples). "goldie-parse" is the one that outputs JSON.


> Btw, is there a grammar file for C99? What about C++, I haven't seen a grammar on the Gold website? (well, C++ is a monster, I know..).
>

Not that I'm aware of. But if you know the differences between ANSI C and C99 you should be able to modify the ANSI C grammar and turn it into a C99. The grammar description language should be very easy to understand if you're familiar with BNF and regex (In fact, the grammar definition langauge doesn't even use the barely-readable Perl regex syntax - it uses a far more readable equivalent instead). BTW, Tip on the grammar language: Everything enclosed in angle brackets is a nonterminal.

And yea, C++ is a beast. And one of C++'s biggest issues is that, not only does it have the preprocessor, but what's worse: the parsing is dependent on the semantics pass. I'd say that any generalized parsing tool that can do C++ properly is doing an *incredibly* damn good job.


> I'm also trying to figure out whether to go with the static or dynamic approach (I've looked at your docs). The static examples seem quite complex, but perhaps they're more reliable. I think I'll do a few tryouts with dynamic style since it looks much easier to do.

The general recommendation is to use static whenever you just have one specific grammar you're trying to deal with (because it provides better protection against mistakes). But you're right, the dynamic style may be an easier way to learn Goldie.

If you haven't already, you may wat to look at the source for the calculator examples. They're both the exact same program, but one does it the static way, and the other does it the dynamic way.

> If I get anything done you'll know about it. :)

Cool, appreciated :)


April 15, 2011
What I meant was that code like this will throw if MyType isn't defined anywhere:

int main(int x)
{
    MyType var;
}

goldie.exception.UnexpectedTokenException@src\goldie\exception.d(35):
test.c(3:12): Unexpected Id: 'var'

It looks like valid C /syntax/, except that MyType isn't defined. But
this will work:
struct MyType {
       int field;
};
int main(int x)
{
    struct MyType var;
}

So either Goldie or ParseAnything needs to have all types defined. Maybe this is obvious, but I wouldn't know since I've never used a parser before. :p

Oddly enough, this one will throw:
typedef struct {
    int field;
} MyType;
int main(int x)
{
    MyType var;
}

goldie.exception.UnexpectedTokenException@src\goldie\exception.d(35):
test.c(7:12): Unexpected Id: 'var'

This one will throw as well:
struct SomeStruct {
    int field;
};
typedef struct SomeStruct MyType;
int main(int x)
{
    MyType var;
}

goldie.exception.UnexpectedTokenException@src\goldie\exception.d(35):
test.c(13:12): Unexpected Id: 'myvar'

Isn't typedef a part of ANSI C?
April 16, 2011
Andrej Mitrovic Wrote:

> What I meant was that code like this will throw if MyType isn't defined anywhere:
> 
> int main(int x)
> {
>     MyType var;
> }
> 
> goldie.exception.UnexpectedTokenException@src\goldie\exception.d(35):
> test.c(3:12): Unexpected Id: 'var'
> 
> It looks like valid C /syntax/, except that MyType isn't defined. But
> this will work:
> struct MyType {
>        int field;
> };
> int main(int x)
> {
>     struct MyType var;
> }
> 
> So either Goldie or ParseAnything needs to have all types defined. Maybe this is obvious, but I wouldn't know since I've never used a parser before. :p
> 
> Oddly enough, this one will throw:
> typedef struct {
>     int field;
> } MyType;
> int main(int x)
> {
>     MyType var;
> }
> 
> goldie.exception.UnexpectedTokenException@src\goldie\exception.d(35):
> test.c(7:12): Unexpected Id: 'var'
> 
> This one will throw as well:
> struct SomeStruct {
>     int field;
> };
> typedef struct SomeStruct MyType;
> int main(int x)
> {
>     MyType var;
> }
> 
> goldie.exception.UnexpectedTokenException@src\goldie\exception.d(35):
> test.c(13:12): Unexpected Id: 'myvar'
> 
> Isn't typedef a part of ANSI C?

I'm not at my computer right now, so I can't check, but it sounds like the grammar follows the really old C-style of requiring structs to be declared with "struct StructName varName". Apperently it doesn't take into account the possibility of typedefs being used to eliminate that. When I get home, I'll check, I think it may be an easy change to the grammar.

April 16, 2011
"Nick Sabalausky" <a@a.a> wrote in message news:ioanmi$82c$1@digitalmars.com...
> Andrej Mitrovic Wrote:
>
>> What I meant was that code like this will throw if MyType isn't defined anywhere:
>>
>> int main(int x)
>> {
>>     MyType var;
>> }
>>
>> goldie.exception.UnexpectedTokenException@src\goldie\exception.d(35):
>> test.c(3:12): Unexpected Id: 'var'
>>
>> It looks like valid C /syntax/, except that MyType isn't defined. But
>> this will work:
>> struct MyType {
>>        int field;
>> };
>> int main(int x)
>> {
>>     struct MyType var;
>> }
>>
>> So either Goldie or ParseAnything needs to have all types defined. Maybe this is obvious, but I wouldn't know since I've never used a parser before. :p
>>
>> Oddly enough, this one will throw:
>> typedef struct {
>>     int field;
>> } MyType;
>> int main(int x)
>> {
>>     MyType var;
>> }
>>
>> goldie.exception.UnexpectedTokenException@src\goldie\exception.d(35):
>> test.c(7:12): Unexpected Id: 'var'
>>
>> This one will throw as well:
>> struct SomeStruct {
>>     int field;
>> };
>> typedef struct SomeStruct MyType;
>> int main(int x)
>> {
>>     MyType var;
>> }
>>
>> goldie.exception.UnexpectedTokenException@src\goldie\exception.d(35):
>> test.c(13:12): Unexpected Id: 'myvar'
>>
>> Isn't typedef a part of ANSI C?
>
> I'm not at my computer right now, so I can't check, but it sounds like the grammar follows the really old C-style of requiring structs to be declared with "struct StructName varName". Apperently it doesn't take into account the possibility of typedefs being used to eliminate that. When I get home, I'll check, I think it may be an easy change to the grammar.
>

Yea, turns out that grammar just doesn't support using user-defined types without preceding them with "struct", "union", or "enum". You can see that here:

<Var Decl>     ::= <Mod> <Type> <Var> <Var List>  ';'
                 |       <Type> <Var> <Var List>  ';'
                 | <Mod>        <Var> <Var List>  ';'

<Mod>      ::= extern
             | static
             | register
             | auto
             | volatile
             | const

<Type>     ::= <Base> <Pointers>

<Base>     ::= <Sign> <Scalar>  ! Ie, the built-ins like char, signed int,
etc...
             | struct Id
             | struct '{' <Struct Def> '}'
             | union Id
             | union '{' <Struct Def> '}'
             | enum Id

So when you use "MyType" instead of "struct MyType": It sees "MyType", assumes it's a variable since it doesn't match any of the <Type> forms above, and then barfs on "var" because "variable1 variable2" isn't valid C code. Normally, you'd just add another form to <Base> (Ie, add a line after "  | enum Id" that says "  | Id "). Except, the problem is...

C is notorious for types and variables being ambiguous with each other. So the distinction pretty much has to be done in the semantic phase (ie, outside of the formal grammar). But this grammar seems to be trying to make that distinction anyway. So trying to fix it by just simply adding a "<Base> ::= Id" leads to ambiguity problems with types versus variables/expressions. That's probably why they didn't enhance the grammar that far - their "separation of type and variable" approach doesn't really work for C.

I'll have to think a bit on how best to adjust it. You can also check the GOLD mailing lists here to see if anyone has another C grammar:

http://www.devincook.com/goldparser/contact.htm



April 16, 2011
Nick Sabalausky Wrote:

> Yea, turns out that grammar just doesn't support using user-defined types without preceding them with "struct", "union", or "enum". You can see that here:
> 
> <Var Decl>     ::= <Mod> <Type> <Var> <Var List>  ';'
>                  |       <Type> <Var> <Var List>  ';'
>                  | <Mod>        <Var> <Var List>  ';'
> 
> <Mod>      ::= extern
>              | static
>              | register
>              | auto
>              | volatile
>              | const
> 
> <Type>     ::= <Base> <Pointers>
> 
> <Base>     ::= <Sign> <Scalar>  ! Ie, the built-ins like char, signed int,
> etc...
>              | struct Id
>              | struct '{' <Struct Def> '}'
>              | union Id
>              | union '{' <Struct Def> '}'
>              | enum Id
> 
> So when you use "MyType" instead of "struct MyType": It sees "MyType", assumes it's a variable since it doesn't match any of the <Type> forms above, and then barfs on "var" because "variable1 variable2" isn't valid C code. Normally, you'd just add another form to <Base> (Ie, add a line after "  | enum Id" that says "  | Id "). Except, the problem is...
> 
> C is notorious for types and variables being ambiguous with each other.

As I understand, <Type> is a type, <Var> is a variable. There should be no problem here.
April 16, 2011
"Kagamin" <spam@here.lot> wrote in message news:iod552$rbe$1@digitalmars.com...
> Nick Sabalausky Wrote:
>
>> Yea, turns out that grammar just doesn't support using user-defined types
>> without preceding them with "struct", "union", or "enum". You can see
>> that
>> here:
>>
>> <Var Decl>     ::= <Mod> <Type> <Var> <Var List>  ';'
>>                  |       <Type> <Var> <Var List>  ';'
>>                  | <Mod>        <Var> <Var List>  ';'
>>
>> <Mod>      ::= extern
>>              | static
>>              | register
>>              | auto
>>              | volatile
>>              | const
>>
>> <Type>     ::= <Base> <Pointers>
>>
>> <Base>     ::= <Sign> <Scalar>  ! Ie, the built-ins like char, signed
>> int,
>> etc...
>>              | struct Id
>>              | struct '{' <Struct Def> '}'
>>              | union Id
>>              | union '{' <Struct Def> '}'
>>              | enum Id
>>
>> So when you use "MyType" instead of "struct MyType": It sees "MyType",
>> assumes it's a variable since it doesn't match any of the <Type> forms
>> above, and then barfs on "var" because "variable1 variable2" isn't valid
>> C
>> code. Normally, you'd just add another form to <Base> (Ie, add a line
>> after
>> "  | enum Id" that says "  | Id "). Except, the problem is...
>>
>> C is notorious for types and variables being ambiguous with each other.
>
> As I understand, <Type> is a type, <Var> is a variable. There should be no problem here.

First of all, the name <Var> up there is misleading. That only refers the the "name of the variable" in the variable's declaration. When actually *using* a variable, that's a <Value>, which is defined like this:

<Value>      ::= OctLiteral
               | HexLiteral
               | DecLiteral
               | StringLiteral
               | CharLiteral
               | FloatLiteral
               | Id '(' <Expr> ')'   ! Function call
               | Id '(' ')'             ! Function call
               | Id   ! Use a variable
               | '(' <Expr> ')'

So we have a situation like this:

<Type> ::= <Base>
<Base> ::= Id
<Value> ::= Id

So when the parser encounters an Id, how does it know whether to reduce it to a <Base> or a <Value>? Since they can both appear in the same place (Ex: Immediately after a left curly-brace, such as at the start of a function body), there's no way to tell.

Worse, suppose it comes across this:

x*y

If x is a variable, then that's a multiplication. If x is a type then it's a pointer declaration. Is it supposed to be multiplication or a declaration? Could be either. They're both permitted in the same place.



April 16, 2011
"Nick Sabalausky" <a@a.a> wrote in message news:iod6fn$tch$1@digitalmars.com...
> "Kagamin" <spam@here.lot> wrote in message news:iod552$rbe$1@digitalmars.com...
>>
>> As I understand, <Type> is a type, <Var> is a variable. There should be no problem here.
>
> First of all, the name <Var> up there is misleading. That only refers the the "name of the variable" in the variable's declaration. When actually *using* a variable, that's a <Value>, which is defined like this:
>
> <Value>      ::= OctLiteral
>               | HexLiteral
>               | DecLiteral
>               | StringLiteral
>               | CharLiteral
>               | FloatLiteral
>               | Id '(' <Expr> ')'   ! Function call
>               | Id '(' ')'             ! Function call
>               | Id   ! Use a variable
>               | '(' <Expr> ')'
>
> So we have a situation like this:
>
> <Type> ::= <Base>
> <Base> ::= Id
> <Value> ::= Id
>
> So when the parser encounters an Id, how does it know whether to reduce it to a <Base> or a <Value>? Since they can both appear in the same place (Ex: Immediately after a left curly-brace, such as at the start of a function body), there's no way to tell.
>
> Worse, suppose it comes across this:
>
> x*y
>
> If x is a variable, then that's a multiplication. If x is a type then it's a pointer declaration. Is it supposed to be multiplication or a declaration? Could be either. They're both permitted in the same place.
>

In other words, we basically have a form of this:

<A> ::= <B> | <C>
<B> ::= X
<C> ::= X

Can't be done. No way to tell if X is <B> or <C>.


April 17, 2011
"Nick Sabalausky" <a@a.a> wrote in message news:iobh9o$1d04$1@digitalmars.com...
> "Nick Sabalausky" <a@a.a> wrote in message news:ioanmi$82c$1@digitalmars.com...
>> Andrej Mitrovic Wrote:
>>
>>> What I meant was that code like this will throw if MyType isn't defined anywhere:
>>>
>>> int main(int x)
>>> {
>>>     MyType var;
>>> }
>>>
>>> goldie.exception.UnexpectedTokenException@src\goldie\exception.d(35):
>>> test.c(3:12): Unexpected Id: 'var'
>>>
>>> It looks like valid C /syntax/, except that MyType isn't defined. But
>>> this will work:
>>> struct MyType {
>>>        int field;
>>> };
>>> int main(int x)
>>> {
>>>     struct MyType var;
>>> }
>>>
>>> So either Goldie or ParseAnything needs to have all types defined. Maybe this is obvious, but I wouldn't know since I've never used a parser before. :p
>>>
>>> Oddly enough, this one will throw:
>>> typedef struct {
>>>     int field;
>>> } MyType;
>>> int main(int x)
>>> {
>>>     MyType var;
>>> }
>>>
>>> goldie.exception.UnexpectedTokenException@src\goldie\exception.d(35):
>>> test.c(7:12): Unexpected Id: 'var'
>>>
>>> This one will throw as well:
>>> struct SomeStruct {
>>>     int field;
>>> };
>>> typedef struct SomeStruct MyType;
>>> int main(int x)
>>> {
>>>     MyType var;
>>> }
>>>
>>> goldie.exception.UnexpectedTokenException@src\goldie\exception.d(35):
>>> test.c(13:12): Unexpected Id: 'myvar'
>>>
>>> Isn't typedef a part of ANSI C?
>>
>> I'm not at my computer right now, so I can't check, but it sounds like the grammar follows the really old C-style of requiring structs to be declared with "struct StructName varName". Apperently it doesn't take into account the possibility of typedefs being used to eliminate that. When I get home, I'll check, I think it may be an easy change to the grammar.
>>
>
> Yea, turns out that grammar just doesn't support using user-defined types without preceding them with "struct", "union", or "enum". You can see that here:
>
> <Var Decl>     ::= <Mod> <Type> <Var> <Var List>  ';'
>                 |       <Type> <Var> <Var List>  ';'
>                 | <Mod>        <Var> <Var List>  ';'
>
> <Mod>      ::= extern
>             | static
>             | register
>             | auto
>             | volatile
>             | const
>
> <Type>     ::= <Base> <Pointers>
>
> <Base>     ::= <Sign> <Scalar>  ! Ie, the built-ins like char, signed int,
> etc...
>             | struct Id
>             | struct '{' <Struct Def> '}'
>             | union Id
>             | union '{' <Struct Def> '}'
>             | enum Id
>
> So when you use "MyType" instead of "struct MyType": It sees "MyType", assumes it's a variable since it doesn't match any of the <Type> forms above, and then barfs on "var" because "variable1 variable2" isn't valid C code. Normally, you'd just add another form to <Base> (Ie, add a line after "  | enum Id" that says "  | Id "). Except, the problem is...
>
> C is notorious for types and variables being ambiguous with each other. So the distinction pretty much has to be done in the semantic phase (ie, outside of the formal grammar). But this grammar seems to be trying to make that distinction anyway. So trying to fix it by just simply adding a "<Base> ::= Id" leads to ambiguity problems with types versus variables/expressions. That's probably why they didn't enhance the grammar that far - their "separation of type and variable" approach doesn't really work for C.
>
> I'll have to think a bit on how best to adjust it. You can also check the GOLD mailing lists here to see if anyone has another C grammar:
>
> http://www.devincook.com/goldparser/contact.htm
>

Unfortunately, I think this may require LALR(k). Goldie and GOLD are only LALR(1) right now.

I had been under the impression that LALR(1) was sufficient because according to the oh-so-useful-in-the-real-world formal literature, any LR(k) can *technically* be converted into a *cough* "equivalent" LR(1). But not only is algorithm to do this hidden behind the academic ivory wall, but word on the street is that the resulting grammar is gigantic and bears little or no resemblance to the original structure (and is therefore essentially useless in the real world).

Seems I'm gonna have to add some backtracking or stack-cloning to Goldie, probably along with some sort of cycle-detection. (I think I'm starting to understand why Walter said he doesn't like to bother with parser generators, unngh...)


April 19, 2011
Nick Sabalausky Wrote:

> In other words, we basically have a form of this:
> 
> <A> ::= <B> | <C>
> <B> ::= X
> <C> ::= X
> 
> Can't be done. No way to tell if X is <B> or <C>.

A hairy grammar can be used here, anyway goldie's output needs postprocessing, right?
1 2
Next ›   Last »