July 08, 2012
On Saturday, 7 July 2012 at 22:40:05 UTC, Andrei Alexandrescu
wrote:

> http://www.antlr.org/wiki/display/~admin/ANTLR+v4+lexers

Thanks, nice article.

Also found another post:
http://www.antlr.org/wiki/display/~admin/2008/11/30/Example+tree+rewriting+with+patterns,
which is related to some other discussion in this thread :)

July 08, 2012
On Saturday, 7 July 2012 at 22:35:37 UTC, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 22:25:00 UTC, David Piepgrass wrote:
>> This is all true, but forgetting a brace commonly results in a barrage of error messages anyway. Code that guesses what you meant obviously won't work all the time, and phase 3 would have to take care not to emit an error message about a "{" token that doesn't actually exist (that was merely "guessed-in"). But at least it's nice for a parser to be /able/ to guess what you meant; for a typical parser it would be out of the question, upon detecting an error, to back up four source lines, insert a brace and try again.
>
> So you simply admit that error recovery is difficult to implement. For me, it is a must-have, and thus throwing away information is bad.

I'm not seeing any tremendous error-handling difficulty in my idea. Anyway, I missed the part about information being thrown away...?
July 08, 2012
On 07/07/2012 21:17, Dmitry Olshansky wrote:
> On 07-Jul-12 22:23, Andrei Alexandrescu wrote:
>> On 7/7/12 6:24 AM, Roman D. Boiko wrote:
>>> On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:
>>>> http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk
>>>>
>>>>
>>>>
>>>
>>> So far it looks like LALR parsers may have lower constant factors than
>>> Packrat.
>>>
>>> The difference could be minimized by paying attention to parsing of
>>> terminal symbols, which was in my plans already. It is not necessary to
>>> strictly follow Packrat parsing algorithm.
>>>
>>> The benefits of Pegged, in my view, are its support of Parsing
>>> Expression Grammar (PEG) and compile-time evaluation. It is easily
>>> extensible and modifiable.
>>
>> Isn't also the fact that lexing and parsing are integrated? With
>> traditional LALR you need a separate tokenizer.
>>
>
> I'll have to point out that the whole point about integrated lexing is
> moot. It's more of liability then benefit. At very least it's just
> implementation curiosity not advantage.
>

Many usages only need a lexer.
July 08, 2012
>> Yeah, with a tree-transforming parser, I imagine the same thing, except
>> my current [fantasy] is to convert a certain subset of D to multiple other
>> languages automatically. Then I could write libraries that can easily be
>> used by an astonishingly large audience. I certainly would like to see D
>> targetting Android, but that's best done directly from D to ARM.
>>
> That does sound very cool.  Possibly difficult though, due to having to cater to the lowest-common-denominator in all of your API designs.  No templated functions or ranges in your API, that's for sure.  I'm sure there are some things where this is very doable though; it probably depends on what kind of libraries you are writing.

Well, for templates, in general, it would be necessary to instantiate a particular set of templates and explicitly give them names in the target language. So for instance, I could define a Point!T struct in D, sure, but then I'd have to tell the language converter to create target-language specializations: in C#, PointD=Point!double, PointI=Point!int, etc. If the target were C++, the template could be translated to a C++ template, Point<T>, as long as there aren't any "static ifs" or other things that can't be translated. Notably, if a template P!T depends on another template Q!T, then P!T cannot be translated to a C++/C# P<T> unless Q!T was also translated as Q<T>.

Adapting standard libraries could no doubt be a gigantic problem. I don't know how to begin to think about doing that.

But for ranges in particular, I think the concept is too important to leave out of public interfaces. So I'd port the major range data structures to the target languages, most likely by hand, so that they could be used by converted code.

> As for D targeting Android, my intent is really to target X where X is any CPU/OS combo you can think of.  I want to be able to get D, the language, not necessarily phobos or other niceties, to work on any platform, and to do so without much work.  Cross-compiling to a new platform that has never been cross-compiled before should require zero coding.

I understand. Conversion to C is an effective last resort. And, well, I hear a lot of compilers have even used it as a standard practice. I guess you'd be stuck with refcounting, though.

> I think that the D-directly-to-ARM is the current approach for cross-compiling.  I critique it for its underwhelming lack of results.

Yeah. I assume it involves weird object-file formats, calling conventions and ABIs. I guess very few want to get involved with that stuff, and very few have the slightest clue where to begin, myself included.

> (2) suffer from integration problems if you try to compile the expressions in separate files before compiling the rest of the front-end.

Absolutely, I love language-integrated metaprogramming. Without it you end up with complicated build environments, and I hate those, cuz there isn't a single standard build environment that everybody likes. I think people should be able to just load up their favorite IDE and add all source files to the project and It Just Works. Or on the command line, do dmd *.d or whatever. Oh, and the ability to run the same code at meta-compile-time, compile-time and run-time, also priceless.
July 08, 2012
"Jonathan M Davis" <jmdavisProg@gmx.com> wrote in message news:mailman.163.1341683651.31962.digitalmars-d@puremagic.com...
> [snip]
>
> In all honesty though, I would be surprised if you could convince Walter
> to
> switch dmd's frontend to Phobos' lexer and parser even once they've been
> implemented. So, while I agree that there are benefits in doing so, I'm
> not
> sure how much chance you have of ever getting any traction with that.
>
> - Jonathan M Davis

A more realistic alternative is to modify dmd to optionally (at build/link time) use Phobos' lexer and/or parser, which would allow them to be run against the test suite.  Parser and lexer changes are relatively rare in the compiler, so keeping two implementations in sync is not really that big a deal.


July 08, 2012
On Sunday, 8 July 2012 at 01:15:31 UTC, David Piepgrass wrote:
> I'm not seeing any tremendous error-handling difficulty in my idea. Anyway, I missed the part about information being thrown away...?

(I will use the word "you" to refer to some abstract person or compiler.)

Error handling could mean either error reporting and stopping after the first error, or reporting several errors and continuing parsing + semantic analysis whenever possible, so that the user would get partial functionality like code completion, information about method overloads, or "go to definition" / find all usages, etc.

The first method is the less powerful option, but still requires constructing a meaningful error message which could help the user.

There are many possible error recovery strategies. For example, you might decide to insert some token according to the parsing information available up to the moment when error is discovered, if that would fix the problem and allow to continue parsing. Another option is to ignore a series of further tokens (treat them a whitespace, for example), until parser is able to continue its work. Also there are many heuristics for typical error situations.

All of these can only be performed if parser gathers the syntax information which can be inferred from source code according to the grammar rules.

In stage 2 you have only performed some basic analysis, like, e.g., matched braces to define some hierarchy. This means that at the time when you find a missing brace, for example, you cannot tell anything more than that braces don't match. Or, if the user inserts a brace in an incorrect location, it is only possible to say that it is either missing somewhere and somewhere else another brace is missing, or that it is missing in one place, and some other brace is redundant. In many cases you won't even notice that a brace is incorrectly placed, and pass the resulting tree to the 3rd stage. You don't take any hint from grammar about the exact locations where some token should be present.

Now, stage 3 heavily depends on the output of stage 2. As I demonstrated in some examples, it could get the output which implies incorrect structure, even if that has not been found in the previous stage. You would need to analyze so much information attempting to find the real roots of a problem, that effectively it would involve duplicating (implicitly) the logic of previous stage, but with combinatorial complexity growth.

The problems you would need to deal with are much more complex than I described here. So you wouldn't be able to deliver error recovery at all, or (if you could) it would be either trivial or would require needlessly complex logic. Breaking the system at the wrong boundary (when the number of dependencies that cross the boundary is large) always introduces artificial complexity.

Described above is the illustration of what I call information loss. I may have described something not as clearly as needed, but I didn't have the goal to provide a thorough and verifiable analysis. I speculated and simplified a lot. If you decide to ignore this, it is not a problem and I don't state that you will fail any of your goals. Just admit that __for me__ this approach is not a viable option. Everything written above is IMHO and there may be many ways to resolve the problems with various degrees of success.
July 08, 2012
>> Yup, LL(*) is my favorite so far.
>
> That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative.

We should also consider using GLR if LL(*) doesn't work.
July 08, 2012
On 08-Jul-12 13:05, Tobias Pankrath wrote:
>>> Yup, LL(*) is my favorite so far.
>>
>> That's Terence Parr's discovery, right? I've always liked ANTLR, so if
>> PEGs turn out to have issues LL(*) sounds like a promising alternative.
>
> We should also consider using GLR if LL(*) doesn't work.

GLR ... are you serious? It still does parsing in n^3 if I'm not mistaken.

-- 
Dmitry Olshansky


July 08, 2012
On Sunday, 8 July 2012 at 09:14:32 UTC, Dmitry Olshansky wrote:
> On 08-Jul-12 13:05, Tobias Pankrath wrote:
>>>> Yup, LL(*) is my favorite so far.
>>>
>>> That's Terence Parr's discovery, right? I've always liked ANTLR, so if
>>> PEGs turn out to have issues LL(*) sounds like a promising alternative.
>>
>> We should also consider using GLR if LL(*) doesn't work.
>
> GLR ... are you serious? It still does parsing in n^3 if I'm not mistaken.

Since GLR can parse any CFG the upper bound is in O(n^3), but the actual performance seems to depend on the grammar.

From Elkhound [1]
> It should be competitive with Bison for LALR (fragements of) grammars, and degrade gracefully from there. On the scale of grammar nondeterminism, from none
> (LALR) to some to lots, "some" is the niche Elkhound is going after.
> These goals are driven by Elkhound's primary application, the Elsa C++ Parser. In essence, Elkhound came about because I wanted to apply automatic parsing technology to parsing C++, but found exiting systems inadequate.

So it seems to me that it is not worse than PEG if you have a grammar with
reasonable many ambiguities.

Performance is one thing, but you should be able to express your language in the underlying formalism without too many hurdles.



July 08, 2012
> From Elkhound [1]

http://scottmcpeak.com/elkhound/sources/elkhound/algorithm.html