August 13, 2019
On Tue, Aug 13, 2019 at 11:19:16AM +0200, Jacob Carlborg via Digitalmars-d wrote:
> On 2019-08-12 21:58, H. S. Teoh wrote:
> 
> > This is a big part of why C++'s must-be-parsed-before-it-can-be-lexed syntax is a big hindrance to meaningful progress.  The only way such a needlessly over-complex syntax can be handled is a needlessly over-complex lexer/parser combo, which necessarily results in needlessly over-complex corner cases and other such gotchas.  Part of this nastiness is the poor choice of template syntax (overloading '<' and '>' to be delimiters in addition to their original roles of comparison operators), among several other things.
> 
> I don't know how this is implemented in a C++ compiler but can't the lexer use a more abstract token that includes both the usage for templates and for comparison operators? The parser can then figure out exactly what it is.

It's not so simple.  The problem is that in C++, the *structure* of the parse tree changes depending on previous declarations. I.e., the lexical structure is not context-free.  For example, given this C++ code:

	int main() {
		A a;
		B b;

		// What do these lines do?
		fun<A, B>(a, b);
		gun<T, U>(a, b);
	}

What do you think the parse tree should be?

On the surface, it would appear that main() contains two variable
declarations, followed by calling two template functions with (a, b) as
the arguments.

Unfortunately, this is not true. The way the last two lines of main() are parsed can be *wildly divergent* depending on what declarations came before.  To see how this can be so, here's the full code (which I posted a while back in a discussion on template syntax):

-----------------------------------snip------------------------------------
// Totally evil example of why C++ template syntax and free-for-all operator
// overloading is a Bad, Bad Idea.
#include <iostream>

struct Bad { };

struct B { };

struct A {
	Bad operator,(B b) { return Bad(); }
};

struct D { };

struct Ugly {
	D operator>(Bad b) { return D(); }
} U;

struct Terrible { } T;

struct Evil {
	~Evil() {
		std::cout << "Hard drive reformatted." << std::endl;
	}
};

struct Nasty {
	Evil operator,(D d) { return Evil(); }
};

struct Idea {
	void operator()(A a, B b) {
		std::cout << "Good idea, data saved." << std::endl;
	}
	Nasty operator<(Terrible t) { return Nasty(); }
} gun;

template<typename T, typename U>
void fun(A a, B b) {
	std::cout << "Have fun!" << std::endl;
}

int main() {
	A a;
	B b;

	// What do these lines do?
	fun<A, B>(a, b);
	gun<T, U>(a, b);
}
-----------------------------------snip------------------------------------


Note that `gun` is not a template, and not even a function. It's a global struct instance with a completely-abusive series of operator overloads.

While I admit that this example is contrived, it does prove my point that you simply cannot parse C++ code in any straightforward way.  You have to use nasty hacks in both the lexer and the parser just to get the thing to parse at all, and this is not even touching the more pertinent topic of C++ semantic analysis, which in many places is even worse (SFINAE and Koenig Lookup, anyone? -- thanks to which, the meaning of your code can change simply by adding an #include line at the top of the file without touching anything else. Symbol hijacking galore!).


> DMD is doing something similar, but at a later stage. For example, in the following code snippet: "int a = foo;", "foo" is parsed as an identifier expression. Then the semantic analyzer figures out if "foo" is a function call or a variable.
[...]

There's no such thing as an 'identifier expression'. `foo` is parsed as an expression, period.  The parse tree is pretty straightforward.  Of course, there *is* a hack later that turns it into an implicit function call, but that's already long past the parsing stage. Unlike the C++ example above, the *structure* of the parse tree doesn't change, just the meaning of a leaf node.  You don't end up with a completely unrelated parse tree structure just because of some strange declarations elsewhere in the source code.


T

-- 
Windows 95 was a joke, and Windows 98 was the punchline.
August 13, 2019
On Tue, Aug 13, 2019 at 08:21:56AM -0700, H. S. Teoh via Digitalmars-d wrote:
> On Tue, Aug 13, 2019 at 11:19:16AM +0200, Jacob Carlborg via Digitalmars-d wrote:
[...]
> > I don't know how this is implemented in a C++ compiler but can't the lexer use a more abstract token that includes both the usage for templates and for comparison operators? The parser can then figure out exactly what it is.
> 
> It's not so simple.  The problem is that in C++, the *structure* of the parse tree changes depending on previous declarations. I.e., the lexical structure is not context-free.
[...]

Not to mention, in the more recent C++ revisions, it's not just the parse tree that changes, even the tokenization changes. I.e.:

	fun<gun<A, B>>(c, d);

can be tokenized as either:

	fun < gun < A , B >> ( c , d ) ;

(i.e., '>>' is the right-shift operator), or:

	fun < gun < A , B > > ( c , d );

(i.e., '>>' is *two* closing template argument list delimiters).


There is simply no way you can write a straightforward, context-free lexer for C++.  Such a thing simply doesn't exist, because C++ must be parsed before it can be lexed.  The lexer has to somehow know when '>>' should be lexed as two tokens, or when it should be lexed as a single token.  The only way it can know this is if the parser informs it what parse tree it's currently expecting.  But that means the parser has to be running *before* the lexer has completely lexified the input. Furthermore, how does the parser know when it's expecting a template argument list?  From my previous example, you see that even when an input statement looks like a template function call, it may not actually be one.  Which means *semantic analysis* has to have already begun (at least partially), enough to recognize certain identifiers as templates, with a feedback loop to the parser, which in turn has a feedback loop to the lexer so that it knows whether '>>' should be two tokens or one.

You can't get around this inherent complexity without becoming non-compliant with the C++ spec.

So you see, the seemingly insignificant choice of <> as template argument list delimiters has far-reaching consequences.  In retrospect, it was a bad design decision.  '<' and '>' should have been left alone as comparison operators only, not overloaded with a completely unrelated meaning that leads to all sorts of pathological ambiguities and needless parser complexity.


T

-- 
Why ask rhetorical questions? -- JC
August 13, 2019
On Tuesday, 13 August 2019 at 09:13:50 UTC, Jacob Carlborg wrote:
> On 2019-08-12 11:47, Atila Neves wrote:
>
>> [...]
>
> I suggest you look into incremental compilation, if you haven't done that already. I'm not talking about recompiling a whole file and relink. I'm talking incremental lexing, parsing, semantic analyzing and code generation. That is, recompile only those characters that have changed in a source file and what depends on it.

That's basically what I want.

August 13, 2019
On Tue, Aug 13, 2019 at 06:06:48PM +0000, Atila Neves via Digitalmars-d wrote:
> On Tuesday, 13 August 2019 at 09:13:50 UTC, Jacob Carlborg wrote:
[...]
> > I suggest you look into incremental compilation, if you haven't done that already. I'm not talking about recompiling a whole file and relink.  I'm talking incremental lexing, parsing, semantic analyzing and code generation. That is, recompile only those characters that have changed in a source file and what depends on it.
> 
> That's basically what I want.

I suppose in C++, with its overcomplex lexing/parsing, this could potentially be significant savings.  But based on what Walter has said about storing binary forms of the source code (which is basically what you'll end up doing if you really want to implement incremental compilation in the above sense), it's not much more efficient than (re)lexing and (re)parsing the entire file.

Plus, the way dmd currently works, the AST is mutated by the various semantic passes, so you can't really do something like caching the AST and rewriting subtrees of it based on what changed in the source file, either.  And even if you could, I'm not sure it will be much more efficient than just recompiling the entire file. (Assuming reasonable file sizes, that is -- obviously, if you have a 100,000-line source file, then you might want to look into refactoring that into smaller chunks, for many other reasons.)

Templates & CTFE, though, are well-known and acknowledged performance hogs.  If anything, I'd focus on improving templates and CTFE instead of worrying too much about incremental compilation in the sense Jacob describes. (Speaking of which, when is newCTFE landing in master? :-/)


T

-- 
Жил-был король когда-то, при нём блоха жила.
August 13, 2019
On Tuesday, 13 August 2019 at 15:21:56 UTC, H. S. Teoh wrote:
> On Tue, Aug 13, 2019 at 11:19:16AM +0200, Jacob Carlborg via Digitalmars-d wrote:
>> [...]
>
> It's not so simple.  The problem is that in C++, the *structure* of the parse tree changes depending on previous declarations. I.e., the lexical structure is not context-free.  For example, given this C++ code:
>
> [...]

I love the

    fon< fun< 1 >>::three >::two >::one

expression in C++ from Jens Gustedt's blog [1]

There the expression means something different in C++98 than in C++11

Let’s have a look how this expression is parsed
	
1  fon< fun< 1 >>::three >::two >::one    // in C++98
2            -----------
3       -----------------------
4  -----------------------------------
5
	
1  fon< fun< 1 >>::three >::two >::one    // in C++11
2       --------
3  ---------------------
4  ----------------------------
5  -----------------------------------


[1]: https://gustedt.wordpress.com/2013/12/18/right-angle-brackets-shifting-semantics/#more-2083
August 13, 2019
On Tue, Aug 13, 2019 at 06:51:10PM +0000, Patrick Schluter via Digitalmars-d wrote:
> On Tuesday, 13 August 2019 at 15:21:56 UTC, H. S. Teoh wrote:
[...]
> > It's not so simple.  The problem is that in C++, the *structure* of the parse tree changes depending on previous declarations. I.e., the lexical structure is not context-free.
[...]
> I love the
> 
>     fon< fun< 1 >>::three >::two >::one
> 
> expression in C++ from Jens Gustedt's blog [1]
> 
> There the expression means something different in C++98 than in C++11
> 
> Let’s have a look how this expression is parsed
> 
> 1  fon< fun< 1 >>::three >::two >::one    // in C++98
> 2            -----------
> 3       -----------------------
> 4  -----------------------------------
> 5
> 
> 1  fon< fun< 1 >>::three >::two >::one    // in C++11
> 2       --------
> 3  ---------------------
> 4  ----------------------------
> 5  -----------------------------------
> 
> 
> [1]: https://gustedt.wordpress.com/2013/12/18/right-angle-brackets-shifting-semantics/#more-2083

Yeah, it's things like this that convinced me that C++ is hopelessly and needlessly over-complex, and that it was time for me to find a better language. Like D. :-D

Using <> as delimiters for template arguments must have been one of the biggest blunders of C++, among many other things.


T

-- 
Recently, our IT department hired a bug-fix engineer. He used to work for Volkswagen.
August 14, 2019
On 2019-08-13 20:33, H. S. Teoh wrote:

> I suppose in C++, with its overcomplex lexing/parsing, this could
> potentially be significant savings.  But based on what Walter has said
> about storing binary forms of the source code (which is basically what
> you'll end up doing if you really want to implement incremental
> compilation in the above sense), it's not much more efficient than
> (re)lexing and (re)parsing the entire file.

I'm guessing the compiler would run as a daemon in the background storing everything in memory.

> Plus, the way dmd currently works, the AST is mutated by the various
> semantic passes, so you can't really do something like caching the AST
> and rewriting subtrees of it based on what changed in the source file,
> either.

Yeah, that's quite annoying, for other reasons as well.

-- 
/Jacob Carlborg
August 14, 2019
On 2019-08-13 17:21, H. S. Teoh wrote:
> On Tue, Aug 13, 2019 at 11:19:16AM +0200, Jacob Carlborg via Digitalmars-d wrote:
>> DMD is doing something similar, but at a later stage. For example, in
>> the following code snippet: "int a = foo;", "foo" is parsed as an
>> identifier expression. Then the semantic analyzer figures out if "foo"
>> is a function call or a variable.
> [...]
> 
> There's no such thing as an 'identifier expression'. `foo` is parsed as
> an expression, period.

I'm not sure if you're still talking about DMD here, but there's definitely an "identifier expression", it's right here [1]. It's created in the parser here [2]. Then during the semantic phase it's turned into something else.

[1] https://github.com/dlang/dmd/blob/b2522da8566783491648bc104a29b42dc2dc569e/src/dmd/expression.d#L2056

[2] https://github.com/dlang/dmd/blob/b2522da8566783491648bc104a29b42dc2dc569e/src/dmd/parse.d#L7633

-- 
/Jacob Carlborg
August 14, 2019
On Wed, Aug 14, 2019 at 12:40:00PM +0200, Jacob Carlborg via Digitalmars-d wrote:
> On 2019-08-13 17:21, H. S. Teoh wrote:
> > On Tue, Aug 13, 2019 at 11:19:16AM +0200, Jacob Carlborg via Digitalmars-d wrote:
> > > DMD is doing something similar, but at a later stage. For example, in the following code snippet: "int a = foo;", "foo" is parsed as an identifier expression. Then the semantic analyzer figures out if "foo" is a function call or a variable.
> > [...]
> > 
> > There's no such thing as an 'identifier expression'. `foo` is parsed as an expression, period.
> 
> I'm not sure if you're still talking about DMD here, but there's definitely an "identifier expression", it's right here [1]. It's created in the parser here [2]. Then during the semantic phase it's turned into something else.
[...]

Whoa.  OK, I stand corrected.  Didn't know that was how DMD did it! That's pretty ... weird.  And explains some of the quirkiness around certain bits of D syntax, like some of the ugly corner cases of `alias`.


T

-- 
It always amuses me that Windows has a Safe Mode during bootup. Does that mean that Windows is normally unsafe?
August 14, 2019
On Wed, Aug 14, 2019 at 12:32:11PM +0200, Jacob Carlborg via Digitalmars-d wrote:
> On 2019-08-13 20:33, H. S. Teoh wrote:
> 
> > I suppose in C++, with its overcomplex lexing/parsing, this could potentially be significant savings.  But based on what Walter has said about storing binary forms of the source code (which is basically what you'll end up doing if you really want to implement incremental compilation in the above sense), it's not much more efficient than (re)lexing and (re)parsing the entire file.
> 
> I'm guessing the compiler would run as a daemon in the background storing everything in memory.
[...]

In theory, I suppose it could work.  But again, it comes back to this point of whether it's worth the trouble to maintain the necessary data structures that allows you to reliably map sections of a modified source file to the AST.  Some parts of the AST aren't 100% context-free. Think, for example, of static ifs or static foreach that generate parts of the AST based on some CTFE computation taking as input some enums defined elsewhere in the file, or in another module.  You'll need some pretty complex data structures to keep track of which part(s) of the AST depends on which other part(s), and you'll need some clever algorithms to update everything consistently when you detect that one of the parts of the source file corresponding to these AST subtrees is modified.

I suspect the complexity required to maintain consistency and update the AST will be not much better than the cost of a straightforward re-parse of the entire source file and rebuild of the AST from scratch.  It probably only starts to be superior when your source file becomes unreasonably large, in which case you should be splitting it up into more manageable modules anyway.

But then again, this is all in the hypothetical.  Somebody should run a proof-of-concept experiment and profile it to compare the actual performance characteristics of such a scheme.


T

-- 
If you want to solve a problem, you need to address its root cause, not just its symptoms. Otherwise it's like treating cancer with Tylenol...