Jump to page: 1 25  
Page
Thread overview
xdc: A hypothetical D cross-compiler and AST manipulation tool.
Jul 18, 2013
Chad Joan
Jul 18, 2013
Chad Joan
Jul 18, 2013
Chad Joan
Jul 18, 2013
Kagamin
Jul 19, 2013
Chad Joan
Jul 19, 2013
Tove
Jul 20, 2013
Chad Joan
Jul 23, 2013
Kagamin
Jul 23, 2013
Chad Joan
Jul 25, 2013
Kagamin
Jul 19, 2013
Joakim
Jul 20, 2013
Chad Joan
Nov 11, 2013
Kai Nacke
Nov 12, 2013
Chad Joan
Nov 12, 2013
David Nadlinger
Jul 20, 2013
cal
Jul 20, 2013
Chad Joan
Jul 20, 2013
deadalnix
Jul 22, 2013
Chad Joan
Jul 25, 2013
Tofu Ninja
Jul 26, 2013
Chad Joan
Nov 09, 2013
Etienne
Nov 12, 2013
Chad Joan
Nov 12, 2013
Kelly
Nov 12, 2013
Chad Joan
Nov 09, 2013
nazriel
Nov 10, 2013
Daniel Murphy
Nov 10, 2013
deadalnix
Nov 10, 2013
Daniel Murphy
Nov 10, 2013
Daniel Murphy
Nov 12, 2013
Chad Joan
Nov 10, 2013
Walter Bright
Nov 12, 2013
Chad Joan
Nov 12, 2013
Walter Bright
Nov 12, 2013
Jacob Carlborg
Nov 10, 2013
Iain Buclaw
Nov 12, 2013
Chad Joan
Nov 12, 2013
Chad Joan
Nov 11, 2013
Dejan Lekic
July 18, 2013
I'd like to present my vision for a new D compiler.  I call it xdc, a loose abbreviation for "Cross D Compiler" (if confused, see http://english.stackexchange.com/questions/37394/why-do-some-words-have-x-as-a-substitute).  It could also mean other fun things like "Crossbone D Compiler" (I imagine a logo with some crossbones having a metal D atop where the skull normally goes), "eXperimental D Compiler", or something sounding like "ectasy" ;)

We usually think of language features as being rewritten into simpler features.  The simple features eventually get rewritten into machine instructions.  Compilers are, fundamentally, responsible for performing "lowering" operations.

It makes sense to me, then, to make a compiler whose internals /look/ like a bunch of these rewrites and lowering operations.  There should be a bunch of input patterns matched to the desired results.  This has the happy side-effect of giving us a pleasant way to do AST manipulations from within D code.

I've also had a long-standing desire to see D on many more platforms.  It should make an appearance on console platforms and on smartphones.  I've tried doing this with a retargetable compiler like GCC before, and the work was surprisingly large.  Even if the compiler already emits code for the target system's CPU, there are still a large number of details involving calling conventions, executable format, and any number of CPU/OS specific tweaks to object file output.  It makes a lot more sense to me to just output C/C++ code and feed that to a native toolchain.  That would skip a lot of the platform-specific nonsense that creates a barrier to entry for people who, say, just want to write a simple game for Android/iPhone/PS(3|4)/etc in D, and don't want to become compiler experts first.  Ideally, some day, this compiler would also emit code or bytecode for Javascript, AS3/Flash, Java, and any other popular targets that the market desires.  This can probably be done with DMD, but I'd like to make the process more approachable, and make backend authoring as easy as possible.  It should be possible (and easy) to tell the compiler exactly what lowerings should be applied before the AST hits the backend.

xdc should bring all of that cross-platform targeting together with a compiler infrastructure that can blow everything else away (I hope!).

xdc is my dream for a D compiler that gives us our first step (of few) towards having what haXe has already (http://haxe.org/) : a compiler that can land code just about anywhere.

What follows is a collection of my thoughts written down as notes.

== Ideal Outcomes ==

.- D to C/C++ compiler that can easily reach target platforms that are
.    currently either unsupported or poorly supported by current D
.    compilers.
.   - Useful for game developers wishing to write D code on the
.       next big console platform.
.   - Useful for embedded systems developers wishing to write D code
.       on obscure or potentially proprietary microcontrollers.

.- Other backends (ex: llvm, Java bytecode, AS3/Flash bytecode, etc)
.    possible in the future.  Community desires considered when
.    selecting new backend targets.

.- Interpreter backend: a notable backend that would be implemented as
.    a necessity for making CTFE work.  A dedicated interpreter
.    backend would hopefully be much faster and easier on memory than
.    DMD's souped-up constant folder.  (Even though DMD has become
.    considerably better at this in the past year or two.)

.- Abstract Syntax Tree (AST) manipulation toolchain, possibly usable
.    in CTFE.  It would work like this:
.    (1) Pass some D code or Domain Specific Language (DSL) of your
.          choice (as text) into xdc at compile-time.
.    (2) xdc returns an AST.
.    (3) Use xdc's pattern-matching and substitution DSL to
.          manipulate the AST.
.    (4) xdc consumes the AST and emits modified D code.
.    (5) mixin(...) the result.
.   - If xdc is the compiler used to execute itself in CTFE, then
.       it might be possible to optimize this by having it expose
.       itself as a set of intrinsics.

.- Reference counting available by default on all platforms.
.   - Gets you into the action with minimal effort and little or no
.       compiler hacking. (More complete GC tends to require platform
.       specific ASM and/or operating system API support).

.- Full garbage collection available if supported.
.   - Ex: The C backends would default to ref-counting until the ASM
.       and OS-level code is written to support full GC.
.   - Ex: A Java backend would probably use the Java JVM by default.

.- Threading model determined by compiler configuration or available
.    platform hints.
.   - Ex: The user may have a posix-threads implementation available,
.       but know little other details about the target system.  It
.       should be possible for xdc to use pthreads to emulate the
.       TLS and synchronization mechanisms needed to make D tick.
.       (Or at least emulate as many features as possible.)
.   - Ex: Possible "no threading" target for people who don't need
.       threading but DO need other D features to be available NOW
.       on an alien platform.  Errors when the source code passed
.       into xdc assumes that threading features are present.

.- D compiler that is easy to hack on.
.   - "Looks like the problem it solves."
.       (To quote Walter's DConf2013 keynote.)
.   - Made of a bunch of patterns that describe
.       code rewrites/lowerings.
.   - Few or no null value checks necessary.
.      - null checks don't look like pattern matching or lowering.
.   - Few or no convoluted if-while-if-for-etc nests.
.      - These also don't look like pattern matching or lowering.
.   - It should be largely made of "pattern handlers" (see below).
.   - Each pattern handler will have one part that closely resembles
.       the AST fragment for the D code that it recognizes, and
.       another part that resembles the lowered form that it outputs.
.   - Dependency analysis that prevents your AST manipulation from
.       happening either too early or too late.
.   - Because the code that actually does lowerings is generated from
.       a DSL, it is possible to make it automate a lot of tedious
.       tasks, like updating the symbol table when nodes are added or
.       removed from the AST.
.   - This makes it easier to test experimental features.

.- A step-by-step view of what the compiler is doing to your code.
.   - Since the semantic analysis of xdc would be composed of
.      "pattern handlers" (see below), then each time one of them
.      completes the compiler could output the result of calling
.      .toString() (or .toDCode() or whatever) on the entire AST.
.   - This could be attached to an ncurses interface that would be
.      activated by passing a flag to the compiler, which would then
.      proceed to show the AST at every stage of compilation.
.      Press ENTER to see the next step, etc.
.   - This could also be exposed as API functionality that IDEs could
.      use to show developers how the compiler sees their code.

.- D code analysis engine that might be usable to automatically
.    translate D1 code into D2 code, or maybe D2 into D3 in the far
.    future.

== Architectural Overview ==

.- xdc will largely consist of "pattern handlers" that recognize
.    patterns in its AST and replace them with AST fragments that
.    contain successively fewer high-level features (lowering).
.   - These pattern handlers would feature a DSL that should make
.       the whole task fairly easy.
.   - The DSL proposed would be similar to regular expressions in
.       semantics but different in syntax.
.      - It will have operators for choice, repetition, optional
.          matches, capturing, and so on.
.      - The DSL must support nested structures well.
.      - The DSL must support vertical layout of patterns well.
.      - Because of the vertical patterns, most operators will either
.          be prefix or will be written in block style:
.          some_block_header { block_stmt1; block_stmt2; etc; }
.      - Actions like entering and leaving nodes are given their own
.          special syntax.  The machine will treat them like tokens
.          that can be matched the same as any AST node.  Notably,
.          node-entry and node-exit do not require introducing
.          non-regular elements to the DSL.  node-entry and node-exit
.          may be subsumed into Deterministic Finite Automatons (DFAs).
.   - An example pattern handler might look like this:

const lowerWhileStatement =
{
	// Apologies in advance if this isn't actually valid D code:
	//   This is a design sketch and I currently don't have a way to compile it.
	//
	// The Pattern template, PatternMatch template, and PatternHandler class
	//   have not yet been written.  This is an example of how I might expect
	//   them to be used.
	//

	auto consumes = "while_statement";
	auto produces = "if_statement","goto","label");
	
	auto recognizer = Pattern!
		"WhileStatement has
		{
			// Capture the conditional expression (call it \"expr\") and
			//   capture the loop body (call it \"statement\").
			.expression $expr;
			.statement  $statement has
			{
				// Capture any continue/break statements.
				any_amount_of {
					any_amount_of .; // Same as .* in regexes.
					one_of
					{
						ContinueStatement $continues;
						BreakStatement    $breaks;
					}
				}
				any_amount_of .;
			}
		}";
	
	auto action = (PatternMatch!(recognizer) m)
	{
		m.captures.add("uniqLoopAgain", getUniqueLabel(syntaxNode.enclosingScope))
		m.captures.add("uniqExitLoop", getUniqueLabel(syntaxNode.enclosingScope))
		
		// The "recognizes" clause defines m.getCapture!"continues" with:
		//   "ContinueStatement $continues;"
		// That line appears in a repitition context ("any_amount_of") and is
		//   therefore typed as an array.
		foreach( ref node; m.getCapture!"continues" )
			node.replaceWith( m, "GotoStatement has $uniqLoopAgain" )
		
		// Ditto for m.getCapture!"breaks" and "BreakStatement $breaks;".
		foreach( ref node; m.getCapture!"breaks" )
			node.replaceWith( m, "GotoStatement has $uniqExitLoop" )
	};
	
	auto synthesizer = Pattern!
		"Label has $uniqLoopAgain
		IfStatement has
		{
			OpNegate has $expr
			GotoStatement has $uniqExitLoop
		}
		$statement
		GotoStatement has $uniqLoopAgain
		Label has $uniqExitLoop
		";

	return new PatternHandler(produces, consumes, recognizer, action, synthesizer);
};

(Also available at: http://pastebin.com/0mBQxhLs )

.- Dispatch to pattern handlers is performed by the execution of a
.    DFA/Packrat hybrid instead of the traditional OOP inheritance
.    with method calls.
.   - Each pattern handler's recognizer gets treated like a regex
.       or Parsing Expression Grammar (PEG) fragment.
.   - All of the recognizers in the same semantic pass are pasted
.       together in an ordered-choice expression.  The ordering is
.       determined by dependency analysis.
.   - A recognizer's pattern handler is invoked when the recognizer's
.       AST expression is matched.
.   - Once any substitutions are completed, then the machine executing
.       the pattern engine will set its cursor to the beginning of
.       the newly substituted AST nodes and continue running.
.   - Executing potentially hundreds of pattern handlers in a single
.       ordered-choice expression would be obnoxious for a packrat
.       parser (packrat machine?).  Thankfully, ordered-choice is
.       possible in regular grammars, so it can be lowered into regex
.       operations and the whole thing turned into a DFA.
.   - If pattern recognizers end up needing recursive elements,
.       then they will probably not appear at the very beginning of
.       the pattern.  Patterns with enough regular elements at the
.       start will be able to merge those regular elements into the
.       DFA with the rest of the pattern recognizers, and it all
.       becomes very fast table lookups in small tables.

.- This compiler would involve the creation of a parser-generator
.    API that allows code to programmatically create grammars, and
.    to do so without a bunch of clumsy string formatting and string
.    concatenation.
.   - These grammars could be written such that things like AST nodes
.       are seen as terminals.  This expands possibilities and allows
.       all of the pattern handlers to be coalesced into a grammar
.       that operates on ASTs and fires off semantic actions whenever
.       one of the recognizer patterns gets tickled by the right AST
.       fragment.
.   - Using strings as terminals is still cool; and necessary for
.       xdc's text/D-code parser.
.   - A simple parser-generator API example:

---------------------------------------
string makeParser()
{
	auto builder = new ParserBuilder!char;
	builder.pushSequence();
		builder.literal('x');
		builder.pushMaybe();
			builder.literal('y');
		builder.pop();
	builder.pop();
	return builder.toDCode("callMe");
}

const foo = makeParser();

pragma(msg, foo);
---------------------------------------
Current output:
http://pastebin.com/V3E0Ubbc
---------------------------------------

.   - Humans would probably never directly write grammars using this
.       API; it is intended for use by code that needs to write
.       grammars.  xdc would be such code: it's given a bunch of
.       pattern handlers and needs to turn them into a grammar.
.   - This API could also make it easier to write the parser
.       generators that humans /would/ use. For example, it could be
.       used as an experimental backend for a regular expression
.       engine that can handle limited recursion.
.   - The packrats usually generated from PEGs are nice and all, but
.       I'd really like to generate DFAs whenever possible, because
.       those seem to be regarded as being /very fast/.
.   - DFAs can't handle the recursive elements of PEGs, but they
.       should be able to handle everything non-recursive that
.       precedes or follows the recursive elements.
.   - The parser-generator API would be responsible for aggressively
.       converting PEG-like elements into regex/DFA elements whenever
.       possible.
.   - Regular expressions can be embedded in PEGs as long as you tell
.       them how much text to match.  You have to give them concrete
.       success/failure conditions that can be determined without
.       help from the rest of the PEG: things like "match as many
.       characters as possible" or "match as few characters as
.       possible".  Without that, the regex's backtracking (DFA'd
.       or otherwise) won't mesh with the PEG.  Give it a concrete
.       win/fail condition, however, and the embedded regex becomes
.       just another PEG building block that chews through some
.       source material and yields a yes/no result.  Such regular
.       expressions allow DFAs to be introduced into a recursive
.       descent or packrat parser.
.   - Many PEG elements can be converted into these well-behaved
.       regular expressions.
.      - PEG repetition is just regular expression repetition with
.          a wrapper around it that says "match as many characters
.          as possible".
.      - PEG ordered choice can be lowered into regular expression
.          unordered choice, which can then be converted into DFAs:
.          I suspect that this is true: (uv/xy)c == (uv|(^(uv)&xy))c
.          (or, by De Morgan's law: (uv/xy)c == (uv|(^(uv|^(xy))))c )
.          & is intersection.
.          ^ is negation.
.          Each letter (u,v,x,y,c) can be a subexpression
.            (non-recursive).
.      - PEG label matching can be inlined up to the point where
.          recursion occurs, thus allowing more elements to be
.          considered for DFA conversion.
.      - etc.

.- The parser would be defined using a PEG (most likely using Pegged
.    specifically).
.   - Although Pegged is an awesome achievement, I suspect its output
.       could be improved considerably.  The templated code it
.       generates is slow to compile and ALWAYS allocates parse
.       tree nodes at every symbol.
.   - I want to experiment with making Pegged (or a branch of it) emit
.       DFA/Packrat parser hybrids.  This could be done by making a
.       version of Pegged that uses the aforementioned
.       parser-generator API to create its parsers.
.   - Design principle:  avoid memory allocations like the plague.
.       The output should be a well-pruned AST, and not just a parse
.       tree that causes a bunch of allocations and needs massaging to
.       become useful.
.   - I really like Pegged and would contribute this stuff upward, if
.       accepted.

.- When hacking on xdc, you don't need to be aware of WHEN your code
.    code gets executed in semantic analysis.  The dependency analysis
.    will guarantee that it always gets performed both
.    (a) when it's needed, and (b) when it has what it needs.
.   - This is what the "consumes" and "produces" variables are all
.       about in the above example.

.- Successfully lowering a D AST into the target backend's input will
.    almost certainly require multiple passes.  xdc's dependency
.    analyzer would automatically minimize the number of passes by
.    looking for patterns that are "siblings" in the dependency graph
.    (eg. neither depends on the other) and bunching as many such
.    patterns as possible into each pass.
.   - It really shouldn't generate very many more than the number of
.       passes that DMD has coded into it.  Ideally: no more than DMD,
.       if not fewer.
.   - I'd like to make the dependency analyzer output a graph that
.       can be used to track which patterns cause which passes to
.       exist, and show which patterns are in which passes.

.- Planned availability of backends.
.   - My first pick for a backend would be an ANSI C89 target.  I feel
.       that this would give it the most reach.
.   - The interpreter backend is along for the ride, as mentioned.
.   - Because the semantic analysis is composed of distinct and
.       loosely-coupled patterns, it is possible for xdc to generate
.       an analysis chain with the minimum number of lowerings needed
.       for a given backend.
.      - The interpreter backend would benefit from having the most
.          lowerings.  By requiring a lot of lowering, the interpreter
.          would only need to support a small number of constructs:
.         - if statements
.         - gotos
.         - function calls
.         - arithmetic expression evaluation
.         - builtin types (byte, short, int, long, float, double, etc)
.         - pointers
.         - Even structs are unnecessary: they can be seen as
.             typed dereferencing of untyped pointers.
.      - The C backend would benefit from slightly less lowering than
.         the interpreter backend.  It is useful for debugging if
.         you can mostly-sorta read the resulting C code, and your
.         C compiler will appreciate the extra optimization
.         opportunities.
.         - Looping constructs like while and for are welcome here.
.         - structs would be more readable.
.      - In the future, a Java or C# backend might use an entirely
.          different set of lowerings in later passes.
.         - Pointers are no longer considered "low".
.         - Classes should be kept as long as possible;
.             I'm pretty sure they bytecode (at least for Java)
.             has opcodes dedicated to classes.  Removing them
.             may cause pessimisation.
.      - The backend writer should not have to worry about rewriting
.          the semantic analysis to suit their needs.  They just define
.          some features and say which ones they need available in the
.          AST, and xdc's semantic-analysis-generator will handle the
.          rest.
.   - Notably, a backend should just be more lowerings, with the
.       result being text or binary code instead of AST nodes.
.      - Backends are essentially defined by the set of AST/language
.          features that they consume and any special lowerings needed
.          to convert generic AST/language features into
.          backend-specific AST/language features.


== Closing Thoughts ==

I am realizing that there are multiple reasons that compel me to write this document:
- To share my ideas with others, on the off-chance that someone else might see this vision too and be better equipped to deliver.
- To suggest capabilities that any community-endorsed compiler tool (ex: compiler-as-a-ctfe-library) should have.
- To see if I might be able to get the help I need to make it a reality.

I just can't decide which reasons are more important.  But there is a common thread: I want this vision to become reality and do really cool things while filling a bunch of missing links in D's ecosystem.

I have to ask:

Would you pay for this?
If so, then I might be able to do a kickstarter at some point.
I am not independently wealthy or retired (or both?) like Walter, nor am I able to survive on zero hours of sleep each night like Andrei, and this would be a big project.  I think it would need full-time attention or it would never become useful in a reasonable timeframe.

Also, assuming you understand the design, are there any gaping holes in this?
This is my first attempt to share these ideas with a larger group, and thus an opportunity to anticipate troubles.

...

Well, I'm anxious to see how well the venerable D community receives this bundle of ideas.  Be chatty.  I'll try to keep up.

Thank you for reading.
July 18, 2013
Crud, I miscalculated the line wrap on the web reader a lot.  Sorry about that.
July 18, 2013
For the web forum interface users, I have re-wrapped the text in my outline.  Hopefully this will look better!  If not, please try this pastebin version: http://pastebin.com/Twc9ZUnQ

I'd like to present my vision for a new D compiler.  I call it xdc, a loose abbreviation for "Cross D Compiler" (if confused, see http://english.stackexchange.com/questions/37394/why-do-some-words-have-x-as-a-substitute).  It could also mean other fun things like "Crossbone D Compiler" (I imagine a logo with some crossbones having a metal D atop where the skull normally goes), "eXperimental D Compiler", or something sounding like "ectasy" ;)

We usually think of language features as being rewritten into simpler features.  The simple features eventually get rewritten into machine instructions.  Compilers are, fundamentally, responsible for performing "lowering" operations.

It makes sense to me, then, to make a compiler whose internals /look/ like a bunch of these rewrites and lowering operations.  There should be a bunch of input patterns matched to the desired results.  This has the happy side-effect of giving us a pleasant way to do AST manipulations from within D code.

I've also had a long-standing desire to see D on many more platforms.  It should make an appearance on console platforms and on smartphones.  I've tried doing this with a retargetable compiler like GCC before, and the work was surprisingly large.  Even if the compiler already emits code for the target system's CPU, there are still a large number of details involving calling conventions, executable format, and any number of CPU/OS specific tweaks to object file output.  It makes a lot more sense to me to just output C/C++ code and feed that to a native toolchain.  That would skip a lot of the platform-specific nonsense that creates a barrier to entry for people who, say, just want to write a simple game for Android/iPhone/PS(3|4)/etc in D, and don't want to become compiler experts first.  Ideally, some day, this compiler would also emit code or bytecode for Javascript, AS3/Flash, Java, and any other popular targets that the market desires.  This can probably be done with DMD, but I'd like to make the process more approachable, and make backend authoring as easy as possible.  It should be possible (and easy) to tell the compiler exactly what lowerings should be applied before the AST hits the backend.

xdc should bring all of that cross-platform targeting together with a compiler infrastructure that can blow everything else away (I hope!).

xdc is my dream for a D compiler that gives us our first step (of few) towards having what haXe has already (http://haxe.org/) : a compiler that can land code just about anywhere.

What follows is a collection of my thoughts written down as notes.

== Ideal Outcomes ==

.- D to C/C++ compiler that can easily reach target platforms that
.    are currently either unsupported or poorly supported by
.    current D compilers.
.   - Useful for game developers wishing to write D code on the
.       next big console platform.
.   - Useful for embedded systems developers wishing to write
.       D code on obscure or potentially proprietary
.       microcontrollers.

.- Other backends (ex: llvm, Java bytecode, AS3/Flash bytecode,
.    etc) possible in the future.  Community desires considered
.    when selecting new backend targets.

.- Interpreter backend: a notable backend that would be
.    implemented as a necessity for making CTFE work.  A dedicated
.    interpreter backend would hopefully be much faster and easier
.    on memory than DMD's souped-up constant folder.  (Even though
.    DMD has become considerably better at this in the past year
.    or two.)

.- Abstract Syntax Tree (AST) manipulation toolchain, possibly
.    usable in CTFE.  It would work like this:
.    (1) Pass some D code or Domain Specific Language (DSL) of
.          your choice (as text) into xdc at compile-time.
.    (2) xdc returns an AST.
.    (3) Use xdc's pattern-matching and substitution DSL to
.          manipulate the AST.
.    (4) xdc consumes the AST and emits modified D code.
.    (5) mixin(...) the result.
.   - If xdc is the compiler used to execute itself in CTFE, then
.       it might be possible to optimize this by having it expose
.       itself as a set of intrinsics.

.- Reference counting available by default on all platforms.
.   - Gets you into the action with minimal effort and little or
.       no compiler hacking. (More complete GC tends to require
.       platform specific ASM and/or operating system API
.       support).

.- Full garbage collection available if supported.
.   - Ex: The C backends would default to ref-counting until the
.       ASM and OS-level code is written to support full GC.
.   - Ex: A Java backend would probably use the Java JVM by
.       default.

.- Threading model determined by compiler configuration or
.    available platform hints.
.   - Ex: The user may have a posix-threads implementation
.       available, but know little other details about the target
.       system.  It should be possible for xdc to use pthreads to
.       emulate the TLS and synchronization mechanisms needed to
.       make D tick.  (Or at least emulate as many features as
.       possible.)
.   - Ex: Possible "no threading" target for people who don't need
.       threading but DO need other D features to be available NOW
.       on an alien platform.  Errors when the source code passed
.       into xdc assumes that threading features are present.

.- D compiler that is easy to hack on.
.   - "Looks like the problem it solves."
.       (To quote Walter's DConf2013 keynote.)
.   - Made of a bunch of patterns that describe
.       code rewrites/lowerings.
.   - Few or no null value checks necessary.
.      - null checks don't look like pattern matching or lowering.
.   - Few or no convoluted if-while-if-for-etc nests.
.      - These also don't look like pattern matching or lowering.
.   - It should be largely made of "pattern handlers" (see below).
.   - Each pattern handler will have one part that closely
.       resembles the AST fragment for the D code that it
.       recognizes, and another part that resembles the lowered
.       form that it outputs.
.   - Dependency analysis that prevents your AST manipulation from
.       happening either too early or too late.
.   - Because the code that actually does lowerings is generated
.       from a DSL, it is possible to make it automate a lot of
.       tedious tasks, like updating the symbol table when nodes
.       are added or removed from the AST.
.   - This makes it easier to test experimental features.

.- A step-by-step view of what the compiler is doing to your code.
.   - Since the semantic analysis of xdc would be composed of
.      "pattern handlers" (see below), then each time one of them
.      completes the compiler could output the result of calling
.      .toString() (or .toDCode() or whatever) on the entire AST.
.   - This could be attached to an ncurses interface that would be
.      activated by passing a flag to the compiler, which would
.      then proceed to show the AST at every stage of compilation.
.      Press ENTER to see the next step, etc.
.   - This could also be exposed as API functionality that IDEs
.      could use to show developers how the compiler sees their
.      code.

.- D code analysis engine that might be usable to automatically
.    translate D1 code into D2 code, or maybe D2 into D3 in the
.    far future.

== Architectural Overview ==

.- xdc will largely consist of "pattern handlers" that recognize
.    patterns in its AST and replace them with AST fragments that
.    contain successively fewer high-level features (lowering).
.   - These pattern handlers would feature a DSL that should make
.       the whole task fairly easy.
.   - The DSL proposed would be similar to regular expressions in
.       semantics but different in syntax.
.      - It will have operators for choice, repetition, optional
.          matches, capturing, and so on.
.      - The DSL must support nested structures well.
.      - The DSL must support vertical layout of patterns well.
.      - Because of the vertical patterns, most operators will
.          either be prefix or will be written in block style:
.          some_block_header { block_stmt1; block_stmt2; etc; }
.      - Actions like entering and leaving nodes are given their
.          own special syntax.  The machine will treat them like
.          tokens that can be matched the same as any AST node.
.          Notably, node-entry and node-exit do not require
.          introducing non-regular elements to the DSL.
.          node-entry and node-exit may be subsumed into
.          Deterministic Finite Automatons (DFAs).
.   - An example pattern handler might look like this:

const lowerWhileStatement =
{
	// Apologies in advance if this isn't actually valid D code:
	//   This is a design sketch and I currently don't have a way to compile it.
	//
	// The Pattern template, PatternMatch template, and PatternHandler class
	//   have not yet been written.  This is an example of how I might expect
	//   them to be used.
	//

	auto consumes = "while_statement";
	auto produces = "if_statement","goto","label");
	
	auto recognizer = Pattern!
		"WhileStatement has
		{
			// Capture the conditional expression (call it \"expr\") and
			//   capture the loop body (call it \"statement\").
			.expression $expr;
			.statement  $statement has
			{
				// Capture any continue/break statements.
				any_amount_of {
					any_amount_of .; // Same as .* in regexes.
					one_of
					{
						ContinueStatement $continues;
						BreakStatement    $breaks;
					}
				}
				any_amount_of .;
			}
		}";
	
	auto action = (PatternMatch!(recognizer) m)
	{
		m.captures.add("uniqLoopAgain", getUniqueLabel(syntaxNode.enclosingScope))
		m.captures.add("uniqExitLoop", getUniqueLabel(syntaxNode.enclosingScope))
		
		// The "recognizes" clause defines m.getCapture!"continues" with:
		//   "ContinueStatement $continues;"
		// That line appears in a repitition context ("any_amount_of") and is
		//   therefore typed as an array.
		foreach( ref node; m.getCapture!"continues" )
			node.replaceWith( m, "GotoStatement has $uniqLoopAgain" )
		
		// Ditto for m.getCapture!"breaks" and "BreakStatement $breaks;".
		foreach( ref node; m.getCapture!"breaks" )
			node.replaceWith( m, "GotoStatement has $uniqExitLoop" )
	};
	
	auto synthesizer = Pattern!
		"Label has $uniqLoopAgain
		IfStatement has
		{
			OpNegate has $expr
			GotoStatement has $uniqExitLoop
		}
		$statement
		GotoStatement has $uniqLoopAgain
		Label has $uniqExitLoop
		";

	return new PatternHandler(produces, consumes, recognizer, action, synthesizer);
};

(Also available at: http://pastebin.com/0mBQxhLs )

.- Dispatch to pattern handlers is performed by the execution of a
.    DFA/Packrat hybrid instead of the traditional OOP inheritance
.    with method calls.
.   - Each pattern handler's recognizer gets treated like a regex
.       or Parsing Expression Grammar (PEG) fragment.
.   - All of the recognizers in the same semantic pass are pasted
.       together in an ordered-choice expression.  The ordering is
.       determined by dependency analysis.
.   - A recognizer's pattern handler is invoked when the
.       recognizer's AST expression is matched.
.   - Once any substitutions are completed, then the machine
.       executing the pattern engine will set its cursor to the
.       beginning of the newly substituted AST nodes and continue
.       running.
.   - Executing potentially hundreds of pattern handlers in a
.       single ordered-choice expression would be obnoxious for a
.       packrat parser (packrat machine?).  Thankfully, ordered-
.       choice is possible in regular grammars, so it can be
.       lowered into regex operations and the whole thing turned
.       into a DFA.
.   - If pattern recognizers end up needing recursive elements,
.       then they will probably not appear at the very beginning
.       of the pattern.  Patterns with enough regular elements at
.       the start will be able to merge those regular elements
.       into the DFA with the rest of the pattern recognizers, and
.       it all becomes very fast table lookups in small tables.

.- This compiler would involve the creation of a parser-generator
.    API that allows code to programmatically create grammars, and
.    to do so without a bunch of clumsy string formatting and
.    string concatenation.
.   - These grammars could be written such that things like AST
.       nodes are seen as terminals.  This expands possibilities
.       and allows all of the pattern handlers to be coalesced
.       into a grammar that operates on ASTs and fires off
.       semantic actions whenever one of the recognizer patterns
.       gets tickled by the right AST fragment.
.   - Using strings as terminals is still cool; and necessary for
.       xdc's text/D-code parser.
.   - A simple parser-generator API example:

---------------------------------------
string makeParser()
{
	auto builder = new ParserBuilder!char;
	builder.pushSequence();
		builder.literal('x');
		builder.pushMaybe();
			builder.literal('y');
		builder.pop();
	builder.pop();
	return builder.toDCode("callMe");
}

const foo = makeParser();

pragma(msg, foo);
---------------------------------------
Current output:
http://pastebin.com/V3E0Ubbc
---------------------------------------

.   - Humans would probably never directly write grammars using
.       this API; it is intended for use by code that needs to
.       write grammars.  xdc would be such code: it's given a
.       bunch of pattern handlers and needs to turn them into a
.       grammar.
.   - This API could also make it easier to write the parser
.       generators that humans /would/ use. For example, it could
.       be used as an experimental backend for a regular
.       expression engine that can handle limited recursion.
.   - The packrats usually generated from PEGs are nice and all,
.       but I'd really like to generate DFAs whenever possible,
.       because those seem to be regarded as being /very fast/.
.   - DFAs can't handle the recursive elements of PEGs, but they
.       should be able to handle everything non-recursive that
.       precedes or follows the recursive elements.
.   - The parser-generator API would be responsible for
.       aggressively converting PEG-like elements into regex/DFA
.       elements whenever possible.
.   - Regular expressions can be embedded in PEGs as long as you
.       tell them how much text to match.  You have to give them
.       concrete success/failure conditions that can be determined
.       without help from the rest of the PEG: things like "match
.       as many characters as possible" or "match as few
.       characters as possible".  Without that, the regex's
.       backtracking (DFA'd or otherwise) won't mesh with the PEG.
.       Give it a concrete win/fail condition, however, and the
.       embedded regex becomes just another PEG building block
.       that chews through some source material and yields a
.       yes/no result.  Such regular expressions allow DFAs to be
.       introduced into a recursive descent or packrat parser.
.   - Many PEG elements can be converted into these well-behaved
.       regular expressions.
.      - PEG repetition is just regular expression repetition with
.          a wrapper around it that says "match as many characters
.          as possible".
.      - PEG ordered choice can be lowered into regular expression
.          unordered choice, which can then be converted into
.          DFAs:
.          I suspect that this is true:
.             (uv/xy)c == (uv|(^(uv)&xy))c
.          or, by De Morgan's law:
.             (uv/xy)c == (uv|(^(uv|^(xy))))c
.          & is intersection.
.          ^ is negation.
.          Each letter (u,v,x,y,c) can be a subexpression
.            (non-recursive).
.      - PEG label matching can be inlined up to the point where
.          recursion occurs, thus allowing more elements to be
.          considered for DFA conversion.
.      - etc.

.- The parser would be defined using a PEG (most likely using
.    Pegged specifically).
.   - Although Pegged is an awesome achievement, I suspect its
.       output could be improved considerably.  The templated code
.       it generates is slow to compile and ALWAYS allocates parse
.       tree nodes at every symbol.
.   - I want to experiment with making Pegged (or a branch of it)
.       emit DFA/Packrat parser hybrids.  This could be done by
.       making a version of Pegged that uses the aforementioned
.       parser-generator API to create its parsers.
.   - Design principle:  avoid memory allocations like the plague.
.       The output should be a well-pruned AST, and not just a
.       parse tree that causes a bunch of allocations and needs
.       massaging to become useful.
.   - I really like Pegged and would contribute this stuff upward,
.       if accepted.

.- When hacking on xdc, you don't need to be aware of WHEN your
.    code gets executed in semantic analysis.  The dependency
.    analysis will guarantee that it always gets performed both
.    (a) when it's needed, and (b) when it has what it needs.
.   - This is what the "consumes" and "produces" variables are all
.       about in the above example.

.- Successfully lowering a D AST into the target backend's input
.    will almost certainly require multiple passes.  xdc's
.    dependency  analyzer would automatically minimize the number
.    of passes by looking for patterns that are "siblings" in the
.    dependency graph (eg. neither depends on the other) and
.    bunching as many such patterns as possible into each pass.
.   - It really shouldn't generate very many more than the number
.       of passes that DMD has coded into it.  Ideally: no more
.       than DMD, if not fewer.
.   - I'd like to make the dependency analyzer output a graph that
.       can be used to track which patterns cause which passes to
.       exist, and show which patterns are in which passes.

.- Planned availability of backends.
.   - My first pick for a backend would be an ANSI C89 target.
.       I feel that this would give it the most reach.
.   - The interpreter backend is along for the ride, as mentioned.
.   - Because the semantic analysis is composed of distinct and
.       loosely-coupled patterns, it is possible for xdc to
.       generate an analysis chain with the minimum number of
.       lowerings needed for a given backend.
.      - The interpreter backend would benefit from having the
.          most lowerings.  By requiring a lot of lowering, the
.          interpreter would only need to support a small number
.          of constructs:
.         - if statements
.         - gotos
.         - function calls
.         - arithmetic expression evaluation
.         - builtin types (byte, short, int, long, float, etc)
.         - pointers
.         - Even structs are unnecessary: they can be seen as
.             typed dereferencing of untyped pointers.
.      - The C backend would benefit from slightly less
.         than the interpreter backend.  It is useful for
.         debugging if you can mostly-sorta read the resulting
.         C code, and your C compiler will appreciate the extra
.         optimization opportunities.
.         - Looping constructs like while and for are welcome
.             here.
.         - structs would be more readable.
.      - In the future, a Java or C# backend might use an entirely
.          different set of lowerings in later passes.
.         - Pointers are no longer considered "low".
.         - Classes should be kept as long as possible;
.             I'm pretty sure they bytecode (at least for Java)
.             has opcodes dedicated to classes.  Removing them
.             may cause pessimisation.
.      - The backend writer should not have to worry about
.          rewriting the semantic analysis to suit their needs.
.          They just define some features and say which ones they
.          need available in the AST, and xdc's semantic-analysis-
.          generator will handle the rest.
.   - Notably, a backend should just be more lowerings, with the
.       result being text or binary code instead of AST nodes.
.      - Backends are essentially defined by the set of
.          AST/language features that they consume and any special
.          lowerings needed to convert generic AST/language
.          features into backend-specific AST/language features.


== Closing Thoughts ==

I am realizing that there are multiple reasons that compel me to write this document:
- To share my ideas with others, on the off-chance that someone else might see this vision too and be better equipped to deliver.
- To suggest capabilities that any community-endorsed compiler tool (ex: compiler-as-a-ctfe-library) should have.
- To see if I might be able to get the help I need to make it a reality.

I just can't decide which reasons are more important.  But there is a common thread: I want this vision to become reality and do really cool things while filling a bunch of missing links in D's ecosystem.

I have to ask:

Would you pay for this?
If so, then I might be able to do a kickstarter at some point.
I am not independently wealthy or retired (or both?) like Walter, nor am I able to survive on zero hours of sleep each night like Andrei, and this would be a big project.  I think it would need full-time attention or it would never become useful in a reasonable timeframe.

Also, assuming you understand the design, are there any gaping holes in this?
This is my first attempt to share these ideas with a larger group, and thus an opportunity to anticipate troubles.

...

Well, I'm anxious to see how well the venerable D community receives this bundle of ideas.  Be chatty.  I'll try to keep up.

Thank you for reading.
July 18, 2013
llvm should be platform-independent enough, so you have ldc. But the problem with D is that it has more features than C/C++, so if you output C code you can only use C features, or you have to implement druntime+phobos for the target platform, which doesn't come for free.
July 19, 2013
On Thursday, 18 July 2013 at 10:26:24 UTC, Kagamin wrote:
> llvm should be platform-independent enough, so you have ldc. But the problem with D is that it has more features than C/C++, so if you output C code you can only use C features, or you have to implement druntime+phobos for the target platform, which doesn't come for free.

I think a C backend would get us farther than an LLVM backend.  Imagine targeting one of the ubiquitous ARM targets like Android, and its gazillion variants.  LLVM has an ARM target I'm pretty sure, but the Android community uses GCC as their compiler.  This puts LLVM/LDC on Android into a "may or may not work" category (unless they've done it already while I wasn't looking).  Emitting C/C++ code and feeding it to the Android NDK though: that will work for sure, or at least land incredibly close.  Rinse and repeat for PIC, Cypress, iPhone, XBox (any of them), Wii, Ouya, PS3, PS4, 3DS, and whatever else might come out a year from now.  I suspect that LLVM will be missing support for a bunch of those, and lag new platforms significantly; not unless it just happens to exist in their "ecosystem".

I think that LLVM might have a maintained C backend again.  It had been unmaintained/unsupported for a while.  If someone thinks that using LLVM's C backend makes more sense, then they should do it with LDC.  I'd be really happy about it, because I want that functionality.

I am not interested in using LLVM as my /first/ backend in xdc for a combination of the above reasons and their implications:
.- LLVM emitting native code for xdc will require someone
.    knowledgable with LLVM to set things up and potentially
.    spend time on it if the target platform requires tweaking.
.    Getting handed a file full of C code requires only knowledge
.    of how to use the C compiler on the target system.
.- LLVM emitting C code should probably be done by ldc instead.
.- LLVM emitting C code would also add unnecessary layers:
.    D->AST->LLVM_IR->C vs D->AST->C.
.    This extra layer can lose information.

You are right that druntime+phobos don't come for free.  Doing complete ports of those to platform X is outside the scope of this project.  I do intend to salvage whatever parts of them I can in a portable manner.  Moreover, I intend xdc to have the capability to toggle druntime/phobos features on and off easily based on existing support for the intended platform.  The C-Windows target would probably have more druntime/phobos features enabled than C-Android target.  I'd like to make sure that anyone can use what's available, and if not, still be allowed to access the platform.  I don't think it makes sense that I'd be completely unable to write iPhone code just because there isn't GC/threading code available for it in the runtime yet; I should be able to do some non-trivial things without those.  Note that a reference-counting implementation would be portable everywhere that doesn't have a native GC, so you'd never have anything less than that (unless you intentionally ditched it, ex: memory limited microcontrollers where things are statically allocated anyways).

To be more concrete, I intend to make it possible to invoke xdc with "hints", like so:
xdc --emit=C89 --hostc=gcc --os=windows --threads=pthreads main.d
# Breakdown:
# --emit=C89 causes C89 compliant code to be emitted.
# --hostc=gcc causes xdc to feed the C89 code to gcc (mingw).
# --os=windows causes xdc to use WinAPI by default for
#     for implementing druntime and platform-specific language
#     features.
# --threads=pthreads overrides the use of Windows threads;
#     posix threads are used instead.
# This combination could probably land GC+threading and mesh well
#   with a mingw environment.

Alternatively, it could be invoked like this:
xdc --emit=C89 main.d
# This will generate some very conservative C89 code.
# You will have refcounting instead of GC, and thread spawning
#   won't work.
# However, you'll have some really freakin' portable code!

Even with a conservative target like C89-only, there are still an incredibly large number of extremely useful D features (OOP, templates, scope(exit), CTFE, mixins, ranges, op-overloading, etc) that DO come for free.  These can be lowered into C code that does the same thing but looks uglier.  Such lowered code would take much longer for a human to write and become tedious to maintain, but a compiler can do it tirelessly.
July 19, 2013
On Friday, 19 July 2013 at 13:38:12 UTC, Chad Joan wrote:
>
> Even with a conservative target like C89-only, there are still an incredibly large number of extremely useful D features (OOP, templates, scope(exit), CTFE, mixins, ranges, op-overloading, etc) that DO come for free.

I love the idea behind xdc, but I would go with C99 instead, even MS as the last vendor(?), with VS2013 now finally committed to supporting C99, variable length arrays alone would make it worth it.
July 19, 2013
On Friday, 19 July 2013 at 13:38:12 UTC, Chad Joan wrote:
> Imagine targeting one of the ubiquitous ARM targets like Android, and its gazillion variants.  LLVM has an ARM target I'm pretty sure, but the Android community uses GCC as their compiler.  This puts LLVM/LDC on Android into a "may or may not work" category (unless they've done it already while I wasn't looking).
A small correction, the Android NDK added clang/llvm support last November in revision 8c:

http://developer.android.com/tools/sdk/ndk/index.html

gcc is still the default, but clang is an alternative option.  I don't think ldc has been ported to use whatever llvm libraries the NDK is providing, but there is some official support for llvm on Android now.
July 20, 2013
On Friday, 19 July 2013 at 15:39:36 UTC, Tove wrote:
> On Friday, 19 July 2013 at 13:38:12 UTC, Chad Joan wrote:
>>
>> Even with a conservative target like C89-only, there are still an incredibly large number of extremely useful D features (OOP, templates, scope(exit), CTFE, mixins, ranges, op-overloading, etc) that DO come for free.
>
> I love the idea behind xdc, but I would go with C99 instead, even MS as the last vendor(?), with VS2013 now finally committed to supporting C99, variable length arrays alone would make it worth it.

I am, in fact, totally willing to make --emit=C99 do cool things once I discover what those cool things are.  Otherwise it will probably emit code that is both C89 and C99 compliant.

I feel that most of the D features that can be implemented with a C99 compiler can be implemented with C89 as well, and C89 might give more reach into esoteric targets like microcontrollers or legacy systems.

Maybe I should ask this: what D features are you afraid of losing due to lowering into C89?

...

I hate to sour whatever cheerful expectations you might have, but, AFAIK, D doesn't have VLAs.  Consequently, I don't think xdc would need them.  I hear that VLA support is becoming optional in C11 as well, which doesn't bode well for its potential existance in the future, see http://en.wikipedia.org/wiki/Variable-length_array

"Programming languages that support VLAs include [...] C99 (and subsequently in C11 relegated to a conditional feature which implementations aren't required to support; ..."

It is important to note that even if D /did/ have VLAs, then it would be possible to lower them into other constructs:

void foo(int n)
{
	char[n] vla;
	...
}

-= can be emulated by =-

void foo(int n)
{
        char[] vla = (cast(char*)std.c.stdlib.malloc(n))[0..n];
        scope(exit) std.c.stdlib.free(vla.ptr);
	...
}

This would be important for emitting to any language without VLAs.  It could be used if someone wanted to add VLAs to xdc as an (off-by-default) experimental feature.  And of course it should use "new T[n]" with core.memory.GC.free if the array contains pointers, or stdlib is unavailable (ex: Java).  I would also expect --emit=C99 to avoid the heap allocation.  alloca might be usable with C89 in some cases, but is very dangerous (in my experience).

----

I had a friend of mine who is an expert C programmer poke me about this C89 vs C99 thing as well.  It's strange to me because I thought that emitting C89 was actually a strong selling point: you're getting D which is so much more than C99, AND it will reach all of the old C89 compilers.

If there are still things that you (community inclusive) are afraid of missing, then I am pretty willing to do C99 instead and skip C89.  I will need to know what they are though, or it won't make a difference anyway (I can't implement what I don't know about!).

HTH.
July 20, 2013
On Friday, 19 July 2013 at 16:42:32 UTC, Joakim wrote:
> On Friday, 19 July 2013 at 13:38:12 UTC, Chad Joan wrote:
>> Imagine targeting one of the ubiquitous ARM targets like Android, and its gazillion variants.  LLVM has an ARM target I'm pretty sure, but the Android community uses GCC as their compiler.  This puts LLVM/LDC on Android into a "may or may not work" category (unless they've done it already while I wasn't looking).
> A small correction, the Android NDK added clang/llvm support last November in revision 8c:
>
> http://developer.android.com/tools/sdk/ndk/index.html
>
> gcc is still the default, but clang is an alternative option.  I don't think ldc has been ported to use whatever llvm libraries the NDK is providing, but there is some official support for llvm on Android now.

Cool, thanks for the info.

It's been a while since I've looked at the Android NDK.  It's just not fun to me without D being there ;)
July 20, 2013
On Thursday, 18 July 2013 at 03:26:10 UTC, Chad Joan wrote:
[...]

Is the input to xdc a semantically-analyzed D AST, or does semantic analysis occur during pattern-matching/lowering?
« First   ‹ Prev
1 2 3 4 5