Writing a compiler in CTFE

July 01, 2018
Posted by MysteryMan
Permalink
MysteryMan
Permalink
I would like to create a compiler, I like D but I hate it! I want to migrate to a new compiler, possibly a personal compiler where I can easily customize and tweak until my hearts content.

For speed of development, instead of having to compile a compiler that then compiles the program I figured using D's CTFE and import could work. For a monolithic compiler file import is used. This is all easily within D's grasp.


The process is as follows:

We write dmd code utilizing all the power of the D language, but minimize complexity since it is for bootstrapping only and ideally exists only at version 0 of the compiler that will parse our new language grammar from which we built our new compiler in it's own language.


We can break the process up in to 5 stages

[Our new compiler's source code written in it's own language]     ->
[D source code that compiles sources in our new language at CTFE] ->
[DMD] ->
[Have the binary run on the source code from stage 1]

After these steps have been done one has a binary that is the boot strap compiler that can be used as the "core" compiler for the new language. It takes the core language, which should be minimally specified to avoid complexity, bugs, etc but completely expressible.

To get the next version of the compiler away from dmd one must then alter the source code to supply the new binary code generators that we used in stage 2. This is a lot of work as all semantics must be remapped from the dmd design to the new languages design.

This last stage is where all the thought must be put in so we can minimize design time.


So, we start with a well specified but arbitrary programming language that has symbols and semantics for those symbols.

For example, we have the tiny super compiler which is written in javascript: https://github.com/jamiebuilds/the-super-tiny-compiler/blob/master/the-super-tiny-compiler.js

To make life more interesting, just assume this is done in D.

This could be our input to the dmd's CTFE engine in which we would have to have a D parser than can parse the source code(maps D constructs to D constructs so this is very easy, in fact, we can just `mixin` the code directly. Imagine a mixinjs which mixes in js source code which was converted to D, a bit more complicated but still doable)


What's interesting about this method is that one can always(assuming no broken compatibilities) use D to generate a new bootstrap and also use the last version to boot strap itself.

The boot strapped compiler automatically has all the features that dmd has such as all the architectures are available(does require recompiling the boot strap with the new dmd args).

What's more, is if we already had a ctfe compiler for our language, we could use it inside any D program, has I've already showed with mixinjs, we could have an mixin(import!(js)(file)) which converts the js code to D code and mixes it in directly. Some plumbing may be required but it would allow us to not only import d code in to d but other languages(that can be representable in D easily).


For example, suppose we had a C to D compiler in the above sense. import!C(C_file) will take any c file and map the source to d source(most of the syntax is identical so it is an easy mapping).

Some work is require, for example, It would have to map #import X; statements to import!C(X);. With some plumbing work we can use any C code with D.


Such a concept would be very powerful indeed! But to be able to accomplish this in a general way as to provide this technique we need a very general way to specify a compiler framework in D(that works in ctfe for rapid production) that makes it easy to represent most popular languages.


Most of the work is in translating one grammar to the other, and therefor, this new framework must be able to make translation easy.

E.g., the for loop in C is identical to the for loop in D so a direct mapping can be used. In matlab code the for loop looks like for i = 1:10. This is just a rearrangement of the for loop in C, for the most part so it too has a direct mapping.

The best I can understand it is that we have our input language input grammar and we want to map it to the D language grammar. Hence we have a mapping between grammars.

This is a very complex issue because of several corner cases. What I am proposing here is for discourse on ways to express this problem for to maximize expressivity while minimizing effort(the good old min/max problem we all know and love).

I will start by expressing my two current positions on this problem:

One of the first problems is to settle on terminology and discuss the pathological issues that exist.
Forums