Thread overview
Compile time regex matching
Jul 14, 2014
Jason den Dulk
Jul 14, 2014
Philippe Sigaud
Jul 15, 2014
Jason den Dulk
Jul 15, 2014
Philippe Sigaud
Jul 14, 2014
Artur Skawina
Jul 15, 2014
Philippe Sigaud
July 14, 2014
Hi

I am trying to write some code that uses and matches to regular expressions at compile time, but the compiler won't let me because matchFirst and matchAll make use of malloc().

Is there an alternative that I can use that can be run at compile time?

Thanks in advance.
Jason
July 14, 2014
> I am trying to write some code that uses and matches to regular expressions at compile time, but the compiler won't let me because matchFirst and matchAll make use of malloc().
>
> Is there an alternative that I can use that can be run at compile time?

You can try Pegged, a parser generator that works at compile-time (both the generator and the generated parser).

https://github.com/PhilippeSigaud/Pegged

docs:

https://github.com/PhilippeSigaud/Pegged/wiki/Pegged-Tutorial

It's also on dub:

http://code.dlang.org/packages/pegged

It takes a grammar as input, not a single regular expression, but the syntax is not too different.


  import pegged.grammar;

  mixin(grammar(`
  MyRegex:
      foo <- "abc"* "def"?
  `));

  void main()
  {
      enum result = MyRegex("abcabcdefFOOBAR"); // compile-time parsing

      // everything can be queried and tested at compile-time, if need be.
      static assert(result.matches == ["abc", "abc", "def"]);
      static assert(result.begin == 0);
      static assert(result.end == 9);

      pragma(msg, result.toString()); // parse tree
  }


It probably does not implement all those regex nifty features, but it has all the usual Parsing Expression Grammars powers. It gives you an entire parse result, though: matches, children, subchildren, etc. As you can see, matches are accessible at the top level.

One thing to keep in mind, that comes from the language and not this library: in the previous code, since 'result' is an enum, it'll be 'pasted' in place everytime it's used in code: all those static asserts get an entire copy of the parse tree. It's a bit wasteful, but using 'immutable' directly does not work here, but this is OK:

    enum res = MyRegex("abcabcdefFOOBAR"); // compile-time parsing
    immutable result = res; // to avoid copying the enum value everywhere

The static asserts then works (not the toString, though). Maybe someone more knowledgeable than me on DMD internals could certify it indeed avoid re-allocating those parse results.
July 14, 2014
On 07/14/14 13:42, Philippe Sigaud via Digitalmars-d-learn wrote:
> asserts get an entire copy of the parse tree. It's a bit wasteful, but using 'immutable' directly does not work here, but this is OK:
> 
>     enum res = MyRegex("abcabcdefFOOBAR"); // compile-time parsing
>     immutable result = res; // to avoid copying the enum value everywhere

   static immutable result = MyRegex("abcabcdefFOOBAR"); // compile-time parsing


> The static asserts then works (not the toString, though). Maybe

diff --git a/pegged/peg.d b/pegged/peg.d
index 98959294c40e..307e8a14b1dd 100644
--- a/pegged/peg.d
+++ b/pegged/peg.d
@@ -55,7 +55,7 @@ struct ParseTree
     /**
     Basic toString for easy pretty-printing.
     */
-    string toString(string tabs = "")
+    string toString(string tabs = "") const
     {
         string result = name;

@@ -262,7 +262,7 @@ Position position(string s)
 /**
 Same as previous overload, but from the begin of P.input to p.end
 */
-Position position(ParseTree p)
+Position position(const ParseTree p)
 {
     return position(p.input[0..p.end]);
 }

[completely untested; just did a git clone and fixed the two
 errors the compiler was whining about. Hmm, did pegged get
 faster? Last time i tried (years ago) it was unusably slow;
 right now, compiling your example, i didn't notice the extra
 multi-second delay that was there then.]

artur
July 15, 2014
On Mon, Jul 14, 2014 at 3:19 PM, Artur Skawina via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> wrote:
> On 07/14/14 13:42, Philippe Sigaud via Digitalmars-d-learn wrote:
>> asserts get an entire copy of the parse tree. It's a bit wasteful, but using 'immutable' directly does not work here, but this is OK:
>>
>>     enum res = MyRegex("abcabcdefFOOBAR"); // compile-time parsing
>>     immutable result = res; // to avoid copying the enum value everywhere
>
>    static immutable result = MyRegex("abcabcdefFOOBAR"); // compile-time parsing

Ah, static!


>
>> The static asserts then works (not the toString, though). Maybe
(snip diff)

I'll push that to the repo, thanks! I should sprinkle some const and pure everywhere...

> [completely untested; just did a git clone and fixed the two
>  errors the compiler was whining about. Hmm, did pegged get
>  faster? Last time i tried (years ago) it was unusably slow;
>  right now, compiling your example, i didn't notice the extra
>  multi-second delay that was there then.]

It's still slower than some handcrafted parsers. At some time, I could get it on par with std.regex (between 1.3 and 1.8 times slower), but that meant losing some other properties. I have other parsing engines partially implemented, with either a larger specter of grammars or better speed (but not both!). I hope the coming holidays will let me go back to it.
July 15, 2014
On Monday, 14 July 2014 at 11:43:01 UTC, Philippe Sigaud via Digitalmars-d-learn wrote:

> You can try Pegged, a parser generator that works at compile-time
> (both the generator and the generated parser).

I did, and I got it to work. Unfortunately, the code used to in the CTFE is left in the final executable even though it is not used at runtime. So now the question is, is there away to get rid of the excess baggage?

BTW Here is the code I am playing with.

import std.stdio;

string get_match()
{
  import pegged.grammar;
  mixin(grammar(`
    MyRegex:
        foo <- "abc"* "def"?
    `));

  auto result = MyRegex(import("config-file.txt")); // compile-time parsing
  return "writeln(\""~result.matches[0]~"\");";
}

void main()
{
  mixin(get_match());
}

July 15, 2014
> I did, and I got it to work. Unfortunately, the code used to in the CTFE is left in the final executable even though it is not used at runtime. So now the question is, is there away to get rid of the excess baggage?

Not that I know of. Once code is injected, it's compiled into the executable.

>   auto result = MyRegex(import("config-file.txt")); // compile-time parsing
>   return "writeln(\""~result.matches[0]~"\");";


>   mixin(get_match());

I never tried that, I'm happy that works.

Another solution would be to push these actions at runtime, by using a small script instead of your compilation command. This script can be in D.

- The script takes a file name as input
- Open the file
- Use regex to parse it
- Extract the values you want and write them to a temporary file.
- Invoke the compiler (with std.process) on your main file with -Jpath
flag to the temporary file. Inside your real code, you can thus use
mixin(import("temp file")) happily.
- Delete the temporary file once the previous step is finished.

Compile the script once and for all, it should execute quite rapidly. It's a unusual pre-processor, in a way.