Q about Phobos regex's architecture

I'll admit this is a bit unorthodox, but...I've been wondering about something regarding Phobos regex's implementation and internal architecture: Just how compartmentalized is the parsing of standard PCRE regex syntax vs actual usage of regexes once parsed?

Or more to the point, (and again, I realize how unorthodox this is), what would it take to implement an alternate (ie, non-PCRE) syntax for regexes that still *uses* the rest of the Phobos regex implementation once the regex string is parsed?

It is currently coupled enough that the only realistic option is to translate the alternate syntax into standard PCRE regex syntax?

Is there a (perhaps "protected", but maybe even "public" if I'm really lucky) manual interface to Phobos regex implementation that bypasses the PCRE parsing?

Any tips/pointers on where to start with this?

April 13, 2017

Re: Q about Phobos regex's architecture

Posted by Dmitry Olshansky
in reply to Nick Sabalausky (Abscissa)

Permalink

Dmitry Olshansky

Posted in reply to Nick Sabalausky (Abscissa)

Permalink

On 4/13/17 7:53 AM, Nick Sabalausky (Abscissa) wrote:
> I'll admit this is a bit unorthodox, but...I've been wondering about
> something regarding Phobos regex's implementation and internal
> architecture: Just how compartmentalized is the parsing of standard PCRE
> regex syntax vs actual usage of regexes once parsed?

Essentially regex is parsed to bytecode which is encapsulated in Regex!Char struct. Then the match family of functions use the bytecode to construct one of the engines, even CT engine is using the same bytecode under the hood. Thus matching has nothing to do with the parser.

> Or more to the point, (and again, I realize how unorthodox this is),
> what would it take to implement an alternate (ie, non-PCRE) syntax for
> regexes that still *uses* the rest of the Phobos regex implementation
> once the regex string is parsed?

It would take a new parser, the rest is transparently reused. It may take an entry point for CT-regex because it blindly uses `regex` function inside. Even the bytecode generator _might_ be reused.
In the worst case the new parser will have to generate bytecode itself.

> It is currently coupled enough that the only realistic option is to
> translate the alternate syntax into standard PCRE regex syntax?
>

No, the bytecode nicely decouples matching from compiling.

> Is there a (perhaps "protected", but maybe even "public" if I'm really
> lucky) manual interface to Phobos regex implementation that bypasses the
> PCRE parsing?
>
> Any tips/pointers on where to start with this?

At the moment it's all internal. That doesn't stop you from doing `import std.regex.internal.ir` for instance. Still I think you'd need to hack on Phobos to bypass the protection levels.

Look at std/regex/internal/parser.d, in particular makeRegex function which is the culmination of parsing step, there you'd see all the bits and pieces that the Regex!Char struct needs to be populated.

The brief description of bytecode is in std/regex/internal/ir.d you can also trace its use in CodeGen struct.

Also if you are interested I'd fully support building a public API to generate regexes w/o touching the parser.

---
Dmitry Olshansky

Forums