View mode: basic / threaded / horizontal-split · Log in · Help
April 15, 2007
Fast regular expressions
Here is an alpha release of a regular expression engine i'm working on:
http://mainia.de/tdfaregexp-0.1-alpha.zip
(artistic license 2.0)

It implements a slightly modified version of Ville Laurikari's tagged
NFA method that allows for regular expression parsing in linear time as
opposed to the very common backtracking method that requires exponential
time (worst case).

As of now, the engine supports greedy and non-greedy operators,
submatches and character classes. It compiles a regexp to a TDFA from
which it can either be applied directly to an input string or compiled
to D code.
Supported operators: . | ( ) ? * + ?? *? +?

The regexp-to-D compilation cannot be done at compile-time, though,
because the required data structures are too complex to be represented
in tuples or strings.

Compile regexp.d and run it for example like this:
regexp "(a*)([0-9a-fA-Z]*)(g|qwe)" aaaa987IUTqwe aaa 234234
This will output the TNFA, TDFA and D Code and try to match the 3 input
strings.
The file regextest.d is an example of how to use the generated code.

There is still a lot of room for optimization, but i think that
categorically, such a natively compiled, linear time regexp matcher
should be the fastest possible.

Comments and bug reports are encouraged ;)
April 15, 2007
Re: Fast regular expressions
Hi Jascha,

Since I found an important feature lacking in Phobos' regex  
implementation, I want to ask you: does your library support matching over  
new-lines?

In any case, good job on writing an improved regex library!

Regards.
April 15, 2007
Re: Fast regular expressions
"Aziz K." <aziz.kerim@gmail.com> wrote in message 
news:op.tqtketww545v36@vaio...
> Hi Jascha,
>
> Since I found an important feature lacking in Phobos' regex 
> implementation, I want to ask you: does your library support matching over 
> new-lines?

I think phobos' version does support this by using the attribute 'g':
char[] s = "strap a \nbla to bla\n";

auto ss = sub(s, " \nb", "ZZ", "g"); // result: strap aZZla to bla

Or I must have understood what you meant?

L.
April 15, 2007
Re: Fast regular expressions
> Or I must have understood what you meant?

Or I must have MISunderstood.. blergh :S

L.
April 15, 2007
Re: Fast regular expressions
thanks!

what exactly do you mean by "over new-lines"?
phobos' regexps will match newlines if stated explicitly. it can also
treat the input as multiple lines.

my engine doesn't care about special chars, yet. the point is merely to
improve on the algorithmic side. all the details with assertions,
special chars and charsets will be filled in when all the operators work.

Aziz K. wrote:
> Hi Jascha,
> 
> Since I found an important feature lacking in Phobos' regex
> implementation, I want to ask you: does your library support matching
> over new-lines?
> 
> In any case, good job on writing an improved regex library!
> 
> Regards.
April 15, 2007
Re: Fast regular expressions
I wonder if the Tango team would be interested in incorporating this 
work into their library? I was disappointed to see they've adopted the 
Phobos regular expression module, given I had some pretty serious gripes 
with it (seemingly erroneous behavior - I don't think this was ever 
resolved).
April 15, 2007
Re: Fast regular expressions
Jeff wrote:

> I wonder if the Tango team would be interested in incorporating this
> work into their library? I was disappointed to see they've adopted the
> Phobos regular expression module, given I had some pretty serious gripes
> with it (seemingly erroneous behavior - I don't think this was ever
> resolved).

I'd say we're likely to be interested once Jascha's library reaches a more
stable state. We are aware that current implementation (which is the same
in both Phobos and Tango) has issues. As with everything that goes into
Tango, someone needs to write it, and mantain it.

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango
April 15, 2007
Re: Fast regular expressions
this is also somewhat meant to be a response to Walter's post a few
weeks ago:
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=47838

Jeff wrote:
> I wonder if the Tango team would be interested in incorporating this
> work into their library? I was disappointed to see they've adopted the
> Phobos regular expression module, given I had some pretty serious gripes
> with it (seemingly erroneous behavior - I don't think this was ever
> resolved).
April 15, 2007
Re: Fast regular expressions
> what exactly do you mean by "over new-lines"?
> phobos' regexps will match newlines if stated explicitly. it can also
> treat the input as multiple lines.

What I mean is that the dot meta character in Phobos' regexps doesn't  
match a new-line.
For example:

char[] s = "<p>ab\ncd</p>";
auto m = search(s, "<p>(.+)</p>", "gmi");
writefln(m.match(1)); // no match...

> my engine doesn't care about special chars, yet. the point is merely to
> improve on the algorithmic side. all the details with assertions,
> special chars and charsets will be filled in when all the operators work.

Yeah, I noticed that when I looked through your code a bit.
April 15, 2007
Re: Fast regular expressions
ah, ic. that will be parameterized later on so you can choose whether
to have the dot match it or not.
you can easily patch it into the alpha by changing charclass.d line 76 to
static const CharClass!(char_t) any_char = {parts: [{l:32, r:127},
{l:10,r:10}, {l:13,r:13}]};


Aziz K. wrote:
>> what exactly do you mean by "over new-lines"?
>> phobos' regexps will match newlines if stated explicitly. it can also
>> treat the input as multiple lines.
> 
> What I mean is that the dot meta character in Phobos' regexps doesn't
> match a new-line.
> For example:
> 
> char[] s = "<p>ab\ncd</p>";
> auto m = search(s, "<p>(.+)</p>", "gmi");
> writefln(m.match(1)); // no match...
> 
>> my engine doesn't care about special chars, yet. the point is merely to
>> improve on the algorithmic side. all the details with assertions,
>> special chars and charsets will be filled in when all the operators work.
> 
> Yeah, I noticed that when I looked through your code a bit.
« First   ‹ Prev
1 2
Top | Discussion index | About this forum | D home