Fast regular expressions (page 2)

April 16, 2007

Re: Fast regular expressions

Posted by Davidl
in reply to Jascha Wetzel

Permalink

Davidl

Posted in reply to Jascha Wetzel

Permalink

will u please add some feature of fuzzy matching? like
with user definable error char number.
for example,
in fuzzy match
people can match ab in abc or acb with 1char errorness
and they can match ab in acdb or aefb with 2 char errorness

> Here is an alpha release of a regular expression engine i'm working on:
> http://mainia.de/tdfaregexp-0.1-alpha.zip
> (artistic license 2.0)
>
> It implements a slightly modified version of Ville Laurikari's tagged
> NFA method that allows for regular expression parsing in linear time as
> opposed to the very common backtracking method that requires exponential
> time (worst case).
>
> As of now, the engine supports greedy and non-greedy operators,
> submatches and character classes. It compiles a regexp to a TDFA from
> which it can either be applied directly to an input string or compiled
> to D code.
> Supported operators: . | ( ) ? * + ?? *? +?
>
> The regexp-to-D compilation cannot be done at compile-time, though,
> because the required data structures are too complex to be represented
> in tuples or strings.
>
> Compile regexp.d and run it for example like this:
> regexp "(a*)([0-9a-fA-Z]*)(g|qwe)" aaaa987IUTqwe aaa 234234
> This will output the TNFA, TDFA and D Code and try to match the 3 input
> strings.
> The file regextest.d is an example of how to use the generated code.
>
> There is still a lot of room for optimization, but i think that
> categorically, such a natively compiled, linear time regexp matcher
> should be the fastest possible.
>
> Comments and bug reports are encouraged ;)

It seems to me that supporting fuzzy matching in a general regular expression library would add a whole lot of complexity that wouldn't be used by many people. I think it would be hard to justify this. Were you really adamant about having regular expression matching, or more focussed on simple fuzzy matching? There are probably some libraries already that can search for fuzzy matches pretty well. If you can't find anything that meets your needs, you might consider having a look at http://en.wikipedia.org/wiki/Bitap_algorithm and http://en.wikipedia.org/wiki/Levenshtein_distance - they could be useful if you decide to write your own.

great work! Jascha Wetzel <[firstname]@mainia.de> Wrote: > Here is an alpha release of a regular expression engine i'm working on: > http://mainia.de/tdfaregexp-0.1-alpha.zip > (artistic license 2.0) > > It implements a slightly modified version of Ville Laurikari's tagged NFA method that allows for regular expression parsing in linear time as opposed to the very common backtracking method that requires exponential time (worst case). > > As of now, the engine supports greedy and non-greedy operators, > submatches and character classes. It compiles a regexp to a TDFA from > which it can either be applied directly to an input string or compiled > to D code. > Supported operators: . | ( ) ? * + ?? *? +? > > The regexp-to-D compilation cannot be done at compile-time, though, because the required data structures are too complex to be represented in tuples or strings. > > Compile regexp.d and run it for example like this: > regexp "(a*)([0-9a-fA-Z]*)(g|qwe)" aaaa987IUTqwe aaa 234234 > This will output the TNFA, TDFA and D Code and try to match the 3 input > strings. > The file regextest.d is an example of how to use the generated code. > > There is still a lot of room for optimization, but i think that categorically, such a natively compiled, linear time regexp matcher should be the fastest possible. > > Comments and bug reports are encouraged ;)

Forums