Bad performance of simple regular expression - why?? (page 2) - D Programming Language Discussion Forum

Walter Bright <newshound@digitalmars.com> wrote > If it fails, then fallback to the original method. It is a ridiculous business to incorporate additional algorithms because a current implementation is broken. The well known chain of converting RE->NFA->DFA->minimal DFA should handle that problem fast, because the minimal DFA for a RE of the form ((a?)^n a^n) for fixed n is a chain of 2n+1 nodes, which corresponds to the longest acceptable input plus final state. In this chain the first n nodes are nonfinal, all others are final. The first node is the start node. All nodes have transitions to the next node under "a". That is recognision time is linear in the length of the input: no exponential behaviour. -manfred

Manfred Nowak wrote: > Walter Bright <newshound@digitalmars.com> wrote > >> If it fails, then fallback to the original method. My understanding is that backtracking is difficult to do with the DFA.

Walter Bright <newshound@digitalmars.com> wrote > My understanding is that backtracking is difficult to do with the DFA. On converting _one_ regular expression to a DFA there is no need to backtrack ever. The example considered in the article is roughly equivalent to a^n a?^n. No need for backtracking and the normal tool chain handles this. Also for maximal munching several regular expression there is no need for backtracking. -manfred

February 08, 2007

Re: Bad performance of simple regular expression - why??

Posted by Benji Smith
in reply to Walter Bright

Permalink

Benji Smith

Posted in reply to Walter Bright

Permalink

Walter Bright wrote:
> Tomas Lindquist Olsen wrote:
>> http://swtch.com/~rsc/regexp/regexp1.html
> 
> That looks pretty cool. Anyone care to take a stab at implementing this in std.regexp? It could be done by examining the generated bytecode and attempting to convert it to an NFA. If that succeeds, then great, do the NFA. If it fails, then fallback to the original method.

A little while back, at my job, I implemented a regex parser (in Java) that builds an NFA and then transforms it into a DFA, re-serializing it into a regex for consumption by any old regex engine.

For example, this regex assumes NFA performance characteristics when compiled by the (java, perl, python, etc) regex engine:

  (?:cat|car|cut|cute|cumbersome)

This regex is semantically identical (meaning that it will produce exactly the same set of matches as the simpler expression), but it assumes DFA performance characteristics in the regex engine:

  c(?:a[tr]|u(?:te?|umbersome))

Since I needed my regular expressions to execute in an ordinary regex engine, I couldn't tweak the engine internals, but I could trick the regex engine into being a DFA.

The major caveat to this approach is that parenthetical groups get jumbled. If you use parentheses to capture text (rather than just to match it), the numbering of your backreferences is going to get all screwed up. When I use this approach, I almost always use noncapturing parentheses (?: ... ) rather than ordinary parens.

However, if you had control of the regex engine itself (rather than just feeding tweaked expressions into an ordinary engine), you could easily implement DFA behavior characteristics without changing the capturing semantics.

I'd implement this myself for D, but I'm afraid my NDA with my employer would probably prevent me from doing so. The theory is straightforward, though, so it should be do-able by anyone with the appropriate RE experience.

--Benji Smith

Manfred Nowak wrote: > Also for maximal munching several regular expression there is no need for backtracking. maximal munch, non-greedy matching and negative matches would usually use backtracking. doing it without backtracking would require to use several DFAs - or is there a simpler method? afaik, the "school-book-way" for backtracking are deterministic pushdown automatons (PDAs) that can also be interpreted efficiently or compiled to bytecode. so i don't see why there shouldn't be backtracking in the regexp implementation.

Jascha Wetzel wrote: > Manfred Nowak wrote: >> Also for maximal munching several regular expression there is no need for backtracking. > > maximal munch, non-greedy matching and negative matches would usually > use backtracking. doing it without backtracking would require to use > several DFAs - or is there a simpler method? > > afaik, the "school-book-way" for backtracking are deterministic pushdown > automatons (PDAs) that can also be interpreted efficiently or compiled > to bytecode. > so i don't see why there shouldn't be backtracking in the regexp > implementation. The problem is that most regex engines (in fact, all of them, as far as I know) use backtracking even when a DFA implementation could be used to determine (much earlier in the search process) that a match is impossible. A backtracking approach might be necessary with positive and negative lookaround assertions (though I still suspect that a DFA could even be used for these cases), but every mainstream regex engine uses a backtracking approach even for simple choice situations like (cat|car) where a DFA would be much more appropriate. And, by the way, it's possible to create a DFA implementation that degrades into a backtracking matcher for certain subexpressions. --benji

Benji Smith wrote: > The problem is that most regex engines (in fact, all of them, as far as I know) use backtracking even when a DFA implementation could be used to determine (much earlier in the search process) that a match is impossible. This has been linked to in this thread, but Ken Thompson’s algorithm does /not/ use backtracking, and there are open-source implementations around; see <http://www.swtch.com/~rsc/regexp/regexp1.html>. --Joel

Forums