Thread overview
[Issue 12690] std.regex BacktrackingMatcher bmatch is faster than ThompsonMatcher but discouraged
May 02, 2014
Dmitry Olshansky
May 04, 2014
Diez
May 04, 2014
Dmitry Olshansky
Sep 08, 2017
Dmitry Olshansky
Dec 17, 2022
Iain Buclaw
May 02, 2014
https://issues.dlang.org/show_bug.cgi?id=12690

Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh@gmail.com

--- Comment #1 from Dmitry Olshansky <dmitry.olsh@gmail.com> ---
For the test case in question on Win32 I get:

//With bmatch
D:\D>dvt4a
Time=3 secs, 802 ms, 707 μs, and 7 hnsecs, matches per second=1577823

//With match
D:\D>dvt4b
Time=4 secs, 274 ms, 43 μs, and 7 hnsecs, matches per second=1403822

Which is around 10%. Note that proper flags for optimized build are:

dmd -O -release -inline


On ldc proper flags (IIRC) are:

ldc -O3 -release

--- 

I would not even consider result w/o inlining as showing anything. Please compile your real-world program with the above flags.

---

On the test and results of such.

The provided test measures the overhead of starting an engine (memory allocation/deallocation eats ~25% of total time in both cases). Overall indeed BT has a bit less of work to do at start-up.

Generally it is known that BT matchers do have an advantage on patterns that tend to match, the problem with them is low stability (small change in pattern may change performance radically) and bad performance on ambiguous patterns.

I imagine the problem you are solving involves a number of patterns that is close to the number of haystacks. Even then a 10% slower start-up is no more then ~1.21% of total time.

P.S. There is an enhancement that may be of interest: https://issues.dlang.org/show_bug.cgi?id=12227

--
May 04, 2014
https://issues.dlang.org/show_bug.cgi?id=12690

--- Comment #2 from Diez <yxcvbasdfgqwert02@gmx.de> ---
Created attachment 1351
  --> https://issues.dlang.org/attachment.cgi?id=1351&action=edit
Source code for more complex test case (UTF-8)

Thank you for your help. I verified the compilation options of my real-world program and in fact the -inline option was missing (-O was already set). However, it didn't change anything at the overall performance, it's still 5 seconds (backtracking) versus 47 seconds (Thompson). With this magnitude in performance difference there's no chance to switch to the Thompson matcher.

The attached source is almost the same as before, but with average expression and haystack from my real-world program (confidential informations scrambled). Now it's not a small performance difference, but a factor of more than 5.

As I understand the meaning of the word "discouraged", the backtracking matcher is to be removed in some future. Please don't. As long as the Thompson matcher has this poor performance (for certain use cases), there's need for both matchers. In fact, using the Thompson matcher is way slower than an equivalent Java 1.7 program, while using the backtracking matcher is slightly faster. Of course, optimally there's only one matcher which combines the advantages of all underlaying matchers.

In my real-world program, there's about 100 times more haystacks than patterns (940 vs. 112_000). As soon as a pattern matches, the next haystack is examined. Each haystack needed about 23 tries (avg) until a pattern matched, actually there were 2_584_925 match calls in total of which 111_675 matched the haystack (a "few" haystacks didn't match any pattern).

Note that the attached test case source file needs to be saved in UTF-8 encoding, since the regular expression string contains UTF-8 characters (hopefully, they don't get destroyed during upload into the issue tracking system). The UTF-8 signature comment line should help Windows editors to automatically switch to UTF-8 encoding when opening the source file (like an UTF-8 BOM).

The idea of matching multiple patterns with only one call of the matcher is very interesting (though I didn't test it yet). Consider a 5 times faster matcher with 5 matches at once ...

--
May 04, 2014
https://issues.dlang.org/show_bug.cgi?id=12690

--- Comment #3 from Dmitry Olshansky <dmitry.olsh@gmail.com> ---
(In reply to Diez from comment #2)
> Created attachment 1351 [details]
> Source code for more complex test case (UTF-8)
> 
> Thank you for your help. I verified the compilation options of my real-world program and in fact the -inline option was missing (-O was already set). However, it didn't change anything at the overall performance, it's still 5 seconds (backtracking) versus 47 seconds (Thompson). With this magnitude in performance difference there's no chance to switch to the Thompson matcher.
> 
> The attached source is almost the same as before, but with average expression and haystack from my real-world program (confidential informations scrambled). Now it's not a small performance difference, but a factor of more than 5.
> 

I see. The fact that -inline doesn't change a thing is troublesome but lets leave that to compiler guys.

The new pattern makes things much clearer - as I suspected the preparation stage of the Thompson engine is the one to blame. Preparing an internal freelist alone takes ~1/3 of total run-time in this case. The figures are much higher this time about x3 difference.

I have a hunch that it allocates much more then needed in this case, might be a bug in counting up repetitions. Need to investigate further.


> As I understand the meaning of the word "discouraged", the backtracking matcher is to be removed in some future. Please don't. As long as the Thompson matcher has this poor performance (for certain use cases), there's need for both matchers. In fact, using the Thompson matcher is way slower than an equivalent Java 1.7 program, while using the backtracking matcher is slightly faster. Of course, optimally there's only one matcher which combines the advantages of all underlaying matchers.

Discouraged means exactly what you say - picking the type of underlying matcher
should be done internally based on pattern properties (or read that as one
meta-matcher). It occurred to me that
- There are more types of matchers, some are very limited in what they can.
Thus should't push it on user to understand when he/she can use them.
- More matchers - more functions doesn't scale.
- Also replace functions have the same NxM combinatorics (n methods, m
functions), there is also splitter.

I envision that when std.regex turns package, it will gradually expose internals for those who wants to make customize/their own matcher.

> In my real-world program, there's about 100 times more haystacks than patterns (940 vs. 112_000). As soon as a pattern matches, the next haystack is examined. Each haystack needed about 23 tries (avg) until a pattern matched, actually there were 2_584_925 match calls in total of which 111_675 matched the haystack (a "few" haystacks didn't match any pattern).
> 
> Note that the attached test case source file needs to be saved in UTF-8 encoding, since the regular expression string contains UTF-8 characters (hopefully, they don't get destroyed during upload into the issue tracking system). The UTF-8 signature comment line should help Windows editors to automatically switch to UTF-8 encoding when opening the source file (like an UTF-8 BOM).
> 
> The idea of matching multiple patterns with only one call of the matcher is very interesting (though I didn't test it yet).

Well, it's not yet implemented ;) But I expect to get to it done somewhere in May.

> Consider a 5 times faster
> matcher with 5 matches at once ...

--
September 08, 2017
https://issues.dlang.org/show_bug.cgi?id=12690

Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|nobody@puremagic.com        |dmitry.olsh@gmail.com

--
December 17, 2022
https://issues.dlang.org/show_bug.cgi?id=12690

Iain Buclaw <ibuclaw@gdcproject.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P1                          |P4

--
December 01
https://issues.dlang.org/show_bug.cgi?id=12690

--- Comment #4 from dlangBugzillaToGithub <robert.schadek@posteo.de> ---
THIS ISSUE HAS BEEN MOVED TO GITHUB

https://github.com/dlang/phobos/issues/9632

DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB

--