Thread overview
[Issue 7471] New: Improve performance of std.regex
Feb 09, 2012
Jesse Phillips
Feb 24, 2012
Dmitry Olshansky
Feb 24, 2012
dawg@dawgfoto.de
Feb 25, 2012
Jesse Phillips
Feb 26, 2012
Dmitry Olshansky
Feb 26, 2012
Dmitry Olshansky
Feb 27, 2012
Jesse Phillips
February 09, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471

           Summary: Improve performance of std.regex
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody@puremagic.com
        ReportedBy: Jesse.K.Phillips+D@gmail.com


--- Comment #0 from Jesse Phillips <Jesse.K.Phillips+D@gmail.com> 2012-02-09 09:27:58 PST ---
The previous implementation is said to do some caching of the last used engine. english.dic is 134,950 entries for these timings.

Test code
----------
import std.file;
import std.string;
import std.datetime;
import std.regex;

private int[string] model;

void main() {
   auto name = "english.dic";
   foreach(w; std.file.readText(name).toLower.splitLines)
      model[w] += 1;

   foreach(w; std.string.split(readText(name)))
      if(!match(w, regex(r"\d")).empty)
      {}
      else if(!match(w, regex(r"\W")).empty)
      {}
}
-------

I'm trying to avoid the caching here, but still see better performance in 2.056. Actually I find these timings are with mingw on Windows. I find it odd that user time is actually fast, but real time is the slow piece, does mingw have access to the proper information?

$ time ./test2.056.exe

real    0m0.860s
user    0m0.047s
sys     0m0.000s

$ time ./test2.058.exe

real    0m55.500s
user    0m0.031s
sys     0m0.000s

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 24, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471


Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh@gmail.com


--- Comment #1 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-02-24 11:14:52 PST ---
I'm willing to investigate the issue. Can you attach english.dic file?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 24, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471


dawg@dawgfoto.de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dawg@dawgfoto.de


--- Comment #2 from dawg@dawgfoto.de 2012-02-24 13:48:57 PST ---
You are compiling two different regexes. So a single entry cache will only solve part of your problem.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 25, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471


Jesse Phillips <Jesse.K.Phillips+D@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Jesse.K.Phillips+D@gmail.co
                   |                            |m


--- Comment #3 from Jesse Phillips <Jesse.K.Phillips+D@gmail.com> 2012-02-24 18:02:05 PST ---
The exact file isn't important, can't get it now. But you could grab similar from http://www.winedt.org/Dict/

I realize that the example given is avoiding the benefit of single caching, but it does perform better and probably should be worked towards.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 26, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471



--- Comment #4 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-02-26 02:22:02 PST ---
Profiling shows that about 99% of time is spent in GC, ouch.
What's at work here is that new regex engine is more costly to create and
allocates a bunch of structures on heap. The biggest ones of them are cached
like e.g. Tries but others are not.
I think I'll spend some time on introducing more caching and probably seek out
some GC unfriendly stuff in parser.
Still I should point out is that \d and \W in new engine are unicode aware and
correspond to MUCH broader character clasess then previos engine does. (that
belongs in ddocs somewhere)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 26, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471



--- Comment #5 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-02-26 06:32:30 PST ---
Anyway how compares of 2.056-2.058 when you don't create regex objects inside
tight loop?
It is a strange thing to do at any circumstances, even N-slot caching you pay
some extra on each iteration to lookup and copy out the compiled regex needed.

I'm dreaming that probably one day the compiler can just see it's a loop
invariant and move it out for you.
Hm.. could happen sometime soon if 'regex' is pure and then it's result is
immutable, the compiler would have it's guarantees to go ahead and optimize.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 27, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471



--- Comment #6 from Jesse Phillips <Jesse.K.Phillips+D@gmail.com> 2012-02-26 18:03:06 PST ---
After moving the regex to outside the loop and I think some other changes it helped immensely. Declaring them as module variables didn't seem to gain any more. I didn't have much time to play with it much more, it was exceptionable, though I hope to do more with regex and just need to watch out for tight loops.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------