| Thread overview | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
February 08, 2012 std.regex performance | ||||
|---|---|---|---|---|
| ||||
I've finely moved to the new regex for some real code. I'm seeing a major change in performance when checking if a large number of words contain a digit.
The english.dic file contains 134,950 entries
With
2.056: 0.22sec
2.058: 7.65sec
I don't expect a correction for this would make it in 2.058 as it is likely an issue in 2.057.
--------
import std.file;
import std.string;
import std.datetime;
import std.regex;
private int[string] model;
void main() {
auto name = "english.dic";
foreach(w; std.file.readText(name).toLower.splitLines)
model[w] += 1;
foreach(w; std.string.split(readText(name)))
if(!match(w, regex(r"\d")).empty)
{}
}
| ||||
February 08, 2012 Re: std.regex performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jesse Phillips | On 2/8/12 10:44 PM, Jesse Phillips wrote:
> foreach(w; std.string.split(readText(name)))
> if(!match(w, regex(r"\d")).empty)
> {}
> }
Could it be that you are rebuilding the regex engine on every iteration here?
David
| |||
February 08, 2012 Re: std.regex performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jesse Phillips | Here's some runs on win32, -release -O -inline (I just generated 134,950 duplicate words) : 2.054: 82 msecs (467KB exe) 2.055: 77 msecs (505KB exe) 2.056: 84 msecs (1095KB exe) 2.057: 3380 msecs (1179KB exe) 2.058: 3373 msecs (630KB exe) Compile times are different too: 2.054: timeit dmd -release -O -inline test.d Elapsed Time: 0:00:01.281 2.055: timeit dmd -release -O -inline test.d Elapsed Time: 0:00:01.500 2.056: same 2.057: timeit dmd -release -O -inline test.d Elapsed Time: 0:00:06.296 2.057: timeit dmd -release -O -inline test.d Elapsed Time: 0:00:08.093 | |||
February 08, 2012 Re: std.regex performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to David Nadlinger | On 2/8/12, David Nadlinger <see@klickverbot.at> wrote: > On 2/8/12 10:44 PM, Jesse Phillips wrote: >> foreach(w; std.string.split(readText(name))) >> if(!match(w, regex(r"\d")).empty) >> {} >> } > > Could it be that you are rebuilding the regex engine on every iteration here? Yup. This one is a steady 80 msecs (ctRegex seems to be a bit slower in this case for some reason): import std.file; import std.datetime; import std.regex; import std.stdio; private int[string] model; void main() { auto sw = StopWatch(AutoStart.yes); auto rg = regex(r"\d"); foreach (w; std.string.split(readText("uk.dic"))) if (!match(w, rg).empty) { } writeln(sw.peek.msecs); } | |||
February 08, 2012 Re: std.regex performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to David Nadlinger | David Nadlinger:
> Could it be that you are rebuilding the regex engine on every iteration here?
This isn't the first time I see this problem.
CPython caches the RE engines to avoid this problem, so after the first time you use it, it doesn't create the same engine again and again.
Another solution is to change the API, and use a long and ugly function name like buildREengine() that makes it clear to the user that it's better to pull it out of loops :-)
Bye,
bearophile
| |||
February 08, 2012 Re: std.regex performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | On 02/08/2012 11:46 PM, bearophile wrote:
> David Nadlinger:
>
>> Could it be that you are rebuilding the regex engine on every iteration
>> here?
>
> This isn't the first time I see this problem.
> CPython caches the RE engines to avoid this problem, so after the first time you use it, it doesn't create the same engine again and again.
>
> Another solution is to change the API, and use a long and ugly function name like buildREengine() that makes it clear to the user that it's better to pull it out of loops :-)
>
> Bye,
> bearophile
The compiler should do loop invariant code motion for pure functions.
| |||
February 08, 2012 Re: std.regex performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Timon Gehr | Timon Gehr:
> The compiler should do loop invariant code motion for pure functions.
Right, that too. But in DMD 2.058alpha this doesn't compile, regex() isn't pure:
import std.regex;
void main() pure {
auto re = regex(r"\d");
}
Bye,
bearophile
| |||
February 08, 2012 Re: std.regex performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to David Nadlinger | On Wednesday, 8 February 2012 at 22:21:35 UTC, David Nadlinger wrote:
> On 2/8/12 10:44 PM, Jesse Phillips wrote:
>> foreach(w; std.string.split(readText(name)))
>> if(!match(w, regex(r"\d")).empty)
>> {}
>> }
>
> Could it be that you are rebuilding the regex engine on every iteration here?
>
> David
That is the case. The older regex apparently cached the last regex. will be more careful in the feature.
| |||
February 09, 2012 Re: std.regex performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jesse Phillips | On Wed, 08 Feb 2012 22:44:25 +0100, Jesse Phillips <jessekphillips+D@gmail.com> wrote: > I've finely moved to the new regex for some real code. I'm seeing a major change in performance when checking if a large number of words contain a digit. > > The english.dic file contains 134,950 entries > > With > 2.056: 0.22sec > 2.058: 7.65sec > > I don't expect a correction for this would make it in 2.058 as it is likely an issue in 2.057. > > -------- > import std.file; > import std.string; > import std.datetime; > import std.regex; > > private int[string] model; > > void main() { > auto name = "english.dic"; > foreach(w; std.file.readText(name).toLower.splitLines) > model[w] += 1; > > foreach(w; std.string.split(readText(name))) > if(!match(w, regex(r"\d")).empty) > {} > } > There are some more performance issues. D has a nice built-in profiler to find such issues. ---------- import std.algorithm, std.stdio, std.string, std.path, std.regex; private int[string] model; int main(string[] args) { if (args.length != 2) { std.stdio.stderr.writefln("usage: %s <file>", std.path.baseName(args[0])); return 1; } auto re = std.regex.regex(r"\d"); foreach(line; std.stdio.File(args[1], "r").byLine()) { // Bug 6791: splitter is UTF-8 unsafe foreach(w; std.algorithm.splitter(line)) { if(!std.regex.match(w, re).empty) { } } std.string.toLowerInPlace(line); model[line.idup] += 1; } return 0; } | |||
February 09, 2012 Re: std.regex performance | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jesse Phillips | On 09.02.2012 3:35, Jesse Phillips wrote: > On Wednesday, 8 February 2012 at 22:21:35 UTC, David Nadlinger wrote: >> On 2/8/12 10:44 PM, Jesse Phillips wrote: >>> foreach(w; std.string.split(readText(name))) >>> if(!match(w, regex(r"\d")).empty) >>> {} >>> } >> >> Could it be that you are rebuilding the regex engine on every >> iteration here? >> >> David > > That is the case. The older regex apparently cached the last regex. will > be more careful in the feature. I suggest to file this as an enhancement request, as new std.regex should have been backwards compatible. -- Dmitry Olshansky | |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply