Thread overview
Poor regex performance?
Apr 04, 2019
Julian
Apr 04, 2019
rikki cattermole
Apr 04, 2019
Julian
Apr 04, 2019
Stefan Koch
Apr 04, 2019
Jon Degenhardt
Apr 04, 2019
XavierAP
Apr 04, 2019
H. S. Teoh
April 04, 2019
The following code, that just runs a regex against a large exim log
to report on top senders, is 140 times slower than similar C code using
PCRE, when compiled with just -O. With a bunch of other flags I got it
down to only 13x slower than C code that's using libc regcomp/regexec.

  import std.stdio, std.string, std.regex, std.array, std.algorithm;

  T min(T)(T a, T b) {
          if (a < b) return a;
          return b;
  }

  void main() {
          ulong[string] emailcounts;
          auto re = ctRegex!(r"(?:\S+ ){3,4}<= ([^@]+@(\S+))");

          foreach (line; File("exim_mainlog").byLine()) {
                  auto m = line.match(re);
                  if (m) {
                          ++emailcounts[m.front[1].idup];
                  }
          }

          string[] senders = emailcounts.keys;
          sort!((a, b) { return emailcounts[a] > emailcounts[b]; })(senders);
          foreach (i; 0 .. min(senders.length, 5)) {
                  writefln("%5s %s", emailcounts[senders[i]], senders[i]);
          }
  }

Other code's available at https://github.com/jrfondren/topsender-bench
I get D down to 1.2x slower with PCRE and getline()

I wrote this part of the way through chapter 1 of "The D Programming Language",
so my question is mainly: is this a fair result? std.regex is very slow and
I should reach for PCRE if regex speed matters? Or is this code severely
flawed somehow? I'm using a random production log; not trying to make things
difficult.

Relatedly, how can I add custom compiler flags to rdmd, in a D script?
For example, -L-lpcre
April 04, 2019
If you need performance use ldc not dmd (assumed).

LLVM has many factors better code optimizes than dmd does.
April 04, 2019
On Thursday, 4 April 2019 at 09:53:06 UTC, Julian wrote:
>
> Relatedly, how can I add custom compiler flags to rdmd, in a D script?
> For example, -L-lpcre

Configuration variable "DFLAGS". On Windows you can specify it in the sc.ini file. On Linux: https://dlang.org/dmd-linux.html
April 04, 2019
On Thursday, 4 April 2019 at 09:57:26 UTC, rikki cattermole wrote:
> If you need performance use ldc not dmd (assumed).
>
> LLVM has many factors better code optimizes than dmd does.

Thanks! I already had dmd installed from a brief look at D a long
time ago, so I missed the details at https://dlang.org/download.html

ldc2 -O3 does a lot better, but the result is still 30x slower
without PCRE.
April 04, 2019
On Thursday, 4 April 2019 at 10:31:43 UTC, Julian wrote:
> On Thursday, 4 April 2019 at 09:57:26 UTC, rikki cattermole wrote:
>> If you need performance use ldc not dmd (assumed).
>>
>> LLVM has many factors better code optimizes than dmd does.
>
> Thanks! I already had dmd installed from a brief look at D a long
> time ago, so I missed the details at https://dlang.org/download.html
>
> ldc2 -O3 does a lot better, but the result is still 30x slower
> without PCRE.

You need to disable the GC.
by importing core.memory : GC;
and calling GC.Disable();

the next thing is to avoid the .idup and cast to string instead.

April 04, 2019
On Thu, Apr 04, 2019 at 09:53:06AM +0000, Julian via Digitalmars-d-learn wrote: [...]
>           auto re = ctRegex!(r"(?:\S+ ){3,4}<= ([^@]+@(\S+))");
[...]

ctRegex is a crock; use regex() instead and it might actually work
better.


T

-- 
Stop staring at me like that! It's offens... no, you'll hurt your eyes!
April 04, 2019
On Thursday, 4 April 2019 at 10:31:43 UTC, Julian wrote:
> On Thursday, 4 April 2019 at 09:57:26 UTC, rikki cattermole wrote:
>> If you need performance use ldc not dmd (assumed).
>>
>> LLVM has many factors better code optimizes than dmd does.
>
> Thanks! I already had dmd installed from a brief look at D a long
> time ago, so I missed the details at https://dlang.org/download.html
>
> ldc2 -O3 does a lot better, but the result is still 30x slower
> without PCRE.

Try:
    ldc2 -O3 -release -flto=thin -defaultlib=phobos2-ldc-lto,druntime-ldc-lto -enable-inlining

This will improve inlining and optimization across the runtime library boundaries. This can help in certain types of code.