Thread overview
Poor regex performance?
Apr 04
Julian
Apr 04
Julian
Apr 04
XavierAP
April 04
The following code, that just runs a regex against a large exim log
to report on top senders, is 140 times slower than similar C code using
PCRE, when compiled with just -O. With a bunch of other flags I got it
down to only 13x slower than C code that's using libc regcomp/regexec.

  import std.stdio, std.string, std.regex, std.array, std.algorithm;

  T min(T)(T a, T b) {
          if (a < b) return a;
          return b;
  }

  void main() {
          ulong[string] emailcounts;
          auto re = ctRegex!(r"(?:\S+ ){3,4}<= ([^@]+@(\S+))");

          foreach (line; File("exim_mainlog").byLine()) {
                  auto m = line.match(re);
                  if (m) {
                          ++emailcounts[m.front[1].idup];
                  }
          }

          string[] senders = emailcounts.keys;
          sort!((a, b) { return emailcounts[a] > emailcounts[b]; })(senders);
          foreach (i; 0 .. min(senders.length, 5)) {
                  writefln("%5s %s", emailcounts[senders[i]], senders[i]);
          }
  }

Other code's available at https://github.com/jrfondren/topsender-bench
I get D down to 1.2x slower with PCRE and getline()

I wrote this part of the way through chapter 1 of "The D Programming Language",
so my question is mainly: is this a fair result? std.regex is very slow and
I should reach for PCRE if regex speed matters? Or is this code severely
flawed somehow? I'm using a random production log; not trying to make things
difficult.

Relatedly, how can I add custom compiler flags to rdmd, in a D script?
For example, -L-lpcre
April 04
If you need performance use ldc not dmd (assumed).

LLVM has many factors better code optimizes than dmd does.
April 04
On Thursday, 4 April 2019 at 09:53:06 UTC, Julian wrote:
>
> Relatedly, how can I add custom compiler flags to rdmd, in a D script?
> For example, -L-lpcre

Configuration variable "DFLAGS". On Windows you can specify it in the sc.ini file. On Linux: https://dlang.org/dmd-linux.html
April 04
On Thursday, 4 April 2019 at 09:57:26 UTC, rikki cattermole wrote:
> If you need performance use ldc not dmd (assumed).
>
> LLVM has many factors better code optimizes than dmd does.

Thanks! I already had dmd installed from a brief look at D a long
time ago, so I missed the details at https://dlang.org/download.html

ldc2 -O3 does a lot better, but the result is still 30x slower
without PCRE.
April 04
On Thursday, 4 April 2019 at 10:31:43 UTC, Julian wrote:
> On Thursday, 4 April 2019 at 09:57:26 UTC, rikki cattermole wrote:
>> If you need performance use ldc not dmd (assumed).
>>
>> LLVM has many factors better code optimizes than dmd does.
>
> Thanks! I already had dmd installed from a brief look at D a long
> time ago, so I missed the details at https://dlang.org/download.html
>
> ldc2 -O3 does a lot better, but the result is still 30x slower
> without PCRE.

You need to disable the GC.
by importing core.memory : GC;
and calling GC.Disable();

the next thing is to avoid the .idup and cast to string instead.

April 04
On Thu, Apr 04, 2019 at 09:53:06AM +0000, Julian via Digitalmars-d-learn wrote: [...]
>           auto re = ctRegex!(r"(?:\S+ ){3,4}<= ([^@]+@(\S+))");
[...]

ctRegex is a crock; use regex() instead and it might actually work
better.


T

-- 
Stop staring at me like that! It's offens... no, you'll hurt your eyes!
April 04
On Thursday, 4 April 2019 at 10:31:43 UTC, Julian wrote:
> On Thursday, 4 April 2019 at 09:57:26 UTC, rikki cattermole wrote:
>> If you need performance use ldc not dmd (assumed).
>>
>> LLVM has many factors better code optimizes than dmd does.
>
> Thanks! I already had dmd installed from a brief look at D a long
> time ago, so I missed the details at https://dlang.org/download.html
>
> ldc2 -O3 does a lot better, but the result is still 30x slower
> without PCRE.

Try:
    ldc2 -O3 -release -flto=thin -defaultlib=phobos2-ldc-lto,druntime-ldc-lto -enable-inlining

This will improve inlining and optimization across the runtime library boundaries. This can help in certain types of code.