Tips on making regex more performant?

Below is an example snippet of code to test for performance of regex matches. I need to parse a large log and extract data from it and i've noticed a huge increase in time of the loop when reading and using regex. ... auto alert = regex(r"^Alert ([0-9]+)"); while ((line = file.readln()) !is null) { auto m = match(line, alert); if (m) { alerts++; } counter++; } ... Using the above example i parse about 700K lines per second (i'm reading from an SSD). If i comment out the regex match function, i read at 4.5M lines per second. Considering i need to use about 8 regex matches and extract data, this figure further drops to about 100K lines per second. Is there anything i can do to speed up regex matching in such a scenario? Are there any tips you can share to speed things up? Thanks.

On 06/18/2013 01:53 PM, Gary Willoughby wrote: > Below is an example snippet of code to test for performance of regex > matches. I need to parse a large log and extract data from it and i've > noticed a huge increase in time of the loop when reading and using regex. > > ... > auto alert = regex(r"^Alert ([0-9]+)"); > > while ((line = file.readln()) !is null) > { > auto m = match(line, alert); > > if (m) > { > alerts++; > } > > counter++; > } > ... > > Using the above example i parse about 700K lines per second (i'm reading > from an SSD). If i comment out the regex match function, i read at 4.5M > lines per second. Considering i need to use about 8 regex matches and > extract data, this figure further drops to about 100K lines per second. > > Is there anything i can do to speed up regex matching in such a > scenario? Are there any tips you can share to speed things up? > > Thanks. enum alert = ctRegex!r"^Alert ([0-9]+)"; And then use it the same way.

Gary Willoughby: > Any more tips? Try using the LDC2 compiler, with the ctRegex, and using the correct compilation flags (like ldmd2 -O -release -inline -noboundscheck). If that fails one solution is to write your own finite state machine where the state is encoded by the program counter, using gotos. There is a well known simple to use C library that generates such code automatically. But it's probably better to rethink your problem and re-frame it. Bye, bearophile

On 2013-06-18 21:22, Gary Willoughby wrote: > Thanks. Hmmm.. i get 500K (worse performance) using that. :/ D has basically the fastest regular expression library/module. It's faster than V8. > Any more tips? How about reading in larger chunks than single lines? -- /Jacob Carlborg

June 18, 2013

Re: Tips on making regex more performant?

Posted by Dmitry Olshansky
in reply to Gary Willoughby

Permalink

Dmitry Olshansky

Posted in reply to Gary Willoughby

Permalink

18-Jun-2013 22:53, Gary Willoughby пишет:
> Below is an example snippet of code to test for performance of regex
> matches. I need to parse a large log and extract data from it and i've
> noticed a huge increase in time of the loop when reading and using regex.
>
>      ...
>      auto alert = regex(r"^Alert ([0-9]+)");
>
>      while ((line = file.readln()) !is null)
>      {
>          auto m = match(line, alert);
>
>          if (m)
>          {
>              alerts++;
>          }
>
>          counter++;
>      }
>      ...
>

Depending on you data size reading more into memory in one go and then use regex  with "g" option on it. The difference is that each time you call `match` regex engine has to allocate state for the matcher. This leads to basically doing malloc/free per line. Millions of allocations per second are surely out of reach.

I've been long wondering what I can do to speed up such one-off matches as most of tuning had gone into the machine itself, not start/stop paths. What can be done to fix this is adding TLS cache of memory in std.regex, so that it reuses the same memory on each call to match.
Partially this was postponed till supposedly soon to come Allocators design but it never did.

Alternatives.

Simplest alternative would be to use some Appender!char object and fill it up with say 100 lines.

Second option is read up a big buffer, find a newline from the end of it and run global match on this chunk. Copy the (hopefully small) tail to the beginning of buffer and read into the rest.

The example below assumes you can't have line bigger then some buffer size.

(it's a sketch, untested)

alert = regex(..., "gm");
ubyte[] buffer = new ubyte[64*1024];
ubyte[] slice = buffer[];
auto hasData = true;
while(hasData){
	size_t got = file.rawRead(slice).length;
	// + tail from prev iteration
	auto usable = got + buffer.length - slice.length;
	if(got < slice.length)
		hasData = false;
	auto idx = indexOf(retro(buffer[0..got]), '\n');
	if(idx < 0){
		slice = buffer[];
		continue;
	}
	size_t lastLF = got-idx;
	char[] piece = cast(char[])buffer[0..lastLF];
	foreach(m; match(piece, alert))
	{
		//old code here
	}
	copy(buffer[lastLF..$], buffer[]); //copy leftover to front
	slice = buffer[$-lastLF..$]; //read into the rest of buffer
}

This is how I'd go about it ATM if speed is important.

> Using the above example i parse about 700K lines per second (i'm reading
> from an SSD). If i comment out the regex match function, i read at 4.5M
> lines per second. Considering i need to use about 8 regex matches and
> extract data, this figure further drops to about 100K lines per second.

Like test 8 separate patterns? This would multiply aforementioned malloc/free by 8. One idea is to try putting them all in a single regex with alternations '|'.


> Is there anything i can do to speed up regex matching in such a
> scenario? Are there any tips you can share to speed things up?
>

There are also compiler flags to fiddle with (dmd):
 -O -release -inline -noboundscheck

> Thanks.


-- 
Dmitry Olshansky

18-Jun-2013 23:22, Gary Willoughby пишет: >> enum alert = ctRegex!r"^Alert ([0-9]+)"; >> >> And then use it the same way. > > Thanks. Hmmm.. i get 500K (worse performance) using that. :/ My bet would be allocations are taking the bulk of time. Any chance to compile it with -profile? > > Any more tips? -- Dmitry Olshansky

19-Jun-2013 00:34, Jacob Carlborg пишет: > On 2013-06-18 21:22, Gary Willoughby wrote: > >> Thanks. Hmmm.. i get 500K (worse performance) using that. :/ > > D has basically the fastest regular expression library/module. It's > faster than V8. > As much as I'm appeased to hear this it's isn't simply "faster then V8". V8 one is very fast in general case (JIT compiler aids in that) and I think would come out as winer more often then std.regex. What is closer to truth is that typically std.regex is within about 5 top well-known engines and it's beats the rest of competition on some patterns including in certain popular benchmarks such as regex-dna. For instance it typically obliterates PCRE, infinitely so on Unicode patterns (I never had patience to let it finish) . With that being said there are many things to improve in std.regex speed-wise, sadly I haven't been able to tell about it at DConf. -- Dmitry Olshansky

On Tuesday, 18 June 2013 at 18:53:34 UTC, Gary Willoughby wrote: > Below is an example snippet of code to test for performance of regex matches. I need to parse a large log and extract data from it and i've noticed a huge increase in time of the loop when reading and using regex. > > ... > auto alert = regex(r"^Alert ([0-9]+)"); > > while ((line = file.readln()) !is null) > { > auto m = match(line, alert); > > if (m) > { > alerts++; > } > > counter++; > } > ... > > Using the above example i parse about 700K lines per second (i'm reading from an SSD). If i comment out the regex match function, i read at 4.5M lines per second. Considering i need to use about 8 regex matches and extract data, this figure further drops to about 100K lines per second. > > Is there anything i can do to speed up regex matching in such a scenario? Are there any tips you can share to speed things up? > > Thanks. I'm working with some string-heavy applications so I was curious about this myself. I'm new to D, but I did some heavy data analysis on chat files a while back. Not knowing anything about your data or what other queries you might want to do on it, matching the first part of the string with std.algorithm.startsWith() and splitting the line on a delimiter outperforms regex matching on my admittedly arbitrary test code. I tested two extremes at 10,000,000 rounds: Match everything: ~39 seconds for match, ~8 seconds for startsWith/split. Match fails at start of string: ~10 seconds for match, ~1 second for startsWith/split. Match fails at end of string: ~15 seconds for match, ~1 second for startsWith/split. Even if you need a regex to match the middle, it might be worthwhile to filter the list with startsWith if you're matching a fixed string at the start of the line. Again, it depends on the frequency of hits and how the data is structured. Split is probably not the best way to slice the match, but I don't have time tonight to try other slicing methods.

On 2013-06-18 23:29, Dmitry Olshansky wrote: > As much as I'm appeased to hear this it's isn't simply "faster then V8". > V8 one is very fast in general case (JIT compiler aids in that) and I > think would come out as winer more often then std.regex. > > What is closer to truth is that typically std.regex is within about 5 > top well-known engines and it's beats the rest of competition on some > patterns including in certain popular benchmarks such as regex-dna. > > For instance it typically obliterates PCRE, infinitely so on Unicode > patterns (I never had patience to let it finish) . > > With that being said there are many things to improve in std.regex > speed-wise, sadly I haven't been able to tell about it at DConf. I see, thanks for the correction. -- /Jacob Carlborg

Forums