Thread overview
Degenerate Regex Case
Apr 24, 2015
Guillaume
Apr 24, 2015
TheFlyingFiddle
Apr 25, 2015
Dmitry Olshansky
Apr 26, 2015
Guillaume
April 24, 2015
Hello, I'm trying to make a regex comparison with D, based off of this article: https://swtch.com/~rsc/regexp/regexp1.html

I've written my code like so:

import std.stdio, std.regex;

void main(string argv[]) {

	string m = argv[1];
	auto p = ctRegex!("a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
	if (match(m, p)) {
		writeln("match");
	} else {
		writeln("no match");
	}

}

And the compiler goes into swap. Doing it at runtime is no better. I was under the impression that this particular regex was used for showcasing the Thompson NFA which D claims to be using.

The golang code version of this runs fine, which makes me think that maybe D isn't using the correct regex engine for this particular regex. Or perhaps I'm using this wrong?
April 24, 2015
On Friday, 24 April 2015 at 18:28:16 UTC, Guillaume wrote:
> Hello, I'm trying to make a regex comparison with D, based off of this article: https://swtch.com/~rsc/regexp/regexp1.html
>
> I've written my code like so:
>
> import std.stdio, std.regex;
>
> void main(string argv[]) {
>
> 	string m = argv[1];
> 	auto p = ctRegex!("a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
> 	if (match(m, p)) {
> 		writeln("match");
> 	} else {
> 		writeln("no match");
> 	}
>
> }
>
> And the compiler goes into swap. Doing it at runtime is no better. I was under the impression that this particular regex was used for showcasing the Thompson NFA which D claims to be using.

The regex "a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" can be simplified to "a{30,60}" (if i counted correctly).

The regex "a{30,60}" works fine.

[Speculation]
I don't have a good understanding of how D's regex engine work but I am guessing that it does not do any simplification of the regex input causing it to generate larger engines for each additional ? symbol. Thus needing more memory. Eventually as in this case the compiler runs out of memory.




April 25, 2015
On Friday, 24 April 2015 at 18:28:16 UTC, Guillaume wrote:
> Hello, I'm trying to make a regex comparison with D, based off of this article: https://swtch.com/~rsc/regexp/regexp1.html
>
> I've written my code like so:
>
> import std.stdio, std.regex;
>
> void main(string argv[]) {
>
> 	string m = argv[1];
> 	auto p = ctRegex!("a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
> 	if (match(m, p)) {
> 		writeln("match");
> 	} else {
> 		writeln("no match");
> 	}
>
> }

>
> And the compiler goes into swap. Doing it at runtime is no better. I was under the impression that this particular regex was used for showcasing the Thompson NFA which D claims to be using.
>

A quick investigation shows that it gets stuck at the end of pattern compilation stage.

The problem is that as a last pass D's regex goes to optimize the pattern to construct simple bit-scanning engine as approximation for prefix of original pattern. And that process is a lot like Thompson NFA ... _BUT_ the trick of merging equivalent threads wasn't applied there.

So in short: file a bug, optimizer absolutely should do de-duplication of threads.


> The golang code version of this runs fine, which makes me think that maybe D isn't using the correct regex engine for this particular regex. Or perhaps I'm using this wrong?

It uses 2 kinds of engines, run-time one is Thompson NFA. Compile-time is (for now) still backtracking.

---
Dmitry Olshansky
April 26, 2015
On Saturday, 25 April 2015 at 09:30:55 UTC, Dmitry Olshansky wrote:
>
> A quick investigation shows that it gets stuck at the end of pattern compilation stage.
>
> The problem is that as a last pass D's regex goes to optimize the pattern to construct simple bit-scanning engine as approximation for prefix of original pattern. And that process is a lot like Thompson NFA ... _BUT_ the trick of merging equivalent threads wasn't applied there.
>
> So in short: file a bug, optimizer absolutely should do de-duplication of threads.
>
> ---
> Dmitry Olshansky

Thanks for your help, I'll go file a bug.