Thread overview
Very Stupid Regex question
Aug 07, 2014
seany
Aug 07, 2014
Marc Schütz
Aug 07, 2014
Justin Whear
Aug 07, 2014
seany
Aug 07, 2014
H. S. Teoh
Aug 07, 2014
Justin Whear
Aug 07, 2014
H. S. Teoh
Aug 07, 2014
H. S. Teoh
Aug 07, 2014
seany
August 07, 2014
Cosider please the following:

string s1 = PREabcdPOST;
string s2 = PREabPOST;


string[] srar = ["ab", "abcd"];
// this can not be constructed with a particular order

foreach(sr; srar)
{

  auto r = regex(sr; "g");
  auto m = matchFirst(s1, r);
  break;
  // this one matches ab
  // but I want this to match abcd
  // and for s2 I want to match ab

}

obviously there are ways like counting the match length, and then using the maximum length, instead of breaking as soon as a match is found.

Are there any other better ways?
August 07, 2014
On Thursday, 7 August 2014 at 16:05:17 UTC, seany wrote:
> Cosider please the following:
>
> string s1 = PREabcdPOST;
> string s2 = PREabPOST;
>
>
> string[] srar = ["ab", "abcd"];
> // this can not be constructed with a particular order
>
> foreach(sr; srar)
> {
>
>   auto r = regex(sr; "g");
>   auto m = matchFirst(s1, r);
>   break;
>   // this one matches ab
>   // but I want this to match abcd
>   // and for s2 I want to match ab
>
> }
>
> obviously there are ways like counting the match length, and then using the maximum length, instead of breaking as soon as a match is found.
>
> Are there any other better ways?

It's not clear to me what exactly you want, but:

Are the regexes in `srar` related? That is, does one regex always include the previous one as a prefix? Then you can use optional matches:

    /ab(cd)?/

This will match "abcd" if it is there, but will also match "ab" otherwise.
August 07, 2014
On Thu, 07 Aug 2014 16:05:16 +0000, seany wrote:

> obviously there are ways like counting the match length, and then using the maximum length, instead of breaking as soon as a match is found.
> 
> Are there any other better ways?

You're not really using regexes properly.  You want to greedily match as much as possible in this case, e.g.:

void main()
{
	import std.regex;
	auto re = regex("ab(cd)?");
	assert("PREabcdPOST".matchFirst(re).hit == "abcd");
	assert("PREabPOST".matchFirst(re).hit == "ab");

}
August 07, 2014
On Thursday, 7 August 2014 at 16:12:59 UTC, Justin Whear wrote:
> On Thu, 07 Aug 2014 16:05:16 +0000, seany wrote:
>
>> obviously there are ways like counting the match length, and then using
>> the maximum length, instead of breaking as soon as a match is found.
>> 
>> Are there any other better ways?
>
> You're not really using regexes properly.  You want to greedily match as
> much as possible in this case, e.g.:
>
> void main()
> {
> 	import std.regex;
> 	auto re = regex("ab(cd)?");
> 	assert("PREabcdPOST".matchFirst(re).hit == "abcd");
> 	assert("PREabPOST".matchFirst(re).hit == "ab");
>
> }

thing is, abcd is read from a file, and in the compile time, i dont know if cd may at all be there or not, ir if it should be ab(ef)
August 07, 2014
On Thu, Aug 07, 2014 at 04:49:05PM +0000, seany via Digitalmars-d-learn wrote:
> On Thursday, 7 August 2014 at 16:12:59 UTC, Justin Whear wrote:
> >On Thu, 07 Aug 2014 16:05:16 +0000, seany wrote:
> >
> >>obviously there are ways like counting the match length, and then using the maximum length, instead of breaking as soon as a match is found.
> >>
> >>Are there any other better ways?
> >
> >You're not really using regexes properly.  You want to greedily match as much as possible in this case, e.g.:
> >
> >void main()
> >{
> >	import std.regex;
> >	auto re = regex("ab(cd)?");
> >	assert("PREabcdPOST".matchFirst(re).hit == "abcd");
> >	assert("PREabPOST".matchFirst(re).hit == "ab");
> >
> >}
> 
> thing is, abcd is read from a file, and in the compile time, i dont know if cd may at all be there or not, ir if it should be ab(ef)

So basically you have a file containing regex patterns, and you want to find the longest match among them?

One way to do this is to combine them at runtime:

	string[] patterns = ... /* read from file, etc. */;

	// Longer patterns match first
	patterns.sort!((a,b) => a.length > b.length);

	// Build regex
	string regexStr = "%((%(%c%))%||%)".format(patterns);
	auto re = regex(regexStr);

	...

	// Run matches against input
	char[] input = ...;
	auto m = input.match(re);
	auto matchedString = m.captures[0];


T

-- 
When solving a problem, take care that you do not become part of the problem.
August 07, 2014
On Thu, 07 Aug 2014 10:22:37 -0700, H. S. Teoh via Digitalmars-d-learn wrote:

> 
> So basically you have a file containing regex patterns, and you want to find the longest match among them?

> 	// Longer patterns match first patterns.sort!((a,b) => a.length >
> 	b.length);
> 
> 	// Build regex string regexStr = "%((%(%c%))%||%)".format
(patterns);
> 	auto re = regex(regexStr);

This only works if the patterns are simple literals.  E.g. the pattern 'a +' might match a longer sequence than 'aaa'.  If you're out for the longest possible match, iteratively testing each pattern is probably the way to go.
August 07, 2014
On Thu, Aug 07, 2014 at 05:33:42PM +0000, Justin Whear via Digitalmars-d-learn wrote:
> On Thu, 07 Aug 2014 10:22:37 -0700, H. S. Teoh via Digitalmars-d-learn wrote:
> 
> > 
> > So basically you have a file containing regex patterns, and you want to find the longest match among them?
> 
> > 	// Longer patterns match first patterns.sort!((a,b) => a.length >
> > 	b.length);
> > 
> > 	// Build regex string regexStr = "%((%(%c%))%||%)".format
> (patterns);
> > 	auto re = regex(regexStr);
> 
> This only works if the patterns are simple literals.  E.g. the pattern 'a +' might match a longer sequence than 'aaa'.  If you're out for the longest possible match, iteratively testing each pattern is probably the way to go.

Hmm, you're right. I was a bit disappointed to find out that the | operator in std.regex (and also in Perl's regex) doesn't do longest-match but first-match. :-( I had always thought it did longest-match, like in lex/flex.

I wish we can extend std.regex to allow longest-match for alternations... but there may be performance consequences.


T

-- 
There's light at the end of the tunnel. It's the oncoming train.
August 07, 2014
On Thu, Aug 07, 2014 at 10:42:13AM -0700, H. S. Teoh via Digitalmars-d-learn wrote: [...]
> Hmm, you're right. I was a bit disappointed to find out that the | operator in std.regex (and also in Perl's regex) doesn't do longest-match but first-match. :-( I had always thought it did longest-match, like in lex/flex.
> 
> I wish we can extend std.regex to allow longest-match for alternations... but there may be performance consequences.

https://issues.dlang.org/show_bug.cgi?id=13268


T

-- 
Valentine's Day: an occasion for florists to reach into the wallets of nominal lovers in dire need of being reminded to profess their hypothetical love for their long-forgotten.
August 07, 2014
On Thursday, 7 August 2014 at 18:16:11 UTC, H. S. Teoh via Digitalmars-d-learn wrote:

>
> https://issues.dlang.org/show_bug.cgi?id=13268
>
>
> T

Thank you soooooooooo much!!