Thread overview
Nested RegEx
Apr 21, 2012
nrgyzer
Apr 21, 2012
Dmitry Olshansky
Apr 21, 2012
H. S. Teoh
Apr 21, 2012
Dmitry Olshansky
Apr 24, 2012
nrgyzer
April 21, 2012
Hi guys,

I'm trying to use std.regex to parse a string like the following:

string myString = "preOuter {if condition1} content1 {if condition2} content2 {elseif condition3} content3 {else}any other content{/if}{/if} postOuter";

Is there any chance to use std.regex to parse the string above? I currently used the following expression:

auto r = regex(`(.*?)\{if:(?P<condition>(.+?))\}(?P<content>(.*))(\{/if\})
(.*)`, "g");

but it doesn't fit my nested if- and else-statements correctly.

Thanks in advance for any suggestions!
April 21, 2012
On 21.04.2012 21:24, nrgyzer wrote:
> Hi guys,
>
> I'm trying to use std.regex to parse a string like the following:
>
> string myString = "preOuter {if condition1} content1 {if condition2} content2
> {elseif condition3} content3 {else}any other content{/if}{/if} postOuter";
>

Simply put pure regex is incapable of arbitrary-nested if-else statements. In a sense if... /if is a case of balanced parens problem and it's widely know to form non-regular language.

The way out is to either preconstruct regex pattern for a given maximum depth or just use handwritten parser (it may use regex for parts of input internally).

One day I may add push-pop stack extensions (that allow this kind of thing) into regex but I'd have to think hard to make it efficient.

(look at .NET regex they kind of have syntax for this but it's too underpowered for my taste)

> Is there any chance to use std.regex to parse the string above? I currently
> used the following expression:
>
> auto r = regex(`(.*?)\{if:(?P<condition>(.+?))\}(?P<content>(.*))(\{/if\})
> (.*)`, "g");
>
> but it doesn't fit my nested if- and else-statements correctly.
>
> Thanks in advance for any suggestions!


-- 
Dmitry Olshansky
April 21, 2012
On Sat, Apr 21, 2012 at 09:41:18PM +0400, Dmitry Olshansky wrote:
> On 21.04.2012 21:24, nrgyzer wrote:
> >Hi guys,
> >
> >I'm trying to use std.regex to parse a string like the following:
> >
> >string myString = "preOuter {if condition1} content1 {if condition2} content2 {elseif condition3} content3 {else}any other content{/if}{/if} postOuter";
> >
> 
> Simply put pure regex is incapable of arbitrary-nested if-else statements. In a sense if... /if is a case of balanced parens problem and it's widely know to form non-regular language.

This is of theoretical interest to me. Very often I find myself wanting a concise way to express patterns with nested matching delimiters, but regexes can't do it. But to jump to a full-blown stack-based language seems like an overkill to me: stack languages can express *much* more than just nested delimiters, most of which is difficult to encapsulate in a nice, concise syntax like regexes. All I want is a simple way to express the kind of simple nesting that matching delimiters give.


[...]
> One day I may add push-pop stack extensions (that allow this kind of thing) into regex but I'd have to think hard to make it efficient.
[...]

Plus, have a nice concise syntax for it (otherwise I might as well just write a recursive descent parser for it in the first place).


T

-- 
The day Microsoft makes something that doesn't suck is probably the day they start making vacuum cleaners... -- Slashdotter
April 21, 2012
On 21.04.2012 22:46, H. S. Teoh wrote:
> On Sat, Apr 21, 2012 at 09:41:18PM +0400, Dmitry Olshansky wrote:
>> On 21.04.2012 21:24, nrgyzer wrote:
>>> Hi guys,
>>>
>>> I'm trying to use std.regex to parse a string like the following:
>>>
>>> string myString = "preOuter {if condition1} content1 {if condition2} content2
>>> {elseif condition3} content3 {else}any other content{/if}{/if} postOuter";
>>>
>>
>> Simply put pure regex is incapable of arbitrary-nested if-else
>> statements. In a sense if... /if is a case of balanced parens
>> problem and it's widely know to form non-regular language.
>
> This is of theoretical interest to me. Very often I find myself wanting
> a concise way to express patterns with nested matching delimiters, but
> regexes can't do it. But to jump to a full-blown stack-based language
> seems like an overkill to me: stack languages can express *much* more
> than just nested delimiters, most of which is difficult to encapsulate
> in a nice, concise syntax like regexes. All I want is a simple way to
> express the kind of simple nesting that matching delimiters give.
>

Recursive descent is not particularly bad unless minimal "grammar descent depth" is high. Example:
a+b-c
uses a quite a lot of recursive calls for grammar with e.g. 10+ operator precedence levels.


I'm thinking of merging operator precedence parser with regex might be a happy marriage you know.

Back to OP topic something along the lines of this will do it (beware of stack overflow):

void parseIf(){
	static int ifNest;
	if(input.startWith("{if")){
		ifNest++;
		scope(exit) ifNest--;
		enforce(ifNest < 10000, "conservative stack overflow");
		parseCond(input[2..$-1]);//regex that does condition
		enforce(input[$-1] == '}', "close that if");
		parseIf();//parse body or another nested if
		//parseElse();//goes here, as does elif
	}
	else
		parseBody();// the regex you used before
}


>
> [...]
>> One day I may add push-pop stack extensions (that allow this kind of
>> thing) into regex but I'd have to think hard to make it efficient.
> [...]
>
> Plus, have a nice concise syntax for it (otherwise I might as well just
> write a recursive descent parser for it in the first place).
>

Concise syntax & lots of power is a priority actually, because I can detect if push-pop stuff is used or not and pick the right engine for the job. So different big-oh complexity is not important.

-- 
Dmitry Olshansky
April 24, 2012
== Auszug aus Dmitry Olshansky (dmitry.olsh@gmail.com)'s Artikel
> On 21.04.2012 22:46, H. S. Teoh wrote:
> > On Sat, Apr 21, 2012 at 09:41:18PM +0400, Dmitry Olshansky wrote:
> >> On 21.04.2012 21:24, nrgyzer wrote:
> >>> Hi guys,
> >>>
> >>> I'm trying to use std.regex to parse a string like the following:
> >>>
> >>> string myString = "preOuter {if condition1} content1 {if condition2}
content2
> >>> {elseif condition3} content3 {else}any other content{/if}{/if} postOuter";
> >>>
> >>
> >> Simply put pure regex is incapable of arbitrary-nested if-else statements. In a sense if... /if is a case of balanced parens problem and it's widely know to form non-regular language.
> >
> > This is of theoretical interest to me. Very often I find myself wanting a concise way to express patterns with nested matching delimiters, but regexes can't do it. But to jump to a full-blown stack-based language seems like an overkill to me: stack languages can express *much* more than just nested delimiters, most of which is difficult to encapsulate in a nice, concise syntax like regexes. All I want is a simple way to express the kind of simple nesting that matching delimiters give.
> >
> Recursive descent is not particularly bad unless minimal "grammar
> descent depth" is high. Example:
> a+b-c
> uses a quite a lot of recursive calls for grammar with e.g. 10+ operator
> precedence levels.
> I'm thinking of merging operator precedence parser with regex might be a
> happy marriage you know.
> Back to OP topic something along the lines of this will do it (beware of
> stack overflow):
> void parseIf(){
> 	static int ifNest;
> 	if(input.startWith("{if")){
> 		ifNest++;
> 		scope(exit) ifNest--;
> 		enforce(ifNest < 10000, "conservative stack overflow");
> 		parseCond(input[2..$-1]);//regex that does condition
> 		enforce(input[$-1] == '}', "close that if");
> 		parseIf();//parse body or another nested if
> 		//parseElse();//goes here, as does elif
> 	}
> 	else
> 		parseBody();// the regex you used before
> }
> >
> > [...]
> >> One day I may add push-pop stack extensions (that allow this kind of thing) into regex but I'd have to think hard to make it efficient.
> > [...]
> >
> > Plus, have a nice concise syntax for it (otherwise I might as well just write a recursive descent parser for it in the first place).
> >
> Concise syntax & lots of power is a priority actually, because I can detect if push-pop stuff is used or not and pick the right engine for the job. So different big-oh complexity is not important.

Thanks guys... I just solved the problem by using indexOf() from std.string for parsing my if-statements. So... in principle, I can have unlimited nested if- statements, but I think it's much more difficult (and slower) than using any regex-expression.