Template regexes, version 2 (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Template regexes, version 2 (page 2)

February 21, 2006

Re: Template regexes, version 2

Posted by Craig Black
in reply to Ivan Senji

Craig Black

Posted in reply to Ivan Senji

> That looks scary :)

Probably you are joking, but I feel the need to defend my code anyway.  To someone who doesn't know the syntax it is not going to be very readable. <g> However, my code is way more readable than a lot of the regular expressions that I've seen.  If you don't believe me, google for regular expressions that can be used to parse C++.  One benefit of my language over regular expressions is that you can define subroutines.  Thus, it's not all one big mess.  It's a bunch of smaller, little messes.  This makes a long regular expression much more managable.

-Craig

February 21, 2006

Re: Template regexes, version 2

Posted by Ivan Senji
in reply to Craig Black

Ivan Senji

Posted in reply to Craig Black

Craig Black wrote:
>>That looks scary :)
> 
> 
> Probably you are joking

I was joking (but it does look scary)

> , but I feel the need to defend my code anyway.  To someone who doesn't know the syntax it is not going to be very readable. <g> However, my code is way more readable than a lot of the regular expressions 

I guess i'll believe you.

> that I've seen.  If you don't believe me, google for regular expressions that can be used to parse C++.  

Well I would be surprised to find that because C++ grammar isn't regular, it isn't even LR, it is something awfull.

> One benefit of my language over regular expressions is that you can define subroutines.  

Interesting? What class of languages can it parse?

> Thus, it's not all one big mess.  It's a bunch of smaller, little messes.  This makes a long regular expression much more managable.
>

February 22, 2006

Re: Template regexes, version 2

Posted by Don Clugston
in reply to Walter Bright

Don Clugston

Posted in reply to Walter Bright

Attachments:

regexp2.d

Walter Bright wrote:
> "Don Clugston" <dac@nospam.com.au> wrote in message news:dtev42$1g98$1@digitaldaemon.com...
>> Perl-ish languages deal with the last case by allowing embedded variables
>> in strings. The template regexes Ive been developing can cope with them by
>> using additional parameters and extended syntax, eg @1 for the string
>> value of the first parameter.
>> eg
>>
>> char [] var1 = "here";
>> char [] input = " a56aherea";
>> char [] x = firstmatch!("a@1a")(input, var1);
>> var1 = "56";
>> char [] y = firstmatch!("a@1a")(input, var1);
>>
>> would give x == "aherea", y=="aherea".
> 
> Great! But can I suggest using the format used by std.regexp.RegExp.replace() ? It uses $$, $&, $`, $', and $n.

OK. Except ... I think replace() doesn't have to worry about $ colliding
with the end of line marker? Anyway, I've used $n for now.
(Doesn't SQL use @ for parameters in stored procedures? Can't quite
remember.)

>> How good is DMD at performing this optimisation?
> 
> Not good enough. But I wouldn't worry about that for the moment.

I'm not really surprised -- I imagine the back-end doesn't have much that's specific to D.

I've attached what I've done. It's not complete -- doesn't do ranges, or
any of the escape characters, or beginning and end of line.
But it does do the full set of repetition operators, parentheses,
and unions (arbitrarily nestable). It also does the $n parameters.
Most of the remaining stuff is straightforward, and not very exciting.

I feel as though I've finally worked out what mixins are for. They are a perfect match for this application.

In any case, this should give you a good idea for what you'll able to use in your presentation.

February 22, 2006

Re: Template regexes, version 2

Posted by Craig Black
in reply to Ivan Senji

Craig Black

Posted in reply to Ivan Senji

"Ivan Senji" <ivan.senji_REMOVE_@_THIS__gmail.com> wrote in message news:dtg4am$2si9$1@digitaldaemon.com...
> Craig Black wrote:
>>>That looks scary :)
>>
>>
>> Probably you are joking
>
> I was joking (but it does look scary)
>
>> , but I feel the need to defend my code anyway.  To someone who doesn't know the syntax it is not going to be very readable. <g> However, my code is way more readable than a lot of the regular expressions
>
> I guess i'll believe you.
>
>> that I've seen.  If you don't believe me, google for regular expressions that can be used to parse C++.
>
> Well I would be surprised to find that because C++ grammar isn't regular, it isn't even LR, it is something awfull.

No, the regular expression doesn't actually parse C++, but it helps to parse it.  A friend of mine was writing a C++ parser and he showed me a regular expression he found on the internet (I would show you, but I just looked but couldn't find it.)  It was hideously long because it has a lot of repetitous stuff.

When you add subroutines to the syntax, the repetition disappears.

>> One benefit of my language over regular expressions is that you can define subroutines.
>
> Interesting? What class of languages can it parse?

Theoretically it can parse anything.  It includes syntax for pushing and poping text on a stack and populating text values in a simple parse tree format.

>> Thus, it's not all one big mess.  It's a bunch of smaller, little messes. This makes a long regular expression much more managable.
>>

February 23, 2006

Re: Template regexes, version 2

Posted by Don Clugston
in reply to Don Clugston

Don Clugston

Posted in reply to Don Clugston

I should mention that the code I posted doesn't do real regexps, because it has no backtracking. It doesn't do *, +, ? properly, they are always greedy. It can never match ".*a", for example.
I was too focused on getting access to variables.
However, the mixin technique means that the regexp is actually turned into a binary tree of nested functions.

eg (ab.*|c)a
becomes something like:

bool test(char [] searchstr)
{
  int p=0;

  bool seq() {
    bool union() {
       bool seq() {
           bool char_a {}
           bool seq() {
             bool char_b {}
             bool star() {
                 bool dot()
             }
           }
        }
        bool char_c() {}
    }
    bool char_a() {}
  }
  return seq();
}

so that in every case (including unions), the next function that needs to be called is always in scope. Intriguingly, it also has access to the local variables all the way down the tree. (Since of these levels can contain its own stack as an array, I'm pretty sure that a Turing machine can be created, it's likely to be more powerful than the state machines normally used for regexps).

Obviously this can be used to do regexps properly. I was hoping to use a clever naming scheme to avoid the alias issues from pragma's version, but it seems to be inevitable.

It seems hard to do this properly without creating a highly optimised regexp parser!

Don Clugston wrote:
> Walter Bright wrote:
>> "Don Clugston" <dac@nospam.com.au> wrote in message news:dtev42$1g98$1@digitaldaemon.com...
>>> Perl-ish languages deal with the last case by allowing embedded variables in strings. The template regexes Ive been developing can cope with them by using additional parameters and extended syntax, eg @1 for the string value of the first parameter.
>>> eg
>>>
>>> char [] var1 = "here";
>>> char [] input = " a56aherea";
>>> char [] x = firstmatch!("a@1a")(input, var1);
>>> var1 = "56";
>>> char [] y = firstmatch!("a@1a")(input, var1);
>>>
>>> would give x == "aherea", y=="aherea".
>>
>> Great! But can I suggest using the format used by std.regexp.RegExp.replace() ? It uses $$, $&, $`, $', and $n.
> 
> OK. Except ... I think replace() doesn't have to worry about $ colliding with the end of line marker? Anyway, I've used $n for now.
> (Doesn't SQL use @ for parameters in stored procedures? Can't quite remember.)
> 
>>> How good is DMD at performing this optimisation?
>>
>> Not good enough. But I wouldn't worry about that for the moment.
> 
> I'm not really surprised -- I imagine the back-end doesn't have much that's specific to D.
> 
> I've attached what I've done. It's not complete -- doesn't do ranges, or any of the escape characters, or beginning and end of line.
> But it does do the full set of repetition operators, parentheses,
> and unions (arbitrarily nestable). It also does the $n parameters.
> Most of the remaining stuff is straightforward, and not very exciting.
> 
> I feel as though I've finally worked out what mixins are for. They are a perfect match for this application.
> 
> In any case, this should give you a good idea for what you'll able to use in your presentation.
>

February 23, 2006

Re: Template regexes, version 2

Posted by Georg Wrede
in reply to Don Clugston

Georg Wrede

Posted in reply to Don Clugston

Don Clugston wrote:

> I should mention that the code I posted doesn't do real regexps, because it has no backtracking. It doesn't do *, +, ? properly, they are always greedy. It can never match ".*a", for example.

> It seems hard to do this properly without creating a highly optimised regexp parser!

LOL!

(Self-reference.)

February 23, 2006

Re: Template regexes, version 2

Posted by pragma
in reply to Don Clugston

pragma

Posted in reply to Don Clugston

Don, I love this rendition.

In article <dtjv60$1tc9$1@digitaldaemon.com>, Don Clugston says...
>
>I should mention that the code I posted doesn't do real regexps, because
>it has no backtracking. It doesn't do *, +, ? properly, they are always
>greedy. It can never match ".*a", for example.
>I was too focused on getting access to variables.

I'm scratching my head as to how to fix the greedy-matching problem.  (my apologies if you've already mulled this over)

One way to go is provide a switch for greedy and non-greedy behaviors at the interface level, such that a programmer can select the behavior.  Another is to adopt a different token sequence in the expression itself (I think some engines use "*?", "??", "{m,n}?" and "+?" for non-greedy matching).

A third option, is to redesign the expression engine itself to work with multiple matches, such that each predicate is understood to match the left-hand side against n right-hand sides (as returned from other predicates).  This way, the expression ".*a" will match "a" in each sucessive substring following each successive attempt at matching "."; the result would be a "char[][]" containing zero or more matches of the "*" predicate expression.


>However, the mixin technique means that the regexp is actually turned into a binary tree of nested functions.
>
>eg (ab.*|c)a
>becomes something like:
>
>bool test(char [] searchstr)
>{
>   int p=0;
>
>   bool seq() {
>     bool union() {
>        bool seq() {
>            bool char_a {}
>            bool seq() {
>              bool char_b {}
>              bool star() {
>                  bool dot()
>              }
>            }
>         }
>         bool char_c() {}
>     }
>     bool char_a() {}
>   }
>   return seq();
>}
>
>so that in every case (including unions), the next function that needs to be called is always in scope. Intriguingly, it also has access to the local variables all the way down the tree. (Since of these levels can contain its own stack as an array, I'm pretty sure that a Turing machine can be created, it's likely to be more powerful than the state machines normally used for regexps).

Also, I'm curious to see what DMD would do to optimize that code.  Would some of those internals dissapear, or would they have to be anonymous functions in order for that to happen?  It might be worth the trouble to go the latter, given there are no function arguments used anywhere, as the resulting code might be pretty darn close to using raw goto statements (is DMD smart enough to *not* create a stack frame without use of the 'naked' keyword?).

>
>Obviously this can be used to do regexps properly. I was hoping to use a clever naming scheme to avoid the alias issues from pragma's version, but it seems to be inevitable.

Heh... finding that problem was the worst part.

Don't forget how my solution has a tendency to hammer and pollute the symbol namespace for large expressions.  It looks like you managed to work around some of that as well.  Again, I'm floored by how much more can be gained via nested functions rather than a more "traditional" approach.

>
>It seems hard to do this properly without creating a highly optimised regexp parser!

Hehe.. so the optimal solution is also the most complete? ;)

- Eric Anderton at yahoo

February 23, 2006

Re: Template regexes, version 2

Posted by David Medlock
in reply to pragma

David Medlock

Posted in reply to pragma

pragma wrote:
> Don, I love this rendition.
> 
> In article <dtjv60$1tc9$1@digitaldaemon.com>, Don Clugston says...
> 
>>I should mention that the code I posted doesn't do real regexps, because it has no backtracking. It doesn't do *, +, ? properly, they are always greedy. It can never match ".*a", for example.
>>I was too focused on getting access to variables.
> 
> 
> I'm scratching my head as to how to fix the greedy-matching problem.  (my
> apologies if you've already mulled this over)
> <snip> Hehe.. so the optimal solution is also the most complete? ;)
> 
> - Eric Anderton at yahoo


One option is something like the following:
(Assuming Regex is a linked list of atomic regular expressions, and it knows if it is nullable)

bool match( string txt, Regex reg, bool delegate() backtrack )
{
  string tail ;
  // attempt to match the regex against the text
  bool success = reg.match( txt, tail );

  // this function skips the current regex and tries the txt on the successor
  bool inner_skip()
  {
    bool result = match( txt, reg.next, backtrack );
    if ( result==false ) return backtrack();
    else return result;
  }

  if ( success )
  {
    if ( reg.nullable ) return match( tail, reg.next, &inner_skip );
    else return match( tail, reg.next, backtrack );
  }
  else return ( backtrack is null ? false : backtrack() );
}

//call with :

match( text, regex, null )

A nice addition would be a callback delegate for matches.

-DavidM

February 24, 2006

Re: Template regexes, version 2

Posted by Walter Bright
in reply to Don Clugston

Walter Bright

Posted in reply to Don Clugston

Thanks!

February 24, 2006

Re: Template regexes, version 2

Posted by Don Clugston
in reply to pragma

Don Clugston

Posted in reply to pragma

pragma wrote:
> Don, I love this rendition.

Thanks! I'm just amazed at how this stuff has progressed, it's so easy now.
I'm a little uncomfortable that people seem to be credited me for the regexp parser which you wrote. I want to make sure that you receive proper acknowledgment.

> In article <dtjv60$1tc9$1@digitaldaemon.com>, Don Clugston says...
>> I should mention that the code I posted doesn't do real regexps, because it has no backtracking. It doesn't do *, +, ? properly, they are always greedy. It can never match ".*a", for example.
>> I was too focused on getting access to variables.
> 
> I'm scratching my head as to how to fix the greedy-matching problem.  (my
> apologies if you've already mulled this over)

> 
> One way to go is provide a switch for greedy and non-greedy behaviors at the
> interface level, such that a programmer can select the behavior.  Another is to
> adopt a different token sequence in the expression itself (I think some engines
> use "*?", "??", "{m,n}?" and "+?" for non-greedy matching).  
> 
> A third option, is to redesign the expression engine itself to work with
> multiple matches, such that each predicate is understood to match the left-hand
> side against n right-hand sides (as returned from other predicates).  This way,
> the expression ".*a" will match "a" in each sucessive substring following each
> successive attempt at matching "."; the result would be a "char[][]" containing
> zero or more matches of the "*" predicate expression.
> 
> 
>> However, the mixin technique means that the regexp is actually turned into a binary tree of nested functions.
>>
>> eg (ab.*|c)a
>> becomes something like:
>>
>> bool test(char [] searchstr)
>> {
>>   int p=0;
>>
>>   bool seq() {
>>     bool union() {
>>        bool seq() {
>>            bool char_a {}
>>            bool seq() {
>>              bool char_b {}
>>              bool star() {
>>                  bool dot()
>>              }
>>            }
>>         }
>>         bool char_c() {}
>>     }
>>     bool char_a() {}
>>   }
>>   return seq();
>> }
>>
>> so that in every case (including unions), the next function that needs to be called is always in scope. Intriguingly, it also has access to the local variables all the way down the tree. (Since of these levels can contain its own stack as an array, I'm pretty sure that a Turing machine can be created, it's likely to be more powerful than the state machines normally used for regexps).
> 
> Also, I'm curious to see what DMD would do to optimize that code.  Would some of
> those internals dissapear, or would they have to be anonymous functions in order
> for that to happen?  It might be worth the trouble to go the latter, given there
> are no function arguments used anywhere, as the resulting code might be pretty
> darn close to using raw goto statements (is DMD smart enough to *not* create a
> stack frame without use of the 'naked' keyword?).

I suspect it's not very optimised, since nested functions are not relevant for C++. But there's a lot of potential. If it can convert CALL; RET; sequences into JMP, the whole thing will collapse into an optimal mass of goto spaghetti, something like

    xxx
    call part2
    xxx
part2:
    xxx
    call part3
    xxx
part3:
    xxx
    ret

> 
>> Obviously this can be used to do regexps properly. I was hoping to use a clever naming scheme to avoid the alias issues from pragma's version, but it seems to be inevitable.
> 
> Heh... finding that problem was the worst part. 
> 
> Don't forget how my solution has a tendency to hammer and pollute the symbol
> namespace for large expressions.  It looks like you managed to work around some
> of that as well.  Again, I'm floored by how much more can be gained via nested
> functions rather than a more "traditional" approach.

Me too. It's just awesome how many language features are coming together in this: real compile time constants, template string parameters, nested functions, having ~ concatenation built into the language, mixins, alias template parameters, static if, constant folding, the !() notation for templates, array slices...
And C++ can't do any of those things. It's incredible.

Update: I got the aliases to work in about 20 minutes, without difficulties. Now it does proper non-greedy matches. Amazingly, the code is barely any more complicated. It still doesn't need explicit backtracking!

Greedy matching can be implemented by repeatedly attempting non-greedy matches, and finally redoing the last one which was successful. Because each step in the process has its own local variables, the only backtracking is restoring a copy of the current position.
I think that greedy matching is an inherently inefficient operation, it would be much nicer to have non-greedy as the default.

Consider the case
(expr1)*(expr2)
I think that what most engines do is keep matching (expr1) until they get a failure. Then try (expr2). If it fails, wind back (expr1) and try (expr2) again. Keep going until you've wound all the way back to the beginning. This gives only one successful match to expr2, but many unnecessary matches to expr1.

So it's good for cases like "b*(really_complex_regexp)"
But it's bad for cases like "(really_complex_regexp)*b"
Also, backtracking expr1 is a real pain.

My method is the reverse, it has no unnecessary matches to (expr1), but may have many for (expr2). I'm not sure that it's really inferior.

Ideally, at compile time you'd determine which of the expressions was more complex (perhaps just by counting the number of *,+, ? in each string) and choose between the two methods that way.

if something that matches (expr1) can ever be a partial match for (expr2); and if it can't, then you know you can just keep matching (expr1) until it fails, and only then evaluate (expr2). But I suspect that comparing two regexps is one of those NP-Hard problems from CompSci that's just not feasible to solve.

>> It seems hard to do this properly without creating a highly optimised regexp parser!
> 
> Hehe.. so the optimal solution is also the most complete? ;)

It certainly looks that way. And it seems to be the simplest, too. For quantifiers "{m,n}, *,+, ?", I used the (I thought) quick-and-dirty method of dealing with all of them at once. But a side-effect is that it's more optimal.

Also, I used functions with no parameters because it was quick and dirty. But it's optimal, too, and it's really easy to add more features.
It's a bizarre experience.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation