regexp suggestion (page 3)

February 10, 2002

Re: regexp suggestion

Posted by Karl Bochert
in reply to Pavel Minayev

Permalink

Karl Bochert

Posted in reply to Pavel Minayev

Permalink

On Sun, 10 Feb 2002 17:54:52 +0300, "Pavel Minayev" <evilone@omen.ru> wrote:
> "Walter" <walter@digitalmars.com> wrote in message news:a45e05$1m8o$1@digitaldaemon.com...
> 
> > RegExp, just set the "g" attribute. If you use one RegExp to search for
> two
> > different patterns, use parenthesized subexpressions, and the math[][] return will tell you which one was matched.
> 
> This will tokenize the string, but once I have all the tokens, there's - once again - the problem how to determine the type of each token, having its regexp. Once again suppose the token was "foo666". Once again I need to check all possible versions, and if I check for the number first, I'll have a match - "666"... of course a check can be done for starting position == 0 - which involves too many checks, IMO, or the regexp can have "^" inserted at the front... but even then, each token gets checked twice - first in the RegExp.match(), then by my type detection routine. Wouldn't it be slow?
> 
> I'm not asking for much... just the version of test() with for-loop
> removed.
> 

 I may be missing the point here but:

The power of regular expressions is their ability to search for multiple
patterns at once. If the next thing in the input is either a number or
a word which could have embedded digits then
 "\w[\w\d]*"  matches a word
"\d+"             matches a number
"(\w[\w\d]*)|(\d+)"  matches a word or a number
and
"[\t ]*(\w[\w\d]*)|(\d+)"  matches any spaces followed by a word or a number.

In the last 2 cases, the result of the search is up to 3 substrings : the overall
match, and the substrings within the parentheses. Perform the search and
then the  lengths of the substrings will tell you what you found.

Documentation on standard regex's can be found at: http://compy.ww.tu-berlin.de/doc/packages/pcre/pcre.html among many other places.

"Karl Bochert" <kbochert@ix.netcom.com> wrote in message news:1103_1013361883@bose... > In the last 2 cases, the result of the search is up to 3 substrings : the overall > match, and the substrings within the parentheses. Perform the search and then the lengths of the substrings will tell you what you found. How can these lengths tell? Token type is determined by the forming characters (described by regexp in my case), not by the length - or am I missing something? Suppose the input was: foo bar123 456 baz Now I get the following tokens: "foo", "bar123", "baz", "123", "456" How do I know that "123" is not supposed to be here?

On Sun, 10 Feb 2002 20:32:11 +0300, "Pavel Minayev" <evilone@omen.ru> wrote: > "Karl Bochert" <kbochert@ix.netcom.com> wrote in message news:1103_1013361883@bose... > > > In the last 2 cases, the result of the search is up to 3 substrings : the > overall > > match, and the substrings within the parentheses. Perform the search and then the lengths of the substrings will tell you what you found. > > How can these lengths tell? Token type is determined by the forming characters (described by regexp in my case), not by the length - or am I missing something? Suppose the input was: > > foo bar123 456 baz > > Now I get the following tokens: > > "foo", "bar123", "baz", "123", "456" > > How do I know that "123" is not supposed to be here? > I probably have some details wrong here, but Declare a regular expression: p = Regexp( "(\w[\w\d]*)|(\d+)" ) then: p.match ("123test") produces 3 substrings: "123" -- the overall match "" -- the match for the first set of parens "123" -- the match for the second set of parens In PCRE (the common C implementation) the substrings are returned as an array of pointers into the string (6 in this case). I suspect D returns an equivalent array of offsets (slices?) into the string? The non-zero length of the third substring shows that a number ("\d+") was found. In your example: p.exec (foo bar123 baz 123); produces: "foo" "foo" "" and: p.exec ("bar123 baz 123") produces: "bar123" "bar123" "" and: p.exec ("123 456"); produces: "123" "" "123" I have used exec() here because it is probably the same as PCRE's exec function. I have read the RegExp documentation but do not understand the difference between the exec() and match() methods. Maybe match() is just exec() anchored to the start of the text? Karl

"Karl Bochert" <kbochert@ix.netcom.com> wrote in message news:1103_1013375566@bose... > In your example: > p.exec (foo bar123 baz 123); > produces: > "foo" > "foo" > "" > > and: > p.exec ("bar123 baz 123") > produces: > "bar123" > "bar123" > "" > > and: > p.exec ("123 456"); > produces: > "123" > "" > "123" Yep, right. Now I have all the tokens, how do I determine the _type_ of each (identifier, number, string...), with regexp describing those types?

"Pavel Minayev" <evilone@omen.ru> wrote in message news:a461ka$1tv0$1@digitaldaemon.com... > "Walter" <walter@digitalmars.com> wrote in message news:a45e05$1m8o$1@digitaldaemon.com... > > > RegExp, just set the "g" attribute. If you use one RegExp to search for > two > > different patterns, use parenthesized subexpressions, and the math[][] return will tell you which one was matched. > > This will tokenize the string, but once I have all the tokens, there's - once again - the problem how to determine the type of each token, having its regexp. That's not a problem with parenthesized subexpressions. You can tell which one got the match by the index in match[][]. The second index 0 is the overall match, subsequent indices are the matches for each subexpression.

"Karl Bochert" <kbochert@ix.netcom.com> wrote in message news:1103_1013375566@bose... > I have used exec() here because it is probably the same as PCRE's exec > function. I have read the RegExp documentation but do not understand the > difference between the exec() and match() methods. Maybe match() is > just exec() anchored to the start of the text? There is no difference if the global attribute is set. If the global attribute is not set, then match returns an array of all the matches in the input.

On Sun, 10 Feb 2002 15:47:34 -0800, "Walter" <walter@digitalmars.com> wrote: > > "Karl Bochert" <kbochert@ix.netcom.com> wrote in message news:1103_1013375566@bose... > > I have used exec() here because it is probably the same as PCRE's exec > > function. I have read the RegExp documentation but do not understand the > > difference between the exec() and match() methods. Maybe match() is > > just exec() anchored to the start of the text? > > There is no difference if the global attribute is set. If the global attribute is not set, then match returns an array of all the matches in the input. > I think I understand. match() without the global attribute set finds all matches in the subject string, but loses the 'which substring' information. That might explain Pavel's problem -- to parse the next token and get it's type info he should use exec() or global match(). Karl Bochert

"Walter" <walter@digitalmars.com> wrote in message news:a470kt$2art$1@digitaldaemon.com... > That's not a problem with parenthesized subexpressions. You can tell which one got the match by the index in match[][]. The second index 0 is the overall match, subsequent indices are the matches for each subexpression. Walter, where is that match[][] thing? match() returns char[][], which ain't what I need...

"Pavel Minayev" <evilone@omen.ru> wrote in message news:a47ir2$2i4j$1@digitaldaemon.com... > "Walter" <walter@digitalmars.com> wrote in message news:a470kt$2art$1@digitaldaemon.com... > > > That's not a problem with parenthesized subexpressions. You can tell which > > one got the match by the index in match[][]. The second index 0 is the overall match, subsequent indices are the matches for each subexpression. > > Walter, where is that match[][] thing? match() returns char[][], which > ain't what I need... It sounds like just what you need. I guess I just don't understand what's wrong.

"Walter" <walter@digitalmars.com> wrote in message news:a47r1i$2lhb$1@digitaldaemon.com... > It sounds like just what you need. I guess I just don't understand what's wrong. char[][] is the list of tokens, or, to be more exact, the list of their _values_. But how do I know their _types_ (string or number or ..)? Suppose the regexp was: ([A-Za-z_]+|0-9+) And I get 10 tokens. How do I tell if the first matched [A-Za-z_]+ part or the 0-9+ part, without checking it separately (which results in two checks per token)?

Forums