regexp suggestion (page 2)

"Pavel Minayev" <evilone@omen.ru> wrote in message news:a443lq$147s$1@digitaldaemon.com... > With my suggestion implemented, however, it'd look somewhat different. First I check for identifier, and get "foo123". Now I advance after the end of that token, and perform another check... when I get to "789", I check if it matches an identifier /\w.../ - it doesn't, so I check if it is a number /0-9+/ and succeed... that's how it is supposed to work. If you're changing the regular expression you're searching for, which is what you're doing by switching from looking for an identifier to looking for a number, you'll need to create a new RegExp for each different regular expression. Then apply them as required to the remainder of the input string.

"Sean L. Palmer" <spalmer@iname.com> wrote in message news:a444t2$14qa$1@digitaldaemon.com... > I think sscanf could do this if it could return a pointer to how far it got > in the input string during processing in addition to how many fields were converted. sscanf as it exists in C is not so useful. Also if sscanf would understoof regexps... =) That's why I suggest RegExp.scan();

February 09, 2002

Re: regexp suggestion

Posted by Pavel Minayev
in reply to Walter

Permalink

Pavel Minayev

Posted in reply to Walter

Permalink

"Walter" <walter@digitalmars.com> wrote in message news:a446n4$15hm$1@digitaldaemon.com...

> If you're changing the regular expression you're searching for, which is what you're doing by switching from looking for an identifier to looking
for
> a number, you'll need to create a new RegExp for each different regular expression. Then apply them as required to the remainder of the input string.

I pre-create them all in form of an array;

    RegExp[] tokens;

    static this()
    {
        tokens =
            new RegExp('\w+', ""),    // word
            new RegExp('\d+', ""),    // number
            ...
    }

Now how do I apply them to the remainder of the input string (whatever this means)? I can of course first retrieve identifiers, and remove them from the array, then get rid of numbers, symbols... etc. But it would be damn slow.

This could be also done by "regexp comparison" function, if there were one:

    // read a token
    for (int i = 0; i < token.length; i++)
    {
        // RegExp.cmp() returns the number of chars at the beginning
        // of given string that match the regexp, or 0 if no match
        int len = tokens[0].cmp(text[pos .. text.length]);
        if (len)
        {
            // match!
            token = text[pos .. pos + len];
            pos += len;
        }
    }

Regexp comparison is a good idea anyhow, IMO. Can be used for lots of different things.

> tokens = > new RegExp('\w+', ""), // word > new RegExp('\d+', ""), // number > ... Sorry =) This should of course look: tokens = new RegExp('\w+', "") ~ // word new RegExp('\d+', "") ~ // number ...

sscanf has alot more power than most people realize. I myself didn't discover alot of it until recently. But it won't tell you where it got to in the string. Sean "Pavel Minayev" <evilone@omen.ru> wrote in message news:a447tq$161o$1@digitaldaemon.com... > "Sean L. Palmer" <spalmer@iname.com> wrote in message news:a444t2$14qa$1@digitaldaemon.com... > > > I think sscanf could do this if it could return a pointer to how far it > got > > in the input string during processing in addition to how many fields were > > converted. sscanf as it exists in C is not so useful. > > Also if sscanf would understoof regexps... =) > That's why I suggest RegExp.scan();

On Sat, 9 Feb 2002 15:56:56 -0800, "Walter" <walter@digitalmars.com> wrote: > All you have to do is: > > r1 = new RegExp(...); > > m1 = r1.match(input); > if (m1.length) > m2 = r2.match(input[&m1[0][0] - &input[0] .. input.length]; > > and so on... > > Looks really awkward. Why doesn't the RegExp class have some query fuctions to hide the gore? r1 = new RegExp (...); r1.exec(input); x = r1.matches (); //returns number of parenthesized matches tail = r1.tail (); //returns portion of input after match m1 = getMatch (n) //returns the nth matching substring Regular expressions are very powerful but can also be very complicated. Shouldn't the class help by providing well-named queries? In addition it would be more like PCRE, which is already well understood. Karl Bochert

"Walter" <walter@digitalmars.com> wrote in message news:a44fdn$18t6$1@digitaldaemon.com... > All you have to do is: > > r1 = new RegExp(...); > > m1 = r1.match(input); > if (m1.length) > m2 = r2.match(input[&m1[0][0] - &input[0] .. input.length]; > > and so on... If the first token will be r2, and not r1, but there are some r1s further in the string, the first match() will skip the r2 and get the r1.

"Pavel Minayev" <evilone@omen.ru> wrote in message news:a45a2l$1kk4$1@digitaldaemon.com... > "Walter" <walter@digitalmars.com> wrote in message news:a44fdn$18t6$1@digitaldaemon.com... > > > All you have to do is: > > > > r1 = new RegExp(...); > > > > m1 = r1.match(input); > > if (m1.length) > > m2 = r2.match(input[&m1[0][0] - &input[0] .. input.length]; > > > > and so on... > > If the first token will be r2, and not r1, but there are some r1s further in the string, the first match() will skip the r2 and get the r1. Yes, but if you are using multiple RegExp's on the same string, you need to decide which slices get searched for which patterns. If you are using one RegExp, just set the "g" attribute. If you use one RegExp to search for two different patterns, use parenthesized subexpressions, and the math[][] return will tell you which one was matched.

"Walter" <walter@digitalmars.com> wrote in message news:a45e05$1m8o$1@digitaldaemon.com... > RegExp, just set the "g" attribute. If you use one RegExp to search for two > different patterns, use parenthesized subexpressions, and the math[][] return will tell you which one was matched. This will tokenize the string, but once I have all the tokens, there's - once again - the problem how to determine the type of each token, having its regexp. Once again suppose the token was "foo666". Once again I need to check all possible versions, and if I check for the number first, I'll have a match - "666"... of course a check can be done for starting position == 0 - which involves too many checks, IMO, or the regexp can have "^" inserted at the front... but even then, each token gets checked twice - first in the RegExp.match(), then by my type detection routine. Wouldn't it be slow? I'm not asking for much... just the version of test() with for-loop removed.

Forums