problems with regexps

Dec 06, 2003

Achim Schmitt

Dec 06, 2003

Adam Harper

Dec 07, 2003

Achim Schmitt

Dec 07, 2003

Adam Harper

hi, this is my third or forth program in "D". I want to split phobos.html (docs) into small chunks, to get something like the python module index. The output shows that one of the regexps does not match every time the filter "<h2>([^<]+)</h2>" should match. It finds "compiler" in "<a name="compiler"><h2>compiler</h2></a>". But it dows not find "conv" in the next modules line... # ./split_module_docs Reading complete file... Slicing... len = 25 num matches: 2 in: compiler 2: <a name="compiler"><h2>compiler</h2></a> num matches: 0 in: 2: <a name="conv"><h2>conv</h2></a> num matches: 2 in: ctype 2: <a name="ctype"><h2>ctype</h2></a> num matches: 0 in: 2: <a name="date"><h2>date</h2></a> num matches: 0 in: 2: <a name="file"><h2>file</h2></a> num matches: 2 in: gc 2: <a name="gc"><h2>gc</h2></a> .. Here is the source - what is going wrong? [and why do I have to call the dup method of an array without "()"?] ---------snip--------------- import std.file; import std.regexp; int main (char[][] args) { char[] splitter = "<hr>"; RegExp reSplitter = new RegExp( splitter, ""); RegExp reNewline = new RegExp( "\n", ""); RegExp rePhobosName = new RegExp( "<h2>([^<]+)</h2>", ""); // * FILTER * char[] phpbos_filename = "/data/tapo/dmd/html/d/phobos.html"; printf("Reading complete file...\n"); char[] buffer = cast(char[])std.file.read( phpbos_filename ); // :-o printf("Slicing...\n"); char[][] slices = reSplitter.split( buffer ); printf(" len = %d\n", slices.length ); char[] line; // Now, for every slice we'll take line #2, and extract the modules name: for ( int i = 1; i < slices.length; i++ ) { char[] element = slices[ i ]; char[][] lines = reNewline.split( element ); // split into lines if (lines.length > 1) { line = lines[1].dup; char[][] matches = rePhobosName.match( line ); // extract name printf("num matches: %d in: ", matches.length); if (matches.length > 1) { printf("%.*s\n", matches[1]); // print name } printf( "2: %.*s\n", lines[1] ); // no name??? } else { printf("no lines in %d.\n", i); } } return 0; }

December 06, 2003

Re: problems with regexps

Posted by Adam Harper
in reply to Achim Schmitt

Permalink

Adam Harper

Posted in reply to Achim Schmitt

Permalink

On Sat, 06 Dec 2003 19:00:24 +0000, Achim Schmitt wrote:
> <snip>
> Here is the source - what is going wrong? [and why do I have to call the
> dup method of an array without "()"?]
> 
> <snip>

There's an issue with the current current implementation of the RegExp class which I'll explain in a bit, first a fix for your program.

To get the expected behaviour you'll need to move the creation of the "rePhobosName" regular expression inside of the for loop (preferably within the "if" branch in which it actually gets used).

Now the explanation (which turned into a somewhat long incoherent ramble *sigh*, skip to the end for the [very] short of it).

You need to (re)create the regular expression before each test because of
the way the RegExp class works.  At present the RegExp class stores
(amongst others) two variables "input" and "pmatch", "input" is the string
it performed the last action on (be it match, search, etc.), "pmatch" is
an array of "regmatch_t"'s (a struct containing two ints, one that
indicates the starting position within "input" of the match and one that
indicates the end).  Now, when you call "my_regex.match( somestr )" the
following happens (if the RegExp isn't configured with the global
attribute):

    - "input" gets set to "somestr"
    - "test( input, pmatch[0].rm_eo ) gets called, "pmatch[0].rm_eo" is
      the end position of the last match ( its counterpart is
      "pmatch[0].rm_so" which is the start of the match)

This is fine and dandy, /if the RegExp hasn't been used before/, because
"pmatch[0].rm_eo" will be 0, thus telling "test" to start looking for a
match at the beginning of "input".  If the RegExp has been used before
then "pmatch[0].rm_eo" will be the end position of the last match /for the
previous input string/, which may be greater than the length of the
/new/ input string.  As a result, test will return 0, no match (after all
it can't find a match after the end of the string can it!).

In the context of your program, the following happens:

    - rePhobosName is initialised
      input = ""
      pmatch[0].rm_so, pmatch[0].rm_eo = 0

    - We loop through the headers from "phobos.html" trying to match each
      one against the "rePhobosName" regexp

    - We try and match '<a name="compiler"><h2>compiler</h2></a>', which
      succeeds. Now:

          input = '<a name="compiler"><h2>compiler</h2></a>'
	  pmatch[0].rm_so = 19
          pmatch[0].rm_eo = 36

    - We try and match '<a name="conv"><h2>conv</h2></a>', which fails
      because the test tries to start at character 36 but the string is
      only 32 characters long! Now:

          input = '<a name="conv"><h2>conv</h2></a>'
          pmatch[0].rm_so = 0
          pmatch[0].rm_eo = 0

    - We try and match '<a name="ctype"><h2>ctype</h2></a>', which
      succeeds. Now:

          input = '<a name="ctype"><h2>ctype</h2></a>'
	  pmatch[0].rm_so = 16
          pmatch[0].rm_eo = 30

    - We try and match '<a name="date"><h2>date</h2></a>', which fails
      because the test starts at character 30 and as a result only tries
      to find a match in the substing 'a>'.  Now:

          input = '<a name="date"><h2>date</h2></a>'
          pmatch[0].rm_so = -1
          pmatch[0].rm_eo = -1

      (Note: the pmatch values get set to -1 for no match but 0 when a
             test cannot be performed, as was the case with the "conv"
             header.)

    - We try and match '<a name="file"><h2>file</h2></a>', which fails
      because the start index (pmatch[0].rm_eo, which current;ly equals
      -1) is invalid.  Now:

          input = '<a name="file"><h2>file</h2></a>'
          pmatch[0].rm_so = 0
          pmatch[0].rm_eo = 0

The above sequence (one match followed by two failures then one match, etc.) then repeats until we run out headers.

The fix for Phobos would be to reset pmatch[0].rm_so and pmatch[0].rm_eo to 0 every time the input value is changed.

Thanks for your explanation, Adam Harper! That is not good. A language must have easy regexps to be liked by me :-). There's - char[][] match(char[] string) and - char[][] exec(char[] string). beside finding a better name for exec [matchagain? :)] there's one method to start searching and another one to continue searching. The first should reset the search. Will someone correct that? Say "yes", and I'll fall in love with "D"... AS.

Achim Schmitt wrote: > Thanks for your explanation, Adam Harper! > > That is not good. A language must have easy regexps to be liked by me :-). > > There's - char[][] match(char[] string) > and > - char[][] exec(char[] string). > > beside finding a better name for exec [matchagain? :)] there's one method to > start searching and another one to continue searching. > > The first should reset the search. Agreed. > Will someone correct that? > Say "yes", and I'll fall in love with "D"... I've started a new topic on this with a patch that works but has a workaround for a bug I can't identify. We'll have to see what others make of it, either way I'm sure Walter will have it fixed shortly. > > AS.

Forums