| |
| Posted by Adam Harper in reply to Achim Schmitt | PermalinkReply |
|
Adam Harper
Posted in reply to Achim Schmitt
| On Sat, 06 Dec 2003 19:00:24 +0000, Achim Schmitt wrote:
> <snip>
> Here is the source - what is going wrong? [and why do I have to call the
> dup method of an array without "()"?]
>
> <snip>
There's an issue with the current current implementation of the RegExp class which I'll explain in a bit, first a fix for your program.
To get the expected behaviour you'll need to move the creation of the "rePhobosName" regular expression inside of the for loop (preferably within the "if" branch in which it actually gets used).
Now the explanation (which turned into a somewhat long incoherent ramble *sigh*, skip to the end for the [very] short of it).
You need to (re)create the regular expression before each test because of
the way the RegExp class works. At present the RegExp class stores
(amongst others) two variables "input" and "pmatch", "input" is the string
it performed the last action on (be it match, search, etc.), "pmatch" is
an array of "regmatch_t"'s (a struct containing two ints, one that
indicates the starting position within "input" of the match and one that
indicates the end). Now, when you call "my_regex.match( somestr )" the
following happens (if the RegExp isn't configured with the global
attribute):
- "input" gets set to "somestr"
- "test( input, pmatch[0].rm_eo ) gets called, "pmatch[0].rm_eo" is
the end position of the last match ( its counterpart is
"pmatch[0].rm_so" which is the start of the match)
This is fine and dandy, /if the RegExp hasn't been used before/, because
"pmatch[0].rm_eo" will be 0, thus telling "test" to start looking for a
match at the beginning of "input". If the RegExp has been used before
then "pmatch[0].rm_eo" will be the end position of the last match /for the
previous input string/, which may be greater than the length of the
/new/ input string. As a result, test will return 0, no match (after all
it can't find a match after the end of the string can it!).
In the context of your program, the following happens:
- rePhobosName is initialised
input = ""
pmatch[0].rm_so, pmatch[0].rm_eo = 0
- We loop through the headers from "phobos.html" trying to match each
one against the "rePhobosName" regexp
- We try and match '<a name="compiler"><h2>compiler</h2></a>', which
succeeds. Now:
input = '<a name="compiler"><h2>compiler</h2></a>'
pmatch[0].rm_so = 19
pmatch[0].rm_eo = 36
- We try and match '<a name="conv"><h2>conv</h2></a>', which fails
because the test tries to start at character 36 but the string is
only 32 characters long! Now:
input = '<a name="conv"><h2>conv</h2></a>'
pmatch[0].rm_so = 0
pmatch[0].rm_eo = 0
- We try and match '<a name="ctype"><h2>ctype</h2></a>', which
succeeds. Now:
input = '<a name="ctype"><h2>ctype</h2></a>'
pmatch[0].rm_so = 16
pmatch[0].rm_eo = 30
- We try and match '<a name="date"><h2>date</h2></a>', which fails
because the test starts at character 30 and as a result only tries
to find a match in the substing 'a>'. Now:
input = '<a name="date"><h2>date</h2></a>'
pmatch[0].rm_so = -1
pmatch[0].rm_eo = -1
(Note: the pmatch values get set to -1 for no match but 0 when a
test cannot be performed, as was the case with the "conv"
header.)
- We try and match '<a name="file"><h2>file</h2></a>', which fails
because the start index (pmatch[0].rm_eo, which current;ly equals
-1) is invalid. Now:
input = '<a name="file"><h2>file</h2></a>'
pmatch[0].rm_so = 0
pmatch[0].rm_eo = 0
The above sequence (one match followed by two failures then one match, etc.) then repeats until we run out headers.
The fix for Phobos would be to reset pmatch[0].rm_so and pmatch[0].rm_eo to 0 every time the input value is changed.
|