std.regex with multiple matches

Apr 21, 2011

David Gileadi

Apr 21, 2011

Dmitry Olshansky

Apr 21, 2011

David Gileadi

Apr 21, 2011

Kai Meyer

Apr 21, 2011

David Gileadi

I was using std.regex yesterday, matching a regular expression against a string with the "g" flag to find multiple matches. As the example from the docs shows (BTW I think the example may be wrong; I think it needs the "g" flag added to the regex call), you can do a foreach loop on the matches like: foreach(m; match("abcabcabab", regex("ab"))) { writefln("%s[%s]%s", m.pre, m.hit, m.post); } Each match "m" is a RegexMatch, which includes .pre, .hit, and .post properties to return ranges of everything before, inside, and after the match. However what I really wanted was a way to get the range between matches, i.e. since I had multiple matches I wanted something like m.upToNextMatch. Since I'm not very familiar with ranges, am I missing some obvious way of doing this with the existing .pre, .hit and .post properties? -Dave

On 21.04.2011 21:43, David Gileadi wrote: > I was using std.regex yesterday, matching a regular expression against a string with the "g" flag to find multiple matches. As the example from the docs shows (BTW I think the example may be wrong; I think it needs the "g" flag added to the regex call), you can do a foreach loop on the matches like: > > foreach(m; match("abcabcabab", regex("ab"))) > { > writefln("%s[%s]%s", m.pre, m.hit, m.post); > } > > Each match "m" is a RegexMatch, which includes .pre, .hit, and .post properties to return ranges of everything before, inside, and after the match. > > However what I really wanted was a way to get the range between matches, i.e. since I had multiple matches I wanted something like m.upToNextMatch. I might be wrong but I think that you are looking for std.regex.splitter: auto s1 =", abc, de, fg, hi,"; assert(equal(splitter(s1, regex(", *")), ["","abc","de","fg","hi",""][])) Simply put it gets you range of slices of input separated by regex matches. > > Since I'm not very familiar with ranges, am I missing some obvious way of doing this with the existing .pre, .hit and .post properties? > > -Dave -- Dmitry Olshansky

On 4/21/11 11:36 AM, Dmitry Olshansky wrote: > On 21.04.2011 21:43, David Gileadi wrote: >> I was using std.regex yesterday, matching a regular expression against >> a string with the "g" flag to find multiple matches. As the example >> from the docs shows (BTW I think the example may be wrong; I think it >> needs the "g" flag added to the regex call), you can do a foreach loop >> on the matches like: >> >> foreach(m; match("abcabcabab", regex("ab"))) >> { >> writefln("%s[%s]%s", m.pre, m.hit, m.post); >> } >> >> Each match "m" is a RegexMatch, which includes .pre, .hit, and .post >> properties to return ranges of everything before, inside, and after >> the match. >> >> However what I really wanted was a way to get the range between >> matches, i.e. since I had multiple matches I wanted something like >> m.upToNextMatch. > I might be wrong but I think that you are looking for std.regex.splitter: > > auto s1 =", abc, de, fg, hi,"; > assert(equal(splitter(s1, regex(", *")), > ["","abc","de","fg","hi",""][])) > > Simply put it gets you range of slices of input separated by regex matches. I considered that but I also need the content of the matches--the captures, etc.

April 21, 2011

Re: std.regex with multiple matches

Posted by Kai Meyer
in reply to David Gileadi

Permalink

Kai Meyer

Posted in reply to David Gileadi

Permalink

On 04/21/2011 11:43 AM, David Gileadi wrote:
> I was using std.regex yesterday, matching a regular expression against a
> string with the "g" flag to find multiple matches. As the example from
> the docs shows (BTW I think the example may be wrong; I think it needs
> the "g" flag added to the regex call), you can do a foreach loop on the
> matches like:
>
> foreach(m; match("abcabcabab", regex("ab")))
> {
> writefln("%s[%s]%s", m.pre, m.hit, m.post);
> }
>
> Each match "m" is a RegexMatch, which includes .pre, .hit, and .post
> properties to return ranges of everything before, inside, and after the
> match.
>
> However what I really wanted was a way to get the range between matches,
> i.e. since I had multiple matches I wanted something like m.upToNextMatch.
>
> Since I'm not very familiar with ranges, am I missing some obvious way
> of doing this with the existing .pre, .hit and .post properties?
>
> -Dave

There's two ways I can think of off the top of my head.

I don't think D supports "look ahead", but if it did you could match something, then capture the portion afterwards (in m.captures[1]) that matches everything up until the look ahead (which is what you matched in the first place).

Otherwise, you could manually capture the ranges like this (captures the first word character after each word boundry, then prints the remaining portion of the word until the next word boundary followed by a word character):

import std.stdio;
import std.regex;

void main()
{
    size_t last_pos;
    size_t last_size;
    string abc = "the quick brown fox jumped over the lazy dog";
    foreach(m; match(abc, regex(r"\b\w")))
    {
        writefln("between: '%s'", abc[last_pos + last_size..m.pre.length]);
        writefln("%s[%s]%s", m.pre, m.hit, m.post);
        last_size = m.hit.length;
        last_pos = m.pre.length;
    }
    writefln("between: '%s'", abc[last_pos + last_size..$]);
}
// Prints:
// between: ''
// [t]he quick brown fox jumped over the lazy dog
// between: 'he '
// the [q]uick brown fox jumped over the lazy dog
// between: 'uick '
// the quick [b]rown fox jumped over the lazy dog
// between: 'rown '
// the quick brown [f]ox jumped over the lazy dog
// between: 'ox '
// the quick brown fox [j]umped over the lazy dog
// between: 'umped '
// the quick brown fox jumped [o]ver the lazy dog
// between: 'ver '
// the quick brown fox jumped over [t]he lazy dog
// between: 'he '
// the quick brown fox jumped over the [l]azy dog
// between: 'azy '
// the quick brown fox jumped over the lazy [d]og
// between: 'og'

If you replace '\b\w' with '\s' it should help illuminate the way it works:

between: 'the'
the[ ]quick brown fox jumped over the lazy dog
between: 'quick'
the quick[ ]brown fox jumped over the lazy dog
between: 'brown'
the quick brown[ ]fox jumped over the lazy dog
between: 'fox'
the quick brown fox[ ]jumped over the lazy dog
between: 'jumped'
the quick brown fox jumped[ ]over the lazy dog
between: 'over'
the quick brown fox jumped over[ ]the lazy dog
between: 'the'
the quick brown fox jumped over the[ ]lazy dog
between: 'lazy'
the quick brown fox jumped over the lazy[ ]dog
between: 'dog'

On 4/21/11 1:29 PM, Kai Meyer wrote: > On 04/21/2011 11:43 AM, David Gileadi wrote: >> I was using std.regex yesterday, matching a regular expression against a >> string with the "g" flag to find multiple matches. As the example from >> the docs shows (BTW I think the example may be wrong; I think it needs >> the "g" flag added to the regex call), you can do a foreach loop on the >> matches like: >> >> foreach(m; match("abcabcabab", regex("ab"))) >> { >> writefln("%s[%s]%s", m.pre, m.hit, m.post); >> } >> >> Each match "m" is a RegexMatch, which includes .pre, .hit, and .post >> properties to return ranges of everything before, inside, and after the >> match. >> >> However what I really wanted was a way to get the range between matches, >> i.e. since I had multiple matches I wanted something like >> m.upToNextMatch. >> >> Since I'm not very familiar with ranges, am I missing some obvious way >> of doing this with the existing .pre, .hit and .post properties? >> >> -Dave > > There's two ways I can think of off the top of my head. > > I don't think D supports "look ahead", but if it did you could match > something, then capture the portion afterwards (in m.captures[1]) that > matches everything up until the look ahead (which is what you matched in > the first place). > > Otherwise, you could manually capture the ranges like this (captures the > first word character after each word boundry, then prints the remaining > portion of the word until the next word boundary followed by a word > character): (snip an excellent explanation) Ahh yes, that's a good way of doing it--track the lengths and slice the original array to get the "betweens". Thanks for the insight!

Forums