Thread overview
Empty subexpressions captures in std.regex
Jul 11, 2010
PC
Jul 12, 2010
PC
July 11, 2010
Hi, I've been lurking in this group for a few months, have read
through TDPL (which is great Andrei) and have started using D for
some
small programs. So far it's been a joy to use (you may have a C++
convert on your hands) and with the convenience of rdmd, I've been
using it where I'd normally use a scripting language.

It's been pretty good for this especially as Phobos has had almost
everything I've wanted to do covered. I have run into some issues
with
std.regex matching empty subexpressions though (dmd 2.047, win32):

    auto r1 = regex( "(a*)b" );
    auto m = match( "b", r1 );
    writefln( "captures = %s, empty = %s", m.captures.length,
m.empty );

=> captures = 0, empty = true

If I disable the call to optimize, it gives the expected results:

=> captures = 2, empty = false

Also, with optimize disabled:

    auto r = regex("([^,]*),([^,]*),([^,]*)");
    m = match( ",,", r );
    writefln( "captures = %s, empty = %s", m.captures.length,
m.empty );

=> captures = 3, empty = false

I noticed in Captures:

        @property size_t length()
        {
            foreach (i; 0 .. matches.length)
            {
                if (matches[i].startIdx >= input.length) return i;
            }
            return matches.length;
        }

In this case matches[3].startIdx = 2 and matches[3].endIdx=2. Should this line be:

     if (matches[i].startIdx > input.length) return i;


Anyway kudos to everyone involved with D, I'm certainly going to be using it a lot in the future.
July 12, 2010
Hi PC,


Thanks for your kind words.

Regarding regex, we need to get a report into bugzilla so we keep track of the problem. When you say "disable the call to optimize" are you referring to the -O compiler flag? In that case it's a compiler problem (otherwise it might be a library issue). Could you please clarify?


Thanks,

Andrei

On 07/11/2010 06:29 AM, PC wrote:
> Hi, I've been lurking in this group for a few months, have read
> through TDPL (which is great Andrei) and have started using D for
> some
> small programs. So far it's been a joy to use (you may have a C++
> convert on your hands) and with the convenience of rdmd, I've been
> using it where I'd normally use a scripting language.
>
> It's been pretty good for this especially as Phobos has had almost
> everything I've wanted to do covered. I have run into some issues
> with
> std.regex matching empty subexpressions though (dmd 2.047, win32):
>
>      auto r1 = regex( "(a*)b" );
>      auto m = match( "b", r1 );
>      writefln( "captures = %s, empty = %s", m.captures.length,
> m.empty );
>
> =>  captures = 0, empty = true
>
> If I disable the call to optimize, it gives the expected results:
>
> =>  captures = 2, empty = false
>
> Also, with optimize disabled:
>
>      auto r = regex("([^,]*),([^,]*),([^,]*)");
>      m = match( ",,", r );
>      writefln( "captures = %s, empty = %s", m.captures.length,
> m.empty );
>
> =>  captures = 3, empty = false
>
> I noticed in Captures:
>
>          @property size_t length()
>          {
>              foreach (i; 0 .. matches.length)
>              {
>                  if (matches[i].startIdx>= input.length) return i;
>              }
>              return matches.length;
>          }
>
> In this case matches[3].startIdx = 2 and matches[3].endIdx=2. Should
> this line be:
>
>       if (matches[i].startIdx>  input.length) return i;
>
>
> Anyway kudos to everyone involved with D, I'm certainly going to be
> using it a lot in the future.

July 12, 2010
Sorry about the lack of clarity in the last post. I actually commented out the call to the Regex.optimize in Regex.compile.

    auto r1 = regex( "(a*)b" );
    r1.printProgram();

Prints out:

printProgram()
  0: 	REtestbit 98, 13
 18: 	REparen len=15 n=0, pc=>42
 27: 	REnm  len=2, n=0, m=4294967295, pc=>42
 40: 	REchar 'a'
 42: 	REchar 'b'
 44: 	REend

With optimize(buf); commented out I get:

printProgram()
  0: 	REparen len=15 n=0, pc=>24
  9: 	REnm  len=2, n=0, m=4294967295, pc=>24
 22: 	REchar 'a'
 24: 	REchar 'b'
 26: 	REend

I don't understand why REtestbit is inserted at the start of the program by the optimize routine, but it will not match if there is no "a" at the start of the input (e.g. "b").

I think I need to spend some more time looking through the regex.d source to understand it better

- Pete


== Quote from Andrei Alexandrescu (SeeWebsiteForEmail@erdani.org)'s
article
> Hi PC,
> Thanks for your kind words.
> Regarding regex, we need to get a report into bugzilla so we keep
track
> of the problem. When you say "disable the call to optimize" are you referring to the -O compiler flag? In that case it's a compiler
problem
> (otherwise it might be a library issue). Could you please clarify?
> Thanks,
> Andrei
> On 07/11/2010 06:29 AM, PC wrote:
> > Hi, I've been lurking in this group for a few months, have read
> > through TDPL (which is great Andrei) and have started using D for
> > some
> > small programs. So far it's been a joy to use (you may have a C++
> > convert on your hands) and with the convenience of rdmd, I've been
> > using it where I'd normally use a scripting language.
> >
> > It's been pretty good for this especially as Phobos has had almost
> > everything I've wanted to do covered. I have run into some issues
> > with
> > std.regex matching empty subexpressions though (dmd 2.047, win32):
> >
> >      auto r1 = regex( "(a*)b" );
> >      auto m = match( "b", r1 );
> >      writefln( "captures = %s, empty = %s", m.captures.length,
> > m.empty );
> >
> > =>  captures = 0, empty = true
> >
> > If I disable the call to optimize, it gives the expected results:
> >
> > =>  captures = 2, empty = false
> >
> > Also, with optimize disabled:
> >
> >      auto r = regex("([^,]*),([^,]*),([^,]*)");
> >      m = match( ",,", r );
> >      writefln( "captures = %s, empty = %s", m.captures.length,
> > m.empty );
> >
> > =>  captures = 3, empty = false
> >
> > I noticed in Captures:
> >
> >          @property size_t length()
> >          {
> >              foreach (i; 0 .. matches.length)
> >              {
> >                  if (matches[i].startIdx>= input.length) return i;
> >              }
> >              return matches.length;
> >          }
> >
> > In this case matches[3].startIdx = 2 and matches[3].endIdx=2.
Should
> > this line be:
> >
> >       if (matches[i].startIdx>  input.length) return i;
> >
> >
> > Anyway kudos to everyone involved with D, I'm certainly going to
be
> > using it a lot in the future.