February 20, 2009
Andrei Alexandrescu, el 18 de febrero a las 21:35 me escribiste:
> I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired.
> 
> First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop.
> 
> In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine).
> 
> The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it.
> 
> The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g.
> 
> foreach (e; match("abracazoo", regex("a[b-e]", "g")))
>     writeln(e.pre, e.hit, e.post);
> auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

BTW, why are the flags passed as string and not as an integer mask? For
example:
auto s = regex.sub("abracazoo", regex.regex("a([b-e])", regex.G), "A$1");

This way you can catch a few errors at compile-time.


-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
When I was a child I had a fever
My hands felt just like two balloons.
Now I've got that feeling once again
I can't explain you would not understand
This is not how I am.
I have become comfortably numb.
February 20, 2009
>> And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless.
>>
> 
> Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might be better? I don't find "gmi" immediately clear nor self-documenting.

I prefer the enum options too. But not vociferously. I could live with the single-char flags.

--benji
February 20, 2009
Christopher Wright wrote:
> Denis Koroskin wrote:
>> On Thu, 19 Feb 2009 15:00:42 +0300, Christopher Wright <dhasenan@gmail.com> wrote:
>>
>>> Denis Koroskin wrote:
>>>> "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over '~' version. In is also fine (both ways).
>>>
>>> This isn't so good for two reasons.
>>> First, I can't reuse regexes in your way, so if there is any expensive initialization, that is duplicated.
>>>
>>> Second, I can't reuse regexes in your way, so I have to use a pair of string constants.
>>
>> auto re = regex("a[b-e]", "g");
>> foreach (e; "abracazoo".match(re)) {
>>     // what's wrong with that?
>> }
> 
> Your first example was:
> auto match (char[] source, char[] pattern, char[] options);
> 
> Your second example was:
> auto match (char[] source, regex expression);
> 
> The second is good, but more typing than you said originally. The first is problematic.

Why is it problematic? Is the name "match" too common?

Andrei
February 20, 2009
Some of the things I'd like to see in the regex implementation:

All functions accepting a compiled regex object/struct should also accept a string version of the pattern (and vice versa). Some implementations (Java) only accept the compiled version in some places and the string pattern in other places. That's annoying.

Just like with ordinary string-searching functions, you should be able to specify a start position (and maybe an end position) for the search. Even if the match exists somewhere in the string, it fails if not found within the target slice. Something like this:

   auto text = "ABCDEFG";
   auto pattern = regex("[ABCEFG]");

   // returns false, because the char at position 3 does not match
   auto result = match(text, 3);

   // this should be exactly equivalent (but the previous version
   // uses less memory, and ought to work with infinite ranges, whereas
   // the slice version wouldn't make any sense)
   auto equivalent = match(text[3..$]);

I've needed to use this technique in a few cases to implement a simple lexical scanner, and it's a godsend, if the regex engine supports it (though most don't).

Finally, it'd be extremely cool if the regex compiler automatically eliminated redundant nodes from its NFA, converting as much of it as possible to a DFA. I did some work on this a few years ago, and it's actually remarkably simple to implement using prefix trees.

   // These two expressions produce an identical set of matches,
   // but the first one is functionally an NFA, while the second
   // one is a DFA.
   auto a = regex("(cat|car|cry|dog|door|dry)");
   auto b = regex("(c(?:a[tr]|ry)|d(?:o(?:g|or)|ry)");

In cases where the expression can only be partially simplified, you can leave some NFA nodes deep within the tree, while still DFA-ifying the rest of it:

   auto a = regex("(attitude|attribute|att.+ion");
   auto b = regex("(att(?:itude|ribute|.+ion)");

It's a very simple transformation, increases speed (dramatically) for complex regular expressions (especially those produced dynamically at runtime by combining large sets of unrelated target expressions), and it reliably produces equivalent results with the inefficient version.

The only really tricky part is if the subexpressions have their own capturing groups, in which case the DFA transformation screws up the ordinal-numbering of the resultant captures.

Anyhoo...

I don't have any strong feelings about the function names (though I'd rather have functions that operators, like "~", for searching and matching).

And I don't have any strong feelings about whether the compiled regex is an object or a struct (though I prefer reference semantics over value semantics for regexen, and right now, I think that makes objects the (slightly) better choice).

Thanks for your hard work! I've implemented a small regex engine before, so I know it's no small chunk of effort. Regular expressions are my personal favorite "tiny language", and I'm glad to see them get some special attention in phobos2.

--benji
February 20, 2009

Benji Smith wrote:
>>> And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless.
>>>
>>
>> Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might be better? I don't find "gmi" immediately clear nor self-documenting.
> 
> I prefer the enum options too. But not vociferously. I could live with the single-char flags.
> 
> --benji

I dislike enum options because it dramatically bloats the code, in terms of how much typing it takes.

This is another thing that Visual Basic actually got right [1]; instead of:

Match(string, "a[b-e]", Regex.Repeat | Regex.IgnoreCase)

you could use this:

Match(string, "a[b-e]", Repeat | IgnoreCase)

Since the compiler knew the type of that third argument, it allowed you to omit the prefix.

If D did that, I would completely reverse my dislike of enums and DEFINITELY prefer them over strings; you could always have this:

enum Regex
{
  Repeat,
  IgnoreCase,
  R = Repeat,
  I = IgnoreCase,
}

And then instead of "ri" you have R|I which is actually shorter AND safer!

But yeah; I think Walter said he wanted to do this ages and ages ago, but it never happened.

  -- Daniel


[1] It's funny how many things that poor language *did* get right.  I mean, yeah, it's a terrible language from a design standpoint, but boy did it ever let you just get shit done.
February 20, 2009
Benji Smith:
> It's a very simple transformation, increases speed (dramatically) for complex regular expressions (especially those produced dynamically at runtime by combining large sets of unrelated target expressions), and it reliably produces equivalent results with the inefficient version.

See: http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/List.pm http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/Optimizer.pm

Something like that can be implemented as small pre-processing layer over the re module.

Bye,
bearophile
February 20, 2009
Andrei Alexandrescu wrote:
> Christopher Wright wrote:
>> Your first example was:
>> auto match (char[] source, char[] pattern, char[] options);
>>
>> Your second example was:
>> auto match (char[] source, regex expression);
>>
>> The second is good, but more typing than you said originally. The first is problematic.
> 
> Why is it problematic? Is the name "match" too common?
> 
> Andrei

No. What is the difference between those two? One is building a regex internally and not letting me store it; and it forces me to pass two parameters rather than one. The other takes a regex as a parameter, which I can store, and I can set both options (the pattern and the match options) once for all.
February 20, 2009
Thu, 19 Feb 2009 07:46:47 -0800, Andrei Alexandrescu wrote:

> The argument could go both ways:
> 
> "Organize the set of 2-char strings starting with 'c' and ending with 'a' to 'z' into a structured haystack, then look for substrings of "conoco" in that haystack."
> 
> versus
> 
> "Given the unstructured haystack conoco, look for a structured needle in it that is any 2-char string starting with 'c' and ending with 'a' to 'z'."
> 
> What is the most natural way?

I think calling a regex a 'haystack' is a far-fetched metaphor.  A haystack is a pile of stuff, and a needle is a precise thing you're looking for.  I think they're unambiguous.

Also, the in operator doesn't leave you guessing whether you should put a haystack or a needle first.
February 20, 2009
On Fri, Feb 20, 2009 at 9:59 PM, Sergey Gromov <snake.scaly@gmail.com> wrote:
> Thu, 19 Feb 2009 07:46:47 -0800, Andrei Alexandrescu wrote:
>
>> The argument could go both ways:
>>
>> "Organize the set of 2-char strings starting with 'c' and ending with 'a' to 'z' into a structured haystack, then look for substrings of "conoco" in that haystack."
>>
>> versus
>>
>> "Given the unstructured haystack conoco, look for a structured needle in it that is any 2-char string starting with 'c' and ending with 'a' to 'z'."
>>
>> What is the most natural way?
>
> I think calling a regex a 'haystack' is a far-fetched metaphor.  A haystack is a pile of stuff, and a needle is a precise thing you're looking for.  I think they're unambiguous.

I thought so too.  It's a stretch.  And I also agree with the various posts people have made about allowing plain char[] search strings to be interchangeable with regex'es as much as possible.  And in that light saying you're looking for [big-string] inside of [sub-string] just sounds ridiculous.  But a string is certainly a kind of special-case regex that just describes a set consisting 1 element.  So yeh, you /could/ say you're looking for matches inside that set, but it's quite a stretch.

> Also, the in operator doesn't leave you guessing whether you should put a haystack or a needle first.

That's a good point in it's favor.  As long as you aren't one of these folks who thinks "looking for matches inside a regular grammar" is just as reasonable as "looking for a pattern inside a string".

--bb
1 2 3 4 5
Next ›   Last »