February 19, 2009 Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired.
First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop.
In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine).
The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it.
The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g.
foreach (e; match("abracazoo", regex("a[b-e]", "g")))
writeln(e.pre, e.hit, e.post);
auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
Two other syntactic options are available:
"abracazoo".match(regex("a[b-e]", "g")))
"abracazoo".match("a[b-e]", "g")
I could have made match a member of regex:
regex("a[b-e]", "g")).match("abracazoo")
but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.
Now, match() is likely to be called very often so I'm considering:
foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
writeln(e);
In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts?
Andrei
| ||||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Thu, Feb 19, 2009 at 2:35 PM, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote: > I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired. > > First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop. > > In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine). > > The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it. > > The interface is very simple, mainly consisting of the functions regex(), > match(), and sub(), e.g. > > foreach (e; match("abracazoo", regex("a[b-e]", "g"))) > writeln(e.pre, e.hit, e.post); > auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1"); > > Two other syntactic options are available: > > "abracazoo".match(regex("a[b-e]", "g"))) > "abracazoo".match("a[b-e]", "g") > > I could have made match a member of regex: > > regex("a[b-e]", "g")).match("abracazoo") > > but most regex code I've seen mentions the string first and the regex second. So I dropped that idea. > > Now, match() is likely to be called very often so I'm considering: > > foreach (e; "abracazoo" ~ regex("a[b-e]", "g")) > writeln(e); > > In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts? No. ~ means matching in Perl. In D it means concatenation. This special case is not special enough to warrant breaking D's convention, in my opinion. It also breaks D's convention that operators have an inherent meaning which shouldn't be subverted to do unrelated things. What about turning it around and using 'in' though? foreach(e; regex("a[b-e]", "g") in "abracazoo") writeln(e); The charter for "in" isn't quite as focused as that for ~, and anyway you could view this as finding instances of the regular expression "in" the string. --bb | |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Bill Baxter | Hello Bill,
> What about turning it around and using 'in' though?
>
> foreach(e; regex("a[b-e]", "g") in "abracazoo")
> writeln(e);
vote += lots; // I had the same thought as well
| |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to BCS |
BCS wrote:
> Hello Bill,
>
>> What about turning it around and using 'in' though?
>>
>> foreach(e; regex("a[b-e]", "g") in "abracazoo")
>> writeln(e);
>
> vote += lots; // I had the same thought as well
If a regex represents a set of strings, then wouldn't
"abracazoo" in regex("a[b-e]", "g")
make more sense? Of course, that's "match" semantics; if you turn it around and say that you're looking for elements from the set in the string, then it's
regex("a[b-e]", "g") in "abracazoo"
Hmm...
None the less, I do prefer the 'in' syntax over the '~' syntax. Please let's no go down the road of co-opting operators to do things other than what they're designed for.
If you REALLY want a custom operator, you could always convince Walter to let us define infix functions using Unicode characters. :D
-- Daniel
| |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Thu, 19 Feb 2009 08:35:20 +0300, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote: > I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired. > > First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop. > > In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine). > > The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it. > > The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g. > > foreach (e; match("abracazoo", regex("a[b-e]", "g"))) > writeln(e.pre, e.hit, e.post); > auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1"); > > Two other syntactic options are available: > > "abracazoo".match(regex("a[b-e]", "g"))) > "abracazoo".match("a[b-e]", "g") > > I could have made match a member of regex: > > regex("a[b-e]", "g")).match("abracazoo") > > but most regex code I've seen mentions the string first and the regex second. So I dropped that idea. > > Now, match() is likely to be called very often so I'm considering: > > foreach (e; "abracazoo" ~ regex("a[b-e]", "g")) > writeln(e); > > In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts? > > > Andrei "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over '~' version. In is also fine (both ways). | |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Thu, 19 Feb 2009 08:35:20 +0300, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote: > I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired. > > First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop. > > In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine). > > The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it. > > The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g. > > foreach (e; match("abracazoo", regex("a[b-e]", "g"))) > writeln(e.pre, e.hit, e.post); > auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1"); > > Two other syntactic options are available: > > "abracazoo".match(regex("a[b-e]", "g"))) > "abracazoo".match("a[b-e]", "g") > > I could have made match a member of regex: > > regex("a[b-e]", "g")).match("abracazoo") > > but most regex code I've seen mentions the string first and the regex second. So I dropped that idea. > > Now, match() is likely to be called very often so I'm considering: > > foreach (e; "abracazoo" ~ regex("a[b-e]", "g")) > writeln(e); > > In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts? > > > Andrei "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't break existing conventions. I prefer it over '~' version. 'in' is also fine (both ways). | |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Wed, 18 Feb 2009 21:35:20 -0800, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote: >I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired. > >First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop. > >In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine). > >The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it. > >The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g. > >foreach (e; match("abracazoo", regex("a[b-e]", "g"))) > writeln(e.pre, e.hit, e.post); >auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1"); > >Two other syntactic options are available: > >"abracazoo".match(regex("a[b-e]", "g"))) >"abracazoo".match("a[b-e]", "g") > >I could have made match a member of regex: > >regex("a[b-e]", "g")).match("abracazoo") > >but most regex code I've seen mentions the string first and the regex second. So I dropped that idea. > >Now, match() is likely to be called very often so I'm considering: > >foreach (e; "abracazoo" ~ regex("a[b-e]", "g")) > writeln(e); > >In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts? > Please anything but ~. It would be fine, if it didn't follow an array. It could be 'in' as Bill suggested. Or maybe /, which reminds of sed and downs: foreach (e; "abracazoo"/regex("a[b-e]", "g")) writeln(e); If you made 'match' a member of Regex, another option would be the opCall: regex("a[b-e]", "g")( "abracazoo") | |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Denis Koroskin | On Thu, 19 Feb 2009 11:31:42 +0300, Denis Koroskin wrote:
> "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~
> regex("a[b-e]", "g") but doesn't break existing conventions. I prefer it
> over '~' version. 'in' is also fine (both ways).
i dont see a problem either with just using .match. if you use spaces around the ~ then its actually to more characters plus needing to hit shift. certainly not as simple as .match
| |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu: >but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.< I like the following syntaxes (the one with .match() too): import std.re: regex; foreach (e; regex("a[b-e]", "g") in "abracazoo") writeln(e); foreach (e; regex("a[b-e]", "g").match("abracazoo")) writeln(e); auto re1 = regex("a[b-e]", "g"); foreach (e; re1.match("abracazoo")) writeln(e); auto re1 = regex("a[b-e]", "g"); foreach (e; re1 in "abracazoo") writeln(e); ---------------- I like the support of verbose regular expressions too, that ignore whitespace and comments (for example with //...) inserted into the regex itself. This simple thing is able to turn the messy world of regexes into programming again. This is an example of usual RE in Python: finder = re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$") This is the same RE in verbose mode, in Python still (# is the Python single-line comment syntax): finder = re.compile(r""" ^ \s* # start at beginning+ opt spaces ( [\[\]] ) # Group 1: opening bracket \s* # optional spaces ( [-+]? \d+ ) # Group 2: first number \s* , \s* # opt spaces+ comma+ opt spaces ( [-+]? \d+ ) # Group 3: second number \s* # opt spaces ( [\[\]] ) # Group 4: closing bracket \s* $ # opt spaces+ end at the end """, flags=re.VERBOSE) As you can see it's often very positive to indent logically those lines just like code. ---------------- As the other people here, I don't like the following much, it's a misleading overload of the ~ operator: "abracazoo" ~ regex("a[b-e]", "g") ---------------- I don't like that "g" argument much, my suggestions: RE attributes: "repeat", "r": Repeat over the whole input string "ignorecase", "i": case insensitive "multiline", "m": treat as multiple lines separated by newlines "verbose", "v": ignores space outside [] and allows comments ---------------- If not already so, I'd like sub() to take as replacement a string or a callable. Bye, bearophile | |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Denis Koroskin | Denis Koroskin wrote:
> "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over '~' version. In is also fine (both ways).
This isn't so good for two reasons.
First, I can't reuse regexes in your way, so if there is any expensive initialization, that is duplicated.
Second, I can't reuse regexes in your way, so I have to use a pair of string constants.
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply