February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Michel Fortin |
Michel Fortin wrote:
> [snip]
>
> And if any of you complains about the extra verbosity, here's what I suggest:
>
> auto s = replace("abracazoo", re"a([b-e])"g, se"A$1");
>
> Yes, syntaxic sugar for declaring regular expressions.
Didn't D previously have special regex literals that got dropped for being unpopular and/or hated?
-- Daniel
| |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Leandro Lucarella | Leandro Lucarella wrote:
> Bill Baxter, el 19 de febrero a las 14:50 me escribiste:
> [snip]
>>> regex("a[b-e]", "g")).match("abracazoo")
>>>
>>> but most regex code I've seen mentions the string first and the regex
>>> second. So I dropped that idea.
> [snip]
>>> Now, match() is likely to be called very often so I'm considering:
>>>
>>> foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
>>> writeln(e);
>>>
>>> In general I'm weary of unwitting operator overloading, but I think this
>>> case is more justified than others. Thoughts?
>> No. ~ means matching in Perl. In D it means concatenation. This
>> special case is not special enough to warrant breaking D's convention,
>> in my opinion. It also breaks D's convention that operators have an
>> inherent meaning which shouldn't be subverted to do unrelated things.
>> What about turning it around and using 'in' though?
>>
>> foreach(e; regex("a[b-e]", "g") in "abracazoo")
>> writeln(e);
>>
>> The charter for "in" isn't quite as focused as that for ~, and anyway
>> you could view this as finding instances of the regular expression
>> "in" the string.
>
> I think match is pretty short, I don't see any need for any shortcut wich
> makes the code more obscure.
>
> BTW, in case Andrei was looking for a precedent, Python uses the syntax
> like:
> regex("a[b-e]", "g")).match("abracazoo")
Yah, but since even bearophile admitted python kinda botched regexes, I better not consider this argument :o). The Unix toolchain invariably puts the string before the regex.
Andrei
| |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Don | Don wrote: > Andrei Alexandrescu wrote: >> I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired. >> >> First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop. >> >> In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine). >> >> The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it. >> >> The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g. >> >> foreach (e; match("abracazoo", regex("a[b-e]", "g"))) >> writeln(e.pre, e.hit, e.post); >> auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1"); >> >> Two other syntactic options are available: >> >> "abracazoo".match(regex("a[b-e]", "g"))) >> "abracazoo".match("a[b-e]", "g") >> >> I could have made match a member of regex: >> >> regex("a[b-e]", "g")).match("abracazoo") >> >> but most regex code I've seen mentions the string first and the regex second. So I dropped that idea. >> >> Now, match() is likely to be called very often so I'm considering: >> >> foreach (e; "abracazoo" ~ regex("a[b-e]", "g")) >> writeln(e); >> >> In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts? >> >> >> Andrei > > I agree with the comments against ~. > I believe this Perl6 document is a must-read: > > http://dev.perl.org/perl6/doc/design/apo/A05.html > > There are some excellent observations there, especially near the beginning. By separating the engine from the state of the match, you open the possibilty of subsequently providing cleaner regex syntax. I'd read it a while ago, but a refresher is in order. Thanks! > I do wonder though, how you'd deal with a regex which includes a match to a literal string provided as a variable. Would this be passed to the engine, or to the match state? At the moment these are not supported. It's a good question. > If the engine is using backtracking, there's no difference in the generated bytecode; but if it's creating an automata, the compiled engine depends on the contents of the string variable. The current engine is, to the best of my understanding, using backtracking. At least when there's an "or", it tries both matches as recursive calls and picks the longest. Andrei | |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Michel Fortin | Michel Fortin wrote: > On 2009-02-19 00:50:06 -0500, Bill Baxter <wbaxter@gmail.com> said: > >> On Thu, Feb 19, 2009 at 2:35 PM, Andrei Alexandrescu >>> In general I'm weary of unwitting operator overloading, but I think this >>> case is more justified than others. Thoughts? >> >> No. ~ means matching in Perl. In D it means concatenation. This >> special case is not special enough to warrant breaking D's convention, >> in my opinion. It also breaks D's convention that operators have an >> inherent meaning which shouldn't be subverted to do unrelated things. > > Indeed. That's why I don't like seeing `~` here. > > >> What about turning it around and using 'in' though? >> >> foreach(e; regex("a[b-e]", "g") in "abracazoo") >> writeln(e); >> >> The charter for "in" isn't quite as focused as that for ~, and anyway >> you could view this as finding instances of the regular expression >> "in" the string. > > That seems reasonable, although if we support it it shouldn't be limited to regular expressions for coherency reasons. For instance: > > foreach(e; "co" in "conoco") > writeln(e); > > should work too. If we can't make that work in the most simple case, then I'd say it shouldn't with the more complicated ones either. Well I'm a bit unhappy about that one. At least in current D and to yours truly, "in" means "fast membership lookup". The use above is linear lookup. I'm not saying that's bad, but I prefer the non-diluted semantics. For linear search, there's always find(). > By the way, regular expressions should work everywhere where we can search for a string. For instance (from std.string): > > auto firstMatchIndex = find("conoco", "co"); > > should work with a regex too: > > auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g")); If you mean typeof(firstMatchIndex) to be size_t, that's unlikely to be enough. When looking for a regular expression, you need more than just an index - you need captures, "pre" and "post" substrings, the works. That's why matching a string against a regex must return a richer structure that can't be easily integrated with std.algorithm. Andrei | |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Max Samukha | Max Samukha wrote:
> On Thu, 19 Feb 2009 06:47:57 -0500, bearophile
> <bearophileHUGS@lycos.com> wrote:
>
>> If not already so, I'd like sub() to take as replacement a string or a callable.
>>
>> Bye,
>> bearophile
>
> I don't like 'sub' because it can denote anything. The most confusing
> is substring. 'replace' seems to be better.
Ok.
Andrei
| |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile wrote: > Andrei Alexandrescu: > >> but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.< > > I like the following syntaxes (the one with .match() too): > > import std.re: regex; > > foreach (e; regex("a[b-e]", "g") in "abracazoo") > writeln(e); > > foreach (e; regex("a[b-e]", "g").match("abracazoo")) > writeln(e); > > auto re1 = regex("a[b-e]", "g"); > foreach (e; re1.match("abracazoo")) > writeln(e); > > auto re1 = regex("a[b-e]", "g"); > foreach (e; re1 in "abracazoo") > writeln(e); These all put the regex before the string, something many people would find unsavory. > ---------------- > > I like the support of verbose regular expressions too, that ignore whitespace and comments (for example with //...) inserted into the regex itself. This simple thing is able to turn the messy world of regexes into programming again. > > This is an example of usual RE in Python: > > finder = re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$") > > > This is the same RE in verbose mode, in Python still (# is the Python single-line comment syntax): > > finder = re.compile(r""" > ^ \s* # start at beginning+ opt spaces > ( [\[\]] ) # Group 1: opening bracket > \s* # optional spaces > ( [-+]? \d+ ) # Group 2: first number > \s* , \s* # opt spaces+ comma+ opt spaces > ( [-+]? \d+ ) # Group 3: second number > \s* # opt spaces > ( [\[\]] ) # Group 4: closing bracket > \s* $ # opt spaces+ end at the end > """, flags=re.VERBOSE) > > As you can see it's often very positive to indent logically those lines just like code. Yah, I saw that ECMA introduced comments in regexes too. At some point we'll implement that. > ---------------- > > As the other people here, I don't like the following much, it's a misleading overload of the ~ operator: > > "abracazoo" ~ regex("a[b-e]", "g") > > ---------------- > > I don't like that "g" argument much, my suggestions: > > RE attributes: > "repeat", "r": Repeat over the whole input string > "ignorecase", "i": case insensitive > "multiline", "m": treat as multiple lines separated by newlines > "verbose", "v": ignores space outside [] and allows comments And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless. > If not already so, I'd like sub() to take as replacement a string or a callable. It does, I haven't mentioned it yet. Pass-by-alias of course :o). Andrei | |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Thu, 19 Feb 2009 07:01:56 -0800, Andrei Alexandrescu wrote:
> These all put the regex before the string, something many people would find unsavory.
I don't. To me the regex is what you are looking for so it's like saying "find this pattern in that string".
--
Derek Parnell
| |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile |
bearophile wrote:
> Simen Kjaeraas:
>> auto re = regex("a[b-e]", "g");
>> foreach (e; "abracazoo" / re) {
>> }
>
> D has operator overload, that Java lacks, but this fact doesn't force you to use them even when they are unreadable.
>
> For people that like the "in" there I'd like to remind how it can look once (if it will ever happen) the foreach uses "in" too:
>
> foreach (e in (re in "abracazoo")) {...}
>
> Bye,
> bearophile
But it doesn't, and I can't see how it could given how confusing it would make things. Besides which, we shouldn't be making judgements based on possible, not planned for syntax changes at some unspecified point in the future.
We have enough trouble with deciding on things as it is. :P
-- Daniel
| |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Thu, 19 Feb 2009 18:01:56 +0300, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote: > bearophile wrote: >> Andrei Alexandrescu: >> >>> but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.< >> I like the following syntaxes (the one with .match() too): >> import std.re: regex; >> foreach (e; regex("a[b-e]", "g") in "abracazoo") >> writeln(e); >> foreach (e; regex("a[b-e]", "g").match("abracazoo")) >> writeln(e); >> auto re1 = regex("a[b-e]", "g"); >> foreach (e; re1.match("abracazoo")) >> writeln(e); >> auto re1 = regex("a[b-e]", "g"); >> foreach (e; re1 in "abracazoo") >> writeln(e); > > These all put the regex before the string, something many people would find unsavory. > >> ---------------- >> I like the support of verbose regular expressions too, that ignore whitespace and comments (for example with //...) inserted into the regex itself. This simple thing is able to turn the messy world of regexes into programming again. >> This is an example of usual RE in Python: >> finder = re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$") >> This is the same RE in verbose mode, in Python still (# is the Python single-line comment syntax): >> finder = re.compile(r""" >> ^ \s* # start at beginning+ opt spaces >> ( [\[\]] ) # Group 1: opening bracket >> \s* # optional spaces >> ( [-+]? \d+ ) # Group 2: first number >> \s* , \s* # opt spaces+ comma+ opt spaces >> ( [-+]? \d+ ) # Group 3: second number >> \s* # opt spaces >> ( [\[\]] ) # Group 4: closing bracket >> \s* $ # opt spaces+ end at the end >> """, flags=re.VERBOSE) >> As you can see it's often very positive to indent logically those lines just like code. > > Yah, I saw that ECMA introduced comments in regexes too. At some point we'll implement that. > >> ---------------- >> As the other people here, I don't like the following much, it's a misleading overload of the ~ operator: >> "abracazoo" ~ regex("a[b-e]", "g") >> ---------------- >> I don't like that "g" argument much, my suggestions: >> RE attributes: >> "repeat", "r": Repeat over the whole input string >> "ignorecase", "i": case insensitive >> "multiline", "m": treat as multiple lines separated by newlines >> "verbose", "v": ignores space outside [] and allows comments > > And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless. > Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might be better? I don't find "gmi" immediately clear nor self-documenting. >> If not already so, I'd like sub() to take as replacement a string or a callable. > > It does, I haven't mentioned it yet. Pass-by-alias of course :o). > > > Andrei | |||
February 19, 2009 Re: Is str ~ regex the root of all evil, or the leaf of all good? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Derek Parnell | Derek Parnell wrote:
> On Thu, 19 Feb 2009 07:01:56 -0800, Andrei Alexandrescu wrote:
>
>> These all put the regex before the string, something many people would find unsavory.
>
> I don't. To me the regex is what you are looking for so it's like saying
> "find this pattern in that string".
Yah, but to most others it's "match this string against that pattern". Again, regexes have a long history behind them. So probably we need to have both "find" and "match" with different order of arguments, something .
Anyway, std.algorithm defines find() like this:
find(haystack, needle)
In the least structured case, the haystack is a range and needle is either an element or another range. But then we can think, hey, we can think of efficient finds by using a more structured haystack and/or a more structured needle. So then:
string a = "conoco", b = "co";
// linear find
auto r1 = find(a, b[0]);
// quadratic find
auto r2 = find(a, b);
// organize a in a Boyer-Moore structure; sublinear find
auto r3 = find(boyerMoore(a), b);
I'll actually implement the above, it's pretty nice. Now the question is, what's the haystack and what's the needle in a regex find?
auto r3 = find("conoco", regex("c[a-z]"));
or
auto r3 = find(regex("c[a-z]"), "conoco");
?
The argument could go both ways:
"Organize the set of 2-char strings starting with 'c' and ending with 'a' to 'z' into a structured haystack, then look for substrings of "conoco" in that haystack."
versus
"Given the unstructured haystack conoco, look for a structured needle in it that is any 2-char string starting with 'c' and ending with 'a' to 'z'."
What is the most natural way?
Andrei
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply