February 19, 2009
On 2009-02-19 00:50:06 -0500, Bill Baxter <wbaxter@gmail.com> said:

> On Thu, Feb 19, 2009 at 2:35 PM, Andrei Alexandrescu
>> In general I'm weary of unwitting operator overloading, but I think this
>> case is more justified than others. Thoughts?
> 
> No.  ~ means matching in Perl.  In D it means concatenation.  This
> special case is not special enough to warrant breaking D's convention,
> in my opinion.  It also breaks D's convention that operators have an
> inherent meaning which shouldn't be subverted to do unrelated things.

Indeed. That's why I don't like seeing `~` here.


> What about turning it around and using 'in' though?
> 
>    foreach(e; regex("a[b-e]", "g") in "abracazoo")
>       writeln(e);
> 
> The charter for "in" isn't quite as focused as that for ~, and anyway
> you could view this as finding instances of the regular expression
> "in" the string.

That seems reasonable, although if we support it it shouldn't be limited to regular expressions for coherency reasons. For instance:

	foreach(e; "co" in "conoco")
		writeln(e);

should work too. If we can't make that work in the most simple case, then I'd say it shouldn't with the more complicated ones either.

By the way, regular expressions should work everywhere where we can search for a string. For instance (from std.string):

	auto firstMatchIndex = find("conoco", "co");

should work with a regex too:

	auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g"));

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

February 19, 2009
On Thu, 19 Feb 2009 06:47:57 -0500, bearophile <bearophileHUGS@lycos.com> wrote:

>If not already so, I'd like sub() to take as replacement a string or a callable.
>
>Bye,
>bearophile

I don't like 'sub' because it can denote anything. The most confusing is substring. 'replace' seems to be better.
February 19, 2009
Andrei Alexandrescu wrote:
> I'm almost done rewriting the regular expression engine, and some pretty interesting things have transpired.
> 
> First, I separated the engine into two parts, one that is the actual regular expression engine, and the other that is the state of the match with some particular input. The previous code combined the two into a huge class. The engine (written by Walter) translates the regex string into a bytecode-compiled form. Given that there is a deterministic correspondence between the regex string and the bytecode, the Regex engine object is in fact invariant and cached by the implementation. Caching makes for significant time savings even if e.g. the user repeatedly creates a regular expression engine in a loop.
> 
> In contrast, the match state depends on the input string. I defined it to implement the range interface, so you can either inspect it directly or iterate it for all matches (if the "g" option was passed to the engine).
> 
> The new codebase works with char, wchar, and dchar and any random-access range as input (forward ranges to come, and at some point in the future input ranges as well). In spite of the added flexibility, the code size has shrunk from 3396 lines to 2912 lines. I plan to add support for binary data (e.g. ubyte - handling binary file formats can benefit a LOT from regexes) and also, probably unprecedented, support for arbitrary types such as integers, floating point numbers, structs, what have you. any type that supports comparison and ranges is a good candidate for regular expression matching. I'm not sure how regular expression matching can be harnessed e.g. over arrays of int, but I suspect some pretty cool applications are just around the corner. We can introduce that generalization without adding complexity and there is nothing in principle opposed to it.
> 
> The interface is very simple, mainly consisting of the functions regex(), match(), and sub(), e.g.
> 
> foreach (e; match("abracazoo", regex("a[b-e]", "g")))
>     writeln(e.pre, e.hit, e.post);
> auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
> 
> Two other syntactic options are available:
> 
> "abracazoo".match(regex("a[b-e]", "g")))
> "abracazoo".match("a[b-e]", "g")
> 
> I could have made match a member of regex:
> 
> regex("a[b-e]", "g")).match("abracazoo")
> 
> but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.
> 
> Now, match() is likely to be called very often so I'm considering:
> 
> foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
>     writeln(e);
> 
> In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts?
> 
> 
> Andrei

I agree with the comments against ~.
I believe this Perl6 document is a must-read:

http://dev.perl.org/perl6/doc/design/apo/A05.html

There are some excellent observations there, especially near the beginning. By separating the engine from the state of the match, you open the possibilty of subsequently providing cleaner regex syntax.

I do wonder though, how you'd deal with a regex which includes a match to a literal string provided as a variable. Would this be passed to the engine, or to the match state?
If the engine is using backtracking, there's no difference in the generated bytecode; but if it's creating an automata, the compiled engine depends on the contents of the string variable.
February 19, 2009
Michel Fortin:
> 	foreach(e; "co" in "conoco")
> 		writeln(e);
> should work too.

Of course :-) I think eventually it will work, it's handy and natural.


> By the way, regular expressions should work everywhere where we can
> search for a string. For instance (from std.string):
> 	auto firstMatchIndex = find("conoco", "co");
> should work with a regex too:
> 	auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g"));

I agree, I have said the same thing regarding splitter()/xsplitter().

Bye,
bearophile
February 19, 2009
Bill Baxter, el 19 de febrero a las 14:50 me escribiste: [snip]
> > regex("a[b-e]", "g")).match("abracazoo")
> >
> > but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.
[snip]
> > Now, match() is likely to be called very often so I'm considering:
> >
> > foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
> >    writeln(e);
> >
> > In general I'm weary of unwitting operator overloading, but I think this case is more justified than others. Thoughts?
> 
> No.  ~ means matching in Perl.  In D it means concatenation.  This special case is not special enough to warrant breaking D's convention, in my opinion.  It also breaks D's convention that operators have an inherent meaning which shouldn't be subverted to do unrelated things. What about turning it around and using 'in' though?
> 
>    foreach(e; regex("a[b-e]", "g") in "abracazoo")
>       writeln(e);
> 
> The charter for "in" isn't quite as focused as that for ~, and anyway you could view this as finding instances of the regular expression "in" the string.

I think match is pretty short, I don't see any need for any shortcut wich makes the code more obscure.

BTW, in case Andrei was looking for a precedent, Python uses the syntax
like:
regex("a[b-e]", "g")).match("abracazoo")

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
<Palmer> recien estuvimos con el vita... se le paro yo lo vi
<Luca> ???????????????????????????????????????????????????????
<Palmer> sisi, cuando vio a josefina
<Luca> y quiƩn es josefina?
<Palmer> Mi computadora nuevaaaaa
February 19, 2009
On 2009-02-19 00:35:20 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

I don't like `sub`, I mean the name. Makes me think of substring more than substitute. My choice would be to reuse what we have in std.string and augment it to work with regular expressions:

	auto s = replace("abracazoo", regex("a([b-e])", "g"), subex("A$1"));

This way it works consistently whether you're using a string or a regular expression: just replace any pattern string with regex(...) and any replacement string with subex(...) -- "substition-expression" -- when you want them to be parsed as such. Omitting subex in the above would make it a plain string replacement for instance (this way it's easy to place use a variable there).

These functions should allow easy substitution of any string or regex pattern with another algorithm for matching the pattern.

And there's not way to get a range of matches using std.string, but there should be, and it should follow the same rule as above: supporting strings and regex consistently. (Using the `in` operator as suggested by Bill Baxter seems a good fit for this function.)

And if any of you complains about the extra verbosity, here's what I suggest:

	auto s = replace("abracazoo", re"a([b-e])"g, se"A$1");

Yes, syntaxic sugar for declaring regular expressions.


> Two other syntactic options are available:
> 
> "abracazoo".match(regex("a[b-e]", "g")))
> "abracazoo".match("a[b-e]", "g")

I despise the second one, because if you omit regex(...) it makes me think you're checking for string matches, not expression matches. There's nothing in the name of the funciton telling you you're dealing with a regular expression, so it could easily get confusing.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

February 19, 2009
On Thu, 19 Feb 2009 15:00:42 +0300, Christopher Wright <dhasenan@gmail.com> wrote:

> Denis Koroskin wrote:
>> "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over '~' version. In is also fine (both ways).
>
> This isn't so good for two reasons.
> First, I can't reuse regexes in your way, so if there is any expensive initialization, that is duplicated.
>
> Second, I can't reuse regexes in your way, so I have to use a pair of string constants.

auto re = regex("a[b-e]", "g");
foreach (e; "abracazoo".match(re)) {
    // what's wrong with that?
}
February 19, 2009
On Thu, 19 Feb 2009 14:34:30 +0100, Denis Koroskin <2korden@gmail.com> wrote:

> On Thu, 19 Feb 2009 15:00:42 +0300, Christopher Wright <dhasenan@gmail.com> wrote:
>
>> Denis Koroskin wrote:
>>> "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over '~' version. In is also fine (both ways).
>>
>> This isn't so good for two reasons.
>> First, I can't reuse regexes in your way, so if there is any expensive initialization, that is duplicated.
>>
>> Second, I can't reuse regexes in your way, so I have to use a pair of string constants.
>
> auto re = regex("a[b-e]", "g");
> foreach (e; "abracazoo".match(re)) {
>      // what's wrong with that?
> }

This:

auto re = regex("a[b-e]", "g");
foreach (e; "abracazoo" / re) {
}

--
Simen
February 19, 2009
Simen Kjaeraas:
> auto re = regex("a[b-e]", "g");
> foreach (e; "abracazoo" / re) {
> }

D has operator overload, that Java lacks, but this fact doesn't force you to use them even when they are unreadable.

For people that like the "in" there I'd like to remind how it can look once (if it will ever happen) the foreach uses "in" too:

foreach (e in (re in "abracazoo")) {...}

Bye,
bearophile
February 19, 2009
Michel Fortin wrote:
> On 2009-02-19 00:35:20 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:
> 
>> auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
> 
> I don't like `sub`, I mean the name. Makes me think of substring more than substitute. My choice would be to reuse what we have in std.string and augment it to work with regular expressions:
> 
>     auto s = replace("abracazoo", regex("a([b-e])", "g"), subex("A$1"));

Ok. Probably subex is a bit of a killer, but I see your point (subex is not an arbitrary string).

> This way it works consistently whether you're using a string or a regular expression: just replace any pattern string with regex(...) and any replacement string with subex(...) -- "substition-expression" -- when you want them to be parsed as such. Omitting subex in the above would make it a plain string replacement for instance (this way it's easy to place use a variable there).

Indeed, that was part of the impetus for making regex a distinct type that participates in larger functions. The only problem is that regex does not work with std.algorithm in an obvious way, e.g. find() works very differently for strings and regexes. I considered at a point trying to integrate them, but decided to not spend that effort right now.

> These functions should allow easy substitution of any string or regex pattern with another algorithm for matching the pattern.
> 
> And there's not way to get a range of matches using std.string, but there should be, and it should follow the same rule as above: supporting strings and regex consistently. (Using the `in` operator as suggested by Bill Baxter seems a good fit for this function.)

I defined the following in std.algorithm (signatures simplified):

// Split a range by a 1-element separator
Splitter!(...) splitter(Range, Element)(Range input, Range separator);
// Split a range by a subrange separator
Splitter!(...) splitter(Range)(Range input, Range separator);

I then defined this in std.regex:

// Split a range by a subrange separator
Splitter!(...) splitter(Range)(Range input, Regex separator);

Now this is very nice because you get to switch from one to another very easily.

foreach (e; splitter(input, ',')) { ... }
foreach (e; splitter(input, ", ")) { ... }
foreach (e; splitter(input, regex(", *"))) { ... }

The speed/flexibility tradeoff is self-evident and under the control of the programmer without much fuss as it's very easy to switch from one form to another.

> And if any of you complains about the extra verbosity, here's what I suggest:
> 
>     auto s = replace("abracazoo", re"a([b-e])"g, se"A$1");
> 
> Yes, syntaxic sugar for declaring regular expressions.
> 
> 
>> Two other syntactic options are available:
>>
>> "abracazoo".match(regex("a[b-e]", "g")))
>> "abracazoo".match("a[b-e]", "g")
> 
> I despise the second one, because if you omit regex(...) it makes me think you're checking for string matches, not expression matches. There's nothing in the name of the funciton telling you you're dealing with a regular expression, so it could easily get confusing.

This is yet another proof that discussion of syntax, notation, and naming will never go out of fashion. I was half convinced by the others that we're in good shape with input.match(regex).


Andrei