February 19, 2009
On Thu, Feb 19, 2009 at 10:01 AM, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

>> RE attributes:
>> "repeat", "r": Repeat over the whole input string
>> "ignorecase", "i": case insensitive
>> "multiline", "m": treat as multiple lines separated by newlines
>> "verbose", "v": ignores space outside [] and allows comments
>
> And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless.

While we're on the subject I'd like to mention that an unbelievably overwhelming proportion of the time, when I use regexen, I want them to be global.  As in, I don't think I've ever used a non-global regex.

To that effect I'd like to propose that either "g" be the default attribute, or that it should be on _unless_ some other attribute ("o" for once?) is present.  I think this is one thing that Perl got terribly wrong.
February 19, 2009
Denis Koroskin wrote:
> On Thu, 19 Feb 2009 18:01:56 +0300, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
> 
>> bearophile wrote:
>>> Andrei Alexandrescu:
>>>
>>>> but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.<
>>>  I like the following syntaxes (the one with .match() too):
>>>  import std.re: regex;
>>>  foreach (e; regex("a[b-e]", "g") in "abracazoo")
>>>      writeln(e);
>>>  foreach (e; regex("a[b-e]", "g").match("abracazoo"))
>>>      writeln(e);
>>>  auto re1 = regex("a[b-e]", "g");
>>> foreach (e; re1.match("abracazoo"))
>>>      writeln(e);
>>>  auto re1 = regex("a[b-e]", "g");
>>> foreach (e; re1 in "abracazoo")
>>>      writeln(e);
>>
>> These all put the regex before the string, something many people would find unsavory.
>>
>>> ----------------
>>>  I like the support of verbose regular expressions too, that ignore whitespace and comments (for example with //...) inserted into the regex itself. This simple thing is able to turn the messy world of regexes into programming again.
>>>  This is an example of usual RE in Python:
>>>  finder = re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$")
>>>   This is the same RE in verbose mode, in Python still (# is the Python single-line comment syntax):
>>>  finder = re.compile(r"""
>>>     ^ \s*             # start at beginning+ opt spaces
>>>     ( [\[\]] )        # Group 1: opening bracket
>>>         \s*           # optional spaces
>>>         ( [-+]? \d+ ) # Group 2: first number
>>>         \s* , \s*     # opt spaces+ comma+ opt spaces
>>>         ( [-+]? \d+ ) # Group 3: second number
>>>         \s*           # opt spaces
>>>     ( [\[\]] )        # Group 4: closing bracket
>>>     \s* $             # opt spaces+ end at the end
>>>     """, flags=re.VERBOSE)
>>>  As you can see it's often very positive to indent logically those lines just like code.
>>
>> Yah, I saw that ECMA introduced comments in regexes too. At some point we'll implement that.
>>
>>> ----------------
>>>  As the other people here, I don't like the following much, it's a misleading overload of the ~ operator:
>>>  "abracazoo" ~ regex("a[b-e]", "g")
>>>  ----------------
>>>  I don't like that "g" argument much, my suggestions:
>>>  RE attributes:
>>> "repeat", "r": Repeat over the whole input string
>>> "ignorecase", "i": case insensitive
>>> "multiline", "m": treat as multiple lines separated by newlines
>>> "verbose", "v": ignores space outside [] and allows comments
>>
>> And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless.
>>
> 
> Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might be better? I don't find "gmi" immediately clear nor self-documenting.

I got disabused a very long time ago of the notion that everything about regexes is clear or self-documenting. Really. You just get to a level of understanding that's appropriate for your needs. On that scale, getting used to "gmi" is so low, it's not even worth discussing.


Andrei
February 19, 2009
Andrei Alexandrescu wrote:
> Derek Parnell wrote:
>> On Thu, 19 Feb 2009 07:01:56 -0800, Andrei Alexandrescu wrote:
>>
>>> These all put the regex before the string, something many people would find unsavory.
>>
>> I don't. To me the regex is what you are looking for so it's like saying
>> "find this pattern in that string". 
> 
> Yah, but to most others it's "match this string against that pattern". Again, regexes have a long history behind them. So probably we need to have both "find" and "match" with different order of arguments, something .

... "I'm not thrilled about".

Andrei
February 19, 2009
Jarrett Billingsley wrote:
> On Thu, Feb 19, 2009 at 10:01 AM, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
> 
>>> RE attributes:
>>> "repeat", "r": Repeat over the whole input string
>>> "ignorecase", "i": case insensitive
>>> "multiline", "m": treat as multiple lines separated by newlines
>>> "verbose", "v": ignores space outside [] and allows comments
>> And how do you combine them? "repeat, ignorecase"? Writing and parsing such
>> options becomes a little adventure in itself. I think the "g", "i", and "m"
>> flags are popular enough if you've done any amount of regex programming. If
>> not, you'll look up the manual regardless.
> 
> While we're on the subject I'd like to mention that an unbelievably
> overwhelming proportion of the time, when I use regexen, I want them
> to be global.  As in, I don't think I've ever used a non-global regex.
> 
> To that effect I'd like to propose that either "g" be the default
> attribute, or that it should be on _unless_ some other attribute ("o"
> for once?) is present.  I think this is one thing that Perl got
> terribly wrong.

Well I agree for searches but not for substitutions.

In D searches, the lazy way of matching means you can always go with "g" and change your mind whenever you please. I think I'll simply eliminate "g" from the offered options for search.


Andrei
February 19, 2009
Reply to bearophile,

> Michel Fortin:
> 
>> foreach(e; "co" in "conoco")
>> writeln(e);
>> should work too.
> Of course :-) I think eventually it will work, it's handy and natural.
> 
>> By the way, regular expressions should work everywhere where we can
>> search for a string. For instance (from std.string):
>> auto firstMatchIndex = find("conoco", "co");
>> should work with a regex too:
>> auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g"));
> I agree, I have said the same thing regarding splitter()/xsplitter().
> 
> Bye,
> bearophile

If the overhead of regex(string) is small enough (vs having it in the find/split function) I'd go with overloads rather than different names.


February 19, 2009
BCS wrote:
> Reply to bearophile,
> 
>> Michel Fortin:
>>
>>> foreach(e; "co" in "conoco")
>>> writeln(e);
>>> should work too.
>> Of course :-) I think eventually it will work, it's handy and natural.
>>
>>> By the way, regular expressions should work everywhere where we can
>>> search for a string. For instance (from std.string):
>>> auto firstMatchIndex = find("conoco", "co");
>>> should work with a regex too:
>>> auto firstMatchIndex = find("abracazoo", regex("a[b-e]", "g"));
>> I agree, I have said the same thing regarding splitter()/xsplitter().
>>
>> Bye,
>> bearophile
> 
> If the overhead of regex(string) is small enough (vs having it in the find/split function) I'd go with overloads rather than different names.

The overhead is low because the last few used regexes are cached.

Andrei
February 19, 2009
On Thu, 19 Feb 2009 07:46:47 -0800, Andrei Alexandrescu wrote:

> Derek Parnell wrote:
>> On Thu, 19 Feb 2009 07:01:56 -0800, Andrei Alexandrescu wrote:
>> 
>>> These all put the regex before the string, something many people would find unsavory.
>> 
>> I don't. To me the regex is what you are looking for so it's like saying "find this pattern in that string".
> 
> Yah, but to most others it's "match this string against that pattern".

I might not be normal ;-)

> Again, regexes have a long history behind them. So probably we need to have both "find" and "match" with different order of arguments, something .
> 
> Anyway, std.algorithm defines find() like this:
> 
> find(haystack, needle)

I use the Euphoria language a lot, and its routine API is find(needle, haystack), so I'm sure this is where my normality springs from.


> What is the most natural way?

Get your "personal assistant" to do it.

-- 
Derek Parnell
Melbourne, Australia
skype: derek.j.parnell
February 19, 2009
Denis Koroskin wrote:
> On Thu, 19 Feb 2009 15:00:42 +0300, Christopher Wright <dhasenan@gmail.com> wrote:
> 
>> Denis Koroskin wrote:
>>> "abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over '~' version. In is also fine (both ways).
>>
>> This isn't so good for two reasons.
>> First, I can't reuse regexes in your way, so if there is any expensive initialization, that is duplicated.
>>
>> Second, I can't reuse regexes in your way, so I have to use a pair of string constants.
> 
> auto re = regex("a[b-e]", "g");
> foreach (e; "abracazoo".match(re)) {
>     // what's wrong with that?
> }

Your first example was:
auto match (char[] source, char[] pattern, char[] options);

Your second example was:
auto match (char[] source, regex expression);

The second is good, but more typing than you said originally. The first is problematic.
February 20, 2009
Andrei Alexandrescu wrote:
> Michel Fortin wrote:
>> That seems reasonable, although if we support it it shouldn't be limited to regular expressions for coherency reasons. For instance:
>>
>>     foreach(e; "co" in "conoco")
>>         writeln(e);
>>
>> should work too. If we can't make that work in the most simple case, then I'd say it shouldn't with the more complicated ones either.
> 
> Well I'm a bit unhappy about that one. At least in current D and to yours truly, "in" means "fast membership lookup". The use above is linear lookup. I'm not saying that's bad, but I prefer the non-diluted semantics. For linear search, there's always find().

At least, "in" refers to a look-up, whereas "~" refers to concatenation, which has nothing in common with the regex matching.

Furthermore, we can't make any complexity guarantees for operators; this always depends on the data structure you use the operator on. And, if I'm not mistaken, "in" is only used by the associated array at the moment. It's a "fast look-up" because of the associated array, but it doesn't have to be.

(Similarly, to me, ~ and ~= feel slow, O(n), but that shouldn't keep us from using it with other data structures that can do a similar concat/append operation with lower complexity.)

L.
February 20, 2009
Denis Koroskin wrote:
> On Thu, 19 Feb 2009 18:01:56 +0300, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
> 
>> bearophile wrote:
>>> Andrei Alexandrescu:
>>>
>>>> but most regex code I've seen mentions the string first and the regex second. So I dropped that idea.<
>>>  I like the following syntaxes (the one with .match() too):
>>>  import std.re: regex;
>>>  foreach (e; regex("a[b-e]", "g") in "abracazoo")
>>>      writeln(e);
>>>  foreach (e; regex("a[b-e]", "g").match("abracazoo"))
>>>      writeln(e);
>>>  auto re1 = regex("a[b-e]", "g");
>>> foreach (e; re1.match("abracazoo"))
>>>      writeln(e);
>>>  auto re1 = regex("a[b-e]", "g");
>>> foreach (e; re1 in "abracazoo")
>>>      writeln(e);
>>
>> These all put the regex before the string, something many people would find unsavory.
>>
>>> ----------------
>>>  I like the support of verbose regular expressions too, that ignore whitespace and comments (for example with //...) inserted into the regex itself. This simple thing is able to turn the messy world of regexes into programming again.
>>>  This is an example of usual RE in Python:
>>>  finder = re.compile("^\s*([\[\]])\s*([-+]?\d+)\s*,\s*([-+]?\d+)\s*([\[\]])\s*$")
>>>   This is the same RE in verbose mode, in Python still (# is the Python single-line comment syntax):
>>>  finder = re.compile(r"""
>>>     ^ \s*             # start at beginning+ opt spaces
>>>     ( [\[\]] )        # Group 1: opening bracket
>>>         \s*           # optional spaces
>>>         ( [-+]? \d+ ) # Group 2: first number
>>>         \s* , \s*     # opt spaces+ comma+ opt spaces
>>>         ( [-+]? \d+ ) # Group 3: second number
>>>         \s*           # opt spaces
>>>     ( [\[\]] )        # Group 4: closing bracket
>>>     \s* $             # opt spaces+ end at the end
>>>     """, flags=re.VERBOSE)
>>>  As you can see it's often very positive to indent logically those lines just like code.
>>
>> Yah, I saw that ECMA introduced comments in regexes too. At some point we'll implement that.
>>
>>> ----------------
>>>  As the other people here, I don't like the following much, it's a misleading overload of the ~ operator:
>>>  "abracazoo" ~ regex("a[b-e]", "g")
>>>  ----------------
>>>  I don't like that "g" argument much, my suggestions:
>>>  RE attributes:
>>> "repeat", "r": Repeat over the whole input string
>>> "ignorecase", "i": case insensitive
>>> "multiline", "m": treat as multiple lines separated by newlines
>>> "verbose", "v": ignores space outside [] and allows comments
>>
>> And how do you combine them? "repeat, ignorecase"? Writing and parsing such options becomes a little adventure in itself. I think the "g", "i", and "m" flags are popular enough if you've done any amount of regex programming. If not, you'll look up the manual regardless.
>>
> 
> Perhaps, string.match("a[b-e]", Regex.Repeat | Regex.IgnoreCase); might be better? I don't find "gmi" immediately clear nor self-documenting.

I think it's worth an overload! (I also keep forgetting those flags.)

In fact, *the first thing* the current RegExp.compile does is convert the string attributes to enum flags!

L.