View mode: basic / threaded / horizontal-split · Log in · Help
March 13, 2012
[draft] New std.regex walkthrough
For a couple of releases we have a new revamped std.regex, that as far 
as I'm concerned works nicely, thanks to my GSOC commitment last summer. 
Yet there was certain dark trend around std.regex/std.regexp as both had 
severe bugs, missing documentation and what not, enough to consider them 
unusable or dismiss prematurely.

It's about time to break this gloomy aura, and show that std.regex is 
actually easy to use, that it does the thing and has some nice extras.

Link: http://blackwhale.github.com/regular-expression.html

Comments are welcome from experts and newbies alike, in fact it should 
encourage people to try out a few tricks ;)

This is intended as replacement for an article on dlang.org
about outdated (and soon to disappear) std.regexp:
http://dlang.org/regular-expression.html

[Spoiler] one example relies on a parser bug being fixed (blush):
https://github.com/D-Programming-Language/phobos/pull/481
Well, it was a specific lookahead inside lookaround so that's not severe 
bug ;)

P.S. I've been following through a bunch of new bug reports recently, 
thanks to everyone involved :)


-- 
Dmitry Olshansky
March 13, 2012
Re: [draft] New std.regex walkthrough
On 3/13/12 2:27 PM, Dmitry Olshansky wrote:
> For a couple of releases we have a new revamped std.regex, that as far
> as I'm concerned works nicely, thanks to my GSOC commitment last summer.
> Yet there was certain dark trend around std.regex/std.regexp as both had
> severe bugs, missing documentation and what not, enough to consider them
> unusable or dismiss prematurely.
>
> It's about time to break this gloomy aura, and show that std.regex is
> actually easy to use, that it does the thing and has some nice extras.
>
> Link: http://blackwhale.github.com/regular-expression.html

Reddited: 
http://www.reddit.com/r/programming/comments/quyy1/walk_through_regexen_in_the_d_programming/


Andrei
March 13, 2012
Re: [draft] New std.regex walkthrough
"Dmitry Olshansky" <dmitry.olsh@gmail.com> wrote in message 
news:jjo73v$4gv$1@digitalmars.com...
> For a couple of releases we have a new revamped std.regex, that as far as 
> I'm concerned works nicely, thanks to my GSOC commitment last summer. Yet 
> there was certain dark trend around std.regex/std.regexp as both had 
> severe bugs, missing documentation and what not, enough to consider them 
> unusable or dismiss prematurely.
>
> It's about time to break this gloomy aura, and show that std.regex is 
> actually easy to use, that it does the thing and has some nice extras.
>
> Link: http://blackwhale.github.com/regular-expression.html
>
> Comments are welcome from experts and newbies alike, in fact it should 
> encourage people to try out a few tricks ;)
>
> This is intended as replacement for an article on dlang.org
> about outdated (and soon to disappear) std.regexp:
> http://dlang.org/regular-expression.html
>
> [Spoiler] one example relies on a parser bug being fixed (blush):
> https://github.com/D-Programming-Language/phobos/pull/481
> Well, it was a specific lookahead inside lookaround so that's not severe 
> bug ;)
>
> P.S. I've been following through a bunch of new bug reports recently, 
> thanks to everyone involved :)
>

Looks nice at an initial glance through. Few things I'll point out though:

- The bullet-list immediately after the text "Now, come to think of it, this 
tiny sample showed a lot of useful things already:" looks like it's 
outdented instead of indented. Just kinda looks a little odd.

- Speaking of the same line, I'd omit the "Now, come to think of it" part. 
It sounds too "stream-of-conciousness" and not very "professional article".

- I'm very much in favor of using backticked strings for regexes instead of 
r"", because with the latter, you can't include double-quotes, which I'd 
think would be a much more common need in a regex than a backtick. Although 
I understand that backticks aren't easy to make on some keyboards. (In the 
US layout I have, it's just an unshifted tilde, ie, the key just to the left 
of "1". I guess some people don't have a backtick key though?)
March 13, 2012
Re: [draft] New std.regex walkthrough
On 13.03.2012 23:42, Nick Sabalausky wrote:
> "Dmitry Olshansky"<dmitry.olsh@gmail.com>  wrote in message
> news:jjo73v$4gv$1@digitalmars.com...
>> For a couple of releases we have a new revamped std.regex, that as far as
>> I'm concerned works nicely, thanks to my GSOC commitment last summer. Yet
>> there was certain dark trend around std.regex/std.regexp as both had
>> severe bugs, missing documentation and what not, enough to consider them
>> unusable or dismiss prematurely.
>>
>> It's about time to break this gloomy aura, and show that std.regex is
>> actually easy to use, that it does the thing and has some nice extras.
>>
>> Link: http://blackwhale.github.com/regular-expression.html
>>
>> Comments are welcome from experts and newbies alike, in fact it should
>> encourage people to try out a few tricks ;)
>>
>> This is intended as replacement for an article on dlang.org
>> about outdated (and soon to disappear) std.regexp:
>> http://dlang.org/regular-expression.html
>>
>> [Spoiler] one example relies on a parser bug being fixed (blush):
>> https://github.com/D-Programming-Language/phobos/pull/481
>> Well, it was a specific lookahead inside lookaround so that's not severe
>> bug ;)
>>
>> P.S. I've been following through a bunch of new bug reports recently,
>> thanks to everyone involved :)
>>
>
> Looks nice at an initial glance through. Few things I'll point out though:
>
> - The bullet-list immediately after the text "Now, come to think of it, this
> tiny sample showed a lot of useful things already:" looks like it's
> outdented instead of indented. Just kinda looks a little odd.
>
> - Speaking of the same line, I'd omit the "Now, come to think of it" part.
> It sounds too "stream-of-conciousness" and not very "professional article".

Thanks, these are kind of things I intend to fix/improve/etc.
Hence the [draft] prefix.

>
> - I'm very much in favor of using backticked strings for regexes instead of
> r"", because with the latter, you can't include double-quotes, which I'd
> think would be a much more common need in a regex than a backtick. Although
> I understand that backticks aren't easy to make on some keyboards. (In the
> US layout I have, it's just an unshifted tilde, ie, the key just to the left
> of "1". I guess some people don't have a backtick key though?)
>

Same here, but I recall there is a movement (was it?) against backticked 
strings, including some of DPL's highly ranked members ;)
So I thought that maybe it's best to not impose my (perverted?) style on 
readers.

-- 
Dmitry Olshansky
March 13, 2012
Re: [draft] New std.regex walkthrough
Dmitry Olshansky:

> It's about time to break this gloomy aura, and show that std.regex is 
> actually easy to use, that it does the thing and has some nice extras.

This seems a good moment to ask people regarding this small problem, that we have already discussed a little in Bugizilla (there is a significant need to show here some Bugzilla discussions):

http://d.puremagic.com/issues/show_bug.cgi?id=7260

The problem is easy to show:

import std.stdio: write, writeln;
import std.regex: regex, match;

void main() {
   string text = "abc312de";

   foreach (c; text.match("1|2|3|4"))
       write(c, " ");
   writeln();

   foreach (c; text.match(regex("1|2|3|4", "g")))
       write(c, " ");
   writeln();
}


It outputs:

["3"] 
["3"] ["1"] ["2"]

In my code I have seen that usually the "g" option (that means "repeat over the
whole input") is what I want. So what do you think about making "g" the default?

This request is not as arbitrary as it looks, if you compare to the older API. See Bug 7260 for more info.

Bye,
bearophile
March 13, 2012
Re: [draft] New std.regex walkthrough
On Tuesday, 13 March 2012 at 19:27:59 UTC, Dmitry Olshansky wrote:
> For a couple of releases we have a new revamped std.regex, that 
> as far as I'm concerned works nicely, thanks to my GSOC 
> commitment last summer. Yet there was certain dark trend around 
> std.regex/std.regexp as both had severe bugs, missing 
> documentation and what not, enough to consider them unusable or 
> dismiss prematurely.

Thank you for the work Dmitry, I look forward to reading this and 
ultimately have been happy with the changes.

D has been getting a great number of face lifts on its many faces.
March 13, 2012
Re: [draft] New std.regex walkthrough
On 14.03.2012 0:05, bearophile wrote:
> Dmitry Olshansky:
>
>> It's about time to break this gloomy aura, and show that std.regex is
>> actually easy to use, that it does the thing and has some nice extras.
>
> This seems a good moment to ask people regarding this small problem, that we have already discussed a little in Bugizilla (there is a significant need to show here some Bugzilla discussions):
>
> http://d.puremagic.com/issues/show_bug.cgi?id=7260
>

Yeah, it's prime  thing that I regret when thinking of current API.

> The problem is easy to show:
>
> import std.stdio: write, writeln;
> import std.regex: regex, match;
>
> void main() {
>      string text = "abc312de";
>
>      foreach (c; text.match("1|2|3|4"))
>          write(c, " ");
>      writeln();
>
>      foreach (c; text.match(regex("1|2|3|4", "g")))
>          write(c, " ");
>      writeln();
> }
>
>
> It outputs:
>
> ["3"]
> ["3"] ["1"] ["2"]
>
> In my code I have seen that usually the "g" option (that means "repeat over the
> whole input") is what I want. So what do you think about making "g" the default?
>
I like the general idea of foreach on match to work intuitively.
Yet I'm not convinced to use extra flag as "non-global".

I'd propose to yank "g" flag entirely assuming all regex are global, but 
that breaks code in a lot of subtle ways. Problems of using global flag 
by default:

1. Generic stuff:
assert(equal(match(...), someOtherRange)); //normal regex silently 
becomes global, quite unexpectedly

2. replace that then have to be 2 funcs - replaceFirst, replaceAll or we 
are back to the problem of extra flag.

I'm thinking there is a path through opApply to allow foreach iteration 
of non-global regex as if it had global flag, yet not getting full range 
interface. It's hackish but so far it's as best as it gets.

> This request is not as arbitrary as it looks, if you compare to the older API. See Bug 7260 for more info.
>



-- 
Dmitry Olshansky
March 13, 2012
Re: [draft] New std.regex walkthrough
On Tue, Mar 13, 2012 at 1:27 PM, Dmitry Olshansky <dmitry.olsh@gmail.com>wrote:

> For a couple of releases we have a new revamped std.regex, that as far as
> I'm concerned works nicely, thanks to my GSOC commitment last summer. Yet
> there was certain dark trend around std.regex/std.regexp as both had severe
> bugs, missing documentation and what not, enough to consider them unusable
> or dismiss prematurely.
>
> It's about time to break this gloomy aura, and show that std.regex is
> actually easy to use, that it does the thing and has some nice extras.
>
> Link: http://blackwhale.github.com/**regular-expression.html<http://blackwhale.github.com/regular-expression.html>
>
> Comments are welcome from experts and newbies alike, in fact it should
> encourage people to try out a few tricks ;)
>
> This is intended as replacement for an article on dlang.org
> about outdated (and soon to disappear) std.regexp:
> http://dlang.org/regular-**expression.html<http://dlang.org/regular-expression.html>
>
> [Spoiler] one example relies on a parser bug being fixed (blush):
> https://github.com/D-**Programming-Language/phobos/**pull/481<https://github.com/D-Programming-Language/phobos/pull/481>
> Well, it was a specific lookahead inside lookaround so that's not severe
> bug ;)
>
> P.S. I've been following through a bunch of new bug reports recently,
> thanks to everyone involved :)
>
>
> --
> Dmitry Olshansky
>

Second paragraph:
- "..,expressions, though one though one should..." has too many "though
one"s

Third paragraph:
- "...keeping it's implementation..." should be "its"
- "We'll see how close to built-ins one can get this way." was kind of
confusing.  I'd consider just doing away with the distinction between built
in and non-built in regex since it's an implementation detail most
programmers who use it don't even need to know about.  Maybe say that it is
not built in and explain why that is a neat thing to have (meaning, the
language itself is powerful enough to express it in user code).

Fourth paragraph:
- "...article you'd have..." should probably be "you'll" or, preferably,
"you will".
- "...utilize it's API..." should be "its"
- "yet it's not required to get an understanding of the API." I'd probably
change this to "...yet it's not required to understand the API"

Lost track of which paragraph:
- "... that allows writing a regex pattern in it's natural notation"
another "its"
- "trying to match special characters like" I'd write "trying to match
special regex characters like" for clarity
- "over input like e.g. search or simillar" I'd remove the e.g., write
search as "search()" to show it's a function in other languages and fix the
spelling of similar :P
- "An element type is Captures for the string type being used, it is a
random access range." I just found this confusing.  Not sure what it's
trying to say.
- "I won't go into full detail of the range conception, suffice to say,"
I'd change "conception" to "concept" and remove "suffice to say". (It's a
shame we don't a range article we can link to).
- "At that time ancors like" misspelled "anchors"
- "Needless to say, one need not" I'd remove the "Needless to say," because
I think it's actually important to say :P
- "replace(text, regex(r"([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})","g"),
"--");" Is this code example correct?  It references $1, $2, etc. in the
explanatory paragraph below but they are no where to be found.
- When you are explaining named captures it sounds like you are about to
show them in the subsequent code example but you are actually showing what
it'd look like without them which was a bit confusing.
- Maybe some more words on what lookaround/lookahead do as I was lost.
- "Amdittedly, barrage of ? and ! makes regex rather obscure, more then
it's actually is. However" should be "Admittedly, the barrage of ? and !
makes the regex rather obscure, more than it actually is.".  Maybe change
"obscure" to a different adjective. Perhaps "complex looking" or
"complicated". (note I've removed the "However" as the upcoming sentence
isn't contradicting what you just said.
- "Needless to say it's", again, I think it's rather important to say :P
- "Run-time version took around 10-20us on my machine, admittedly no
statistics." here, borrow this "µ" :P.  Also, I'd get rid of "admittedly no
statistics".
- "meaningful tasks, it's features" another "its"
- "together it's major" and another :P
- "...flexible tools: match, replace, spliter" should be spelled "splitter"


Great article.  I didn't even know about the replacement delegate feature
which is something I've often wished I could use in other regex systems.  D
and Phobos need more articles like this.  We should have a link to it from
the std.regex documentation once this is added to the website.

Regards,
Brad Anderson
March 13, 2012
Re: [draft] New std.regex walkthrough
On Tue, Mar 13, 2012 at 11:27:57PM +0400, Dmitry Olshansky wrote:
> For a couple of releases we have a new revamped std.regex, that as
> far as I'm concerned works nicely, thanks to my GSOC commitment last
> summer. Yet there was certain dark trend around std.regex/std.regexp
> as both had severe bugs, missing documentation and what not, enough
> to consider them unusable or dismiss prematurely.
> 
> It's about time to break this gloomy aura, and show that std.regex
> is actually easy to use, that it does the thing and has some nice
> extras.
> 
> Link: http://blackwhale.github.com/regular-expression.html
> 
> Comments are welcome from experts and newbies alike, in fact it
> should encourage people to try out a few tricks ;)
[...]

Yay! Updated docs is always a good thing. I'd like to do some
copy-editing to make it nicer to read. (Hope you don't mind my extensive
revisions, I'm trying to make the docs as professional as possible.)
My revisions are in straight text under the quoted sections, and inline
comments are enclosed in [].


> Introduction
> 
> String processing is a kind of daily routine that most applications do
> in a one way or another.  It should come as no wonder that many
> programming languages have standard libraries stoked with specialized
> functions for common needs.

String processing is a common task performed by many applications. Many
programming languages come with standard libraries that are equipped
with a variety of functions for common string processing needs.


> The D programming language standard library among others offers a nice
> assortment in std.string and generic ones from std.algorithm.

The D programming language standard library also offers a nice
assortment of such functions in std.string, as well as generic functions
in std.algorithm that can also work with strings.


> Still no amount of fixed functionality could cover all needs, as
> naturally flexible text data needs flexible solutions. 

Still no amount of predefined string functions could cover all needs.
Text data is very flexible by nature, and so needs flexible solutions.


> Here is where regular expressions come in handy, often succinctly
> called as regexes.

This is where regular expressions, or regexes for short, come in.


> Simple yet powerful language for defining patterns of strings, put
> together with a substitution mechanism, forms a Swiss Army knife of
> text processing.

Regexes are a simple yet powerful language for defining patterns of
strings, and when integrated with a substitution mechanism, forms a
Swiss Army knife of text processing.


> It's considered so useful that a number of languages provides built-in
> support for regular expressions, though one though one should not jump
> to conclusion that built-in implies faster processing or more
> features. It's all about getting more convenient and friendly syntax
> for typical operations and usage patterns. 

It's considered so useful that a number of languages provides built-in
support for regular expressions. (This doesn't necessarily mean,
however, that built-in implies faster processing or more features.  It's
more a matter of providing a more convenient and friendly syntax for
typical operations and usage patterns.) 

[I think it's better to put the second part in parentheses, since it's
not really the main point of this doc.]


> The D programming language provides a standard library module
> std.regex.

[OK]


> Being a highly expressive systems language, it opens a possibility to
> get a good look and feel via core features, while keeping it's
> implementation within the language.

Being a highly expressive systems language, D allows regexes to be
implemented within the language itself, yet still have the same level of
readability and usability that a built-in implementation would provide.


> We'll see how close to built-ins one can get this way. 

We will see below how close to built-in regexes we can achieve.


> By the end of article you'd have a good understanding of regular
> expression capabilities in this library, and how to utilize it's API
> in a most straightforward way.

By the end of this article, you will have a good understanding of the
regular expression capabilities offered by this library, and how to
utilize its API in the most straightforward way.



> Examples in this article assume the reader has fairly good
> understanding of regex elements, yet it's not required to get an
> understanding of the API.

Examples in this article assume that the reader has fairly good
understanding of regex elements, but this is not required to get an
understanding of the API.

[I'll do this much for now. More to come later.]


T

-- 
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it. -- Brian W. Kernighan
March 13, 2012
Re: [draft] New std.regex walkthrough
On 14.03.2012 0:32, Brad Anderson wrote:
> On Tue, Mar 13, 2012 at 1:27 PM, Dmitry Olshansky <dmitry.olsh@gmail.com
> <mailto:dmitry.olsh@gmail.com>> wrote:
>
>     For a couple of releases we have a new revamped std.regex, that as
>     far as I'm concerned works nicely, thanks to my GSOC commitment last
>     summer. Yet there was certain dark trend around std.regex/std.regexp
>     as both had severe bugs, missing documentation and what not, enough
>     to consider them unusable or dismiss prematurely.
>
>     It's about time to break this gloomy aura, and show that std.regex
>     is actually easy to use, that it does the thing and has some nice
>     extras.
>
>     Link: http://blackwhale.github.com/__regular-expression.html
>     <http://blackwhale.github.com/regular-expression.html>
>
>     Comments are welcome from experts and newbies alike, in fact it
>     should encourage people to try out a few tricks ;)
>
>     This is intended as replacement for an article on dlang.org
>     <http://dlang.org>
>     about outdated (and soon to disappear) std.regexp:
>     http://dlang.org/regular-__expression.html
>     <http://dlang.org/regular-expression.html>
>
>     [Spoiler] one example relies on a parser bug being fixed (blush):
>     https://github.com/D-__Programming-Language/phobos/__pull/481
>     <https://github.com/D-Programming-Language/phobos/pull/481>
>     Well, it was a specific lookahead inside lookaround so that's not
>     severe bug ;)
>
>     P.S. I've been following through a bunch of new bug reports
>     recently, thanks to everyone involved :)
>
>
>     --
>     Dmitry Olshansky
>
>
> Second paragraph:
> - "..,expressions, though one though one should..." has too many "though
> one"s
>
> Third paragraph:
> - "...keeping it's implementation..." should be "its"
> - "We'll see how close to built-ins one can get this way." was kind of
> confusing.  I'd consider just doing away with the distinction between
> built in and non-built in regex since it's an implementation detail most
> programmers who use it don't even need to know about.  Maybe say that it
> is not built in and explain why that is a neat thing to have (meaning,
> the language itself is powerful enough to express it in user code).
>
> Fourth paragraph:
> - "...article you'd have..." should probably be "you'll" or, preferably,
> "you will".
> - "...utilize it's API..." should be "its"
> - "yet it's not required to get an understanding of the API." I'd
> probably change this to "...yet it's not required to understand the API"
>
> Lost track of which paragraph:
> - "... that allows writing a regex pattern in it's natural notation"
> another "its"
> - "trying to match special characters like" I'd write "trying to match
> special regex characters like" for clarity
> - "over input like e.g. search or simillar" I'd remove the e.g., write
> search as "search()" to show it's a function in other languages and fix
> the spelling of similar :P
> - "An element type is Captures for the string type being used, it is a
> random access range." I just found this confusing.  Not sure what it's
> trying to say.
> - "I won't go into full detail of the range conception, suffice to say,"
> I'd change "conception" to "concept" and remove "suffice to say". (It's
> a shame we don't a range article we can link to).
> - "At that time ancors like" misspelled "anchors"
> - "Needless to say, one need not" I'd remove the "Needless to say,"
> because I think it's actually important to say :P
> - "replace(text, regex(r"([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})","g"),
> "--");" Is this code example correct?  It references $1, $2, etc. in the
> explanatory paragraph below but they are no where to be found.
> - When you are explaining named captures it sounds like you are about to
> show them in the subsequent code example but you are actually showing
> what it'd look like without them which was a bit confusing.
> - Maybe some more words on what lookaround/lookahead do as I was lost.
> - "Amdittedly, barrage of ? and ! makes regex rather obscure, more then
> it's actually is. However" should be "Admittedly, the barrage of ? and !
> makes the regex rather obscure, more than it actually is.".  Maybe
> change "obscure" to a different adjective. Perhaps "complex looking" or
> "complicated". (note I've removed the "However" as the upcoming sentence
> isn't contradicting what you just said.
> - "Needless to say it's", again, I think it's rather important to say :P
> - "Run-time version took around 10-20us on my machine, admittedly no
> statistics." here, borrow this "µ" :P.  Also, I'd get rid of "admittedly
> no statistics".
> - "meaningful tasks, it's features" another "its"
> - "together it's major" and another :P
> - "...flexible tools: match, replace, spliter" should be spelled "splitter"
>

Wow, thanks a lot, that sure was a through read. I'll going to carefully 
work through this list tomorrow.

>
> Great article.  I didn't even know about the replacement delegate
> feature which is something I've often wished I could use in other regex
> systems.  D and Phobos need more articles like this.  We should have a
> link to it from the std.regex documentation once this is added to the
> website.
>
> Regards,
> Brad Anderson


-- 
Dmitry Olshansky
« First   ‹ Prev
1 2
Top | Discussion index | About this forum | D home