Questions about builtin RegExp (page 4)

Andrew Fedoniouk, What he's saying is... essentially... please take this string: char[] some_text = "The email address Walter is posting from is newshound@digitalmars.com. The headers for your message have <news@terrainformatica.com>, so I would assume that is your address. My address can be found in this HTML: <a href=\"mailto:unknown@simplemachines.org\">my email</a>"; Now use strtok to output just the email addresses. I would expect the output to be like this: 1: newshound@digitalmars.com 2: news@terrainformatica.com 3: unknown@simplemachines.org How many lines will it take to grab those addresses, without using a regular expression? You can use "like()" all you like, and strtok(), or even strpos()... He does not mean a whitespace separated list of addresses, why would you need to work to parse that? Most people would not use a regular expression for that, it'd be silly. I think you're looking at this from a different angle than Walter is. Just illustrating, -[Unknown] > "Walter Bright" <newshound@digitalmars.com> wrote in message news:dt9ho8$20e4$3@digitaldaemon.com... > >>>> Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate. >>> Why? >> I'd like to see strtok() parse an email address out of a body of text. >> > > I don't really understand "parse an email address out of a body of text." > > Do you mean something like this: > > char* pw = text; > url u; > > forever > { > pw = strtok( pw, " \t\n\r" ); if( !pw ) return; > if( !u.parse(pw) ) continue; > if( u.protocol() == url::MAILTO ) > //found - do something here > ; > }; > > ? > > Andrew. > >

On Sun, 19 Feb 2006 14:47:43 -0800, Unknown W. Brackets <unknown@simplemachines.org> wrote: > Andrew Fedoniouk, > > What he's saying is... essentially... please take this string: > > char[] some_text = "The email address Walter is posting from is newshound@digitalmars.com. The headers for your message have <news@terrainformatica.com>, so I would assume that is your address. My address can be found in this HTML: <a href=\"mailto:unknown@simplemachines.org\">my email</a>"; > > Now use strtok to output just the email addresses. I would expect the output to be like this: > > 1: newshound@digitalmars.com > 2: news@terrainformatica.com > 3: unknown@simplemachines.org > > How many lines will it take to grab those addresses, without using a regular expression? You can use "like()" all you like, and strtok(), or even strpos()... Here's how I'd do it: import std.stdio; import std.string; char[] some_text = "The email address Walter is posting from is newshound@digitalmars.com. The headers for your message have <news@terrainformatica.com>, so I would assume that is your address. My address can be found in this HTML: <a href=\"mailto:unknown@simplemachines.org\">my email</a>"; void main() { char[][] res; res = parse_string(some_text); foreach(int i, char[] r; res) writefln("%d. %s",i+1,r); } bool valid_email_char(char c) { char* special = "<>()[]\\.,;:@\""; if (c == '.') return true; if (c <= 0x1F) return false; if (c == 0x7F) return false; if (c == ' ') return false; if (strchr(special,c)) return false; return true; } char[][] parse_string(char[] text) { char[][] res; char* raw = toStringz(text); char* p; char* e; for(p = strchr(raw,'@'); p; p = strchr(e,'@')) { for(e = p+1; valid_email_char(*e); e++) {} if (e > raw && *(e-1) == '.') e--; for(; p > raw && valid_email_char(*(p-1)); p--) {} res ~= p[0..(e-p)]; //add .dup if required } return res; } Regan

On Sun, 19 Feb 2006 18:52:19 -0800, Walter Bright <newshound@digitalmars.com> wrote: > "Regan Heath" <regan@netwin.co.nz> wrote in message > news:ops48ur1em23k2f5@nrage.netwin.co.nz... >> Here's how I'd do it: > > Your's is a lot of code to do what a regex does. This is true, though my code is likely faster. > Now recognize a url <g>. Nah. You've made your point.. in fact I was secretly trying to help. <g> Regex is a good general purpose string parsing facility. I personally find composing a regex can be complicated, likely it's easier with practice. A custom piece of code is probably faster and I find it easier to tweak. In the end, unless it was performance critical or has resisted my initial efforts at composing a regex, I'd probably use a regex. Regan

Andrew Fedoniouk wrote: > "Walter Bright" <newshound@digitalmars.com> >> "Andrew Fedoniouk" <news@terrainformatica.com> >>> In general shortcuts are good but in this particular case it has >>> hidden side effects in creation of new RegExp object on each test >>> invocation. >> >> Yes, but why is that a bad thing? > > > You need to explain very well what is going on under the hood of this > ~~ - it is statefull operator (if it is /g). > > <ot> > > I am using stream tokenizer in Harmonia instead of this /g. (class > TokenizerT(CHAR) // harmonia/string.d) > > Simple like(pattern) method is enough in 90% of cases. > > Perl is completely different story - it is built around RegExp. And > it is typeless. > > </ot> > > BTW: Have you seen Nemerle and its way of meta-programming? http://nemerle.org/ Had I to do stuff on the M$ "platform", I'd definitely look long and hard on Nemerle, before even touching C#. The macro thing looks quite a bit like what I had in mind last winter when we were discussing whether the high level (that is, metaprogramming) features of D should be implemented in a syntax distinct from the "normal" language syntax or not. Seems I lost. :-) (No hard feelings, Walter and Don are really amazing me, over and over again!) Still, there's a lot of obvious stuff that seems trivial with a separate syntax, while either impossible or cumbersome with the current one. (But hey, with the rate W&D are going, all that will also be fixed by D 1.5.)

February 21, 2006

Re: Questions about builtin RegExp

Posted by Georg Wrede
in reply to Regan Heath

Permalink

Georg Wrede

Posted in reply to Regan Heath

Permalink

Regan Heath wrote:
> Walter Bright <newshound@digitalmars.com> wrote:
>> "Regan Heath" <regan@netwin.co.nz> wrote 
>>
>>> Here's how I'd do it:
>>
>> Your's is a lot of code to do what a regex does.
> 
> This is true, though my code is likely faster.
> 
>> Now recognize a url <g>.
> 
> Nah. You've made your point.. in fact I was secretly trying to help. <g>

DISCLAIMER INSERTED WHEN PROOFREADING:

I'm not attacking you, or anybody's opinion here, I'm just thinking aloud -- mostly to sort out my own opinion on this issue!  :-)

> Regex is a good general purpose string parsing facility. I personally find  composing a regex can be complicated, likely it's easier with practice. A  custom piece of code is probably faster and I find it easier to tweak. In  the end, unless it was performance critical or has resisted my initial  efforts at composing a regex, I'd probably use a regex.

Heh, interestingly, I have the same feeling about all three!! (I.e. composing nontrivial regexes is hard, custom code is faster and easier to tweak.)

But I can't but wonder whether I'm wrong on all three!

In other words, writing custom code to do the same as a nontrivial regexp might feel the easier choice at the outset, but the sheer number of lines required (for example for the url recognition task) makes the code error prone and unobvious.

And I too _feel_ that the custom code would be faster, but, on second thought, I'd probably have to do some intensive optimizing cycles if I were against an average regexp implementation. ;-( This regexp stuff is "well understood" and polished during decades, after all.

As to "easier to tweak", suppose that Boss comes to you 2 months later and wants this Url Recognizer (which you had to write in a hurry to compete with the regexp guy in the next cubicle) to only accept top-level domains in country specific urls, you'd be hard put to know where to start tweaking, while the other guy gets it right in 30 seconds flat tweaking his regexp code.

(The boss' tweak accepts foo.fi but not foo.bar.fi nor foo.com)

> Here's how I'd do it:
> 
> import std.stdio;
> import std.string;
> 
> char[] some_text = "The email address Walter is posting from is  newshound@digitalmars.com.  The headers for your message have  <news@terrainformatica.com>, so I would assume that is your address.  My  address can be found in this HTML: <a  href=\"mailto:unknown@simplemachines.org\">my email</a>";
> 
> void main()
> {
>     char[][] res;       res = parse_string(some_text);
>     foreach(int i, char[] r; res)
>         writefln("%d. %s",i+1,r);
> }
> 
> bool valid_email_char(char c)
> {
>     char* special = "<>()[]\\.,;:@\"";
>     if (c == '.') return true;
>     if (c <= 0x1F) return false;
>     if (c == 0x7F) return false;
>     if (c == ' ') return false;
>     if (strchr(special,c)) return false;
>     return true;
> }
> 
> char[][] parse_string(char[] text)
> {
>     char[][] res;
>     char* raw = toStringz(text);
>     char* p;
>     char* e;
>         for(p = strchr(raw,'@'); p; p = strchr(e,'@')) {
>         for(e = p+1; valid_email_char(*e); e++) {}
>         if (e > raw && *(e-1) == '.') e--;
>         for(; p > raw && valid_email_char(*(p-1)); p--) {}
>         res ~= p[0..(e-p)]; //add .dup if required
>     }
>         return res;
> }

Forums