Questions about builtin RegExp (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Questions about builtin RegExp (page 3)

February 18, 2006

Re: Questions about builtin RegExp

Posted by kris
in reply to Walter Bright

kris

Posted in reply to Walter Bright

Walter Bright wrote:
[snip]
> regex is a large reason people use scripting languages. 

Really? Do you have some kind of data to back that assertion?

February 18, 2006

Re: Questions about builtin RegExp

Posted by Andrew Fedoniouk
in reply to Regan Heath

Andrew Fedoniouk

Posted in reply to Regan Heath

>> I beleive there is a sort of misunderstanding about what scripting is and
>> why there are scripting (typeless) languages, compiled bytecoded and
>> compiled native.
>> These three groups has their own niches. D as a compiled language will
>> never
>> reach
>> flexibility of e.g. prototype based JavaScript or Ruby. There are just
>> different definitions of flexibility
>> for these groups - different and sometimes even orthogonal tasks .
>
> I think there is some overlap, i.e. some scripting tasks do not require
> the flexibilty you mention, instead the important factor may be one or
> more of:
>  - how fast can I code the solution
>  - how easily can I code the solution
>  - how easily can I maintain the solution
>  - how likely is my solution to contain bugs
>  - how easy will it be to find those bugs
>
> Assuming you're a D programmer and assuming the D std lib contains the tools to achieve your task, why not use D?
>

1) Scrtipting langauges are being used usualy as built into some other
environments.
This use case is quite different from D execution model. Different life
cycle.
2) Scripting langauges are safe. Tremendous effort needed to make GPF in
scripting
environment. In D to make GPF is a piece of cake. I mean not because of bugs
in
language or libs but because you can dereference null object pointer for
example.
3) Scripting languages provide very high level and convenient set of ready
to use
task oriented set of classes/objects.
Example: for building D projects you would rather use make or build scripts
than
D itself, right? Even if you would have something like std.build I bet you
will use
some scripting tool for your builds.

What I want to say:

To write  fast scripting engine in D is possible and this is what D is best
for (among other things).
But to write something D-ish in scripting.... Completely different areas of
use to be short.

February 18, 2006

Re: Questions about builtin RegExp

Posted by Andrew Fedoniouk
in reply to Walter Bright

Andrew Fedoniouk

Posted in reply to Walter Bright

"Walter Bright" <newshound@digitalmars.com> wrote in message news:dt6nug$1lhe$1@digitaldaemon.com...
>> Or as in Harmonia:
>>
>> string s = ....
>> bool r = s.like("str*");
>
> That doesn't give the match results, though.

Who cares in most of cases?
user input validation tasks or simple filename matching ...

When you need match results you will use regexp
or something more effective like tokenizers.

>
>
>>> Sure. Create your own MyReg object, and use it like:
>>>    MyReg("*.ext") ~~ filename
>> But I want my own function for char[] ~~ char[] !
>
> Consider overloading the '+' in '1+2'? To overload operators, one of the operands must be a user defined type.
>
>
>> I don't understand why not allow this:
>> bool opMatch(char[] a, char[] b) ?
>
> For the same reason opAdd(int a, int b) is not allowed. Such a function would apply globally, all the library code will break, etc.
>
>
>> BTW: Have you seen Nemerle and its way of meta-programming? http://nemerle.org/
>
> I don't know anything about it. I'll take a look at the link.

Take a look.  A bit ugly on my taste but
some ideas of Nemerle macros can be reused.
They allow to add your own problem specific notation and
syntax to the language.

February 18, 2006

Re: Questions about builtin RegExp

Posted by Walter Bright
in reply to kris

Walter Bright

Posted in reply to kris

"kris" <fu@bar.org> wrote in message news:dt7m3l$2hc5$1@digitaldaemon.com...
> Walter Bright wrote:
> [snip]
>> regex is a large reason people use scripting languages.
> Really? Do you have some kind of data to back that assertion?

Peer reviewed statistical research studies? Nope. But it's a pretty good impression one gets by reading the examples in manuals for scripting languages, listening to what people say about those languages, and looking at a sampling of actual scripts.

Here's a quote from "Programming Perl"'s preface by Larry Wall: "Perl is no longer just for text processing." That means, to me, that Perl was DESIGNED to be a text processing language. Why would the backbone of that, regex, not be why a large number of people use Perl?

Perl stands for "Practical Extraction and Report Language", i.e. text manipulation. Larry goes out of his way to say that Perl is a superset of sed and awk, which are regex string manipulation scripting languages.

February 18, 2006

Re: Questions about builtin RegExp

Posted by Walter Bright
in reply to Andrew Fedoniouk

Walter Bright

Posted in reply to Andrew Fedoniouk

"Andrew Fedoniouk" <news@terrainformatica.com> wrote in message news:dt7qm4$2kn0$1@digitaldaemon.com...
> "Walter Bright" <newshound@digitalmars.com> wrote in message news:dt6nug$1lhe$1@digitaldaemon.com...
>>> string s = ....
>>> bool r = s.like("str*");
> That doesn't give the match results, though.
> Who cares in most of cases?

In a very large fraction of cases, it matters. After all, if you are searching a posting for an embedded email address, it doesn't do much good to only know that one is/isn't there. One is searching for it so one can do something with it.

> When you need match results you will use regexp
> or something more effective like tokenizers.

Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate.

February 18, 2006

Re: Questions about builtin RegExp

Posted by Lucas Goss
in reply to Walter Bright

Lucas Goss

Posted in reply to Walter Bright

Walter Bright wrote:
> ... but regex is a large reason people use scripting languages.

I've never used scripting languages for that purpose. The only reason I've used scripting languages is because they are often times easier, quicker, and have a huge library to write portable code. D almost matches them in being as easy and as quick, but lacks the huge standard library.

February 18, 2006

Re: Questions about builtin RegExp

Posted by Andrew Fedoniouk
in reply to Walter Bright

Andrew Fedoniouk

Posted in reply to Walter Bright

"Walter Bright" <newshound@digitalmars.com> wrote in message news:dt80n7$2qiu$3@digitaldaemon.com...
>
> "Andrew Fedoniouk" <news@terrainformatica.com> wrote in message news:dt7qm4$2kn0$1@digitaldaemon.com...
>> "Walter Bright" <newshound@digitalmars.com> wrote in message news:dt6nug$1lhe$1@digitaldaemon.com...
>>>> string s = ....
>>>> bool r = s.like("str*");
>> That doesn't give the match results, though.
>> Who cares in most of cases?
>
> In a very large fraction of cases, it matters. After all, if you are searching a posting for an embedded email address, it doesn't do much good to only know that one is/isn't there. One is searching for it so one can do something with it.

Probably in some Perl-ish use cases this is really so needed.

In my http://blocknote.net hyperlink auto-recognition start working
on each complete non-ws sequence - I already know position.
But this is a particular use case.


>
>> When you need match results you will use regexp
>> or something more effective like tokenizers.
>
> Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate.

Why?

Here is simple Tokenizer for C/C++/D/etc. alike texts

module harmonia.string;

class TokenizerT(CHAR)
{
  enum token { EOT, SPACE, WORD, QUOTE, DELIMETER, COMMENT }  ...
}

And

module harmonia.html.scanner;

is simple HTML/XML push parser (scanner)

----------------------
I mean that std.lib should have multiple text handling tools. RegExp is not only one possible.

I would like to see something like customizeable
TokenizerT above in std lib.
Frequently such tokenizer is what really needed rather than
regexp and scriptin style poor man tokenizing using array.split and the
like.

Andrew.

February 19, 2006

Re: Questions about builtin RegExp

Posted by Walter Bright
in reply to Andrew Fedoniouk

Walter Bright

Posted in reply to Andrew Fedoniouk

"Andrew Fedoniouk" <news@terrainformatica.com> wrote in message news:dt87fd$314d$1@digitaldaemon.com...
>
> "Walter Bright" <newshound@digitalmars.com> wrote in message news:dt80n7$2qiu$3@digitaldaemon.com...
>> Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate.
>
> Why?

I'd like to see strtok() parse an email address out of a body of text.

February 19, 2006

Re: Questions about builtin RegExp

Posted by Andrew Fedoniouk
in reply to Walter Bright

Andrew Fedoniouk

Posted in reply to Walter Bright

"Walter Bright" <newshound@digitalmars.com> wrote in message news:dt9ho8$20e4$3@digitaldaemon.com...

>>> Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate.
>>
>> Why?
>
> I'd like to see strtok() parse an email address out of a body of text.
>

I don't really understand "parse an email address out of a body of text."

Do you mean something like this:

char* pw = text;
url u;

forever
{
  pw = strtok( pw, " \t\n\r" ); if( !pw ) return;
  if( !u.parse(pw) ) continue;
  if( u.protocol() == url::MAILTO )
     //found - do something here
     ;
};

?

Andrew.

February 19, 2006

Re: Questions about builtin RegExp

Posted by Chris Sauls
in reply to Andrew Fedoniouk

Chris Sauls

Posted in reply to Andrew Fedoniouk

Andrew Fedoniouk wrote:
> "Walter Bright" <newshound@digitalmars.com> wrote in message news:dt9ho8$20e4$3@digitaldaemon.com...
> 
> 
>>>>Writing a real lexer takes a lot of effort. That's why people invented regex, it'll handle most jobs without having to write a lexer. C's strtok() is embarassingly inadequate.
>>>
>>>Why?
>>
>>I'd like to see strtok() parse an email address out of a body of text.
>>
> 
> 
> I don't really understand "parse an email address out of a body of text."
> 
> Do you mean something like this:
> 
> char* pw = text;
> url u;
> 
> forever
> {
>   pw = strtok( pw, " \t\n\r" ); if( !pw ) return;
>   if( !u.parse(pw) ) continue;
>   if( u.protocol() == url::MAILTO )
>      //found - do something here
>      ;
> };
> 
> ?
> 
> Andrew. 
> 
> 

I think he meant something more like (using MatchExpr, sorry):

# char[] text = ...;
# char[] addr, user, host, tld;
# if (`([_a-z0-9]*)@([_a-z0-9]*).([_a-z0-9]*)` ~~ text) {
#   addr = _match[0];
#   user = _match[1];
#   host = _match[2];
#   tld  = _match[3];
#
#   // do something
# }

Granted, I just tossed that together in five seconds flat, so its probably not quite right.  I'm just recently starting to lean into the RegExp camp myself.  Its made parsing of Lyra scripts a dream.

One thing I miss from a scripting language in doing the above, is PHP's lovely list() construct.  Pretending we had this in D:

# char[] text = ...;
# char[] addr, user, host, tld;
# if (`([_a-z0-9]*)@([_a-z0-9]*).([_a-z0-9]*)` ~~ text) {
#   list(addr,user,host,tld) = _match;
#   // do something
# }

-- Chris Nicholson-Sauls

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation