What is a regular expression? - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » What is a regular expression?

Thread overview

What is a regular expression?
Feb 16, 2005 Georg Wrede
Feb 16, 2005 Norbert Nemec
Feb 16, 2005 Russ Lewis
Feb 16, 2005 Georg Wrede
Feb 16, 2005 Charlie Patterson
Feb 16, 2005 Kris
Feb 16, 2005 Georg Wrede
Feb 16, 2005 Regan Heath
Feb 16, 2005 pragma
Feb 16, 2005 Georg Wrede
Feb 16, 2005 John Reimer
Feb 16, 2005 Kris
Feb 16, 2005 Kris
Feb 17, 2005 Georg Wrede
Feb 19, 2005 KennyB
Feb 19, 2005 Russ Lewis
Feb 19, 2005 pragma
Feb 19, 2005 h3r3tic
Feb 16, 2005 Craig Black

February 16, 2005

What is a regular expression?

Posted by Georg Wrede

Georg Wrede


So the guru unrolls a long scroll and
holds it in front of my eyes.

"What do you see here?"
I saw a long D program.

"What do you see in the code?"
I see an Unquoted Regular Expression.

"What do you reckon that is?"
A kind of string literal, I guess.

"Why?"
Er, the compiler knows what it is, so it's unquoted.

"What does the compiler think it is?"
Uhhh, a regular expression?

"What does the compiler think that is?"
Ehhh, well, a string literal?

"I see you have a problem. You have a cluttered mind."
????

"Your mind has been tortured for decades by this
man in Washington state. Your brain cells have been
fried by a man originally from Denmark. And your
perception is impaired, and this has been brought
upon you by those who type with two fingers, whose
books are stained with mayo, who use the floor as
their ashtray, and who do not entertain respect."

Head down and tail between my legs, I went home.
I decided to find out what a Regular Expression is.

---------

program lines

/unquoted-regular-expression/

program lines


Okay, let's start: this regular expression, what
does it want? It wants a string.

What does it give? Sometimes it gives a few other
strings.

Anything else? It could indicate whether it is happy.

Anything more? Well, I hope it tells me where it
stopped. Like when it didn't consume the entire
string.

Now, suppose we remove the word regular expression
from this? Then it should become obvious what we
are talking about.

We are talking about a function.

This function takes a string, and returns a group
of strings, and a pointer to where it stopped.

We also want to know whether it succeeded, which
might either be explicitly returned as a boolean
value, or we could infer it from the group of (or
lack of) strings.

---------

To be precise, it is an anonymous function.

Using this notion, we could use it in D already,
with no changes to the language -- other than
having the compiler understand them, of course.

February 16, 2005

Re: What is a regular expression?

Posted by Norbert Nemec
in reply to Georg Wrede

Norbert Nemec

Posted in reply to Georg Wrede

Georg Wrede schrieb:
[...]
> 
> To be precise, it is an anonymous function.
> 
> Using this notion, we could use it in D already,
> with no changes to the language -- other than
> having the compiler understand them, of course.

So then, what would your proposed syntax look like?

February 16, 2005

Re: What is a regular expression?

Posted by Russ Lewis
in reply to Georg Wrede

Russ Lewis

Posted in reply to Georg Wrede

I am intrigued by your suggestion, but, like Norbert, wonder what syntax you would propose.



P.S. You can "find out where a regular expression stopped" by adding (.*) to the tail of your expression, and then viewing the string returned by that.

February 16, 2005

Re: What is a regular expression?

Posted by Georg Wrede
in reply to Russ Lewis

Georg Wrede

Posted in reply to Russ Lewis

I am reposting my answer to this. It seems to have ended
up in the message tree in a place where nobody noticed.  :-(

The answer is at the end.

Russ Lewis wrote:
> I am intrigued by your suggestion, but, like Norbert, wonder what syntax you would propose.

##########################################################

Norbert Nemec wrote:

> Ben Hinkle schrieb:
>
>> "Norbert Nemec" <Norbert@Nemec-online.de>
>>
>>> Matthew schrieb:
>>>
>>>> All we need now is to use that reserved $ for built-in regex, and D will be on its way to top place. :-)
>>>
>>>
>>> I've been thinking along that line as well. There is a clear advantage of compile-time regexps. Just consider the fact that in Python, you have to call 'compile' on regexps before you can use it. This already makes clear, that the translation from a regexp in string form to a regexp in executable representation basically is a compilation step which costs run-time performace.
>>
>>
>> I could see another string prefix for patterns that looks like the r"..." for raw strings. Something like p"..." would compile the pattern at compile-time instead of run-time. I bet, though, that the time spent in compiling a pattern at run-time is small compared to the time spent "running" the pattern. The downside as other people mentioned is that the format of the compiled pattern would have to match whatever the library expects.
>
>
> Depends on what you call 'compiling'. Your words remind me more of 'parsing' - and returning the pattern in some pre-digested format that can then be used be the matching library.
>
> What I was thinking of would be to produce actual executable code from the string-representation. And here, the compiler could do all kinds of optimizations on a given pattern.

-----------------------------------------
georg:

What if we just consider an Unquoted Regular Expression as
just another piece of program code? It is, after all, only
program code, albeit written with a syntax of its own.

Then we might decide on a fixed signature for such. This would
make it easy for all kinds of libraries to interact with
regexps without hassles. What if the unquoted regex were
treated exactly the same as

bit justAnotherRegex(in char[] s,
                     inout int pos,
                     out char[][] subStrings)
{
  //implementation in D.
}

Then we could write

ok = /lkajsdlkjaksdf/ (aDataLine, i, theDollars);

with no problem.

And, if we code responsibly and so Mother would be proud,
we might make it a habit to write in the following style:

enum {firstNam, lastNam}
if ( /lkjslkjlskjdlksjdf/ (currentLine, where, S) )
{
    victim = S[lastNam] ~ " ," ~ S[firstNam];
}

Also, if a bare regular expression is considered an
anonymous function, then it could be passed around
with Delegates, Function Pointers, and Dynamic Closures.

February 16, 2005

Re: What is a regular expression?

Posted by Charlie Patterson
in reply to Georg Wrede

Charlie Patterson

Posted in reply to Georg Wrede

"Georg Wrede" <georg.wrede@nospam.org> wrote in message news:42139061.5060708@nospam.org...
> Then we might decide on a fixed signature for such. This would make it easy for all kinds of libraries to interact with regexps without hassles. What if the unquoted regex were treated exactly the same as
>
> bit justAnotherRegex(in char[] s,
>                      inout int pos,
>                      out char[][] subStrings)
> {
>   //implementation in D.
> }
>
> Then we could write
>
> ok = /lkajsdlkjaksdf/ (aDataLine, i, theDollars);
>
> with no problem.

Sounds good!  But to simplify, why not just assume that a failure to match returns an empty set of "subStrings".  Also I don't know what pos is supposed to do since the code can always do the entire input s at once. This leaves the equivalent of

    char[][] justAnotherRegex(in char[] s )

and is instantiated thusly

    theDollars = /lkajsdlkjaksdf/ (aDataLine);

This is similar to Perl except I like your function-looking call.  The Perl way would have been

    # note approximate Perl. it's been a while
    theDollars = aDataline ~= /lkajsdkljaksdf/

I find yours clever and easier to read.

February 16, 2005

Re: What is a regular expression?

Posted by Kris
in reply to Charlie Patterson

Kris

Posted in reply to Charlie Patterson

In article <cv0480$pid$1@digitaldaemon.com>, Charlie Patterson says...
>
>
>"Georg Wrede" <georg.wrede@nospam.org> wrote in message news:42139061.5060708@nospam.org...
>> Then we might decide on a fixed signature for such. This would make it easy for all kinds of libraries to interact with regexps without hassles. What if the unquoted regex were treated exactly the same as
>>
>> bit justAnotherRegex(in char[] s,
>>                      inout int pos,
>>                      out char[][] subStrings)
>> {
>>   //implementation in D.
>> }
>>
>> Then we could write
>>
>> ok = /lkajsdlkjaksdf/ (aDataLine, i, theDollars);
>>
>> with no problem.
>
>Sounds good!  But to simplify, why not just assume that a failure to match returns an empty set of "subStrings".  Also I don't know what pos is supposed to do since the code can always do the entire input s at once. This leaves the equivalent of
>
>    char[][] justAnotherRegex(in char[] s )
>
>and is instantiated thusly
>
>    theDollars = /lkajsdlkjaksdf/ (aDataLine);
>
>This is similar to Perl except I like your function-looking call.  The Perl way would have been
>
>    # note approximate Perl. it's been a while
>    theDollars = aDataline ~= /lkajsdkljaksdf/
>
>I find yours clever and easier to read.
>
>

.. and encapsulated too

(assuming there's a RegExp class instance to contain the char[][], or struct of char[]'s or whatever).

- Kris

February 16, 2005

Re: What is a regular expression?

Posted by pragma
in reply to Georg Wrede

pragma

Posted in reply to Georg Wrede

In article <42139061.5060708@nospam.org>, Georg Wrede says...
>
>Then we might decide on a fixed signature for such. This would make it easy for all kinds of libraries to interact with regexps without hassles. What if the unquoted regex were treated exactly the same as
>
>bit justAnotherRegex(in char[] s,
>                      inout int pos,
>                      out char[][] subStrings)
>{
>   //implementation in D.
>}
>
>Then we could write
>
>ok = /lkajsdlkjaksdf/ (aDataLine, i, theDollars);
>
>with no problem.
>
>And, if we code responsibly and so Mother would be proud, we might make it a habit to write in the following style:
>
>enum {firstNam, lastNam}
>if ( /lkjslkjlskjdlksjdf/ (currentLine, where, S) )
>{
>     victim = S[lastNam] ~ " ," ~ S[firstNam];
>}
>
>Also, if a bare regular expression is considered an
>anonymous function, then it could be passed around
>with Delegates, Function Pointers, and Dynamic Closures.

I, for one, like this idea.  I can also envision the following:

> alias bit function(in char[] s,
>                      inout int pos,
>                      out char[][] subStrings) Regexp;
>
> Regexp myExpression = /lkjslkjlskjdlksjdf/;
> if (myExpression(currentLine, where, S))
> {
>     victim = S[lastNam] ~ " ," ~ S[firstNam];
> }

..which looks a touch cleaner.

Another possibility is to tie the language a little closer to Phobos by way of the RegExp class; although that throws all the talk of compiler-based optimization out the window.

- EricAnderton at yahoo

February 16, 2005

Re: What is a regular expression?

Posted by Georg Wrede
in reply to Kris

Georg Wrede

Posted in reply to Kris

Kris wrote:
> In article <cv0480$pid$1@digitaldaemon.com>, Charlie Patterson says...
>>Sounds good!  But to simplify, why not just assume that a failure to match returns an empty set of "subStrings".  

Sometimes you might consider it a success even if no substrings
are found.

>>Also I don't know what pos is supposed to do since the code can always do the entire input s at once. 

I can imagine several scenarios where you scan, say through
an entire file, and depending on what you find, you might
want to treat the immediately following part differently.

Like reading source code in a folding text editor, or such.
Or creating an interpreter for a new language.

>>This leaves the equivalent of
>>
>>   char[][] justAnotherRegex(in char[] s )
>>
>>and is instantiated thusly
>>
>>   theDollars = /lkajsdlkjaksdf/ (aDataLine);
>>
>>This is similar to Perl except I like your function-looking call.  The Perl way would have been
>>
>>   # note approximate Perl. it's been a while
>>   theDollars = aDataline ~= /lkajsdkljaksdf/
>>
>>I find yours clever and easier to read.
> 
> .. and encapsulated too 
> 
> (assuming there's a RegExp class instance to contain the char[][], or struct of
> char[]'s or whatever).

There are actually two different situations for using a
regexp. One is scanning, the other is search-and-replace.

So, actually we need two signatures.

function bit (in char[] s,
              inout int pos,
              out char[][] subStrings)
{}

function bit (in char[] stringIn,
              inout int posStringIn,
              out char[] stringOut,
              inout int posStringOut,
              out char[][] subStrings)
{}

The first one takes the string to be examined, a position
in it, and a pointer to an array of strings. It returns
false if it did not find anything, or if the string
to search was null.

The array of strings is the set of "$1 ..." substrings,
that are so often used in Perl, and the like.

As I said, one probably uses this function several times
over on a given textstring, so saving the position is
essential. And we want this function to be re-entrant,
so the position can't be saved "within the function."

The second one is for when we are doing a search-and-
replace. The new parameters are the output string,
and a position in it.

Since speed is a main point in using regular expressions,
returning the output via a parameter instead of as the
return value works well. Especially since we have two
things, the output "string" and the position in it.

Normally, instead of concatenating (which is slow), one
would reserve an empty string, maybe 1.5 * the initial
guess of size, for output.

Search-and-replace are hardly ever done in-place. (even
if we all are used to thinking this -- but that is because
everyday we see text editors do it "in place" in the
text!) So this gives a natural reason to have the two
signatures.

When the compiler sees the regexp, it immediately sees
whether it is an outputting (replacing) regexp or just
a searching one. At that point it decides which signature
it gives to the regexp, and goes on compiling the binary
for it.

If we (or actually Walter!) feel industrious, we might
even have two more signatures. One for the case where
no substrings are generated, and another for when we
don't even care about the end position.

function bit (in char[] s,
              inout int pos)
{}

function bit (in char[] s)
{}

But I guess these come for "free" once the first two are
written.  :-)

February 16, 2005

Re: What is a regular expression?

Posted by Georg Wrede
in reply to pragma

Georg Wrede

Posted in reply to pragma

pragma wrote:
> In article <42139061.5060708@nospam.org>, Georg Wrede says...
> I, for one, like this idea.  I can also envision the following:
> 
> 
>>alias bit function(in char[] s,
>>                     inout int pos,
>>                     out char[][] subStrings) Regexp;
>>
>>Regexp myExpression = /lkjslkjlskjdlksjdf/;
>>if (myExpression(currentLine, where, S))
>>{
>>    victim = S[lastNam] ~ " ," ~ S[firstNam];
>>}
> 
> 
> ..which looks a touch cleaner.
> 
> Another possibility is to tie the language a little closer to Phobos by way of
> the RegExp class; although that throws all the talk of compiler-based
> optimization out the window.

That looks really good! This actually means, that we could
use regexps now in several cool ways!

Ah, and about the RegExp class you mention, we still need a
runtime regexp compiling facility. That would naturally be
in the library. That would be used when the regexp to be
parsed is only known at runtime.

Also, anybody willing could write an OO interface to the
"hard-coded-regexps" thing discussed in this thread. That
too should go to a library.

February 16, 2005

Re: What is a regular expression?

Posted by John Reimer
in reply to Georg Wrede

John Reimer

Posted in reply to Georg Wrede

Georg Wrede wrote:
> pragma wrote:
> 
>> In article <42139061.5060708@nospam.org>, Georg Wrede says...
>> I, for one, like this idea.  I can also envision the following:
>>
>>
>>> alias bit function(in char[] s,
>>>                     inout int pos,
>>>                     out char[][] subStrings) Regexp;
>>>
>>> Regexp myExpression = /lkjslkjlskjdlksjdf/;
>>> if (myExpression(currentLine, where, S))
>>> {
>>>    victim = S[lastNam] ~ " ," ~ S[firstNam];
>>> }
>>
>>
>>
>> ..which looks a touch cleaner.
>>
>> Another possibility is to tie the language a little closer to Phobos by way of
>> the RegExp class; although that throws all the talk of compiler-based
>> optimization out the window.
> 
> 
> That looks really good! This actually means, that we could
> use regexps now in several cool ways!
> 
> Ah, and about the RegExp class you mention, we still need a
> runtime regexp compiling facility. That would naturally be
> in the library. That would be used when the regexp to be
> parsed is only known at runtime.
> 
> Also, anybody willing could write an OO interface to the
> "hard-coded-regexps" thing discussed in this thread. That
> too should go to a library.

Must admit... looks good.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation