February 16, 2005
If D had better support for RegEx's this would expedite a self-hosting compiler, no?  And that's the real test of a programming language.  You know you're getting somewhere when you are self-hosting.

-Craig


February 16, 2005
On Wed, 16 Feb 2005 21:33:55 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> Kris wrote:
>> In article <cv0480$pid$1@digitaldaemon.com>, Charlie Patterson says...
>>> Sounds good!  But to simplify, why not just assume that a failure to match returns an empty set of "subStrings".
>
> Sometimes you might consider it a success even if no substrings
> are found.

I prefer to ask "can this function ever return false?".

I can only think of 1 time, a malformed regular expression. If the regexp is 'compiled' then the compiler can check that at compile time, and we dont need a 'false' return value.

If the regexp is formed at runtime and malformed then we should be throwing an exception.

So either way, no false return is required.

function char[][] (int char[] s, out int rest);


>>> This leaves the equivalent of
>>>
>>>   char[][] justAnotherRegex(in char[] s )
>>>
>>> and is instantiated thusly
>>>
>>>   theDollars = /lkajsdlkjaksdf/ (aDataLine);
>>>
>>> This is similar to Perl except I like your function-looking call.  The Perl way would have been
>>>
>>>   # note approximate Perl. it's been a while
>>>   theDollars = aDataline ~= /lkajsdkljaksdf/
>>>
>>> I find yours clever and easier to read.
>>  .. and encapsulated too  (assuming there's a RegExp class instance to contain the char[][], or struct of
>> char[]'s or whatever).
>
> There are actually two different situations for using a
> regexp. One is scanning, the other is search-and-replace.
>
> So, actually we need two signatures.
>
> function bit (in char[] s,
>                inout int pos,
>                out char[][] subStrings)
> {}
>
> function bit (in char[] stringIn,
>                inout int posStringIn,
>                out char[] stringOut,
>                inout int posStringOut,
>                out char[][] subStrings)
> {}

Can't we do search and replace with the original function eg.

char[] searchReplace(char[] input, char[] search, char[] replace) {
  char[][] s;

  //this regexp is really a run-time one
  s = /regexp/(input);

  //find and replace strings
  foreach(int i, inout char[] r; s) {
    if (r == search) r = replace;
  }

  //generic function which appends all strings in an array into 1 string
  return combine(s);
}

The above function assumes the results in s are non overlapping, non repeated slices of the input. Basically I don't see why we need more than the 1 regexp function, writing a generic search and replace function is trivial.


> The first one takes the string to be examined, a position
> in it, and a pointer to an array of strings. It returns
> false if it did not find anything, or if the string
> to search was null.
>
> The array of strings is the set of "$1 ..." substrings,
> that are so often used in Perl, and the like.
>
> As I said, one probably uses this function several times
> over on a given textstring, so saving the position is
> essential. And we want this function to be re-entrant,
> so the position can't be saved "within the function."
>
> The second one is for when we are doing a search-and-
> replace. The new parameters are the output string,
> and a position in it.
>
> Since speed is a main point in using regular expressions,
> returning the output via a parameter instead of as the
> return value works well. Especially since we have two
> things, the output "string" and the position in it.

If we need to store state, then IMO this is an indication we should be using a class (or struct).

function RegExp (char[] input);

class RegExp {
  char[] input;
  int rest;

  char[][] results;

  void continue() {}
}

RegExp r;

r = /regexp/(input);   //initial use/construction
..etc..
r.continue();          //reuse


> Normally, instead of concatenating (which is slow), one
> would reserve an empty string, maybe 1.5 * the initial
> guess of size, for output.

Modifying my above search replace, to optimise the memory allocation, I get...

char[] searchReplace(char[] input, char[] search, char[] replace) {
  char[] newstring;
  int length = 0;
  RegExp r;

  //this regexp is really a run-time one
  r = /regexp/(input);

  //find and replace strings (does inout work here?)
  foreach(inout char[] s; r.results) {
    if (s == search) s = replace;
    length += s.length;
  }

  //generic function which copies all strings into array provided
  newstring.length = length;
  combine(r.results,newstring);
  return newstring;
}


> Search-and-replace are hardly ever done in-place. (even
> if we all are used to thinking this --

In D, given that a char[][] is an array of arrays, the regexp results can be slices into a existing array, it actually seems/looks like we _are_ doing it in place :)

Even tho in reality of course we're not.


All up I think this is a great idea, but I don't see the need for any more than the 1 function/form.

Regan
February 16, 2005
In article <cv06f4$s5u$1@digitaldaemon.com>, pragma says...
>
>In article <42139061.5060708@nospam.org>, Georg Wrede says...
>>
>>Then we might decide on a fixed signature for such. This would make it easy for all kinds of libraries to interact with regexps without hassles. What if the unquoted regex were treated exactly the same as
>>
>>bit justAnotherRegex(in char[] s,
>>                      inout int pos,
>>                      out char[][] subStrings)
>>{
>>   //implementation in D.
>>}
>>
>>Then we could write
>>
>>ok = /lkajsdlkjaksdf/ (aDataLine, i, theDollars);
>>
>>with no problem.
>>
>>And, if we code responsibly and so Mother would be proud, we might make it a habit to write in the following style:
>>
>>enum {firstNam, lastNam}
>>if ( /lkjslkjlskjdlksjdf/ (currentLine, where, S) )
>>{
>>     victim = S[lastNam] ~ " ," ~ S[firstNam];
>>}
>>
>>Also, if a bare regular expression is considered an
>>anonymous function, then it could be passed around
>>with Delegates, Function Pointers, and Dynamic Closures.
>
>I, for one, like this idea.  I can also envision the following:
>
>> alias bit function(in char[] s,
>>                      inout int pos,
>>                      out char[][] subStrings) Regexp;
>>
>> Regexp myExpression = /lkjslkjlskjdlksjdf/;
>> if (myExpression(currentLine, where, S))
>> {
>>     victim = S[lastNam] ~ " ," ~ S[firstNam];
>> }
>
>..which looks a touch cleaner.
>
>Another possibility is to tie the language a little closer to Phobos by way of the RegExp class; although that throws all the talk of compiler-based optimization out the window.
>
>- EricAnderton at yahoo


So yeah, one of the points was to have the compiler convert the search pattern at compile-time. This could be done within static constructors, but let's suppose the compiler did it instead -- this would then require an ABI for the compiled-regex representation, and the runtime executor would have to abide by that. Right?

Alternatively, the compiler could compile the regexp into an object file. I mean, that's what compilers do, right? Presumeably one mightleverage such things via an Interface or a set of delegates (same thing), for match, replace, etc. For such patterns, there would be no need for a runtime execution model. This is, indeed, similar to what some regex implementations do (compile to x86 code).

That's all well and good. But then what about patterns not known at compile time? Perhaps the part of the compiler responsible for compiling the regex might be exposed at runtime also?

Yet another alternative would be to support "compile time extension functions" .. these would have to accept const data and emit const data. The compiler might invoke them via some new 'static' or 'const' syntax? At that stage, one could extend the compiler in a number of interesting ways. The point is that the current *runtime* regex compiler could be invoked at *compile time* instead. Not a huge gain, given static constructors, but interesting perhaps.

Regardless; given the functionality that RegExp exposes, there really needs to be a context maintained for each instantiation (via a class, struct, or some other means). Again, there's that encapsulation I keep harping on about.

- Kris


February 16, 2005
In article <cv0iap$1bss$1@digitaldaemon.com>, Kris says...
>Alternatively, the compiler could compile the regexp into an object file. I mean, that's what compilers do, right? Presumeably one mightleverage such things via an Interface or a set of delegates (same thing), for match, replace, etc. For such patterns, there would be no need for a runtime execution model. This is, indeed, similar to what some regex implementations do (compile to x86 code).
>
>That's all well and good. But then what about patterns not known at compile time? Perhaps the part of the compiler responsible for compiling the regex might be exposed at runtime also?

(poor form, replying to ones own post ...)

Thought I'd note that one *could* invoke the compiler, as a subprocess, to support regex compilation at runtime. The output (for the compiled regex) would need to be a shared-lib/DLL ... but if each regex were compiled to a class anyway, that would work just fine. This would:

- alleviate issues relating to a regex ABI
- remove the need for a RegExp library
- would presumably take care of code bloat via char/wchar/dchar RegExp templates
(they'd be in the compiler rather than in the runtime code)
- arm D with the fastest damn regex functionality around (worth a lot in some
environments)

Just a thought;



February 17, 2005
Kris wrote:
> In article <cv0iap$1bss$1@digitaldaemon.com>, Kris says...
> 
>>Alternatively, the compiler could compile the regexp into an object file. I
>>mean, that's what compilers do, right? Presumeably one mightleverage such things
>>via an Interface or a set of delegates (same thing), for match, replace, etc.
>>For such patterns, there would be no need for a runtime execution model. This
>>is, indeed, similar to what some regex implementations do (compile to x86 code).
>>
>>That's all well and good. But then what about patterns not known at compile
>>time? Perhaps the part of the compiler responsible for compiling the regex might
>>be exposed at runtime also?
> 
> 
> (poor form, replying to ones own post ...)
> 
> Thought I'd note that one *could* invoke the compiler, as a subprocess, to
> support regex compilation at runtime. The output (for the compiled regex) would
> need to be a shared-lib/DLL ... but if each regex were compiled to a class
> anyway, that would work just fine. This would:
> 
> - alleviate issues relating to a regex ABI - remove the need for a RegExp library - would presumably take care of code bloat via char/wchar/dchar RegExp templates
> (they'd be in the compiler rather than in the runtime code)


> - arm D with the fastest damn regex functionality around (worth a lot in some
> environments)

Yessss! :-)

> Just a thought;

((( FTR: I better change tone, I just read all my regex
posts of lately, and some of them came across as a bit
arrogant. Hope I was the only one who thought that! :-(
But back to the issue.)))

Assuming compile time regexps were implemented as recently
discussed, then we need to think about how we could get
runtime regexps to behave smoothly, too.

Maybe, if this is ok,

if ( /aCompileTimeRegexHere/ (instr, i, S) )
{
   // do something
}

then, the question is: How does a _runtime_ regex "enter"
into our program?

Presumably it would start its life as the contents of a
string?

userregex = askUser("give me the regex");

Or any of a thousand other ways, like as found in a file.

Now, what would be really nice is to end up using this
regex the same way as the compile time regexes. Like

if ( aRuntimeCompiledRegex (instr, i, S) )
{
   // do something
}

So, we need to transform this "userregex" to this
"aRuntimeCompiledRegex" thing.

Skipping a few steps here, my suggestion is:
can we afford a new type in the language?
Then we could simply write

regex myRegex;
myRegex = RunTimeCompileRegex(regexAsString);

if ( myRegex (instr, i, S) )
{
   // do something
}

And the myRegex could then be used in Delegates,
Function Pointers, and Dynamic Closures just as
well as the compile-time ones.

Maybe we could make the compilation implicit?
Like:

regex myRegex = new regex(regexAsString);

That of course assuming code on the heap can be
executed in both Windows and Linux. Gurus: help!

As to the UTF issue, can we just decide that
the resulting myRegex will work on the same kind
of strings from which it was made? Or is there
some reason to make this more complicated?

---

Somebody pointed out earlier that the regex type
should only have one signature. I'm starting to
believe this. It would surely help when passing
them around and interacting with libraries.

At the same time it would be nice to be able to
use regexps like they had many signatures:

if ( aRegex (thestring) ) {xxxxxx}
if ( aRegex (str, pos) ) {xxxxxx}
if ( aRegex (str, outSubstArray) ) {xxxxxx}
if ( aRegex (str, pos, outSubstrArray) ) {xxxxxx}
if ( aRegex (instr, outstr) ) {xxxxxx}
if ( aRegex (instr, outstr, outSubArr) ) {xxxxxx}
... and a couple of others?

Is this an easy or hard problem to fix?

---

Somebody pointed out that we do not necessarily
need the regex to return a boolean value. I assume
that is because the outcome can be inferred from
the parameter returns.

I think however, that returning a boolean makes
the code overall look nicer. False would mean
not found, and the like, and exceptions would
be raised only for more serious conditions. Indeed,
if a regexp can even theoretically need to do that?

Malformed regexps of course would be caught at
instantiation.
February 19, 2005
> Maybe, if this is ok,
> 
> if ( /aCompileTimeRegexHere/ (instr, i, S) )
> {
>    // do something
> }
> 
> then, the question is: How does a _runtime_ regex "enter"
> into our program?

I like your idea of the type. A runtime regex could also be something like this:

char[] my_regex = "/aRuntimeRegexHere/";
if ( my_regex(instr, i, S) )
{
   // do something
}

This will require a check to see if there is parenthesis after an array of char (may be more trouble than it's worth, so perhaps, adding .regex() as a compiler interface would be a good way of doing it (similar to .length).

--------------------------

In the case of optimizing, there isn't a lot of good ways to do it at runtime. There are two cases I can foresee. (the performance enhancement might only be negligible though) The main advantage of non-runtime regexes is that they are compiled ahead of time. This could be provided to the programmer for runtime regexes too. It might be highly impractical to compile into machine code using the static compiler, as that would require executable memory... The other option would be bytecode. Again, not sure how much of a boost that would give though, if the programmer is allowed to compile into bytecode ahead of time.

myregex.compileregex;
// do stuff
if( myregex.regex(instr, i, S) )

The other possible performance increase could be caching the regex' bytecode or asm with a hash or something (not md5 -- it's too big. something smaller and fast like two 32 bit hashes or something). When the same regex is done twice, it doesn't need compilation again. I suppose this could be added to phobos or something.

That also can be done with your 'type' method as well.


It just seems so much for such a small performance increase. Someone should write a test program and benchmark it. The real performance will definitely be in staticly compiled regexes.

tough subject :) I like the ideas a lot though...
February 19, 2005
How about the following syntax:

char[][] = /<putYourRegexHere>/(<stringToMatch>);

Then, we devise a preprocessor which would turn it into something like this:

char[][] = function char[][](char[] arg) { ... } (<stringToMatch>);

The advantage here being that the preprocessor can compile the regex into a function, and detect any regex errors at build time.

Of course, if we got this functionality working well, then we could suggest that Walter integrate it into the compiler, and thus get rid of the preprocessor utility.
February 19, 2005
In article <cv827q$c8s$1@digitaldaemon.com>, Russ Lewis says...
>
>How about the following syntax:
>
>char[][] = /<putYourRegexHere>/(<stringToMatch>);
>
>Then, we devise a preprocessor which would turn it into something like this:
>
>char[][] = function char[][](char[] arg) { ... } (<stringToMatch>);
>
>The advantage here being that the preprocessor can compile the regex into a function, and detect any regex errors at build time.
>
>Of course, if we got this functionality working well, then we could suggest that Walter integrate it into the compiler, and thus get rid of the preprocessor utility.

There's just one hole in the syntax that I can think of: comments will not play nice with the slashes:

>char[][] = /*is this a comment?/(<stringToMatch>);
>char[][] = /+is this a comment?/(<stringToMatch>);

I, for one, cannot forsee a way to make this syntax compatible with the rest of the D language.  Perhaps a string decorator should be used instead:

>char[][] = r"<putYourRegexHere>"(<stringToMatch>);
>char[][] = $"<putYourRegexHere>"(<stringToMatch>);

Its not as pretty, but using / is going to cause trouble.

- EricAnderton at yahoo
February 19, 2005
pragma wrote:
> There's just one hole in the syntax that I can think of: comments will not play
> nice with the slashes:
> 
> 
>>char[][] = /*is this a comment?/(<stringToMatch>);
>>char[][] = /+is this a comment?/(<stringToMatch>);

Well, I'd say it's a comment since it doesn't make any sense as a regexp :) * or + will always follow some other part of a regular expression. Here they don't. It's a comment

Tom
1 2
Next ›   Last »