February 15, 2005
Regan Heath wrote:
> On Tue, 15 Feb 2005 13:39:16 +0200, Georg Wrede <georg.wrede@nospam.org>  wrote:
>> Matthew wrote:
....
> syntax may play a part in making it more or less likely, neither of these  suggested syntax'es, to me, seems to make any difference in this respect.

I'm completely with you here!

I've got an idea, but I'll have to check it first.
February 16, 2005
Eeew.... so I'd have to do a struct of ugly proportions for something this simple? (yes I know it could also likely be done with sscanf and indeed it would probably be faster...)

// Check memcached return value...
if (stream.readLine() ~= /^VALUE [^\s]+ (\d+)/i)
{
   value = stream.readString(atoi(matches[1]));

   // Skip the \r\n...
   assert(stream.getc() == '\r');
   assert(stream.getc() == '\n');
}

So some sort of struct like:

struct SomeHatefulDummyStruct
{
   char[] matchedData; // $0
   char[] firstMatch;  // $1
}

I don't know about you, but I can't stand that.  The code could be so simple, so readable (assuming you understand my basic code practices, which I would of course have documented elsewhere and would be common enough that most people would recognize them.)  Instead I have to have a dummy struct?  Why?

I suppose it'd be nice, for the really long ugly cases where you probably shouldn't be using a regular expression anyway... but what about the simple ones where introducing identifiers the reader has to look up is only going to complicate things?

-[Unknown]

> Yeah. 
> 
> One major problem, as I see it, with this syntax is the anonymous nature of the
> $ references. I presume these are supposed to refer to numbered sub-patterns,
> yes?
> 
> From my perspective, this would be much better served using structs with <gasp>
> named members instead:
> 
> ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
> 12            3  4          5       6  7        8 9
> 
> struct Uri_RFC2396
> {
> char[] uri,
> protocol,
> scheme,
> resource,
> authority,
> path,
> args,
> query,
> tail,
> fragment;
> }
> 
> If one could pass this off to the RegExp and have it filled in, then so much the
> better. One might make the struct a union if need be (with a char[][10]). One
> could also give the members really obtuse names if so desired (a, b, c, d, e, f
> ..), and end up with vague 3 letter abbreviations instead of 2
> 
> In addition to readability & maintainability, the primary benefit is via the
> lack of implicit global variables (or thread-locals), coupled with the privacy,
> utility, and speedy access of stack-based structures. Heck, you could place the
> struct on the heap if you needed to; or even make it a global if you feel lucky.
> The point is that you have a choice. 
> 
> I sure hope D does not end up looking like, ahh, APL :-)
> 
> 
February 16, 2005
Georg Wrede wrote:
> 
> 
> John Reimer wrote:
> 
>> Georg Wrede wrote:
>>
>>>>>>>>>>  if line =~ /unquoted-regex-pattern-here/
>>>>>>>>>>      gr1 = $1
>>>>>>>>>>      gr2 = $2
>>>>>>>>>>      etc ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Just off hand, suggesting this syntax for D scares the
>>>>>>>>> living daylights out of me.
>>>
>>>
>>>
>>>
>>> if file1line =~ /unquoted-regex-pattern-here/
>>> mystring = $3 ~ "bla" ~ $5;
>>> if ($3<$8 && $9==$1 || $3>$2 && $5=<($2 ~ foo)
>>>     mystring ~= $9 ~ "xx" ~ $3;
>>> else {
>>>     if file2line =~ /unquoted-regex-pattern-here/
>>>     mystring ~= $2 ~ $3 ~ " got out of hand!";
>>> }
>>> mystring ~= "\nAnd he never knew why.";
>>
>>
>>
>> Actually, it doesn't look all that bad.
> 
> 
> Yeah, but there is one bug already in this code, made by
> the "author" (i.e. me, writing the code). And there is
> another that'll hit him the second he starts writing the
> rest (of this imagined example).
> 
> My problem is that neither you, nor Matthew, noticed it.
> So probably the average user, or the next guy won't either.
> 
> Thus, I'm not happy with the syntax.
> 
> ---------
> 
> D can brag about being very easy to understand. Just browsing
> someone's code gives you immediately an idea of what's
> going on. But the above code takes a pencil and paper,
> and peace and quiet, before one can figure it out.
> 
> I really see worms and anacondas taking over, if we
> we unlock the door and let Larry Wall in.

Frankly, I didn't look at this as an exercise in bug-hunting.  I thought the point of the post was to show how ugly the $ sign notation was. Thus my response to the contrary.  So the fact that I didn't notice the problem is of no relevance.

I'm not too surprised that you could make a buggy example out of a not-yet-defined D syntax.  Someone else could make a non-buggy example too, I'm sure, with a nice, clear presentation.

I was just musing that, aesthetically, it looked tolerable.  This doesn't rule out the possibility of a better way, of course; and, by far, I'm in no position to offer any expert opinion on what that way might be.

Matthew on the other hand...

Later,

John
February 16, 2005
Unknown W. Brackets wrote:
> Eeew.... so I'd have to do a struct of ugly proportions for something this simple? (yes I know it could also likely be done with sscanf and indeed it would probably be faster...)
> 
> // Check memcached return value...
> if (stream.readLine() ~= /^VALUE [^\s]+ (\d+)/i)
> {
>    value = stream.readString(atoi(matches[1]));
> 
>    // Skip the \r\n...
>    assert(stream.getc() == '\r');
>    assert(stream.getc() == '\n');
> }
> 
> So some sort of struct like:
> 
> struct SomeHatefulDummyStruct
> {
>    char[] matchedData; // $0
>    char[] firstMatch;  // $1
> }
> 
> I don't know about you, but I can't stand that.  The code could be so simple, so readable (assuming you understand my basic code practices, which I would of course have documented elsewhere and would be common enough that most people would recognize them.)  Instead I have to have a dummy struct?  Why?

You don't have to, if you don't want to. You might use a predefined struct with anonymous names. Perhaps it might be something like this:

(hope the spaces don't get mashed ...)

module std.RegExp;

struct Patterns
{
char[] p1,
       p2,
       p3,
       etc ... // or a, b, c, d ...
}

Followed by:

import std.RegExp;

Patterns pattens;
...
with (patterns)
     {
     if (p1)
         ...
     }

and so on. The 'with' statement can be quite handy for such things.


> I suppose it'd be nice, for the really long ugly cases where you probably shouldn't be using a regular expression anyway... 

Why ever not? Complex pattern matching is what regex typically excels at. A url is a fairly jolly example, is it not? If Speed is the #1 priority, then I would unwind the parse into dedicated and hand-tweaked code. Where convenience is king, regex can be awesome.

> but what about the simple ones where introducing identifiers the reader has to look up is only going to complicate things?

Are you saying that using '$3' somehow magically trancends the need to locate and mentally parse the original regex pattern? To see what '$3' refers to? The point of using descriptive names is to avoid that :-)

This is the same principle as used when folk apply descriptive names to program variables; so one doesn't have to search out the last assignment assignment to see what the dashed thing represents. These sub-patterns *are* program variables; they just happen to be related to one particular type of functionality (all the more reason why containment should be applied). Still, one can always use the predefined 'anonymous' struct instead; in the same fashion as we often use 'int i'. If $1 and $7 are so blindingly obvious, then:

with (patterns)
     {
     p1 ~= p7;
     ...
     }

should be just as obvious. Yes?

And let's not forget opCall ... one can use that quite effectively when trying to reduce syntactic sugar.

These are just suggestions. There's a whole raft of issues to be resolved if this stuff were to be built-in instead. I think we should explore the possibilities open to us with the current compiler and, again, this approach allows a set of choices over how the sub-patterns are managed, stored, and named. If we keep chipping away at it, it can only get better.
February 16, 2005
You just succeeded in raping and pillaging my example.  I said, in simple cases.... for one, "p7" would never happen in simple cases.

There's no way to tell what p1 is, you're right.  There's no regular expression.  Kris, if I gave you this code:

int grabTheURL()
{
   return globals_definedURL;
}

What does it do?  Got any idea?  Well, for one thing the code is missing documentation (and uses icky globals.)  What makes programmers think they can get away with not commenting their code will never make sense to me; I always comment my code, and hence I *ALWAYS* get comments about how readable, clean, and organized my code is.  And, yes, for your information I do use $1, i, j, k, etc. - and yet people still say this.  Funny.

Context and commenting are what REALLY matter, even in complicated examples.  We could keep trading examples and get nowhere.  The fact is, this:

if (url ~= /^http:\/\/(.+)$/)
   url = $1;

Is just as self explanatory as:

for (int i = 0; i < url.length; i++)

If it's not, the person at fault is not the programmer, but rather the reader.  I'm sorry if that sounds harsh.  However, it should of course be commented when at all necessary:

// Strip off the http:// because we know it's HTTP.
if (url ~= /^http:\/\/(.+)$/)
   url = $1;

If you really think this is MORE readable and MORE self explanatory:

// Strip off the http:// because we know it's HTTP.
Parameters parameters;
if (new Regexp(`^http://(.+)$`, "").test(url, parameters))
{
   with (parameters)
      url = p1;
}

Or, perhaps:

// Strip off the http:// because we know it's HTTP.
HTTPStripper parameters;
if (new Regexp(`^http://(.+)$`, "").test(url, parameters))
   url = parameters.urlWithoutHttp;

Then we must simply think differently.  For one thing, the second example looks more complicated and "scary".  What's this "HTTPStripper" thing?  Gotta go look it up... oh, it's just a struct with fullMatch and urlWithoutHttp in it.  What's fullMatch?  Doesn't appear to be used... err... weird... I don't understand it.  That's what people reading my code will say to the last two examples.

Then again, I'm more of an open source guy.  I want my code to be fast, effecient, clean, understandable, structured, and most importantly commented.  I don't think adding identifiers or using longer names where completely unecessary (e.g. when the variable is only used the once and a comment can make it 100% clear) acheives any of the above goals.

Yes, it's just syntactical sugar, perhaps, but arguing against it because it's less readable... well, that's like saying driving a car (whatever your age) should be illegal because doing so can kill people.  Okay, that's true, but should it be illegal?

-[Unknown]


> with (patterns)
>      {
>      p1 ~= p7;
>      ...
>      }
> 
> should be just as obvious. Yes?
> 
> And let's not forget opCall ... one can use that quite effectively when trying to reduce syntactic sugar.
February 16, 2005
Ben Hinkle schrieb:
> "Norbert Nemec" <Norbert@Nemec-online.de> wrote in message news:cupkc2$30jf$1@digitaldaemon.com...
> 
>>Matthew schrieb:
>>
>>>All we need now is to use that reserved $ for built-in regex, and D will be on its way to top place. :-)
>>
>>I've been thinking along that line as well. There is a clear advantage of compile-time regexps. Just consider the fact that in Python, you have to call 'compile' on regexps before you can use it. This already makes clear, that the translation from a regexp in string form to a regexp in executable representation basically is a compilation step which costs run-time performace.
> 
> 
> I could see another string prefix for patterns that looks like the r"..." for raw strings. Something like p"..." would compile the pattern at compile-time instead of run-time. I bet, though, that the time spent in compiling a pattern at run-time is small compared to the time spent "running" the pattern. The downside as other people mentioned is that the format of the compiled pattern would have to match whatever the library expects.

Depends on what you call 'compiling'. Your words remind me more of 'parsing' - and returning the pattern in some pre-digested format that can then be used be the matching library.

What I was thinking of would be to produce actual executable code from the string-representation. And here, the compiler could do all kinds of optimizations on a given pattern.

>>An alternative approach to the issue would be, though, to enhance the template-meta language in such a way that regexps can be translated at compile-time.
> 
> 
> That sounds interesting and fun to try. Do you think it would address the syntax issues, though?

Not without the language allowing some more flexibility. So far, Walter runs a rather strict line on limiting the use of operators. There is some good justification for that, but it also means that for every additional operator used be the library, you'll need Walters blessing first. In that situation, it will never be easy to do fundamental extension purely in the library.
February 16, 2005
Norbert Nemec wrote:
> Ben Hinkle schrieb:
>> "Norbert Nemec" <Norbert@Nemec-online.de>
>>> Matthew schrieb:
>>>
>>>> All we need now is to use that reserved $ for built-in regex, and D will be on its way to top place. :-)
>>>
>>> I've been thinking along that line as well. There is a clear advantage of compile-time regexps. Just consider the fact that in Python, you have to call 'compile' on regexps before you can use it. This already makes clear, that the translation from a regexp in string form to a regexp in executable representation basically is a compilation step which costs run-time performace.
>>
>> I could see another string prefix for patterns that looks like the r"..." for raw strings. Something like p"..." would compile the pattern at compile-time instead of run-time. I bet, though, that the time spent in compiling a pattern at run-time is small compared to the time spent "running" the pattern. The downside as other people mentioned is that the format of the compiled pattern would have to match whatever the library expects.
> 
> Depends on what you call 'compiling'. Your words remind me more of 'parsing' - and returning the pattern in some pre-digested format that can then be used be the matching library.
> 
> What I was thinking of would be to produce actual executable code from the string-representation. And here, the compiler could do all kinds of optimizations on a given pattern.

What if we just consider an Unquoted Regular Expression as
just another piece of program code? It is, after all, only
program code, albeit written with a syntax of its own.

Then we might decide on a fixed signature for such. This would
make it easy for all kinds of libraries to interact with
regexps without hassles. What if the unquoted regex were
treated exactly the same as

bit justAnotherRegex(in char[] s,
                     inout int pos,
                     out char[][] subStrings)
{
  //implementation in D.
}

Then we could write

ok = /lkajsdlkjaksdf/ (aDataLine, i, theDollars);

with no problem.

And, if we code responsibly and so Mother would be proud,
we might make it a habit to write in the following style:

enum {firstNam, lastNam}
if ( /lkjslkjlskjdlksjdf/ (currentLine, where, S) )
{
    victim = S[lastNam] ~ " ," ~ S[firstNam];
}

Also, if a bare regular expression is considered an
anonymous function, then it could be passed around
with Delegates, Function Pointers, and Dynamic Closures.
February 16, 2005
In article <cuo83m$1mjn$1@digitaldaemon.com>, Walter says...
>
>http://www.tiobe.com/tpci.htm
>
>We've risen to number 29!
>

Doesn't anyone else find strange that D is higher then objective-C?

Ant


February 16, 2005
"Georg Wrede" <georg.wrede@nospam.org> wrote in message news:42125F89.2070109@nospam.org...

>>> if ($3<$8 && $9==$1 || $3>$2 && $5=<($2 ~ foo)
>>>     mystring ~= $9 ~ "xx" ~ $3;
>>> else {
>>>     if file2line =~ /unquoted-regex-pattern-here/
>>>     mystring ~= $2 ~ $3 ~ " got out of hand!";
>>> }

>> Actually, it doesn't look all that bad.

> D can brag about being very easy to understand. Just browsing
> someone's code gives you immediately an idea of what's
> going on. But the above code takes a pencil and paper,
> and peace and quiet, before one can figure it out.

I'm not pushing for regex in D, but...

Sometimes a simplified line of code may be worth a paper and pencil to "decipher".  Think of math.  One integral equation can take a book to explain in detail, but it is nice to have a perfect and complete description in one line.  In other words, an obtuse looking regular expression may parse an entire, compicated line of input.  The alternative is to write 50 lines of code with trims and splices and ifs and substrs.  It might look easier at first glance (less like line-noise), but it will take just as much time to work through the intricacies of those 50 lines of trim(toupper(substr( blah, 3, 9))).  Once your get more comfortable with regexes, you will see lots of useful idioms and you can actually read complex parsing code faster.



February 16, 2005
In article <cuuvfm$2j2a$1@digitaldaemon.com>, Unknown W. Brackets says...
>
>You just succeeded in raping and pillaging my example.  I said, in simple cases.... for one, "p7" would never happen in simple cases.

Sorry ~ wasn't intentional.


>Then again, I'm more of an open source guy.  I want my code to be fast, effecient, clean, understandable, structured, and most importantly commented.  I don't think adding identifiers or using longer names where completely unecessary (e.g. when the variable is only used the once and a comment can make it 100% clear) acheives any of the above goals.

There's perhaps a vague implication in there that I, personally, take a different stand. Which, of course, is simply not true.


>Yes, it's just syntactical sugar, perhaps, but arguing against it because it's less readable... well, that's like saying driving a car (whatever your age) should be illegal because doing so can kill people.
>  Okay, that's true, but should it be illegal?


I think this aspect of the thread is rapidly veering off the road. I'm certainly not against a built-in Regex based solely upon naming conventions. That's simply one observation on a long-standing tradition (regex itself is a bit cryptic, so why shouldn't the identifiers be, too?).

No; the real issue is with respect to global variables, thread-locals, encapsulation, and whatever else one wishes to throw into that basket. The fact is that match() is a function emitting a (potential) set of sub-matches, an indication of success, and an index into the matched content. In other words, it is a multi-return function.

This thread was started with a suggestion to give that particular function very, very, special standing within D by making its multi-return values inherent within the language spec. I think that's overlooking what D can support right now for such cases, and it would introduce a raft of other issues related to that lack of encapsulation noted previously.

I'm really not even vaguely interested in debating syntactical or stylistic preferences with anyone. What I will bitch and whine about is a notion to introduce special syntax into the language (along with a bunch of potential scoping,liveness, and concurreny problems) when the functionality can be supported perfectly well with the language as it is, without creating further special-case scenarios.

Further, given the 'positive' argument for a built-in syntax, there's been a conspicuous absence of resolution for the lack of encapsulation and so on.

Sure, one can argue all day long about how some built-in syntax for such a notion can be made to look more attractive (or more cryptic) than leveraging the existing capabilites. But why bother? If the primary purpose of D were to be a regex language, then perhaps there'd be a point to all this. Yet that is simply not the case.

Thus, I feel we should look for a way to make RegExp as 'acceptable' to all concerned, whilst taking advantage of what we already have with the current language constructs. If you don't think that's even possible or desirable, then we should abate right now :-)

What are your thoughts, Matthew?

- Kris