compile-time regex redux (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » compile-time regex redux (page 3)

February 07, 2007

Re: compile-time regex redux

Posted by Andrei Alexandrescu (See Website For Email)
in reply to janderson

Andrei Alexandrescu (See Website For Email)

Posted in reply to janderson

janderson wrote:
> Walter Bright wrote:
>> String mixins, in order to be useful, need an ability to manipulate strings at compile time. Currently, the core operations on strings that can be done are:
>>
>> At some point, this will prove a barrier to large scale use of this feature.
> 
> While I'm a fan of regex I'm not sure it meets the goal of scale.  I can imagine that regex expressions + templates will get unreadable (or at least very slow to read) as programs get larger.

Symbols and modularity will definitely go a long way to help this. A one-shot long regexp is pretty intimidating, but one containing appropriate symbols all of a sudden starts looking like... a clean sequence of tokens.

Andrei

February 07, 2007

Re: compile-time regex redux

Posted by Kyle Furlong
in reply to Walter Bright

Kyle Furlong

Posted in reply to Walter Bright

Walter Bright wrote:
> Bill Baxter wrote:
>> That would help I suppose, but at the same time regexps themselves have a tendancy to end up being 'write-only' code.  The heavy use of them in perl is I think a large part of what gives it a rep as a write-only language.   Heh heh.  I just found this regexp for matching RFC 822 email addresses:
>>     http://www.regular-expressions.info/email.html
>> (the one at the bottom of the page)
> 
> I agree that non-trivial regexes can be pretty intimidating - but writing templates to do the same will be even more intimidating.

I disagree that any d code could be any more intimidating than a 6k+ character regex string. <g>

February 07, 2007

Re: compile-time regex redux

Posted by Andrei Alexandrescu (See Website For Email)
in reply to Bill Baxter

Andrei Alexandrescu (See Website For Email)

Posted in reply to Bill Baxter

Bill Baxter wrote:
> Walter Bright wrote:
>> String mixins, in order to be useful, need an ability to manipulate strings at compile time. Currently, the core operations on strings that can be done are:
>>
>> 1) indexed access
>> 2) slicing
>> 3) comparison
>> 4) getting the length
>> 5) concatenation
>>
>> Any other functionality can be built up from these using template metaprogramming.
>>
>> The problem is that parsing strings using templates generates a large number of template instantiations, is (relatively) very slow, and consumes a lot of memory (at compile time, not runtime). For example, ParseInteger would need 4 template instantiations to parse 5678, and each template instantiation would also include the rest of the input as part of the template instantiation's mangled name.
>>
>> At some point, this will prove a barrier to large scale use of this feature.
>>
>> Andrei suggested using compile time regular expressions to shoulder much of the burden, reducing parsing of any particular token to one instantiation.
> 
> That would help I suppose, but at the same time regexps themselves have a tendancy to end up being 'write-only' code.  The heavy use of them in perl is I think a large part of what gives it a rep as a write-only language.   Heh heh.  I just found this regexp for matching RFC 822 email addresses:
>     http://www.regular-expressions.info/email.html
> (the one at the bottom of the page)

I think this must be qualified and understood in context. First, much of Perl's reputation of write-only code has much to do with the implicit variables and the generous syntax. The Perl regexps are a standard that all other regexp packages emulate and compare against.

Showcasing the raw RFC 822 email parsing regexp is not very telling. Notice there's a lot of repetition. With symbols, the grammar is very easy to implement with readable regular expressions - and this is how anyone in their right mind would do it.


Andrei

February 07, 2007

Re: compile-time regex redux

Posted by Walter Bright
in reply to kris

Walter Bright

Posted in reply to kris

kris wrote:
> compile-time regex is only part of the picture. A small one too. I rather expect we'd wind up finding the manner it was exposed was just too limiting in one way or another. Exposing, as was apparently suggested, the full API of RegExp inside the compiler sounds a tad distasteful.

I tend to agree with that.

> You'll perhaps forgive me if I question whether this is driven primarily from an academic interest?  What I mean is this: if and when D goes mainstream, perhaps just one in ten-thousand developers will actually use this kind of feature more than 5 times (and still find themselves limited). Perhaps I'm being generous with those numbers also?
> 
> What is wrong with runtime execution anyway? It sure is easier to write and maintain clean D code than (for many ppl) complex concepts that are, what amount to, nothing more than runtime optimizations. Isn't that true?
> 
> It would seem that adding such features does not address the type of things that would be useful to 80% of developers? Surely that should be far more important?

Good question. The simple answer is look what Ruby on Rails did for Ruby. Ruby's a good language, but the killer app for it was RoR. RoR is what drove adoption of Ruby through the roof. Enabling ways for sophisticated DSLs to interoperate with D will enable such applications. Probably the only killer C++ Boost library is the Spirit library, which I have looked over with envious eyes. The way it works (expression templates) is incredibly difficult for library writers to create, and even the result is pretty quirky. But there's no denying how useful people find it.

So I feel that by enabling easy DSL writing, we open the door to a much wider range of libraries to be written for D, and making libraries easy to write is (I think we all agree) key to success.

It isn't at all about runtime optimization. It's about, for example, the ability to create a specialized matrix manipulation language, complete with user defined operators, etc., and have it 'compile' to regular D code.

> And, no ... I'm not just pooh poohing the idea ... I'm really serious about D getting some realistic market traction, and I don't see how adding more compile-time 'specialities' can help in any way other than generating a little bit of 'novelty' interest. Isn't this a good example of "premature optimization" ?

No, I don't think it is at all about optimization.

> Surely some of the others long-term concerns, such as solid debugging support, simmering code/dataseg bloat, lib support for templates, etc, etc, should deserve full attention instead? Surely that is a more successful approach to getting D adopted in the marketplace?

Those are all extremely important, too.

February 08, 2007

Re: compile-time regex redux

Posted by Hasan Aljudy
in reply to Walter Bright

Hasan Aljudy

Posted in reply to Walter Bright


Walter Bright wrote:
> String mixins, in order to be useful, need an ability to manipulate strings at compile time. Currently, the core operations on strings that can be done are:
> 
> 1) indexed access
> 2) slicing
> 3) comparison
> 4) getting the length
> 5) concatenation
> 
> Any other functionality can be built up from these using template metaprogramming.
> 
> The problem is that parsing strings using templates generates a large number of template instantiations, is (relatively) very slow, and consumes a lot of memory (at compile time, not runtime). For example, ParseInteger would need 4 template instantiations to parse 5678, and each template instantiation would also include the rest of the input as part of the template instantiation's mangled name.
> 
> At some point, this will prove a barrier to large scale use of this feature.
> 
> Andrei suggested using compile time regular expressions to shoulder much of the burden, reducing parsing of any particular token to one instantiation.
> 
> The last time I introduced core regular expressions into D, it was soundly rejected by the community and was withdrawn, and for good reasons.
> 
> But I think we now have good reasons to revisit this, at least for compile time use only. For example:
> 
>     ("aa|b" ~~ "ababb") would evaluate to "ab"
> 
> I expect one would generally only see this kind of thing inside templates, not user code.

How about a much simpler and a more general approach: allow compile-time evaluation of functions.

a function will have a specific attribute such that if you pass it constant parameters, it gets evaluated/unfolded at compile time.

This attriute can be called "meta" for example .. (since it will be mainly used for meta programming)

/// Definition
meta int add( int x, int y )
{
    return x+y;
}

/// Usage:
int y = add( 10, 12 );

/// and that would get unfolded/interpreted at compile time to:

int y = 22;

Basically you execute the "meta" function at compile time, you feed it the input and get the output out of it and replace the function call with its compile time output.

February 08, 2007

Re: compile-time regex redux

Posted by Andrei Alexandrescu (See Website For Email)
in reply to kris

Andrei Alexandrescu (See Website For Email)

Posted in reply to kris

kris wrote:
> Walter Bright wrote:
>> String mixins, in order to be useful, need an ability to manipulate strings at compile time. Currently, the core operations on strings that can be done are:
>>
>> 1) indexed access
>> 2) slicing
>> 3) comparison
>> 4) getting the length
>> 5) concatenation
>>
>> Any other functionality can be built up from these using template metaprogramming.
>>
>> The problem is that parsing strings using templates generates a large number of template instantiations, is (relatively) very slow, and consumes a lot of memory (at compile time, not runtime). For example, ParseInteger would need 4 template instantiations to parse 5678, and each template instantiation would also include the rest of the input as part of the template instantiation's mangled name.
>>
>> At some point, this will prove a barrier to large scale use of this feature.
>>
>> Andrei suggested using compile time regular expressions to shoulder much of the burden, reducing parsing of any particular token to one instantiation.
>>
>> The last time I introduced core regular expressions into D, it was soundly rejected by the community and was withdrawn, and for good reasons.
>>
>> But I think we now have good reasons to revisit this, at least for compile time use only. For example:
>>
>>     ("aa|b" ~~ "ababb") would evaluate to "ab"
>>
>> I expect one would generally only see this kind of thing inside templates, not user code.
> 
> compile-time regex is only part of the picture. A small one too. I rather expect we'd wind up finding the manner it was exposed was just too limiting in one way or another. Exposing, as was apparently suggested, the full API of RegExp inside the compiler sounds a tad distasteful.

Au contraire, I think it's a definite step in the right direction. Writing programs that write programs is a great way of doing more with less effort. Various languages can do that to various extents, and it's very heartening that D is taking steps in that direction. Allowing the programmer to manipulate strings during compilation is definitely a good step.

> You'll perhaps forgive me if I question whether this is driven primarily from an academic interest?  What I mean is this: if and when D goes mainstream, perhaps just one in ten-thousand developers will actually use this kind of feature more than 5 times (and still find themselves limited). Perhaps I'm being generous with those numbers also?

Perhaps, just like me, you simply aren't in the position to evaluate them. I will notice, however, a few historical trends. C++ got a shot in the arm from the STL. STL = advanced programming. Interesting. The STL did much to educate the C++ community towards code generation, which continues to be the reason why many influential gurus hang out with C++.

Java tried to radically simplify things. It did get many complicated things right (safety, security), particularly those that were in the requirements early on. As of the features that Java initially stayed away from, a pattern I noticed in the Java circles is that pundits condemn, ridicule, or demean a feature or technique until Java implements it. Of course, implementing it while the language already has immovable parts is less clean. The net result is that now Java does have many of the advanced features that once were deemed uninteresting, and a history-based prediction is that it will continue to move in that direction.

C# also started simple, just to add even more advanced and more (it would appear) exotic features than Java. Again, it's natural to predict that the language will move towards recognizing and integrating advanced features.

To survive, D must compensate for its relative lack of clout and publicity by offering above and beyond what more mainstream languages offer.

> What is wrong with runtime execution anyway? It sure is easier to write and maintain clean D code than (for many ppl) complex concepts that are, what amount to, nothing more than runtime optimizations. Isn't that true?

No. Accommodating DSLs and generating code has more to do with correctness and avoiding duplication of source code, than anything else.

> It would seem that adding such features does not address the type of things that would be useful to 80% of developers? Surely that should be far more important?

No. You are missing a key point - that some code is more influential than other. 2% of programmers may write libraries that work for 90% of programmers.

> And, no ... I'm not just pooh poohing the idea ... I'm really serious about D getting some realistic market traction, and I don't see how adding more compile-time 'specialities' can help in any way other than generating a little bit of 'novelty' interest. Isn't this a good example of "premature optimization" ?

No. As I said above, optimization has exceedingly little to do with it.

Consider as an example the "white hole" and "black hole" pattern. Given an interface:

interface A
{
  int foo();
  void bar(int);
  float baz(char[]);
}

a "white hole" class is an implementation of A that implements all methods to throw, and a "black hole" class is an implementation of A that implements all methods to return the default value of the return type.

This pattern is very useful for either quick starting points for writing true classes implementing A, or as standalone degenerate implementations.

To some programmers, black and white holes might not even raise a "duplicated code" flag. They sit down and write:

class WhiteHoleA
{
  int foo()
  {
    throw new Exception("foo not implemented");
  }
  void bar(int);
  {
    throw new Exception("bar(int) not implemented");
  }
  float baz(char[]);
  {
    throw new Exception("baz(char[]) not implemented");
  }
}

and

class BlackHoleA
{
  int foo()
  {
    return int.init;
  }
  void bar(int);
  {
  }
  float baz(char[]);
  {
    return float.init;
  }
}

But if the language is advanced enough, it readily offers such rapid development goodies as library elements:

alias black_hole!(A) BlackHoleA;
alias white_hole!(A) WhiteHoleA;

This has nothing to do with optimization. It is all about abstraction, saving duplication, and allowing expressive code.

> Surely some of the others long-term concerns, such as solid debugging support, simmering code/dataseg bloat, lib support for templates, etc, etc, should deserve full attention instead? Surely that is a more successful approach to getting D adopted in the marketplace?
>
> Lot's of questions, and I hope you can give them serious consideration, Walter.

I think it's good to be sure only when there's a solid basis.

Andrei

February 08, 2007

Re: compile-time regex redux

Posted by Andrei Alexandrescu (See Website For Email)
in reply to Walter Bright

Andrei Alexandrescu (See Website For Email)

Posted in reply to Walter Bright

Walter Bright wrote:
> Andrei Alexandrescu (See Website For Email) wrote:
>> Walter Bright wrote:
>>> But I think we now have good reasons to revisit this, at least for compile time use only. For example:
>>>
>>>     ("aa|b" ~~ "ababb") would evaluate to "ab"
>>>
>>> I expect one would generally only see this kind of thing inside templates, not user code.
>>
>> The more traditional way is to mention the string first and pattern second, so:
>>
>> ("ababb" ~~ "aa|b") // match this guy against this pattern
>>
>> And I think it returns "b" - juxtaposition has a higher priority than "|", so your pattern is "either two a's or one b". :o)
> 
> My bad. Some more things to think about:
> 
> 1) Returning the left match, the match, the right match?

Perl does allow that (has IIRC $` and $' to mark the left and right surrounding substrings), but the recommended style is to use capturing parens if you need the left and right portion; this makes all matching code more efficient.

So if you want to match the left- and right-substrings you say:

("ababb" ~~ "(.*)(aa|b)(.*)")

and you get in return three juicy strings: left, match, and right.

> 2) Returning values of parenthesized expressions?

Probably it's easiest to always return const char[][]. If you don't have capturing parens, you could return const char[].

> 3) Some sort of sed-like replacement syntax?

Definitely; otherwise it's a pain to express it, particularly because you can't mutate things during compilation.

("ababb" ~~ s/"(.*)(aa|b)(.*)"/"$1 here was an aa|b $2"/i)

(This doesn't make 's' a keyword; it's just used as punctuation.) Probably a more D-like syntax could be devised, but that could be also seen as gratuitous incompatibility with sed, perl etc.

The last "/" is useful because flags could follow it, as is the case here (i = ignore case).

> An alternative is to have the compiler recognize std.Regexp names as being built-in.

Blech. :o)

Andrei

February 08, 2007

Re: compile-time regex redux

Posted by Andrei Alexandrescu (See Website For Email)
in reply to Robby

Andrei Alexandrescu (See Website For Email)

Posted in reply to Robby

Robby wrote:
> Andrei Alexandrescu (See Website For Email) wrote:
>> Walter Bright wrote:
> [snipped]
>>> The last time I introduced core regular expressions into D, it was soundly rejected by the community and was withdrawn, and for good reasons.
>>>
>>> But I think we now have good reasons to revisit this, at least for compile time use only. For example:
>>>
>>>     ("aa|b" ~~ "ababb") would evaluate to "ab"
>>>
>>> I expect one would generally only see this kind of thing inside templates, not user code.
>>
>> The more traditional way is to mention the string first and pattern second, so:
>>
>> ("ababb" ~~ "aa|b") // match this guy against this pattern
>>
>> And I think it returns "b" - juxtaposition has a higher priority than "|", so your pattern is "either two a's or one b". :o)
>>
>> One program I highly recommend for playing with regexes is The Regex Coach: http://weitz.de/regex-coach/.
>>
>>
>> Andrei
> 
> I wasn't here during the first round of regexes so bare with me. Though I can assume that with D's growing visibility the past few months, I'm probably not the only one. Having used Ruby for years I'm quite! fond of them, some general questions.
> 
> What version of regex would be the target? There's a few variations out there.

Ionno.

> Probably a couple of green questions, what is the benefit of having a compile time regex implementation over a "really fast" implementation? Or having the expression compiled and the string allowed runtime?

It's not about speed. Compile-time regexes will be most useful for parsing and subsequently generating code. Of course, it's great to support the same regex syntax and power for both realms.

> Assuming ~~ will be the syntax used?
> 
> Sidebar chuckle :
> http://dev.perl.org/perl6/doc/design/syn/S05.html
> Um, wow. They've really turned 5's implementation on its head.. could be interesting to watch the reaction from that.. could be louder than the vb guys when vb.net was first released..

Interesting how a thorough cleanup is more important than keeping everybody happy. :o)


Andrei

February 08, 2007

Re: compile-time regex redux

Posted by Walter Bright
in reply to Walter Bright

Walter Bright

Posted in reply to Walter Bright

Walter Bright wrote:
> kris wrote:
>> Surely some of the others long-term concerns, such as solid debugging support, simmering code/dataseg bloat, lib support for templates, etc, etc, should deserve full attention instead? Surely that is a more successful approach to getting D adopted in the marketplace?
> 
> Those are all extremely important, too.

I wish to add that if you look at the changelog, the bread and butter issues (see the list of bugs fixed) get a solid share of attention.

February 08, 2007

Re: compile-time regex redux

Posted by Bill Baxter
in reply to Andrei Alexandrescu (See Website For Email)

Bill Baxter

Posted in reply to Andrei Alexandrescu (See Website For Email)

Andrei Alexandrescu (See Website For Email) wrote:
> Bill Baxter wrote:
>> Walter Bright wrote:
>>> String mixins, in order to be useful, need an ability to manipulate strings at compile time. Currently, the core operations on strings that can be done are:
>>>
>>> 1) indexed access
>>> 2) slicing
>>> 3) comparison
>>> 4) getting the length
>>> 5) concatenation
>>>
>>> Any other functionality can be built up from these using template metaprogramming.
>>>
>>> The problem is that parsing strings using templates generates a large number of template instantiations, is (relatively) very slow, and consumes a lot of memory (at compile time, not runtime). For example, ParseInteger would need 4 template instantiations to parse 5678, and each template instantiation would also include the rest of the input as part of the template instantiation's mangled name.
>>>
>>> At some point, this will prove a barrier to large scale use of this feature.
>>>
>>> Andrei suggested using compile time regular expressions to shoulder much of the burden, reducing parsing of any particular token to one instantiation.
>>
>> That would help I suppose, but at the same time regexps themselves have a tendancy to end up being 'write-only' code.  The heavy use of them in perl is I think a large part of what gives it a rep as a write-only language.   Heh heh.  I just found this regexp for matching RFC 822 email addresses:
>>     http://www.regular-expressions.info/email.html
>> (the one at the bottom of the page)
> 
> I think this must be qualified and understood in context. First, much of Perl's reputation of write-only code has much to do with the implicit variables and the generous syntax. The Perl regexps are a standard that all other regexp packages emulate and compare against.

Agreed.  Implicit variables also make things tough to follow.  Regexps also contribute to Perl's reputation for looking like line-noise.  But I like perl actually.  And regular expressions are ok too, but I feel like they're not optimal for writing maintainable code.
They tend to look like line noise.  They're difficult to comment effectively.  And they're certainly not suited for certain tasks, and if you try to use them for something they're not particularly good at, they get very messy.

Unfortunately, lot of what they're not good at is exactly the kind of thing you *need* them to be good at for parsing/generating code.  Like parenthesis balancing, or nested comment parsing, or quoted string munching.

They can be a good tool, but if they're the only tool, or even the main tool, I think we're in trouble.

> Showcasing the raw RFC 822 email parsing regexp is not very telling. Notice there's a lot of repetition. With symbols, the grammar is very easy to implement with readable regular expressions - and this is how anyone in their right mind would do it.

True it's not a realistic example.  The page says as much, and includes several versions that are more realistic
Here's the recommended one:
  \b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b
Ok, that's not so bad, but throw in a few sets of capturing parenthesis here and there, and it starts to look pretty messy.

I my opinion about regexps is that they're too dense and full of abbreviations.  And the typical methods for creating them don't encourage encapsulation and abstraction, which are the foundations of software.  For instance, every time you look at the above you have to re-interpret what [A-Z0-9._%-] really means.  When I'm writing regular expressions I always have to have that chart next to me to remember all those \s \b \w \S \W \ codes, and then again when trying to figure out what the code does later.  There has to be a better way.  Apparently the Perl guys thing so too, because they're redoing regular expressions completely for Perl 6.

--bb

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation