February 17, 2009
Jarrett Billingsley:
>I'm talking 60-70 seconds to compile a more complex regex.<

A modern CPU is able to do something like 60*2*2E9 operations in that time, DMD needs 6 seconds or less to compile about 60000-80000 lines of my D code, so I think it's a bit too much time (probably 100 or 1000 times too much).

Bye,
bearophile
February 17, 2009
Reply to Jarrett,

> On Tue, Feb 17, 2009 at 2:47 PM, Daniel de Kok <me@danieldk.org>
> wrote:
> 
>> Actually, I was wondering why nobody is considering real regular
>> languages anymore, that can be compiled to a normal finite state
>> recognizer or transducer. While this may not be as fancy as Perl-like
>> extensions, they are much faster, and it's easier to do fun stuff
>> such as composition.
>> 
> Tango's regex engine is just that.  It uses a tagged NFA method.
> http://www.dsource.org/projects/tango/docs/current/tango.text.Regex.ht
> ml
> 
> The problem with this method is that while it's certainly faster to
> match, it's MUCH slower to compile.  There are no pathological
> matches; only pathological compiles ;)  I'm talking 60-70 seconds to
> compile a more complex regex.

could this be transitioned to CTFE? you could even have a debug mode that delays till runtime

RegEx mather = new CTFERegEx!("some regex");


class CTFERegEx(char[] regex) : RegEx
{
      debug(NoCTFE)  static char[] done;
      else     static const char[] done = CTFECompile(regex);

      public this()
      {
         debug(NoCTFE) if(done == null) done = CTFECompile(regex);

         base(done)
      }
}


February 17, 2009
On Tue, Feb 17, 2009 at 3:16 PM, BCS <ao@pathlink.com> wrote:
>
> could this be transitioned to CTFE? you could even have a debug mode that delays till runtime
>
> RegEx mather = new CTFERegEx!("some regex");
>
>
> class CTFERegEx(char[] regex) : RegEx
> {
>      debug(NoCTFE)  static char[] done;
>      else     static const char[] done = CTFECompile(regex);
>
>      public this()
>      {
>         debug(NoCTFE) if(done == null) done = CTFECompile(regex);
>
>         base(done)
>      }
> }

For what it's worth the Tango regexes actually have a method to output a D function that will implement the regex after it's compiled.  So you _could_ precompile the regex into D code and use that.

But seriously, man - if something takes 60 seconds to complete at _runtime_, making it CTFE would simply make your computer explode.
February 17, 2009
On Tue, Feb 17, 2009 at 8:57 PM, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote:
> The problem with this method is that while it's certainly faster to match, it's MUCH slower to compile.  There are no pathological matches; only pathological compiles ;)  I'm talking 60-70 seconds to compile a more complex regex. This might be an acceptable tradeoff for when you need to compile a regex in a long-running app like a server, but it's completely unacceptable for most small, Perl-like text munging programs.

Hmmm, define "complex", I suppose it's ok for the general line-splitting/matching stuff? I got into trouble (time-wise) when we compiled a part of speech tagger into a transducer. In those cases we generally pre-compile stuff, and output it as a large struct in the target language. Of course, it would be fun if we can do it at compile-time ;).

Besides that, if we'd have a good general recognizer/transducer implementation it could also be used for compact dictionary storage, perfect hashing automata, etc.

Take care,
Daniel
February 17, 2009
On Tue, Feb 17, 2009 at 9:26 PM, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote:
> For what it's worth the Tango regexes actually have a method to output a D function that will implement the regex after it's compiled.  So you _could_ precompile the regex into D code and use that.

I have only been tinkering with Phobos, but that's good to hear, thanks!
February 17, 2009
On Tue, Feb 17, 2009 at 3:30 PM, Daniel de Kok <me@danieldk.org> wrote:
>
> Hmmm, define "complex"

\w+([\-+.]\w+)*@\w+([\-.]\w+)*\.\w+([\-.]\w+)*

This is a simple email regexp.  This takes about 4 or 5 seconds to
compile on my lappy (Pentium M).

It only goes up from there.
February 17, 2009
Reply to Jarrett,

> On Tue, Feb 17, 2009 at 3:16 PM, BCS <ao@pathlink.com> wrote:
> 
>> could this be transitioned to CTFE? you could even have a debug mode
>> that delays till runtime
>> 
>> RegEx mather = new CTFERegEx!("some regex");
>> 
>> class CTFERegEx(char[] regex) : RegEx
>> {
>> debug(NoCTFE)  static char[] done;
>> else     static const char[] done = CTFECompile(regex);
>> public this()
>> {
>> debug(NoCTFE) if(done == null) done = CTFECompile(regex);
>> base(done)
>> }
>> }
> For what it's worth the Tango regexes actually have a method to output
> a D function that will implement the regex after it's compiled.  So
> you _could_ precompile the regex into D code and use that.
> 
> But seriously, man - if something takes 60 seconds to complete at
> _runtime_, making it CTFE would simply make your computer explode.
> 

For any kind of debug, yeah, that's a problem. OTOH for release, as long as it /does/ compile, who cares? How many real release builds does anyone do a week?


February 17, 2009
Reply to Jarrett,

> On Tue, Feb 17, 2009 at 3:30 PM, Daniel de Kok <me@danieldk.org>
> wrote:
> 
>> Hmmm, define "complex"
>> 
> \w+([\-+.]\w+)*@\w+([\-.]\w+)*\.\w+([\-.]\w+)*
> 
> This is a simple email regexp.  This takes about 4 or 5 seconds to
> compile on my lappy (Pentium M).
> 
> It only goes up from there.
> 

I wonder how well it would work on this:

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

:b


February 17, 2009
On Tue, Feb 17, 2009 at 9:50 PM, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote:
> On Tue, Feb 17, 2009 at 3:30 PM, Daniel de Kok <me@danieldk.org> wrote:
>>
>> Hmmm, define "complex"
>
> \w+([\-+.]\w+)*@\w+([\-.]\w+)*\.\w+([\-.]\w+)*
>
> This is a simple email regexp.  This takes about 4 or 5 seconds to
> compile on my lappy (Pentium M).

Hmm, odd. I have translated that regexp to the syntax of the tool that we used, that is written in Prolog (it is generally a constant factor slower than C/C++/D equivalents). Generating a minimized DFA takes far less than a second. I used the following expression (abstracted a bit with macros):

---
macro(letter, {a..z, 'A'..'Z'}).
macro(punctlet,[{-,+,.},letter+]).
macro(dompunctlet,[{-,.},letter+]).
macro(email,[letter+,punctlet*,@,letter+,dompunctlet*,.,letter+,dompunctlet*]).
---

The software is available from: http://www.let.rug.nl/~vannoord/Fsa/fsa.html

Take care,
Daniel
February 17, 2009
BCS wrote:
> Reply to Andrei,
> 
>> I'm quite unhappy with the API of std.regexp. It's a chaotic design
>> that provides a hodgepodge of functionality and tries to use as many
>> synonyms of "to find" in the dictionary (e.g. search, match). I could
>> swear Walter never really cared for using regexps, and that is felt
>> throughout the design: it fills the bullet point but it's asinine to
>> use.
>>
>> Besides std.regexp only works with (narrow) strings and we want it to
>> work on streams of all widths and structures. One pet complaint I have
>> is that std.regexp puts a class around it all as if everybody's
>> favorite pastime would be to inherit Regexp and override some random
>> function in it.
>>
>> In the upcoming releases of D 2.0 there will be rather dramatic
>> breaking changes of phobos. I just wanted to ask whether y'all could
>> stomach yet another rewritten API or you'd rather use std.regexp as it
>> is for the time being.
>>
>> Andrei
>>
> 
> For what it's worth, I have a partial clone of the .NET API built on top of PCRE. I would have to ask my boss but I expect I could donate it if anyone want to use it as a basis.

That would be cool; I find the engine in std.regexp rather hard to understand.

Andrei