earthquake changes of std.regexp to come (page 2)

Jarrett Billingsley: >I'm talking 60-70 seconds to compile a more complex regex.< A modern CPU is able to do something like 60*2*2E9 operations in that time, DMD needs 6 seconds or less to compile about 60000-80000 lines of my D code, so I think it's a bit too much time (probably 100 or 1000 times too much). Bye, bearophile

Reply to Jarrett, > On Tue, Feb 17, 2009 at 2:47 PM, Daniel de Kok <me@danieldk.org> > wrote: > >> Actually, I was wondering why nobody is considering real regular >> languages anymore, that can be compiled to a normal finite state >> recognizer or transducer. While this may not be as fancy as Perl-like >> extensions, they are much faster, and it's easier to do fun stuff >> such as composition. >> > Tango's regex engine is just that. It uses a tagged NFA method. > http://www.dsource.org/projects/tango/docs/current/tango.text.Regex.ht > ml > > The problem with this method is that while it's certainly faster to > match, it's MUCH slower to compile. There are no pathological > matches; only pathological compiles ;) I'm talking 60-70 seconds to > compile a more complex regex. could this be transitioned to CTFE? you could even have a debug mode that delays till runtime RegEx mather = new CTFERegEx!("some regex"); class CTFERegEx(char[] regex) : RegEx { debug(NoCTFE) static char[] done; else static const char[] done = CTFECompile(regex); public this() { debug(NoCTFE) if(done == null) done = CTFECompile(regex); base(done) } }

On Tue, Feb 17, 2009 at 3:16 PM, BCS <ao@pathlink.com> wrote: > > could this be transitioned to CTFE? you could even have a debug mode that delays till runtime > > RegEx mather = new CTFERegEx!("some regex"); > > > class CTFERegEx(char[] regex) : RegEx > { > debug(NoCTFE) static char[] done; > else static const char[] done = CTFECompile(regex); > > public this() > { > debug(NoCTFE) if(done == null) done = CTFECompile(regex); > > base(done) > } > } For what it's worth the Tango regexes actually have a method to output a D function that will implement the regex after it's compiled. So you _could_ precompile the regex into D code and use that. But seriously, man - if something takes 60 seconds to complete at _runtime_, making it CTFE would simply make your computer explode.

On Tue, Feb 17, 2009 at 8:57 PM, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote: > The problem with this method is that while it's certainly faster to match, it's MUCH slower to compile. There are no pathological matches; only pathological compiles ;) I'm talking 60-70 seconds to compile a more complex regex. This might be an acceptable tradeoff for when you need to compile a regex in a long-running app like a server, but it's completely unacceptable for most small, Perl-like text munging programs. Hmmm, define "complex", I suppose it's ok for the general line-splitting/matching stuff? I got into trouble (time-wise) when we compiled a part of speech tagger into a transducer. In those cases we generally pre-compile stuff, and output it as a large struct in the target language. Of course, it would be fun if we can do it at compile-time ;). Besides that, if we'd have a good general recognizer/transducer implementation it could also be used for compact dictionary storage, perfect hashing automata, etc. Take care, Daniel

On Tue, Feb 17, 2009 at 9:26 PM, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote: > For what it's worth the Tango regexes actually have a method to output a D function that will implement the regex after it's compiled. So you _could_ precompile the regex into D code and use that. I have only been tinkering with Phobos, but that's good to hear, thanks!

On Tue, Feb 17, 2009 at 3:30 PM, Daniel de Kok <me@danieldk.org> wrote: > > Hmmm, define "complex" \w+([\-+.]\w+)*@\w+([\-.]\w+)*\.\w+([\-.]\w+)* This is a simple email regexp. This takes about 4 or 5 seconds to compile on my lappy (Pentium M). It only goes up from there.

Reply to Jarrett, > On Tue, Feb 17, 2009 at 3:16 PM, BCS <ao@pathlink.com> wrote: > >> could this be transitioned to CTFE? you could even have a debug mode >> that delays till runtime >> >> RegEx mather = new CTFERegEx!("some regex"); >> >> class CTFERegEx(char[] regex) : RegEx >> { >> debug(NoCTFE) static char[] done; >> else static const char[] done = CTFECompile(regex); >> public this() >> { >> debug(NoCTFE) if(done == null) done = CTFECompile(regex); >> base(done) >> } >> } > For what it's worth the Tango regexes actually have a method to output > a D function that will implement the regex after it's compiled. So > you _could_ precompile the regex into D code and use that. > > But seriously, man - if something takes 60 seconds to complete at > _runtime_, making it CTFE would simply make your computer explode. > For any kind of debug, yeah, that's a problem. OTOH for release, as long as it /does/ compile, who cares? How many real release builds does anyone do a week?

Reply to Jarrett, > On Tue, Feb 17, 2009 at 3:30 PM, Daniel de Kok <me@danieldk.org> > wrote: > >> Hmmm, define "complex" >> > \w+([\-+.]\w+)*@\w+([\-.]\w+)*\.\w+([\-.]\w+)* > > This is a simple email regexp. This takes about 4 or 5 seconds to > compile on my lappy (Pentium M). > > It only goes up from there. > I wonder how well it would work on this: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html :b

On Tue, Feb 17, 2009 at 9:50 PM, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote: > On Tue, Feb 17, 2009 at 3:30 PM, Daniel de Kok <me@danieldk.org> wrote: >> >> Hmmm, define "complex" > > \w+([\-+.]\w+)*@\w+([\-.]\w+)*\.\w+([\-.]\w+)* > > This is a simple email regexp. This takes about 4 or 5 seconds to > compile on my lappy (Pentium M). Hmm, odd. I have translated that regexp to the syntax of the tool that we used, that is written in Prolog (it is generally a constant factor slower than C/C++/D equivalents). Generating a minimized DFA takes far less than a second. I used the following expression (abstracted a bit with macros): --- macro(letter, {a..z, 'A'..'Z'}). macro(punctlet,[{-,+,.},letter+]). macro(dompunctlet,[{-,.},letter+]). macro(email,[letter+,punctlet*,@,letter+,dompunctlet*,.,letter+,dompunctlet*]). --- The software is available from: http://www.let.rug.nl/~vannoord/Fsa/fsa.html Take care, Daniel

February 17, 2009

Re: earthquake changes of std.regexp to come

Posted by Andrei Alexandrescu
in reply to BCS

Permalink

Andrei Alexandrescu

Posted in reply to BCS

Permalink

BCS wrote:
> Reply to Andrei,
> 
>> I'm quite unhappy with the API of std.regexp. It's a chaotic design
>> that provides a hodgepodge of functionality and tries to use as many
>> synonyms of "to find" in the dictionary (e.g. search, match). I could
>> swear Walter never really cared for using regexps, and that is felt
>> throughout the design: it fills the bullet point but it's asinine to
>> use.
>>
>> Besides std.regexp only works with (narrow) strings and we want it to
>> work on streams of all widths and structures. One pet complaint I have
>> is that std.regexp puts a class around it all as if everybody's
>> favorite pastime would be to inherit Regexp and override some random
>> function in it.
>>
>> In the upcoming releases of D 2.0 there will be rather dramatic
>> breaking changes of phobos. I just wanted to ask whether y'all could
>> stomach yet another rewritten API or you'd rather use std.regexp as it
>> is for the time being.
>>
>> Andrei
>>
> 
> For what it's worth, I have a partial clone of the .NET API built on top of PCRE. I would have to ask my boss but I expect I could donate it if anyone want to use it as a basis.

That would be cool; I find the engine in std.regexp rather hard to understand.

Andrei

Forums