February 17, 2009 Re: earthquake changes of std.regexp to come | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jarrett Billingsley | Jarrett Billingsley:
>I'm talking 60-70 seconds to compile a more complex regex.<
A modern CPU is able to do something like 60*2*2E9 operations in that time, DMD needs 6 seconds or less to compile about 60000-80000 lines of my D code, so I think it's a bit too much time (probably 100 or 1000 times too much).
Bye,
bearophile
|
February 17, 2009 Re: earthquake changes of std.regexp to come | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jarrett Billingsley | Reply to Jarrett,
> On Tue, Feb 17, 2009 at 2:47 PM, Daniel de Kok <me@danieldk.org>
> wrote:
>
>> Actually, I was wondering why nobody is considering real regular
>> languages anymore, that can be compiled to a normal finite state
>> recognizer or transducer. While this may not be as fancy as Perl-like
>> extensions, they are much faster, and it's easier to do fun stuff
>> such as composition.
>>
> Tango's regex engine is just that. It uses a tagged NFA method.
> http://www.dsource.org/projects/tango/docs/current/tango.text.Regex.ht
> ml
>
> The problem with this method is that while it's certainly faster to
> match, it's MUCH slower to compile. There are no pathological
> matches; only pathological compiles ;) I'm talking 60-70 seconds to
> compile a more complex regex.
could this be transitioned to CTFE? you could even have a debug mode that delays till runtime
RegEx mather = new CTFERegEx!("some regex");
class CTFERegEx(char[] regex) : RegEx
{
debug(NoCTFE) static char[] done;
else static const char[] done = CTFECompile(regex);
public this()
{
debug(NoCTFE) if(done == null) done = CTFECompile(regex);
base(done)
}
}
|
February 17, 2009 Re: earthquake changes of std.regexp to come | ||||
---|---|---|---|---|
| ||||
Posted in reply to BCS | On Tue, Feb 17, 2009 at 3:16 PM, BCS <ao@pathlink.com> wrote:
>
> could this be transitioned to CTFE? you could even have a debug mode that delays till runtime
>
> RegEx mather = new CTFERegEx!("some regex");
>
>
> class CTFERegEx(char[] regex) : RegEx
> {
> debug(NoCTFE) static char[] done;
> else static const char[] done = CTFECompile(regex);
>
> public this()
> {
> debug(NoCTFE) if(done == null) done = CTFECompile(regex);
>
> base(done)
> }
> }
For what it's worth the Tango regexes actually have a method to output a D function that will implement the regex after it's compiled. So you _could_ precompile the regex into D code and use that.
But seriously, man - if something takes 60 seconds to complete at _runtime_, making it CTFE would simply make your computer explode.
|
February 17, 2009 Re: earthquake changes of std.regexp to come | ||||
---|---|---|---|---|
| ||||
On Tue, Feb 17, 2009 at 8:57 PM, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote: > The problem with this method is that while it's certainly faster to match, it's MUCH slower to compile. There are no pathological matches; only pathological compiles ;) I'm talking 60-70 seconds to compile a more complex regex. This might be an acceptable tradeoff for when you need to compile a regex in a long-running app like a server, but it's completely unacceptable for most small, Perl-like text munging programs. Hmmm, define "complex", I suppose it's ok for the general line-splitting/matching stuff? I got into trouble (time-wise) when we compiled a part of speech tagger into a transducer. In those cases we generally pre-compile stuff, and output it as a large struct in the target language. Of course, it would be fun if we can do it at compile-time ;). Besides that, if we'd have a good general recognizer/transducer implementation it could also be used for compact dictionary storage, perfect hashing automata, etc. Take care, Daniel |
February 17, 2009 Re: earthquake changes of std.regexp to come | ||||
---|---|---|---|---|
| ||||
On Tue, Feb 17, 2009 at 9:26 PM, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote: > For what it's worth the Tango regexes actually have a method to output a D function that will implement the regex after it's compiled. So you _could_ precompile the regex into D code and use that. I have only been tinkering with Phobos, but that's good to hear, thanks! |
February 17, 2009 Re: earthquake changes of std.regexp to come | ||||
---|---|---|---|---|
| ||||
On Tue, Feb 17, 2009 at 3:30 PM, Daniel de Kok <me@danieldk.org> wrote: > > Hmmm, define "complex" \w+([\-+.]\w+)*@\w+([\-.]\w+)*\.\w+([\-.]\w+)* This is a simple email regexp. This takes about 4 or 5 seconds to compile on my lappy (Pentium M). It only goes up from there. |
February 17, 2009 Re: earthquake changes of std.regexp to come | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jarrett Billingsley | Reply to Jarrett,
> On Tue, Feb 17, 2009 at 3:16 PM, BCS <ao@pathlink.com> wrote:
>
>> could this be transitioned to CTFE? you could even have a debug mode
>> that delays till runtime
>>
>> RegEx mather = new CTFERegEx!("some regex");
>>
>> class CTFERegEx(char[] regex) : RegEx
>> {
>> debug(NoCTFE) static char[] done;
>> else static const char[] done = CTFECompile(regex);
>> public this()
>> {
>> debug(NoCTFE) if(done == null) done = CTFECompile(regex);
>> base(done)
>> }
>> }
> For what it's worth the Tango regexes actually have a method to output
> a D function that will implement the regex after it's compiled. So
> you _could_ precompile the regex into D code and use that.
>
> But seriously, man - if something takes 60 seconds to complete at
> _runtime_, making it CTFE would simply make your computer explode.
>
For any kind of debug, yeah, that's a problem. OTOH for release, as long as it /does/ compile, who cares? How many real release builds does anyone do a week?
|
February 17, 2009 Re: earthquake changes of std.regexp to come | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jarrett Billingsley | Reply to Jarrett, > On Tue, Feb 17, 2009 at 3:30 PM, Daniel de Kok <me@danieldk.org> > wrote: > >> Hmmm, define "complex" >> > \w+([\-+.]\w+)*@\w+([\-.]\w+)*\.\w+([\-.]\w+)* > > This is a simple email regexp. This takes about 4 or 5 seconds to > compile on my lappy (Pentium M). > > It only goes up from there. > I wonder how well it would work on this: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html :b |
February 17, 2009 Re: earthquake changes of std.regexp to come | ||||
---|---|---|---|---|
| ||||
On Tue, Feb 17, 2009 at 9:50 PM, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote: > On Tue, Feb 17, 2009 at 3:30 PM, Daniel de Kok <me@danieldk.org> wrote: >> >> Hmmm, define "complex" > > \w+([\-+.]\w+)*@\w+([\-.]\w+)*\.\w+([\-.]\w+)* > > This is a simple email regexp. This takes about 4 or 5 seconds to > compile on my lappy (Pentium M). Hmm, odd. I have translated that regexp to the syntax of the tool that we used, that is written in Prolog (it is generally a constant factor slower than C/C++/D equivalents). Generating a minimized DFA takes far less than a second. I used the following expression (abstracted a bit with macros): --- macro(letter, {a..z, 'A'..'Z'}). macro(punctlet,[{-,+,.},letter+]). macro(dompunctlet,[{-,.},letter+]). macro(email,[letter+,punctlet*,@,letter+,dompunctlet*,.,letter+,dompunctlet*]). --- The software is available from: http://www.let.rug.nl/~vannoord/Fsa/fsa.html Take care, Daniel |
February 17, 2009 Re: earthquake changes of std.regexp to come | ||||
---|---|---|---|---|
| ||||
Posted in reply to BCS | BCS wrote:
> Reply to Andrei,
>
>> I'm quite unhappy with the API of std.regexp. It's a chaotic design
>> that provides a hodgepodge of functionality and tries to use as many
>> synonyms of "to find" in the dictionary (e.g. search, match). I could
>> swear Walter never really cared for using regexps, and that is felt
>> throughout the design: it fills the bullet point but it's asinine to
>> use.
>>
>> Besides std.regexp only works with (narrow) strings and we want it to
>> work on streams of all widths and structures. One pet complaint I have
>> is that std.regexp puts a class around it all as if everybody's
>> favorite pastime would be to inherit Regexp and override some random
>> function in it.
>>
>> In the upcoming releases of D 2.0 there will be rather dramatic
>> breaking changes of phobos. I just wanted to ask whether y'all could
>> stomach yet another rewritten API or you'd rather use std.regexp as it
>> is for the time being.
>>
>> Andrei
>>
>
> For what it's worth, I have a partial clone of the .NET API built on top of PCRE. I would have to ask my boss but I expect I could donate it if anyone want to use it as a basis.
That would be cool; I find the engine in std.regexp rather hard to understand.
Andrei
|
Copyright © 1999-2021 by the D Language Foundation