Bad performance of simple regular expression - why?? - D Programming Language Discussion Forum

Bad performance of simple regular expression - why??
Feb 05, 2007 MarcL
Feb 05, 2007 Bill Lear
Feb 06, 2007 MarcL
Feb 05, 2007 Bill Baxter
Feb 06, 2007 Tomas Lindquist Olsen
Feb 06, 2007 Bill Baxter
Feb 06, 2007 Tomas Lindquist Olsen
Feb 06, 2007 Walter Bright
Feb 06, 2007 Manfred Nowak
Feb 06, 2007 Walter Bright
Feb 06, 2007 Manfred Nowak
Feb 08, 2007 Jascha Wetzel
Feb 09, 2007 Benji Smith
Feb 09, 2007 Joel C. Salomon
Feb 08, 2007 Benji Smith
Feb 06, 2007 Joel C. Salomon
Feb 06, 2007 MarcL

February 05, 2007

Bad performance of simple regular expression - why??

Posted by MarcL

Permalink

MarcL

Permalink

hi everyone,

first of all i want to say that i'm not a professional programmer - so the problem i have might be caused by my own lack of experience. Nevertheless i want to describe it, hoping that someone in this group might be able to tell me what's wrong. I am a molecular biologist and i often have to deal with larger amounts of DNA and protein sequence data (which is in principle text). I am mainly using Perl to process these DNA files, and Perl generally performs very well (regular expressions are actually the killer tool for working with DNA sequences). Unfortunately not everything in Perl is a fast as the regular expressions and so i started trying to learn a language that can be compiled to produce fast executables : C++ - and went crazy because everything is so complicated. All the ugly details that Perl takes care of for the user have to be organized manually and that really gave me the creeps. Then i learned about D and it sounded like it the solution to my problem: A compilable language that supports associative arrays, garbage collection and (most importantly for me) regular expressions! Great! I experimented a bit and actually managed to write small working programlet directly. I was delighted! But now comes the reason why i write all this: Being enthusiastic about this new nice language i started to write a module that should implement basic functions for working with the most common DNA sequence file formats. To parse these files i planned to use regular expressions. So far so good. When testing my module with a small DNA file everything seemed OK -then i tried to use it to parse a more real world-sized DNA file (~155000 characters of DNA sequence plus about the same amount of textual information) and had to find out that a simple std.regexp.split call took about 59 seconds!!! I could not believe it and wrote a little Perl script doing the same thing and it took less than 1s!! What's wrong here??? This can't really be true, can it? Is the implementation of regular expressions in the phobos library so bad or preliminary that it is so much less performant than the Perl regex engine? It's actually not usable for me like this (which is a sad think because i really like the other features of D and would like to use it). Am i making mistakes or do i simply have to wait for a better version of phobos?

Any comments or suggestions would be great.

cheers

February 05, 2007

Re: Bad performance of simple regular expression - why??

Posted by Bill Lear
in reply to MarcL

Permalink

Bill Lear

Posted in reply to MarcL

Permalink

MarcL <lohse@mpimp-golm.mpg.de> writes:

> .... Am i making mistakes or do i simply have to wait for a better version of phobos?
> 
> Any comments or suggestions would be great.

Please post a representative sample so we can help you, and do please try to post lines < 80 characters long, if possible.

Bill
--
Bill Lear
r * e * @ * o * y * a * c * m
* a * l * z * p * r * . * o *

February 05, 2007

Re: Bad performance of simple regular expression - why??

Posted by Bill Baxter
in reply to MarcL

Permalink

Bill Baxter

Posted in reply to MarcL

Permalink

MarcL wrote:
> hi everyone, 
> 
> first of all i want to say that i'm not a professional programmer  - so the problem i have might be caused by my own  lack of experience. Nevertheless i want to describe it, hoping that someone in this group might be able to tell me what's wrong. I am a molecular biologist and i often have to deal with larger amounts of DNA and protein sequence data (which is in principle text). I am mainly using Perl to process these DNA files, and Perl generally performs very well (regular expressions are actually the killer tool for working with DNA sequences). Unfortunately not everything in Perl is a fast as the regular expressions and so i started trying to learn a language that can be compiled to produce fast executables : C++ - and went crazy because everything is so complicated. All the ugly details that Perl takes care of for the user have to be organized manually and that really gave me the creeps. Then i learned about D and it sounded like it the solution to my problem: A compilable language that supports associative arrays, garbage collection and (most importantly for me) regular expressions! Great! I experimented a bit and actually managed to write small working programlet directly. I was delighted! But now comes the reason why i write all this: Being enthusiastic about this new nice language i started to write a module that should implement basic functions for working with the most common DNA sequence file formats. To parse these files i planned to use regular expressions. So far so good. When testing my module with a small DNA file everything seemed OK -then i tried to use it to parse a more real world-sized DNA file (~155000 characters of DNA sequence plus about the same amount of textual information) and had to find out that a simple std.regexp.split call took about 59 seconds!!! I could not believe it and wrote a little Perl script doing the same thing and it took less than 1s!! What's wrong here??? This can't really be true, can it? Is the implementation of regular expressions in the phobos library so bad or preliminary that it is so much less performant than the Perl regex engine? It's actually not usable for me like this (which is a sad think because i really like the other features of D and would like to use it). Am i making mistakes or do i simply have to wait for a better version of phobos? 
> 
> Any comments or suggestions would be great. 
> 
> cheers 
> 

Like Bill said, a sample would be helpful.

The most common mistake is to not precompile the regexp.
For instance using std.regexp.search with the same pattern string is not optimal.

If you're going to be using the same pattern multiple times, then you should create a RegExp object once, and then apply it multiple times using its search, match etc methods.

Regardless, perl was basically created as a convenient way to use regular expressions, so it's implementation could very likely be more efficient than D's.

Do perl regexp's handle unicode/utf8 properly these days?
(Actually, do D's for that matter?)

--bb

February 06, 2007

Re: Bad performance of simple regular expression - why??

Posted by Tomas Lindquist Olsen
in reply to Bill Baxter

Permalink

Tomas Lindquist Olsen

Posted in reply to Bill Baxter

Permalink

On Tue, 06 Feb 2007 08:47:53 +0900, Bill Baxter wrote:
> Regardless, perl was basically created as a convenient way to use regular expressions, so it's implementation could very likely be more efficient than D's.

I stumbled across this link last night. It says something slightly different...

February 06, 2007

Re: Bad performance of simple regular expression - why??

Posted by Bill Baxter
in reply to Tomas Lindquist Olsen

Permalink

Bill Baxter

Posted in reply to Tomas Lindquist Olsen

Permalink

Tomas Lindquist Olsen wrote:
> On Tue, 06 Feb 2007 08:47:53 +0900, Bill Baxter wrote:
>> Regardless, perl was basically created as a convenient way to use regular
>> expressions, so it's implementation could very likely be more efficient
>> than D's.
> 
> I stumbled across this link last night. It says something slightly
> different...

I'm all ears...

February 06, 2007

Re: Bad performance of simple regular expression - why??

Posted by Tomas Lindquist Olsen
in reply to Bill Baxter

Permalink

Tomas Lindquist Olsen

Posted in reply to Bill Baxter

Permalink

On Tue, 06 Feb 2007 10:35:06 +0900, Bill Baxter wrote:

> Tomas Lindquist Olsen wrote:
>> On Tue, 06 Feb 2007 08:47:53 +0900, Bill Baxter wrote:
>>> Regardless, perl was basically created as a convenient way to use regular expressions, so it's implementation could very likely be more efficient than D's.
>> 
>> I stumbled across this link last night. It says something slightly different...
> 
> I'm all ears...

Oups.. Here you go:

http://swtch.com/~rsc/regexp/regexp1.html

February 06, 2007

Re: Bad performance of simple regular expression - why??

Posted by Walter Bright
in reply to Tomas Lindquist Olsen

Permalink

Walter Bright

Posted in reply to Tomas Lindquist Olsen

Permalink

Tomas Lindquist Olsen wrote:
> http://swtch.com/~rsc/regexp/regexp1.html

That looks pretty cool. Anyone care to take a stab at implementing this in std.regexp? It could be done by examining the generated bytecode and attempting to convert it to an NFA. If that succeeds, then great, do the NFA. If it fails, then fallback to the original method.

February 06, 2007

Re: Bad performance of simple regular expression - why??

Posted by Joel C. Salomon
in reply to Tomas Lindquist Olsen

Permalink

Joel C. Salomon

Posted in reply to Tomas Lindquist Olsen

Permalink

Tomas Lindquist Olsen wrote:
> I stumbled across this link last night. It says something slightly different...
> 
> http://swtch.com/~rsc/regexp/regexp1.html

The regexp library isn’t the only really useful bit of software packed into Plan 9.  I’m implementing a compiler for the first time (C-ish, not D) under Plan 9, and it’s a constant temptation to add “just one more feature” because there’s a library to support it.

--Joel

February 06, 2007

Re: Bad performance of simple regular expression - why??

Posted by MarcL
in reply to Bill Lear

Permalink

MarcL

Posted in reply to Bill Lear

Permalink

Bill Lear Wrote:

> MarcL <lohse@mpimp-golm.mpg.de> writes:
> 
> > .... Am i making mistakes or do i simply have to wait for a better version of phobos?
> > 
> > Any comments or suggestions would be great.
> 
> Please post a representative sample so we can help you, and do please try to post lines < 80 characters long, if possible.
> 
> 
> Bill
> --
> Bill Lear
> r * e * @ * o * y * a * c * m
> * a * l * z * p * r * . * o *

first of all thanks a lot to everyone who answered on my posting i did not expect to receive feedback so quickly.

This is the piece of code that runs so slowly:

<code>

	char[] raw_sequence, stripped_sequence, header_segment, feature_segment;
	char[][] gb_segments, seq_segments;

	raw_sequence = cast(char[])std.file.read("/Users/marc/Desktop/sequences.gb");
	gb_segments = std.regexp.split(raw_sequence, "FEATURES", "");
	writefln("split into ", gb_segments.length, " segments");
	seq_segments = std.regexp.split(gb_segments[1], "ORIGIN", "");
	header_segment = gb_segments[0];
	feature_segment = std.regexp.sub(seq_segments[0], "^.*location/qualifiers\n", "", "i");
	stripped_sequence = std.string.toupper(std.regexp.sub(seq_segments[1], "[0-9\t \n/]", "", "g"));

</code>

I did not precompile the regular expression, so maybe this
is one of the reasons why it's slow, but it doesn't contain any loops
so the expressions are only used once. If anyone wants to try that
on his own machine - i attached the "sequences.gb" file.

I have tested the code on a PowerBook (G4 1,5GHz) using gdc and also on a Linux machine (1,8GHz PentiumM) using the original dmd compiler - but the results were the same

February 06, 2007

Re: Bad performance of simple regular expression - why??

Posted by MarcL
in reply to Bill Baxter

Permalink

MarcL

Posted in reply to Bill Baxter

Permalink

> 
> Like Bill said, a sample would be helpful.
> 
> The most common mistake is to not precompile the regexp.
> For instance using std.regexp.search with the same pattern string is not
> optimal.
> 
> If you're going to be using the same pattern multiple times, then you should create a RegExp object once, and then apply it multiple times using its search, match etc methods.
> 
> Regardless, perl was basically created as a convenient way to use regular expressions, so it's implementation could very likely be more efficient than D's.
> 
> Do perl regexp's handle unicode/utf8 properly these days?
> (Actually, do D's for that matter?)
> 
> --bb


here's a link to the sequences.gb file (attaching it didn't work) http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=76559634

Top | Forum index | About this forum

Forums