Bad performance of simple regular expression

February 05, 2007

Posted by MarcL

Permalink

MarcL

Permalink

hi everyone,

first of all i want to say that i'm not a professional programmer - so the problem i have might be
caused by my own lack of experience. Nevertheless i want to describe it, hoping that someone in this
group might be able to tell me what's wrong. I am a molecular biologist and i often have to deal with
larger amounts of DNA and protein sequence data (which is in principle text). I am mainly using Perl to
process these DNA files, and Perl generally performs very well (regular expressions are actually the killer
tool for working with DNA sequences). Unfortunately not everything in Perl is a fast as the regular
expressions and so i started trying to learn a language that can be compiled to produce fast
executables : C++ - and went crazy because everything is so complicated. All the ugly details that Perl
takes care of for the user have to be organized manually and that really gave me the creeps. Then i
learned about D and it sounded like it the solution to my problem: A compilable language that supports
associative arrays, garbage collection and (most importantly for me) regular expressions! Great! I
experimented a bit and actually managed to write small working programlet directly. I was delighted!
But now comes the reason why i write all this: Being enthusiastic about this new nice language i started
to write a module that should implement basic functions for working with the most common DNA
sequence file formats. To parse these files i planned to use regular expressions. So far so good. When
testing my module with a small DNA file everything seemed OK -then i tried to use it to parse a more
real world-sized DNA file (~155000 characters of DNA sequence plus about the same amount of textual
information) and had to find out that a simple std.regexp.split call took about 59 seconds!!! I could not
believe it and wrote a little Perl script doing the same thing and it took less than 1s!! What's wrong
here??? This can't really be true, can it? Is the implementation of regular expressions in the phobos
library so bad or preliminary that it is so much less performant than the Perl regex engine? It's actually
not usable for me like this (which is a sad think because i really like the other features of D and would
like to use it). Am i making mistakes or do i simply have to wait for a better version of phobos?

Any comments or suggestions would be great.

cheers

If you look at this benchnark you will see two regex programs... one is good and the other one is not.

http://shootout.alioth.debian.org/debian/benchmark.php? test=regexdna&lang=all

Ratio Program & Logs Full CPU Time s Memory Use KB GZip Bytes
1.0 Tcl #2 3.60 26,420 357
1.8 C++ g++ #3 6.45 13,408 1572
1.9 C gcc #2 6.92 12,696 1083
2.0 Python 7.33 23,592 326
2.1 OCaml #2 7.56 47,768 599
2.4 C++ g++ #2 8.45 21,132 619
2.8 Scheme MzScheme 10.04 182,448 819
2.8 D Digital Mars #3 10.23 45,360 1006
3.0 Ruby 10.79 31,756 307
3.1 Lua #3 11.26 30,252 418
3.2 Java JDK -server 11.65 78,204 641
3.6 Scala 12.85 76,960 647
3.8 Perl #2 13.66 25,048 444
4.2 Lisp SBCL 15.06 305,796 709
4.3 Lisp SBCL #2 15.31 305,576 548
5.0 JavaScript SpiderMonkey 18.14 180,884 349
6.7 Ada 95 GNAT #4 24.11 15,264 1336
8.2 C# Mono 29.39 142,272 608
11 Ada 95 GNAT #3 40.28 47,160 1217
17 Smalltalk GST #2 61.09 201,836 567
28 Haskell GHC #3 100.41 121,252 2788
33 D Digital Mars #2 117.40 143,940 490
1,208 Erlang 4,342.63 79,080 606

Forums