Jump to page: 1 25  
Page
Thread overview
Speeding up text file parser (BLAST tabular format)
Sep 14, 2015
Fredrik Boulund
Sep 14, 2015
Fredrik Boulund
Sep 14, 2015
Edwin van Leeuwen
Sep 14, 2015
Fredrik Boulund
Sep 14, 2015
Edwin van Leeuwen
Sep 14, 2015
Fredrik Boulund
Sep 14, 2015
Laeeth Isharc
Sep 14, 2015
Fredrik Boulund
Sep 14, 2015
Andrea Fontana
Sep 14, 2015
Andrea Fontana
Sep 14, 2015
John Colvin
Sep 14, 2015
Fredrik Boulund
Sep 14, 2015
John Colvin
Sep 14, 2015
Fredrik Boulund
Sep 14, 2015
H. S. Teoh
Sep 14, 2015
Fredrik Boulund
Sep 14, 2015
Edwin van Leeuwen
Sep 14, 2015
H. S. Teoh
Sep 15, 2015
Fredrik Boulund
Sep 14, 2015
Fredrik Boulund
Sep 14, 2015
John Colvin
Sep 14, 2015
Fredrik Boulund
Sep 14, 2015
John Colvin
Sep 14, 2015
Fredrik Boulund
Sep 14, 2015
John Colvin
Sep 15, 2015
Fredrik Boulund
Sep 15, 2015
John Colvin
Sep 15, 2015
Fredrik Boulund
Sep 15, 2015
John Colvin
Sep 15, 2015
Andrew Brown
Sep 15, 2015
Fredrik Boulund
Sep 14, 2015
Rikki Cattermole
Sep 14, 2015
John Colvin
Sep 14, 2015
NX
Sep 14, 2015
H. S. Teoh
Sep 14, 2015
Kapps
Sep 14, 2015
H. S. Teoh
Sep 15, 2015
Fredrik Boulund
Sep 15, 2015
H. S. Teoh
Sep 15, 2015
Rikki Cattermole
Sep 15, 2015
Fredrik Boulund
Sep 15, 2015
Kagamin
Sep 15, 2015
Rikki Cattermole
Sep 14, 2015
CraigDillabaugh
Sep 14, 2015
John Colvin
Sep 15, 2015
Fredrik Boulund
Sep 15, 2015
Kagamin
Sep 15, 2015
John Colvin
Sep 15, 2015
Kagamin
Sep 15, 2015
John Colvin
September 14, 2015
Hi,

This is my first post on Dlang forums and I don't have a lot of experience with D (yet). I mainly code bioinformatics-stuff in Python on my day-to-day job, but I've been toying with D for a couple of years now. I had this idea that it'd be fun to write a parser for a text-based tabular data format I tend to read a lot of in my programs, but I was a bit stomped that the D implementation I created was slower than my Python-version. I tried running `dmd -profile` on it but didn't really understand what I can do to make it go faster. I guess there's some unnecessary dynamic array extensions being made but I can't figure out how to do without them, maybe someone can help me out? I tried making the examples as small as possible.

Here's the code D code: http://dpaste.com/2HP0ZVA
Here's my Python code for comparison: http://dpaste.com/0MPBK67

Using a small test file (~550 MB) on my machine (2x Xeon(R) CPU E5-2670 with RAID6 SAS disks and 192GB of RAM), the D version runs in about 20 seconds and the Python version less than 16 seconds. I've repeated runs at least thrice when testing. This holds true even if the D version is compiled with -O.

The file being parsed is the output of a DNA/protein sequence mapping algorithm called BLAT, but the tabular output format is originally known from the famous BLAST algorithm.
Here's a short example of what the input files looks like: http://dpaste.com/017N58F
The format is TAB-delimited: query, target, percent_identity, alignment_length, mismatches, gaps, query_start, query_end, target_start, target_end, e-value, bitscore
In the example the output is sorted by query, but this cannot be assumed to hold true for all cases. The input file varies in range from several hundred megabytes to several gigabytes (10+ GiB).

A brief explanation on what the code does:
Parse each line,
Only accept records with percent_identity >= min_identity (90.0) and alignment_length >= min_matches (10),
Store all such records as tuples (in the D code this is a struct) in an array in an associative array indexed by 'query',
For each query, remove any records with percent_id less than 5 percentage points less than the highest value observed for that query,
Write results to stdout (in my real code the data is subject to further downstream processing)

This was all just for me learning to do some basic stuff in D, e.g. file handling, streaming data from disk, etc. I'm really curious what I can do to improve the D code. My original idea was that maybe I should compile the performance critical parts of my Python codebase to D and call them with PyD or something, but not I'm not so sure any more. Help and suggestions appreciated!

September 14, 2015
On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote:
> [...]

Example output might be useful for you to see as well:

10009.1.1:5.2e-02_13: 16
10014.1.1:2.9e-03_11: 44
10017.1.1:4.1e-02_13: 16
10026.1.1:5.8e-03_12: 27
10027.1.1:6.6e-04_13: 16
10060.1.1:2.7e-03_14: 2
10061.1.1:5.1e-07_13: 41


Worth noticing is that it is essentially impossible to predict how many "hits"/records there are for each query, it varies wildly from 0 to 1000+ in some cases.
September 14, 2015
On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote:
> Hi,
>
> Using a small test file (~550 MB) on my machine (2x Xeon(R) CPU E5-2670 with RAID6 SAS disks and 192GB of RAM), the D version runs in about 20 seconds and the Python version less than 16 seconds. I've repeated runs at least thrice when testing. This holds true even if the D version is compiled with -O.
>

Sounds like this program is actually IO bound. In that case I would not expect a really expect an improvement by using D. What is the CPU usage like when you run this program?

Also which dmd version are you using. I think there were some performance improvements for file reading in the latest version (2.068)
September 14, 2015
On Monday, 14 September 2015 at 12:44:22 UTC, Edwin van Leeuwen wrote:
> Sounds like this program is actually IO bound. In that case I would not expect a really expect an improvement by using D. What is the CPU usage like when you run this program?
>
> Also which dmd version are you using. I think there were some performance improvements for file reading in the latest version (2.068)

Hi Edwin, thanks for your quick reply!

I'm using v2.068.1; I actually got inspired to try this out after skimming the changelog :).

Regarding if it is IO-bound. I actually expected it would be, but both the Python and the D-version consume 100% CPU while running, and just copying the file around only takes a few seconds (cf 15-20 sec in runtime for the two programs). There's bound to be some aggressive file caching going on, but I figure that would rather normalize program runtimes at lower times after running them a few times, but I see nothing indicating that.
September 14, 2015
On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote:
> [...]

Also if problem probabily is i/o related, have you tried with:
-O -inline -release -noboundscheck
?

Anyway I think it's a good idea to test it against gdc and ldc that are known to generate faster executables.

Andrea



September 14, 2015
On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote:
> On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote:
>> [...]
>
> Also if problem probabily is i/o related, have you tried with:
> -O -inline -release -noboundscheck
> ?
>
> Anyway I think it's a good idea to test it against gdc and ldc that are known to generate faster executables.
>
> Andrea

s/also/even
September 14, 2015
On Monday, 14 September 2015 at 12:50:03 UTC, Fredrik Boulund wrote:
> On Monday, 14 September 2015 at 12:44:22 UTC, Edwin van Leeuwen wrote:
>> Sounds like this program is actually IO bound. In that case I would not expect a really expect an improvement by using D. What is the CPU usage like when you run this program?
>>
>> Also which dmd version are you using. I think there were some performance improvements for file reading in the latest version (2.068)
>
> Hi Edwin, thanks for your quick reply!
>
> I'm using v2.068.1; I actually got inspired to try this out after skimming the changelog :).
>
> Regarding if it is IO-bound. I actually expected it would be, but both the Python and the D-version consume 100% CPU while running, and just copying the file around only takes a few seconds (cf 15-20 sec in runtime for the two programs). There's bound to be some aggressive file caching going on, but I figure that would rather normalize program runtimes at lower times after running them a few times, but I see nothing indicating that.

Two things that you could try:

First hitlists.byKey can be expensive (especially if hitlists is big). Instead use:

foreach( key, value ; hitlists )

Also the filter.array.length is quite expensive. You could use count instead.
import std.algorithm : count;
value.count!(h => h.pid >= (max_pid - max_pid_diff));

September 14, 2015
On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote:
> On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote:
>> [...]
>
> Also if problem probabily is i/o related, have you tried with:
> -O -inline -release -noboundscheck
> ?

-inline in particular is likely to have a strong impact here

> Anyway I think it's a good idea to test it against gdc and ldc that are known to generate faster executables.
>
> Andrea

+1 I would expect ldc or gdc to strongly outperform dmd on this code.
September 14, 2015
On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote:
> On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote:
>> [...]
>
> Also if problem probabily is i/o related, have you tried with:
> -O -inline -release -noboundscheck
> ?
>
> Anyway I think it's a good idea to test it against gdc and ldc that are known to generate faster executables.
>
> Andrea

Thanks for the suggestions! I'm not too familiar with compiled languages like this, I've mainly written small programs in D and run them via `rdmd` in a scripting language fashion. I'll read up on what the different compile flags do (I knew about -O, but I'm not sure what the others do).

Unfortunately I cannot get LDC working on my system. It seems to fail finding some shared library when I download the binary released, and I can't figure out how to make it compile. I haven't really given GDC a try yet. I'll see what I can do.

Running the original D code I posted before with the flags you suggested reduced the runtime by about 2 seconds on average.
September 14, 2015
On Monday, 14 September 2015 at 13:10:50 UTC, Edwin van Leeuwen wrote:
> Two things that you could try:
>
> First hitlists.byKey can be expensive (especially if hitlists is big). Instead use:
>
> foreach( key, value ; hitlists )
>
> Also the filter.array.length is quite expensive. You could use count instead.
> import std.algorithm : count;
> value.count!(h => h.pid >= (max_pid - max_pid_diff));

I didn't know that hitlists.byKey was that expensive, that's just the kind of feedback I was hoping for. I'm just grasping for straws in the online documentation when I want to do things. With my Python background it feels as if I can still get things that work that way.

I realize the filter.array.length thing is indeed expensive. I find it especially horrendous that the code I've written needs to allocate a big dynamic array that will most likely be cut down quite drastically in this step. Unfortunately I haven't figured out a good way to do this without storing the intermediary results since I cannot know if there might be yet another hit for any encountered "query" since the input file might not be sorted. But the main reason I didn't just count the values like you suggest is actually that I need the filtered hits in later downstream analysis. The filtered hits for each query are used as input to a lowest common ancestor algorithm on the taxonomic tree (of life).
« First   ‹ Prev
1 2 3 4 5