Speeding up text file parser (BLAST tabular format) (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Speeding up text file parser (BLAST tabular format) (page 3)

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by Fredrik Boulund
in reply to H. S. Teoh

Fredrik Boulund

Posted in reply to H. S. Teoh

On Monday, 14 September 2015 at 14:40:29 UTC, H. S. Teoh wrote:
> If performance is a problem, the first thing I'd recommend is to use a profiler to find out where the hotspots are. (More often than not, I have found that the hotspots are not where I expected them to be; sometimes a 1-line change to an unanticipated hotspot can result in a huge performance boost.)

I agree with you on that. I used Python's cProfile module to find the performance bottleneck in the Python version I posted, and shaved off 8-10 seconds of runtime on an extraneous str.split() I had missed.
I tried using the built-in profiler in DMD on the D program but to no avail. I couldn't really make any sense of the output other than that were enormous amounts of calls to lots of functions I couldn't find a way to remove from the code. Here's a paste of the trace output from the version I posted in the original post: http://dpaste.com/1AXPK9P

> The next thing I'd try is to use gdc instead of dmd. ;-)  IME, code produced by `gdc -O3` is at least 20-30% faster than code produced by `dmd -O -inline`. Sometimes the difference can be up to 40-50%, depending on the kind of code you're compiling.

Yes, it really seems that gdc or ldc is the way to go.

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by John Colvin
in reply to Fredrik Boulund

John Colvin

Posted in reply to Fredrik Boulund

On Monday, 14 September 2015 at 14:35:26 UTC, Fredrik Boulund wrote:
> On Monday, 14 September 2015 at 14:28:41 UTC, John Colvin wrote:
>> Yup, glibc is too old for those binaries.
>>
>> What does "ldd --version" say?
>
> It says "ldd (GNU libc) 2.12". Hmm... The most recent version in RHEL's repo is "2.12-1.166.el6_7.1", which is what is installed. Can this be side-loaded without too much hassle and manual effort?

I've had nothing but trouble when using different versions of libc. It would be easier to do this instead: http://wiki.dlang.org/Building_LDC_from_source

I'm running a build of LDC git HEAD right now on an old server with 2.11, I'll upload the result somewhere once it's done if it might be useful

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by Edwin van Leeuwen
in reply to Fredrik Boulund

Edwin van Leeuwen

Posted in reply to Fredrik Boulund

On Monday, 14 September 2015 at 14:54:34 UTC, Fredrik Boulund wrote:
> On Monday, 14 September 2015 at 14:40:29 UTC, H. S. Teoh wrote:
> I agree with you on that. I used Python's cProfile module to find the performance bottleneck in the Python version I posted, and shaved off 8-10 seconds of runtime on an extraneous str.split() I had missed.
> I tried using the built-in profiler in DMD on the D program but to no avail. I couldn't really make any sense of the output other than that were enormous amounts of calls to lots of functions I couldn't find a way to remove from the code. Here's a paste of the trace output from the version I posted in the original post: http://dpaste.com/1AXPK9P
>

See this link for clarification on what the columns/numbers in the profile file mean
http://forum.dlang.org/post/f9gjmo$2gce$1@digitalmars.com

It is still difficult to parse though. I myself often use sysprof (only available on linux), which automatically ranks by time spent.

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by Rikki Cattermole
in reply to Fredrik Boulund

Rikki Cattermole

Posted in reply to Fredrik Boulund

On 15/09/15 12:30 AM, Fredrik Boulund wrote:
> Hi,
>
> This is my first post on Dlang forums and I don't have a lot of
> experience with D (yet). I mainly code bioinformatics-stuff in Python on
> my day-to-day job, but I've been toying with D for a couple of years
> now. I had this idea that it'd be fun to write a parser for a text-based
> tabular data format I tend to read a lot of in my programs, but I was a
> bit stomped that the D implementation I created was slower than my
> Python-version. I tried running `dmd -profile` on it but didn't really
> understand what I can do to make it go faster. I guess there's some
> unnecessary dynamic array extensions being made but I can't figure out
> how to do without them, maybe someone can help me out? I tried making
> the examples as small as possible.
>
> Here's the code D code: http://dpaste.com/2HP0ZVA
> Here's my Python code for comparison: http://dpaste.com/0MPBK67
>
> Using a small test file (~550 MB) on my machine (2x Xeon(R) CPU E5-2670
> with RAID6 SAS disks and 192GB of RAM), the D version runs in about 20
> seconds and the Python version less than 16 seconds. I've repeated runs
> at least thrice when testing. This holds true even if the D version is
> compiled with -O.
>
> The file being parsed is the output of a DNA/protein sequence mapping
> algorithm called BLAT, but the tabular output format is originally known
> from the famous BLAST algorithm.
> Here's a short example of what the input files looks like:
> http://dpaste.com/017N58F
> The format is TAB-delimited: query, target, percent_identity,
> alignment_length, mismatches, gaps, query_start, query_end,
> target_start, target_end, e-value, bitscore
> In the example the output is sorted by query, but this cannot be assumed
> to hold true for all cases. The input file varies in range from several
> hundred megabytes to several gigabytes (10+ GiB).
>
> A brief explanation on what the code does:
> Parse each line,
> Only accept records with percent_identity >= min_identity (90.0) and
> alignment_length >= min_matches (10),
> Store all such records as tuples (in the D code this is a struct) in an
> array in an associative array indexed by 'query',
> For each query, remove any records with percent_id less than 5
> percentage points less than the highest value observed for that query,
> Write results to stdout (in my real code the data is subject to further
> downstream processing)
>
> This was all just for me learning to do some basic stuff in D, e.g. file
> handling, streaming data from disk, etc. I'm really curious what I can
> do to improve the D code. My original idea was that maybe I should
> compile the performance critical parts of my Python codebase to D and
> call them with PyD or something, but not I'm not so sure any more. Help
> and suggestions appreciated!
>

A lot of this hasn't been covered I believe.

http://dpaste.dzfl.pl/f7ab2915c3e1

1) You don't need to convert char[] to string via to. No. Too much. Cast it.
2) You don't need byKey, use foreach key, value syntax. That way you won't go around modifying things unnecessarily.

Ok, I disabled GC + reserved a bunch of memory. It probably won't help much actually. In fact may make it fail so keep that in mind.

Humm what else.

I'm worried about that first foreach. I don't think it needs to exist as it does. I believe an input range would be far better. Use a buffer to store the Hit[]'s. Have a subset per set of them.

If the first foreach is an input range, then things become slightly easier in the second. Now you can turn that into it's own input range.
Also that .array usage concerns me. Many an allocation there! Hence why the input range should be the return from it.

The last foreach, is lets assume dummy. Keep in mind, stdout is expensive here. DO NOT USE. If you must buffer output then do it large quantities.

Based upon what I can see, you are definitely not able to use your cpu's to the max. There is no way that is the limiting factor here. Maybe your usage of a core is. But not the cpu's itself.

The thing is, you cannot use multiple threads on that first foreach loop to speed things up. No. That needs to happen all on one thread.
Instead after that thread you need to push the result into another.

Perhaps, per thread one lock (mutex) + buffer for hits. Go round robin over all the threads. If mutex is still locked, you'll need to wait. In this situation a locked mutex means all you worker threads are working. So you can't do anything more (anyway).

Of course after all this, the HDD may still be getting hit too hard. In which case I would recommend you memory mapping it. Which should allow the OS to more efficiently handle reading it into memory. But you'll need to rework .byLine for that.

Wow that was a lot at 4:30am! So don't take it too seriously. I'm sure somebody else will rip that to shreds!

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by H. S. Teoh
in reply to Edwin van Leeuwen

H. S. Teoh

Posted in reply to Edwin van Leeuwen

On Mon, Sep 14, 2015 at 04:13:12PM +0000, Edwin van Leeuwen via Digitalmars-d-learn wrote:
> On Monday, 14 September 2015 at 14:54:34 UTC, Fredrik Boulund wrote:
> >[...] I tried using the built-in profiler in DMD on the D program but to no avail. I couldn't really make any sense of the output other than that were enormous amounts of calls to lots of functions I couldn't find a way to remove from the code. Here's a paste of the trace output from the version I posted in the original post: http://dpaste.com/1AXPK9P
> >
> 
> See this link for clarification on what the columns/numbers in the profile file mean http://forum.dlang.org/post/f9gjmo$2gce$1@digitalmars.com
> 
> It is still difficult to parse though. I myself often use sysprof (only available on linux), which automatically ranks by time spent.

Dmd's profiler has some limitations, especially if you're doing something that's CPU bound for a long time (its internal counters are not wide enough and may overflow -- I have run into this before and it made it unusable for me). I highly recommend using `gdc -pg` with gprof.


T

-- 
Only boring people get bored. -- JM

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by John Colvin
in reply to Rikki Cattermole

John Colvin

Posted in reply to Rikki Cattermole

On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole wrote:
> On 15/09/15 12:30 AM, Fredrik Boulund wrote:
>>[...]
>
> A lot of this hasn't been covered I believe.
>
> http://dpaste.dzfl.pl/f7ab2915c3e1
>
> 1) You don't need to convert char[] to string via to. No. Too much. Cast it.

Not a good idea in general. Much better to ask for a string in the first place by using byLine!(immutable(char), immutable(char)). Alternatively just use char[] throughout.

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by NX
in reply to Rikki Cattermole

NX

Posted in reply to Rikki Cattermole

On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole wrote:
> A lot of this hasn't been covered I believe.
>
> http://dpaste.dzfl.pl/f7ab2915c3e1

I believe that should be:
foreach (query, ref value; hitlists)
Since an assignment happenin there..?

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by CraigDillabaugh
in reply to Fredrik Boulund

CraigDillabaugh

Posted in reply to Fredrik Boulund

On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote:
> Hi,
>
> This is my first post on Dlang forums and I don't have a lot of experience with D (yet). I mainly code bioinformatics-stuff in Python on my day-to-day job, but I've been toying with D for a couple of years now. I had this idea that it'd be fun to write a parser for a text-based tabular data format I tend to read a lot of in my programs, but I was a bit stomped that the D implementation I created was slower than my Python-version. I tried running `dmd -profile` on it but didn't really understand what I can do to make it go faster. I guess there's some unnecessary dynamic array extensions being made but I can't figure out how to do without them, maybe someone can help me out? I tried making the examples as small as possible.
>
> Here's the code D code: http://dpaste.com/2HP0ZVA
> Here's my Python code for comparison: http://dpaste.com/0MPBK67
>
> clip

I am going to go off the beaten path here.  If you really want speed
for a file like this one way of getting that is to read the file
in as a single large binary array of ubytes (or in blocks if its too big)
and parse the lines yourself. Should be fairly easy with D's array slicing.

I looked at the format and it appears that lines are quite simple and use
a limited subset of the ASCII chars.  If that is in fact true then you
should be able to speed up reading using this technique.  If you can have
UTF8 chars in there, or if the format can be more complex than that shown
in your example, then please ignore my suggestion.

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by John Colvin
in reply to CraigDillabaugh

John Colvin

Posted in reply to CraigDillabaugh

On Monday, 14 September 2015 at 17:51:43 UTC, CraigDillabaugh wrote:
> On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote:
>> [...]
>
> I am going to go off the beaten path here.  If you really want speed
> for a file like this one way of getting that is to read the file
> in as a single large binary array of ubytes (or in blocks if its too big)
> and parse the lines yourself. Should be fairly easy with D's array slicing.

my favourite for streaming a file:
enum chunkSize = 4096;
File(fileName).byChunk(chunkSize).map!"cast(char[])a".joiner()

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by H. S. Teoh
in reply to NX

H. S. Teoh

Posted in reply to NX

I decided to give the code a spin with `gdc -O3 -pg`. Turns out that the hotspot is in std.array.split, contrary to expectations. :-)  Here are the first few lines of the gprof output:

-----snip-----
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 19.77      0.43     0.43 76368364     0.00     0.00  nothrow @safe int std.array.split!(char[]).split(char[]).__foreachbody2(ref ulong, ref dchar)
 16.74      0.79     0.36                             nothrow void gc.gc.Gcx.mark(void*, void*, int)
 10.23      1.01     0.22                             _aApplycd2
  8.14      1.18     0.18                             _d_arrayappendcTX
  4.19      1.27     0.09                             nothrow ulong gc.gc.Gcx.fullcollect()
-----snip-----

As you can see, std.array.split takes up almost 20% of the total running time.

The surprising (or not-so-surprising?) second runner-up for hot spot is actually the GC's marking algorithm. I'm guessing this is likely because of the extensive use of small GC-allocated arrays (splitting into strings, and the like).

The 3rd entry, _aApplycd2, appears to be the druntime implementation of the foreach loop, so I'm not sure what could be done about it.

Anyway, just for kicks, I tried various ways of reducing the cost of std.array.split, but didn't get very far.  Replacing it with std.regex.split didn't help. Looking at its implementation, while it does allocate a new array each time, it also slices over the input when constructing the substrings, so it didn't seem as inefficient as I first thought.

Next I tried disabling the GC with core.memory.GC.disable(). Immediately, I got a 20% performance boost.  Of course, running with a disabled GC will soon eat up all memory and crash, which isn't an option in real-world usage, so the next best solution is to manually schedule GC collection cycles, say call GC.collect() every n iterations of the parsing loop, for some value of n.

I tried implementing a crude version of this (see code below), and found that manually calling GC.collect() even as frequently as once every 5000 loop iterations (for a 500,000 line test input file) still gives about 15% performance improvement over completely disabling the GC.  Since most of the arrays involved here are pretty small, the frequency could be reduced to once every 50,000 iterations and you'd pretty much get the 20% performance boost for free, and still not run out of memory too quickly.

Note that all measurements were done with `gdc -O3`.  I did try the original code with `dmd -O`, and found that it was 20% slower than the gdc version, so I didn't look further.

Anyway, here's the code with the manual GC collection schedule (I modified main() slightly so that I could easily test the code with different input files, so you have to specify the input filename as an argument to the program when running it):

-------snip-------
#!/usr/bin/env rdmd
// Programmed in the D language
// Fredrik Boulund 2015-09-09
// Modified by H. S. Teoh 2015-09-14

import core.memory; // for GC control
import std.stdio;
import std.array;
import std.conv;
import std.algorithm;

struct Hit
{
    string target;
    float pid;
    int matches, mismatches, gaps, qstart, qstop, tstart, tstop;
    double evalue, bitscore;
}

enum manualGcCount = 5_000;
ulong ticksToCollect = manualGcCount;

void gcTick()
{
    if (--ticksToCollect == 0)
    {
        GC.collect();
        ticksToCollect = manualGcCount;
    }
}

void custom_parse(string filename)
{
    float min_identity = 90.0;
    int min_matches = 10;
    Hit[][string] hitlists;

    foreach (record; filename
                     .File
                     .byLine
                     .map!split
                     .filter!(l => l[2].to!float >= min_identity &&
                                   l[3].to!int >= min_matches))
    {
        hitlists[record[0].to!string] ~= Hit(record[1].to!string,
                                             record[2].to!float,
                                             record[3].to!int,
                                             record[4].to!int,
                                             record[5].to!int,
                                             record[6].to!int,
                                             record[7].to!int,
                                             record[8].to!int,
                                             record[9].to!int,
                                             record[10].to!double,
                                             record[11].to!double);
        gcTick();
    }

    foreach (query; hitlists.byKey)
    {
        float max_pid = reduce!((a,b) => max(a, b.pid))(-double.max, hitlists[query]);
        float max_pid_diff = 5.00;
        hitlists[query] = hitlists[query].filter!(h => h.pid >= (max_pid - max_pid_diff)).array();
        writeln(query, ": ", hitlists[query].length);

        gcTick();
    }

}

void main(string[] args)
{
    auto filename = args[1];
    //auto filename = "blat.blast8";
    //auto filename = "./QE_150828_46.blast8";

    GC.disable();
    custom_parse(filename);
}
-------snip-------


T

-- 
Stop staring at me like that! It's offens... no, you'll hurt your eyes!

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation