Speeding up text file parser (BLAST tabular format) (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Speeding up text file parser (BLAST tabular format) (page 2)

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by Fredrik Boulund
in reply to John Colvin

Fredrik Boulund

Posted in reply to John Colvin

On Monday, 14 September 2015 at 13:37:18 UTC, John Colvin wrote:
> On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote:
>> On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote:
>>> [...]
>>
>> Also if problem probabily is i/o related, have you tried with:
>> -O -inline -release -noboundscheck
>> ?
>
> -inline in particular is likely to have a strong impact here
>

Why would -inline be particularly likely to make a big difference in this case? I'm trying to learn, but I don't see what inlining could be done in this case.

>> Anyway I think it's a good idea to test it against gdc and ldc that are known to generate faster executables.
>>
>> Andrea
>
> +1 I would expect ldc or gdc to strongly outperform dmd on this code.

Why is that? I would love to learn to understand why they could be expected to perform much better on this kind of code.

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by John Colvin
in reply to Fredrik Boulund

John Colvin

Posted in reply to Fredrik Boulund

On Monday, 14 September 2015 at 13:50:22 UTC, Fredrik Boulund wrote:
> On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote:
>> [...]
>
> Thanks for the suggestions! I'm not too familiar with compiled languages like this, I've mainly written small programs in D and run them via `rdmd` in a scripting language fashion. I'll read up on what the different compile flags do (I knew about -O, but I'm not sure what the others do).
>
> Unfortunately I cannot get LDC working on my system. It seems to fail finding some shared library when I download the binary released, and I can't figure out how to make it compile. I haven't really given GDC a try yet. I'll see what I can do.
>
> Running the original D code I posted before with the flags you suggested reduced the runtime by about 2 seconds on average.

what system are you on? What are the error messages you are getting?

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by Laeeth Isharc
in reply to Fredrik Boulund

Laeeth Isharc

Posted in reply to Fredrik Boulund

On Monday, 14 September 2015 at 13:55:50 UTC, Fredrik Boulund wrote:
> On Monday, 14 September 2015 at 13:10:50 UTC, Edwin van Leeuwen wrote:
>> Two things that you could try:
>>
>> First hitlists.byKey can be expensive (especially if hitlists is big). Instead use:
>>
>> foreach( key, value ; hitlists )
>>
>> Also the filter.array.length is quite expensive. You could use count instead.
>> import std.algorithm : count;
>> value.count!(h => h.pid >= (max_pid - max_pid_diff));
>
> I didn't know that hitlists.byKey was that expensive, that's just the kind of feedback I was hoping for. I'm just grasping for straws in the online documentation when I want to do things. With my Python background it feels as if I can still get things that work that way.

I picked up D to start learning maybe a couple of years ago.  I found Ali's book, Andrei's book, github source code (including for Phobos), and asking here to be the best resources.  The docs make perfect sense when you have got to a certain level (or perhaps if you have a computer sciencey background), but can be tough before that (though they are getting better).

You should definitely take a look at the dlangscience project organized by John Colvin and others.

If you like ipython/jupyter also see his pydmagic - write D inline in a notebook.

You may find this series of posts interesting too - another bioinformatics guy migrating from Python:
http://forum.dlang.org/post/akzdstfiwwzfeoudhshg@forum.dlang.org

> I realize the filter.array.length thing is indeed expensive. I find it especially horrendous that the code I've written needs to allocate a big dynamic array that will most likely be cut down quite drastically in this step. Unfortunately I haven't figured out a good way to do this without storing the intermediary results since I cannot know if there might be yet another hit for any encountered "query" since the input file might not be sorted. But the main reason I didn't just count the values like you suggest is actually that I need the filtered hits in later downstream analysis. The filtered hits for each query are used as input to a lowest common ancestor algorithm on the taxonomic tree (of life).

Unfortunately I haven't time to read your code, and others will do better.  But do you use .reserve() ?  Also these are a nice fast container library based on Andrei Alexandrescu's allocator:

https://github.com/economicmodeling/containers

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by John Colvin
in reply to Fredrik Boulund

John Colvin

Posted in reply to Fredrik Boulund

On Monday, 14 September 2015 at 13:58:33 UTC, Fredrik Boulund wrote:
> On Monday, 14 September 2015 at 13:37:18 UTC, John Colvin wrote:
>> On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote:
>>> On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote:
>>>> [...]
>>>
>>> Also if problem probabily is i/o related, have you tried with:
>>> -O -inline -release -noboundscheck
>>> ?
>>
>> -inline in particular is likely to have a strong impact here
>>
>
> Why would -inline be particularly likely to make a big difference in this case? I'm trying to learn, but I don't see what inlining could be done in this case.

Range-based code like you are using leads to *huge* numbers of function calls to get anything done. The advantage of inlining is twofold: 1) you don't have to pay the cost of the function call itself and 2) often more optimisation can be done once a function is inlined.

>>> Anyway I think it's a good idea to test it against gdc and ldc that are known to generate faster executables.
>>>
>>> Andrea
>>
>> +1 I would expect ldc or gdc to strongly outperform dmd on this code.
>
> Why is that? I would love to learn to understand why they could be expected to perform much better on this kind of code.

Because there are much better at inlining. dmd is quick to compile your code and is most up-to-date, but ldc and gdc will produce somewhat faster code in almost all cases, sometimes very dramatically much faster.

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by Fredrik Boulund
in reply to John Colvin

Fredrik Boulund

Posted in reply to John Colvin

On Monday, 14 September 2015 at 14:14:18 UTC, John Colvin wrote:
> what system are you on? What are the error messages you are getting?

I really appreciate your will to try to help me out. This is what ldd shows on the latest binary release of LDC on my machine. I'm on a Red Hat Enterprise Linux 6.6 system.

[boulund@terra ~]$ ldd ~/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2
/home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2)
/home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2: /lib64/libc.so.6: version `GLIBC_2.15' not found (required by /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2)
        linux-vdso.so.1 =>  (0x00007fff623ff000)
        libconfig.so.8 => /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/libconfig.so.8 (0x00007f7f716e1000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7f714a3000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f7f7129f000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000032cde00000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f7f7101a000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000032cca00000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f7f70c86000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f7f718ec000)

As you can see it lacks something related to GLIBC, but I'm not sure how to fix that.

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by John Colvin
in reply to Fredrik Boulund

John Colvin

Posted in reply to Fredrik Boulund

On Monday, 14 September 2015 at 14:25:04 UTC, Fredrik Boulund wrote:
> On Monday, 14 September 2015 at 14:14:18 UTC, John Colvin wrote:
>> what system are you on? What are the error messages you are getting?
>
> I really appreciate your will to try to help me out. This is what ldd shows on the latest binary release of LDC on my machine. I'm on a Red Hat Enterprise Linux 6.6 system.
>
> [boulund@terra ~]$ ldd ~/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2
> /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2)
> /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2: /lib64/libc.so.6: version `GLIBC_2.15' not found (required by /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2)
>         linux-vdso.so.1 =>  (0x00007fff623ff000)
>         libconfig.so.8 => /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/libconfig.so.8 (0x00007f7f716e1000)
>         libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7f714a3000)
>         libdl.so.2 => /lib64/libdl.so.2 (0x00007f7f7129f000)
>         libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000032cde00000)
>         libm.so.6 => /lib64/libm.so.6 (0x00007f7f7101a000)
>         libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000032cca00000)
>         libc.so.6 => /lib64/libc.so.6 (0x00007f7f70c86000)
>         /lib64/ld-linux-x86-64.so.2 (0x00007f7f718ec000)
>
> As you can see it lacks something related to GLIBC, but I'm not sure how to fix that.

Yup, glibc is too old for those binaries.

What does "ldd --version" say?

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by Fredrik Boulund
in reply to Laeeth Isharc

Fredrik Boulund

Posted in reply to Laeeth Isharc

On Monday, 14 September 2015 at 14:15:25 UTC, Laeeth Isharc wrote:
> I picked up D to start learning maybe a couple of years ago.  I found Ali's book, Andrei's book, github source code (including for Phobos), and asking here to be the best resources.  The docs make perfect sense when you have got to a certain level (or perhaps if you have a computer sciencey background), but can be tough before that (though they are getting better).
>
> You should definitely take a look at the dlangscience project organized by John Colvin and others.
>
> If you like ipython/jupyter also see his pydmagic - write D inline in a notebook.
>

I saw the dlangscience project on GitHub the other day. I've yet to venture deeper. The inlining of D in jupyter notebooks sure is cool, but I'm not sure it's very useful for me, Python feels more succinct for notebook use. Still, I really appreciate the effort put into that, it's really cool!

> You may find this series of posts interesting too - another bioinformatics guy migrating from Python:
> http://forum.dlang.org/post/akzdstfiwwzfeoudhshg@forum.dlang.org
>

I'll have a look at that series of posts, thanks for the heads-up!

> Unfortunately I haven't time to read your code, and others will do better.  But do you use .reserve() ?  Also these are a nice fast container library based on Andrei Alexandrescu's allocator:
>
> https://github.com/economicmodeling/containers

Not familiar with .reserve(), nor Andrei's allocator library. I'll put that in the stuff-to-read-about-queue for now. :) Thanks for your tips!

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by Fredrik Boulund
in reply to John Colvin

Fredrik Boulund

Posted in reply to John Colvin

On Monday, 14 September 2015 at 14:18:58 UTC, John Colvin wrote:
> Range-based code like you are using leads to *huge* numbers of function calls to get anything done. The advantage of inlining is twofold: 1) you don't have to pay the cost of the function call itself and 2) often more optimisation can be done once a function is inlined.

Thanks for that explanation! Now that you mention it it makes perfect sense. I never considered it, but of course *huge* numbers of function calls to e.g. next() and other range-methods will be made.

> Because there are much better at inlining. dmd is quick to compile your code and is most up-to-date, but ldc and gdc will produce somewhat faster code in almost all cases, sometimes very dramatically much faster.

Sure sounds like I could have more fun with LDC and GDC on my system in addition to DMD :).

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by Fredrik Boulund
in reply to John Colvin

Fredrik Boulund

Posted in reply to John Colvin

On Monday, 14 September 2015 at 14:28:41 UTC, John Colvin wrote:
> Yup, glibc is too old for those binaries.
>
> What does "ldd --version" say?

It says "ldd (GNU libc) 2.12". Hmm... The most recent version in RHEL's repo is "2.12-1.166.el6_7.1", which is what is installed. Can this be side-loaded without too much hassle and manual effort?

September 14, 2015

Re: Speeding up text file parser (BLAST tabular format)

Posted by H. S. Teoh
in reply to Fredrik Boulund

H. S. Teoh

Posted in reply to Fredrik Boulund

On Mon, Sep 14, 2015 at 02:34:41PM +0000, Fredrik Boulund via Digitalmars-d-learn wrote:
> On Monday, 14 September 2015 at 14:18:58 UTC, John Colvin wrote:
> >Range-based code like you are using leads to *huge* numbers of function calls to get anything done. The advantage of inlining is twofold: 1) you don't have to pay the cost of the function call itself and 2) often more optimisation can be done once a function is inlined.
> 
> Thanks for that explanation! Now that you mention it it makes perfect sense.  I never considered it, but of course *huge* numbers of function calls to e.g. next() and other range-methods will be made.
> 
> >Because there are much better at inlining. dmd is quick to compile your code and is most up-to-date, but ldc and gdc will produce somewhat faster code in almost all cases, sometimes very dramatically much faster.
> 
> Sure sounds like I could have more fun with LDC and GDC on my system in addition to DMD :).

If performance is a problem, the first thing I'd recommend is to use a profiler to find out where the hotspots are. (More often than not, I have found that the hotspots are not where I expected them to be; sometimes a 1-line change to an unanticipated hotspot can result in a huge performance boost.)

The next thing I'd try is to use gdc instead of dmd. ;-)  IME, code produced by `gdc -O3` is at least 20-30% faster than code produced by `dmd -O -inline`. Sometimes the difference can be up to 40-50%, depending on the kind of code you're compiling.

T

-- 
Lottery: tax on the stupid. -- Slashdotter

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation