February 06, 2013
i use both gdc / ldc with "-w -O -release" flags

writeln inside loop is never evaluated as computeLocal boolean is always false


Thanks in any case i continue to read all your answer :-)
February 06, 2013
On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics wrote:
> i use both gdc / ldc with "-w -O -release" flags
>
> writeln inside loop is never evaluated as computeLocal boolean is always false
>
>
> Thanks in any case i continue to read all your answer :-)

just to add more information about fastq
http://www.biomedsearch.com/nih/Sanger-FASTQ-file-format-sequences/20015970.html

And here a set of fastq where parser should success or fail http://www.biomedsearch.com/attachments/00/20/01/59/20015970/gkp1137_nar-02248-d-2009-File005.gz

The problem is a sequence line could be splitted in several line same for quality line. And if i think it should allow to have in this lines white space

the @ is used to identify a identifier line
the + is used to identify a description line
but this char could appear as a quality value (ubyte)

I agree the spec format is really bad but it is heavily used in biology so i would like a fast parser to develop some D application instead to use C++.

I will try later all previous recommendation thank to all.

It seem in any case is not easy to parse fastly a file in D

Note: is possible to lock a file? to able to use pure method ?
February 06, 2013
this/these

sorry
February 06, 2013
On Wednesday, 6 February 2013 at 15:40:39 UTC, bioinfornatics wrote:
> It seem in any case is not easy to parse fastly a file in D

I don't think that's true. D provides the same "FILE" primitive you'd get in C, so there is no reason for it to be slower than C.

It is the "range" approach that, as convenient as it is, is not well adapted for certain things.

As I had said, I tried to write my own program. In it, I devised a range that, instead of exposing things to parse character by character, parses an entire "object" (a ... "genome" ... maybe ? I called them "Q" in my program) at once into an object. I decided to use the very simple "byLine" primitive.

From there, you can query the object for their name/sequence/quality. The irony is that by "parsing twice" (once to do the io read, once to do the actual processing of the text), and taking into account I'm allocating each object individually, I'm running twice as fast as my original already improved implementation. Not only is it faster, it is also more convenient, since you can extract an entire Q object at once, and then operate on that as you would so please: Separation of algorithm and parsing.

It correctly takes into account that a sequence can be multiple lines. It does not strip whitespace because according to http://maq.sourceforge.net/fastq.shtml whitespace is not a legal character.

Now: Keep in mind that this approach allocates (3) new strings for each Q. You could *try* an approach with a pre-allocated re-useable buffer. This would mean you can only operate on 1 Q at once, but you'd probably iterate on them faster.

In any case, you can try it out:
http://dpaste.dzfl.pl/8bdd0c84

February 06, 2013
On Wednesday, 6 February 2013 at 16:06:20 UTC, monarch_dodra wrote:
> It correctly takes into account that a sequence can be multiple lines. It does not strip whitespace because according to http://maq.sourceforge.net/fastq.shtml whitespace is not a legal character.

Hum, just read your example files. I guess you can have white. In any case, that should pose not pose any real problem. http://dlang.org/phobos/std_string.html#.removechars

would come in handy here.
February 06, 2013
On 2013-02-04 15:04, bioinfornatics wrote:
> I am looking to parse efficiently huge file but i think D lacking for this purpose.
> To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min.

Haven't compared to fastxtoolkit, but I have some code for you.
I have processed the file SRR077487_1.filt.fastq from
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
and expect this syntax (no multiline sequences or whitespace).
File takes up almost 6 GB processing took 1m45s - twice as fast as the
fastest D solution so far -- all compiled with gdc -O3.
I bet your computer has better specs than mine.

Program uses a buffer that should be twice the size of the largest sequence
record (counting id, comment and quality data). A chunk of file is read,
then records are scanned on the buffer until record start pointer passes
the middle of the buffer -- then memcpy is used to move all the rest to
the begining of the buffer and the remaining space at the end is filled with
another chunk read from the file.

Data contains both sequence letter and associated quality information.
Sequence ID and comment are slices of the buffer, so they have valid info
until you move to the next sequence (and the number increments).

This is the code: http://dpaste.1azy.net/8424d4ac
Tell me what timings you can get now.
February 06, 2013
On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
> On 2013-02-04 15:04, bioinfornatics wrote:
>> I am looking to parse efficiently huge file but i think D lacking for this purpose.
>> To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min.
>
> Haven't compared to fastxtoolkit, but I have some code for you.
> I have processed the file SRR077487_1.filt.fastq from
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
> and expect this syntax (no multiline sequences or whitespace).
> File takes up almost 6 GB processing took 1m45s - twice as fast as the
> fastest D solution so far

Do you mean my solution above? I tried your solution with dmd, with -release -O -inline, and both gave about the same result (69s yours, 67s mine).

> Data contains both sequence letter and associated quality information.
> Sequence ID and comment are slices of the buffer, so they have valid info
> until you move to the next sequence (and the number increments).

Hum. Mine allocates new slices, so they are never invalidated :)
Mine also takes into account newlines and and lowercase sequences.

Still, it seems you and I both took different approaches. I had mentioned using a re-useable buffer. I'm going to try to consume some of your code to see if I can't improve my implementation.

@bioinfornatics

I'm getting real interested on the subject. I'm going to try to write an actual library/framework for working with fastq files in a D environment.

This means I'll try to write robust and useable code, with both stability and performance in mind, as opposed to the "proofs of concepts in so far".

For now, I'd like to keep it simple: Would something that only knows how to parse/write Sanger FASTQ files be of help to you?

If I write something, can I have you review it?
February 06, 2013
Thanks monarch and FG,
i will read your code to see where i failing :-)
And of course if you are interested with bio format i will really happy to works / review together

In any case  big thanks that is a very interesting subject
February 06, 2013
On 2013-02-06 21:43, monarch_dodra wrote:
> On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
>> I have processed the file SRR077487_1.filt.fastq from
>> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
>> and expect this syntax (no multiline sequences or whitespace).
>> File takes up almost 6 GB processing took 1m45s - twice as fast as the
>> fastest D solution so far
>
> Do you mean my solution above? I tried your solution with dmd, with -release -O
> -inline, and both gave about the same result (69s yours, 67s mine).

Yes. Maybe CPU is the bottleneck on my end.
With DMD32 2.060 on win7-64 compiled with same flags I got:
MD: 4m30 / FG: 1m55s - both using 100% of one core.
Quite similar results with GDC64.

You have timed the same file SRR077487_1.filt.fastq at 67s?


> I'm getting real interested on the subject. I'm going to try to write an actual
> library/framework for working with fastq files in a D environment.

Those fastq are contagious. ;)

> This means I'll try to write robust and useable code, with both stability and
> performance in mind, as opposed to the "proofs of concepts in so far".

Yeah, but the big deal was that D is 5.5x slower than C++.

You have mentioned something about using byLine. Well, I would have gladly used
it instead of looking for line ends myself and pushing stuff with memcpy.
But the thing is that while the fgets(char *buf, int bufSize, FILE *f) in fastx
is fast in reading file by line, using file.readln(buf) is unpredictable. :)
I mean that in DMD it's only a bit slower than file.rawRead(buf), but in GDC
can be several times slower. For example just reading in a loop:

    import std.stdio;
    enum uint bufferSize = 4096 - 16;
    void main(string[] args) {
        char[] tmp, buf = new char[bufferSize];
        size_t cnt;
        auto f = File(args[1], "r");
        switch(args[2]) {
            case "raw":
                do tmp = f.rawRead(buf); while (tmp.length);
                break;

            case "readln":
                do cnt = f.readln(buf); while (cnt);
                break;

            default: writeln("Use parameters: <filename> raw|readln");
        }
    }

Tested on a much smaller SRR077487.filt.fastq:
DMD32 -release -O -inline: raw 94ms / readln 450ms
GDC64 -O3:                 raw 94ms / readln 6.76s

Tested on SRR077487_1.filt.fastq:
DMD32 -release -O -inline: raw 1m44s / readln  1m55s
GDC64 -O3:                 raw 1m48s / readln 14m16s

Why such a big difference between the DMD and GDC (on Windows)?
(or have I missed some switch in GDC?)

February 06, 2013
On 02/06/2013 12:43 PM, monarch_dodra wrote:

> with dmd, with -release -O -inline

Going off topic a little, in a recent experiment, I have noticed that adding -inline made a range solution twice slower. -O -release still helped but -inline was the culprit.

Ali