February 06, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | i use both gdc / ldc with "-w -O -release" flags writeln inside loop is never evaluated as computeLocal boolean is always false Thanks in any case i continue to read all your answer :-) |
February 06, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to bioinfornatics | On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics wrote: > i use both gdc / ldc with "-w -O -release" flags > > writeln inside loop is never evaluated as computeLocal boolean is always false > > > Thanks in any case i continue to read all your answer :-) just to add more information about fastq http://www.biomedsearch.com/nih/Sanger-FASTQ-file-format-sequences/20015970.html And here a set of fastq where parser should success or fail http://www.biomedsearch.com/attachments/00/20/01/59/20015970/gkp1137_nar-02248-d-2009-File005.gz The problem is a sequence line could be splitted in several line same for quality line. And if i think it should allow to have in this lines white space the @ is used to identify a identifier line the + is used to identify a description line but this char could appear as a quality value (ubyte) I agree the spec format is really bad but it is heavily used in biology so i would like a fast parser to develop some D application instead to use C++. I will try later all previous recommendation thank to all. It seem in any case is not easy to parse fastly a file in D Note: is possible to lock a file? to able to use pure method ? |
February 06, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to bioinfornatics | this/these sorry |
February 06, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to bioinfornatics | On Wednesday, 6 February 2013 at 15:40:39 UTC, bioinfornatics wrote: > It seem in any case is not easy to parse fastly a file in D I don't think that's true. D provides the same "FILE" primitive you'd get in C, so there is no reason for it to be slower than C. It is the "range" approach that, as convenient as it is, is not well adapted for certain things. As I had said, I tried to write my own program. In it, I devised a range that, instead of exposing things to parse character by character, parses an entire "object" (a ... "genome" ... maybe ? I called them "Q" in my program) at once into an object. I decided to use the very simple "byLine" primitive. From there, you can query the object for their name/sequence/quality. The irony is that by "parsing twice" (once to do the io read, once to do the actual processing of the text), and taking into account I'm allocating each object individually, I'm running twice as fast as my original already improved implementation. Not only is it faster, it is also more convenient, since you can extract an entire Q object at once, and then operate on that as you would so please: Separation of algorithm and parsing. It correctly takes into account that a sequence can be multiple lines. It does not strip whitespace because according to http://maq.sourceforge.net/fastq.shtml whitespace is not a legal character. Now: Keep in mind that this approach allocates (3) new strings for each Q. You could *try* an approach with a pre-allocated re-useable buffer. This would mean you can only operate on 1 Q at once, but you'd probably iterate on them faster. In any case, you can try it out: http://dpaste.dzfl.pl/8bdd0c84 |
February 06, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On Wednesday, 6 February 2013 at 16:06:20 UTC, monarch_dodra wrote: > It correctly takes into account that a sequence can be multiple lines. It does not strip whitespace because according to http://maq.sourceforge.net/fastq.shtml whitespace is not a legal character. Hum, just read your example files. I guess you can have white. In any case, that should pose not pose any real problem. http://dlang.org/phobos/std_string.html#.removechars would come in handy here. |
February 06, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to bioinfornatics | On 2013-02-04 15:04, bioinfornatics wrote: > I am looking to parse efficiently huge file but i think D lacking for this purpose. > To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min. Haven't compared to fastxtoolkit, but I have some code for you. I have processed the file SRR077487_1.filt.fastq from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/ and expect this syntax (no multiline sequences or whitespace). File takes up almost 6 GB processing took 1m45s - twice as fast as the fastest D solution so far -- all compiled with gdc -O3. I bet your computer has better specs than mine. Program uses a buffer that should be twice the size of the largest sequence record (counting id, comment and quality data). A chunk of file is read, then records are scanned on the buffer until record start pointer passes the middle of the buffer -- then memcpy is used to move all the rest to the begining of the buffer and the remaining space at the end is filled with another chunk read from the file. Data contains both sequence letter and associated quality information. Sequence ID and comment are slices of the buffer, so they have valid info until you move to the next sequence (and the number increments). This is the code: http://dpaste.1azy.net/8424d4ac Tell me what timings you can get now. |
February 06, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to FG | On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote: > On 2013-02-04 15:04, bioinfornatics wrote: >> I am looking to parse efficiently huge file but i think D lacking for this purpose. >> To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min. > > Haven't compared to fastxtoolkit, but I have some code for you. > I have processed the file SRR077487_1.filt.fastq from > ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/ > and expect this syntax (no multiline sequences or whitespace). > File takes up almost 6 GB processing took 1m45s - twice as fast as the > fastest D solution so far Do you mean my solution above? I tried your solution with dmd, with -release -O -inline, and both gave about the same result (69s yours, 67s mine). > Data contains both sequence letter and associated quality information. > Sequence ID and comment are slices of the buffer, so they have valid info > until you move to the next sequence (and the number increments). Hum. Mine allocates new slices, so they are never invalidated :) Mine also takes into account newlines and and lowercase sequences. Still, it seems you and I both took different approaches. I had mentioned using a re-useable buffer. I'm going to try to consume some of your code to see if I can't improve my implementation. @bioinfornatics I'm getting real interested on the subject. I'm going to try to write an actual library/framework for working with fastq files in a D environment. This means I'll try to write robust and useable code, with both stability and performance in mind, as opposed to the "proofs of concepts in so far". For now, I'd like to keep it simple: Would something that only knows how to parse/write Sanger FASTQ files be of help to you? If I write something, can I have you review it? |
February 06, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | Thanks monarch and FG, i will read your code to see where i failing :-) And of course if you are interested with bio format i will really happy to works / review together In any case big thanks that is a very interesting subject |
February 06, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On 2013-02-06 21:43, monarch_dodra wrote: > On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote: >> I have processed the file SRR077487_1.filt.fastq from >> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/ >> and expect this syntax (no multiline sequences or whitespace). >> File takes up almost 6 GB processing took 1m45s - twice as fast as the >> fastest D solution so far > > Do you mean my solution above? I tried your solution with dmd, with -release -O > -inline, and both gave about the same result (69s yours, 67s mine). Yes. Maybe CPU is the bottleneck on my end. With DMD32 2.060 on win7-64 compiled with same flags I got: MD: 4m30 / FG: 1m55s - both using 100% of one core. Quite similar results with GDC64. You have timed the same file SRR077487_1.filt.fastq at 67s? > I'm getting real interested on the subject. I'm going to try to write an actual > library/framework for working with fastq files in a D environment. Those fastq are contagious. ;) > This means I'll try to write robust and useable code, with both stability and > performance in mind, as opposed to the "proofs of concepts in so far". Yeah, but the big deal was that D is 5.5x slower than C++. You have mentioned something about using byLine. Well, I would have gladly used it instead of looking for line ends myself and pushing stuff with memcpy. But the thing is that while the fgets(char *buf, int bufSize, FILE *f) in fastx is fast in reading file by line, using file.readln(buf) is unpredictable. :) I mean that in DMD it's only a bit slower than file.rawRead(buf), but in GDC can be several times slower. For example just reading in a loop: import std.stdio; enum uint bufferSize = 4096 - 16; void main(string[] args) { char[] tmp, buf = new char[bufferSize]; size_t cnt; auto f = File(args[1], "r"); switch(args[2]) { case "raw": do tmp = f.rawRead(buf); while (tmp.length); break; case "readln": do cnt = f.readln(buf); while (cnt); break; default: writeln("Use parameters: <filename> raw|readln"); } } Tested on a much smaller SRR077487.filt.fastq: DMD32 -release -O -inline: raw 94ms / readln 450ms GDC64 -O3: raw 94ms / readln 6.76s Tested on SRR077487_1.filt.fastq: DMD32 -release -O -inline: raw 1m44s / readln 1m55s GDC64 -O3: raw 1m48s / readln 14m16s Why such a big difference between the DMD and GDC (on Windows)? (or have I missed some switch in GDC?) |
February 06, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On 02/06/2013 12:43 PM, monarch_dodra wrote: > with dmd, with -release -O -inline Going off topic a little, in a recent experiment, I have noticed that adding -inline made a range solution twice slower. -O -release still helped but -inline was the culprit. Ali |
Copyright © 1999-2021 by the D Language Foundation