Jump to page: 1 25  
Page
Thread overview
How to read fastly files ( I/O operation)
Feb 04, 2013
bioinfornatics
Feb 04, 2013
bioinfornatics
Feb 04, 2013
FG
Feb 04, 2013
Dejan Lekic
Feb 04, 2013
monarch_dodra
Feb 05, 2013
Brad Roberts
Feb 05, 2013
Jacob Carlborg
Feb 06, 2013
bioinfornatics
Feb 06, 2013
monarch_dodra
Feb 06, 2013
monarch_dodra
Feb 06, 2013
bioinfornatics
Feb 06, 2013
bioinfornatics
Feb 06, 2013
bioinfornatics
Feb 06, 2013
monarch_dodra
Feb 06, 2013
monarch_dodra
[OT] Re: How to read fastly files ( I/O operation)
Feb 08, 2013
Denis Shelomovskij
Dec 18, 2013
Jay Norwood
Feb 06, 2013
FG
Feb 06, 2013
monarch_dodra
Feb 06, 2013
bioinfornatics
Feb 06, 2013
Lee Braiden
Feb 07, 2013
FG
Feb 06, 2013
FG
Feb 07, 2013
monarch_dodra
Feb 07, 2013
FG
Feb 07, 2013
bioinfornatics
Feb 07, 2013
monarch_dodra
Feb 07, 2013
bioinfornatics
Feb 08, 2013
bioinfornatics
Feb 08, 2013
monarch_dodra
Feb 09, 2013
bioinfornatics
Feb 06, 2013
Ali Çehreli
Feb 12, 2013
bioinfornatics
Feb 12, 2013
monarch_dodra
Feb 12, 2013
bioinfornatics
Feb 12, 2013
monarch_dodra
Feb 12, 2013
FG
Feb 12, 2013
bioinfornatics
Feb 12, 2013
monarch_dodra
Feb 13, 2013
monarch_dodra
Feb 13, 2013
FG
Feb 14, 2013
bioinfornatics
Feb 19, 2013
monarch_dodra
Feb 22, 2013
bioinfornatics
Feb 22, 2013
monarch_dodra
Dec 18, 2013
Jay Norwood
Feb 12, 2013
FG
February 04, 2013
Dear,

I am looking to parse efficiently huge file but i think D lacking for this purpose.
To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min.

My code is maybe not easy as is not easy to parse a fastq file and is more harder when using memory mapped file.

I do not see where i can get some perf as i do not do many copy and i use mmfile.
fastxtoolkit do not use mmfile and store his result into a  struct array for each sequences but is is still faster!!!

thanks to any help i hope we can create a faster parser otherwise that is too slow to use D instead C++
February 04, 2013
code: http://dpaste.dzfl.pl/79ab0e17
fastxtoolkit: http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.2.tar.bz2
| - fastx_quality_stats.c ->  read_file()
| - libfastx/fastx.c      -> fastx_read_next_record()
February 04, 2013
On 2013-02-04 15:04, bioinfornatics wrote:
> I am looking to parse efficiently huge file but i think D lacking for this purpose.
> To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min.
>
> My code is maybe not easy as is not easy to parse a fastq file and is more
> harder when using memory mapped file.

Why are you using mmap? Don't you just go through the file sequentially?
In that case it should be faster to read in chunks:

    foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }

February 04, 2013
FG wrote:

> On 2013-02-04 15:04, bioinfornatics wrote:
>> I am looking to parse efficiently huge file but i think D lacking for this purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min.
>>
>> My code is maybe not easy as is not easy to parse a fastq file and is more harder when using memory mapped file.
> 
> Why are you using mmap? Don't you just go through the file sequentially? In that case it should be faster to read in chunks:
> 
>      foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }

I would go even further, and organise the file so N Data objects fit one page, and read the file page by page. The page-size can easily be obtained from the system. IMHO that would beat this fastxtoolkit. :)

-- 
Dejan Lekic
dejan.lekic (a) gmail.com
http://dejan.lekic.org
February 04, 2013
On Monday, 4 February 2013 at 19:30:59 UTC, Dejan Lekic wrote:
> FG wrote:
>
>> On 2013-02-04 15:04, bioinfornatics wrote:
>>> I am looking to parse efficiently huge file but i think D lacking for this
>>> purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++
>>> ) need 2 min.
>>>
>>> My code is maybe not easy as is not easy to parse a fastq file and is more
>>> harder when using memory mapped file.
>> 
>> Why are you using mmap? Don't you just go through the file sequentially?
>> In that case it should be faster to read in chunks:
>> 
>>      foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }
>
> I would go even further, and organise the file so N Data objects fit one page,
> and read the file page by page. The page-size can easily be obtained from the
> system. IMHO that would beat this fastxtoolkit. :)

AFAIK, he is reading text data that needs to be parsed line by line, so byChunk may not be the best approach. Or at least, not the easiest approach.

I'm just wondering if maybe the reason the D code is slow is not just because of:
- unicode.
- front + popFront.

ranges in D are "notorious" for being slow to iterate on text, due to the "double decode".

If you are *certain* that the file contains nothing but ASCII (which should be the case for fastq, right?), you can get more bang for your buck if you attempt to iterate over it as an array of bytes, and convert the bytes to char on the fly, bypassing any and all unicode processing.
February 05, 2013
On Mon, 4 Feb 2013, monarch_dodra wrote:

> AFAIK, he is reading text data that needs to be parsed line by line, so byChunk may not be the best approach. Or at least, not the easiest approach.
> 
> I'm just wondering if maybe the reason the D code is slow is not just because
> of:
> - unicode.
> - front + popFront.

First rule of performance analysis.. don't guess, measure.
February 05, 2013
On 2013-02-04 20:39, monarch_dodra wrote:

> AFAIK, he is reading text data that needs to be parsed line by line, so
> byChunk may not be the best approach. Or at least, not the easiest
> approach.

He can still read a chunk from the file, or the whole file and then read that chunk line by line.

> I'm just wondering if maybe the reason the D code is slow is not just
> because of:
> - unicode.
> - front + popFront.
>
> ranges in D are "notorious" for being slow to iterate on text, due to
> the "double decode".
>
> If you are *certain* that the file contains nothing but ASCII (which
> should be the case for fastq, right?), you can get more bang for your
> buck if you attempt to iterate over it as an array of bytes, and convert
> the bytes to char on the fly, bypassing any and all unicode processing.

Depending on what you're doing you can blast through the bytes even if it's Unicode. It will of course not validate the Unicode.

-- 
/Jacob Carlborg
February 06, 2013
instead to call mmFile opIndex to read ubyte by ubyte i tried to put into a buffer array of length PAGESIZE.

code here: http://dpaste.dzfl.pl/25ee34fc

and is not faster for 12Go to parse i need 11 minutes. I do not see how i could read faster the file!

To remember fastxtoolkit need 2 min!
February 06, 2013
On Wednesday, 6 February 2013 at 10:43:02 UTC, bioinfornatics wrote:
> instead to call mmFile opIndex to read ubyte by ubyte i tried to put into a buffer array of length PAGESIZE.
>
> code here: http://dpaste.dzfl.pl/25ee34fc
>
> and is not faster for 12Go to parse i need 11 minutes. I do not see how i could read faster the file!
>
> To remember fastxtoolkit need 2 min!

This might be stupid, but I see a "writeln" in your inner loop. You aren't slowed down just by your console by any chance?

If I were you, I'd start benching to try and see who is slowing you down.

I'd reorganize the code to parse a file that is, say 512Mb. The rationale being you can place it entirely at once. Then, I'd shift the logic from "fully proccess each charater before moving to the next character" to "make a full processing pass on the entire data structure, before moving to the next pass".

The steps I see that need to be measured are:

* Raw read of file
* Iterating on your file to extract it as a raw array of "Data" objects
* Processing the Data objects
* Outputting the data

Also,  (of course), you need to make sure you are compiling in release (might sound obvious, but you never know). Are you using dmd? I heard the "other" compilers are faster.

I'm going to try and see with some example files if I can't get something running faster.
February 06, 2013
On Wednesday, 6 February 2013 at 11:15:22 UTC, monarch_dodra wrote:
> I'm going to try and see with some example files if I can't get something running faster.

Benchmarking and tweaking, I was able to find 3 things that speeds up your program:

1) Make the computeLocal a compile time constant. This will give you a tinsy bit of performance. Depends on if you plan to make it a run-time argument switch I guess.

2) Makes things about 10%-20% faster:
Your "nucleic" and "amino" hash tables map a character to an index. However, given the range of the characters ('A' to 'Z'), you are better off doing a flat array, where each index represents a character, eg: A is index 0, B is index 1. This way, lookup is a simple array indexing, as opposed to a hash table indexing.

You may even get a bigger bang for your buck by simply giving your "_stats" structure a simple "A is index 0, B is index 1", and only "re-order" the data at the end, when you want to read it. (I haven't done this though).

3) Makes things about 100% faster (ran in half the time on my machine): I don't know how mmFile works, but a simple File + "rawRead" seems to get the job done fast. Also, instead of keeping track of an (several) indexes, I merely keep a single slice. The only thing I care about, is if my slice is empty, in which case I re-fill it.

The modified code is here. I'm apparently getting the same output you are, but that doesn't mean there might not be bugs in it. For example, I noticed that you don't strip leading whites, if any, before the first read.
http://dpaste.dzfl.pl/9b9353b8

----
I'd be tempted to re-write the parser using a "byLine" approach, since my quick reading about fastq seems to imply it is a line based format. Or just plain try to write a parser from scratch, putting my own logic and thought into it (all I did was modify your code, without caring about the actual algorithm)
« First   ‹ Prev
1 2 3 4 5