February 13, 2013
On 2013-02-13 18:39, monarch_dodra wrote:
> In any case, I am now parsing the 6Gig packed into 1.5Gig in about 53 seconds
> (down from 61). I also tried doing a dual-threaded approach (1 thread to unzip,
> 1 thread to parse), but again, the actual *parse* phase is so ridiculously fast,
> that it changes *nothing* to the final result.

Great. Performance aside, we didn't talk much about how this data can be useful - should it only be read sequentially forward or both ways, would there be a need to place some markers or slice the sequence, etc. Our small test case was only about counting nucleotides, so reading order and possibility of further processing was irrelevant.

Mr.Bio, what usage cases you'll be interested in, other than those counters?

February 14, 2013
> Mr.Bio, what usage cases you'll be interested in, other than those counters?

some idea such as letter counting:
rename identifier
trimming sequence from quality value to cutoff
convert to a binary format
convert to fasta + sff
merge close sequence to one concenus
create a brujin graph
more idea later
February 19, 2013
On Thursday, 14 February 2013 at 18:31:35 UTC, bioinfornatics wrote:
>
>> Mr.Bio, what usage cases you'll be interested in, other than those counters?
>
> some idea such as letter counting:
> rename identifier
> trimming sequence from quality value to cutoff
> convert to a binary format
> convert to fasta + sff
> merge close sequence to one concenus
> create a brujin graph
> more idea later

OK. I posted the parser here:
http://dpaste.dzfl.pl/37b893ed

This runs on the 2.061. I'll have to make a few changes if you need it to run 2.060, to get around some 2.060 specific bugs.

This contains strictly only the parser. If you want, I'll post the async file reading stuff I wrote to interface with it.

The example sections should give you a quick idea of how to use it.

Tell me what you think about it.
February 22, 2013
arf I am always in dmdfe 2.060
February 22, 2013
On Friday, 22 February 2013 at 08:53:35 UTC, bioinfornatics wrote:
> arf I am always in dmdfe 2.060

AFAIK, the problems are mostly the "nothrows", and maybe 1 or 2 "new style" alias declarations.

That said, what's stopping you from upgrading? We are at 2.062 right now. Does upgrading break anything for you?
December 18, 2013
On Friday, 8 February 2013 at 06:22:18 UTC, Denis Shelomovskij wrote:
> 06.02.2013 19:40, bioinfornatics пишет:
>> On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics wrote:
>> I agree the spec format is really bad but it is heavily used in biology
>> so i would like a fast parser to develop some D application instead to
>> use C++.
>
> Yes, lets also create 1 GiB XML files and ask for fast encoding/decoding!
>
> The situation can be improved only if:
> 1. We will find and kill every text format creator;
> 2. We will create a really good binary format for each such task and support it in every application we create. So after some time text formats will just die because of evolution as everything will support better formats.
>
> (the second proposal is a real recommendation)

There is a binary resource format for emf models, which normally use xml files, and some timing improvements stated at this link.  It might be worth looking at this if you are thinking about writing your own binary format.
http://www.slideshare.net/kenn.hussey/performance-and-extensibility-with-emf

There is also a fast binary compression library named blosc that is used in some python utilities, measured and presented here, showing that it is faster than doing a memcpy if you have multiple cores.
http://blosc.pytables.org/trac

On the sequential accesses ... I found that windows writes blocks of data all over the place, but the best way to get it to write something in more contiguous locations is to modify the file output routines to use specify write through.  The sequential accesses didn't improve read times on ssd.

Most of the decent ssds can read big files at 300MB/sec or more now, and you can raid 0 a few of them and read 800MB/sec.

December 18, 2013
On Wednesday, 13 February 2013 at 17:39:11 UTC, monarch_dodra wrote:
> On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra wrote:
>> On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics wrote:
>>>
>>> Some time fastq are comressed to gz bz2 or xz as that is often a
>>> huge file.
>>> Maybe we need keep in mind this early in developement and use
>>> std.zlib
>>
>> While working on making the parser multi-threaded compatible, I was able to seperate the part that feeds data, and the part that parses data.
>>
>> Long story short, the parser operates on an input range of ubyte[]: It is not responsible any more for acquisition of data.
>>
>> The range can be a simple (wrapped) File, a byChunk, an asynchroneus file reader, or a zip decompresser, or just stdin I guess. Range can be transient.
>>
>> However, now that you mention it, I'll make sure it is correctly supported.
>>
>> I'll *try* to show you what I have so far tomorow (in about 18h).
>
> Yeah... I played around too much, and the file is dirtier than ever.
>
> The good news is that I was able to test out what I was telling you about: accepting any range is ok:
>
> I used your ZFile range to plug it into my parser: I can now parse zipped files directly.
>
> The good news is that now, I'm not bottle necked by IO anymore! The bad news is that I'm now bottle necked by CPU decompressing. But since I'm using dmd, you may get better results with LDC or GDC.
>
> In any case, I am now parsing the 6Gig packed into 1.5Gig in about 53 seconds (down from 61). I also tried doing a dual-threaded approach (1 thread to unzip, 1 thread to parse), but again, the actual *parse* phase is so ridiculously fast, that it changes *nothing* to the final result.
>
> Long story short: 99% of the time is spent acquiring data. The last 1% is just copying it into local buffers.
>
> The last good news though is that CPU bottleneck is always better than IO bottleneck. If you have multiple cores, you should be able to run multiple *instances* (not threads), and be able to process several files at once, multiplying your throughput.

I modified the library unzip to make a parallel unzip a while back (at the link below).  The execution time scaled very well for the number of cpus for the test case I was using, which was a 2GB unzip'd distribution containing many small files and subdirectories.  The parallel operations were by file.   I think the execution time gains on ssd drives were from having multiple cores scheduling the writes to separate files in parallel.
https://github.com/jnorwood/file_parallel/blob/master/unzip_parallel.d


1 2 3 4 5
Next ›   Last »