February 09, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | some idea such as letter counting: rename identifier trimming sequence from quality value to cutoff convert to a binary format more idea later |
February 12, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to FG | instead to use memcpy I try with slicing ~ lines 136 : _hardBuffer[ 0 .. moveSize] = _hardBuffer[_bufPosition .. moveSize + _bufPosition]; I get same perf |
February 12, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to bioinfornatics | On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics wrote:
> instead to use memcpy I try with slicing ~ lines 136 :
> _hardBuffer[ 0 .. moveSize] = _hardBuffer[_bufPosition .. moveSize + _bufPosition];
>
> I get same perf
I think I figured out why I'm getting different results than you guys are, on my windows machine.
AFAIK, file reads in windows are done natively asynchronously.
I wrote a multi-threaded version of the parser, with a thread dedicated to reading the file, while the main thread parses the read buffers.
I'm getting EXACTLY 0% performance improvement. Not better, not worst, just 0%.
I'd have to try again on my SSD. Right now, I'm parsing the file 6 Gig file in 60 seconds, which is the limit of my HDD. As a matter of fact, just *reading* the files takes the EXACT same amount of time as parsing it...
This takes 60 seconds.
//----
auto input = File(args[1], "rb");
ubyte[] buffer = new ubyte[](BufferSize);
do{
buffer = input.rawRead(buffer);
}while(buffer.length);
//----
This takes 60 seconds too.
//----
Parser parser = new Parser(args[1]);
foreach(q; parser)
foreach(char c; q.sequence)
globalNucleic.collect(c);
}
//----
So at this point, I'd need to test on my Linux box, or publish the code so you can tell me how I'm doing.
I'm still tweaking the code to publish something readable, as there is a lot of sketchy code right now.
I'm also implementing a correct exception handling, so that if there is an erroneous entry, an exception is thrown. However, all the erroneous data is parsed out of the file, and placed inside the exception. This means that:
a) You can inspect the erroneous data
b) You can skip the erroneous data, and parse the rest of the file.
Once I deliver the code with the multi-threaded code activated, you should get some better performance on Linux.
When "1.0" is ready, I'll create a github project for it, so work can be done parallel on it.
|
February 12, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On Tuesday, 12 February 2013 at 12:45:26 UTC, monarch_dodra wrote:
> On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics wrote:
>> instead to use memcpy I try with slicing ~ lines 136 :
>> _hardBuffer[ 0 .. moveSize] = _hardBuffer[_bufPosition .. moveSize + _bufPosition];
>>
>> I get same perf
>
> I think I figured out why I'm getting different results than you guys are, on my windows machine.
>
> AFAIK, file reads in windows are done natively asynchronously.
>
> I wrote a multi-threaded version of the parser, with a thread dedicated to reading the file, while the main thread parses the read buffers.
>
> I'm getting EXACTLY 0% performance improvement. Not better, not worst, just 0%.
>
> I'd have to try again on my SSD. Right now, I'm parsing the file 6 Gig file in 60 seconds, which is the limit of my HDD. As a matter of fact, just *reading* the files takes the EXACT same amount of time as parsing it...
>
> This takes 60 seconds.
> //----
> auto input = File(args[1], "rb");
> ubyte[] buffer = new ubyte[](BufferSize);
> do{
> buffer = input.rawRead(buffer);
> }while(buffer.length);
> //----
>
> This takes 60 seconds too.
> //----
> Parser parser = new Parser(args[1]);
> foreach(q; parser)
> foreach(char c; q.sequence)
> globalNucleic.collect(c);
> }
> //----
>
> So at this point, I'd need to test on my Linux box, or publish the code so you can tell me how I'm doing.
>
> I'm still tweaking the code to publish something readable, as there is a lot of sketchy code right now.
>
> I'm also implementing a correct exception handling, so that if there is an erroneous entry, an exception is thrown. However, all the erroneous data is parsed out of the file, and placed inside the exception. This means that:
> a) You can inspect the erroneous data
> b) You can skip the erroneous data, and parse the rest of the file.
>
> Once I deliver the code with the multi-threaded code activated, you should get some better performance on Linux.
>
> When "1.0" is ready, I'll create a github project for it, so work can be done parallel on it.
about threaded version is possible to use get file size function to split it in several thread.
Use fseek read end of section return it to detect end of split to used
|
February 12, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to bioinfornatics | On Tuesday, 12 February 2013 at 16:28:09 UTC, bioinfornatics wrote:
> On Tuesday, 12 February 2013 at 12:45:26 UTC, monarch_dodra wrote:
>> On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics wrote:
>>> instead to use memcpy I try with slicing ~ lines 136 :
>>> _hardBuffer[ 0 .. moveSize] = _hardBuffer[_bufPosition .. moveSize + _bufPosition];
>>>
>>> I get same perf
>>
>> I think I figured out why I'm getting different results than you guys are, on my windows machine.
>>
>> AFAIK, file reads in windows are done natively asynchronously.
>>
>> I wrote a multi-threaded version of the parser, with a thread dedicated to reading the file, while the main thread parses the read buffers.
>>
>> I'm getting EXACTLY 0% performance improvement. Not better, not worst, just 0%.
>>
>> I'd have to try again on my SSD. Right now, I'm parsing the file 6 Gig file in 60 seconds, which is the limit of my HDD. As a matter of fact, just *reading* the files takes the EXACT same amount of time as parsing it...
>>
>> This takes 60 seconds.
>> //----
>> auto input = File(args[1], "rb");
>> ubyte[] buffer = new ubyte[](BufferSize);
>> do{
>> buffer = input.rawRead(buffer);
>> }while(buffer.length);
>> //----
>>
>> This takes 60 seconds too.
>> //----
>> Parser parser = new Parser(args[1]);
>> foreach(q; parser)
>> foreach(char c; q.sequence)
>> globalNucleic.collect(c);
>> }
>> //----
>>
>> So at this point, I'd need to test on my Linux box, or publish the code so you can tell me how I'm doing.
>>
>> I'm still tweaking the code to publish something readable, as there is a lot of sketchy code right now.
>>
>> I'm also implementing a correct exception handling, so that if there is an erroneous entry, an exception is thrown. However, all the erroneous data is parsed out of the file, and placed inside the exception. This means that:
>> a) You can inspect the erroneous data
>> b) You can skip the erroneous data, and parse the rest of the file.
>>
>> Once I deliver the code with the multi-threaded code activated, you should get some better performance on Linux.
>>
>> When "1.0" is ready, I'll create a github project for it, so work can be done parallel on it.
>
> about threaded version is possible to use get file size function to split it in several thread.
> Use fseek read end of section return it to detect end of split to used
You'd want to have 2 threads reading the same file at once? I don't think there is much more to be gained anyways, since the IO is the bottleneck anyways.
A better approach would be to have 1 file reader that passes data to two simultaneous parsers. This, however, would make things scary complicated, and I'd doubt we'd even get much better results: I was not able to measure the actual amount of time spent working when compared to the time spent reading the file.
|
February 12, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to bioinfornatics | On 2013-02-12 17:28, bioinfornatics wrote:
> about threaded version is possible to use get file size function to split it in
> several thread.
> Use fseek read end of section return it to detect end of split to used
Yes, but like already mentioned before, it only works well for SSD.
For normal hard drives you'd want the data stored and accessed in sequence without jumping between cylinders whenever you switch threads.
Do you store your data on an SSD?
|
February 12, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On 2013-02-12 17:45, monarch_dodra wrote:
> A better approach would be to have 1 file reader that passes data to two
> simultaneous parsers. This, however, would make things scary complicated, and
> I'd doubt we'd even get much better results: I was not able to measure the
> actual amount of time spent working when compared to the time spent reading the
> file.
Best to keep things simple when the potential benefits aren't certain. :)
|
February 12, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to FG | Some time fastq are comressed to gz bz2 or xz as that is often a huge file. Maybe we need keep in mind this early in developement and use std.zlib |
February 12, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to bioinfornatics | On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics wrote:
>
> Some time fastq are comressed to gz bz2 or xz as that is often a
> huge file.
> Maybe we need keep in mind this early in developement and use
> std.zlib
While working on making the parser multi-threaded compatible, I was able to seperate the part that feeds data, and the part that parses data.
Long story short, the parser operates on an input range of ubyte[]: It is not responsible any more for acquisition of data.
The range can be a simple (wrapped) File, a byChunk, an asynchroneus file reader, or a zip decompresser, or just stdin I guess. Range can be transient.
However, now that you mention it, I'll make sure it is correctly supported.
I'll *try* to show you what I have so far tomorow (in about 18h).
|
February 13, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra wrote:
> On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics wrote:
>>
>> Some time fastq are comressed to gz bz2 or xz as that is often a
>> huge file.
>> Maybe we need keep in mind this early in developement and use
>> std.zlib
>
> While working on making the parser multi-threaded compatible, I was able to seperate the part that feeds data, and the part that parses data.
>
> Long story short, the parser operates on an input range of ubyte[]: It is not responsible any more for acquisition of data.
>
> The range can be a simple (wrapped) File, a byChunk, an asynchroneus file reader, or a zip decompresser, or just stdin I guess. Range can be transient.
>
> However, now that you mention it, I'll make sure it is correctly supported.
>
> I'll *try* to show you what I have so far tomorow (in about 18h).
Yeah... I played around too much, and the file is dirtier than ever.
The good news is that I was able to test out what I was telling you about: accepting any range is ok:
I used your ZFile range to plug it into my parser: I can now parse zipped files directly.
The good news is that now, I'm not bottle necked by IO anymore! The bad news is that I'm now bottle necked by CPU decompressing. But since I'm using dmd, you may get better results with LDC or GDC.
In any case, I am now parsing the 6Gig packed into 1.5Gig in about 53 seconds (down from 61). I also tried doing a dual-threaded approach (1 thread to unzip, 1 thread to parse), but again, the actual *parse* phase is so ridiculously fast, that it changes *nothing* to the final result.
Long story short: 99% of the time is spent acquiring data. The last 1% is just copying it into local buffers.
The last good news though is that CPU bottleneck is always better than IO bottleneck. If you have multiple cores, you should be able to run multiple *instances* (not threads), and be able to process several files at once, multiplying your throughput.
|
Copyright © 1999-2021 by the D Language Foundation