How to read fastly files ( I/O operation) (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » How to read fastly files ( I/O operation) (page 4)

February 09, 2013

Re: How to read fastly files ( I/O operation)

Posted by bioinfornatics
in reply to monarch_dodra

bioinfornatics

Posted in reply to monarch_dodra

some idea such as letter counting:
rename identifier
trimming sequence from quality value to cutoff
convert to a binary format
more idea later

February 12, 2013

Re: How to read fastly files ( I/O operation)

Posted by bioinfornatics
in reply to FG

bioinfornatics

Posted in reply to FG

instead to use memcpy I try with slicing ~ lines 136 :
_hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. moveSize + _bufPosition];

I get same perf

February 12, 2013

Re: How to read fastly files ( I/O operation)

Posted by monarch_dodra
in reply to bioinfornatics

monarch_dodra

Posted in reply to bioinfornatics

On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics wrote:
> instead to use memcpy I try with slicing ~ lines 136 :
> _hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. moveSize + _bufPosition];
>
> I get same perf

I think I figured out why I'm getting different results than you guys are, on my windows machine.

AFAIK, file reads in windows are done natively asynchronously.

I wrote a multi-threaded version of the parser, with a thread dedicated to reading the file, while the main thread parses the read buffers.

I'm getting EXACTLY 0% performance improvement. Not better, not worst, just 0%.

I'd have to try again on my SSD. Right now, I'm parsing the file 6 Gig file in 60 seconds, which is the limit of my HDD. As a matter of fact, just *reading* the files takes the EXACT same amount of time as parsing it...

This takes 60 seconds.
//----
    auto input = File(args[1], "rb");
    ubyte[] buffer = new ubyte[](BufferSize);
    do{
        buffer = input.rawRead(buffer);
    }while(buffer.length);
//----

This takes 60 seconds too.
//----
    Parser parser = new Parser(args[1]);
    foreach(q; parser)
        foreach(char c; q.sequence)
            globalNucleic.collect(c);
}
//----

So at this point, I'd need to test on my Linux box, or publish the code so you can tell me how I'm doing.

I'm still tweaking the code to publish something readable, as there is a lot of sketchy code right now.

I'm also implementing a correct exception handling, so that if there is an erroneous entry, an exception is thrown. However, all the erroneous data is parsed out of the file, and placed inside the exception. This means that:
a) You can inspect the erroneous data
b) You can skip the erroneous data, and parse the rest of the file.

Once I deliver the code with the multi-threaded code activated, you should get some better performance on Linux.

When "1.0" is ready, I'll create a github project for it, so work can be done parallel on it.

February 12, 2013

Re: How to read fastly files ( I/O operation)

Posted by bioinfornatics
in reply to monarch_dodra

bioinfornatics

Posted in reply to monarch_dodra

On Tuesday, 12 February 2013 at 12:45:26 UTC, monarch_dodra wrote:
> On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics wrote:
>> instead to use memcpy I try with slicing ~ lines 136 :
>> _hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. moveSize + _bufPosition];
>>
>> I get same perf
>
> I think I figured out why I'm getting different results than you guys are, on my windows machine.
>
> AFAIK, file reads in windows are done natively asynchronously.
>
> I wrote a multi-threaded version of the parser, with a thread dedicated to reading the file, while the main thread parses the read buffers.
>
> I'm getting EXACTLY 0% performance improvement. Not better, not worst, just 0%.
>
> I'd have to try again on my SSD. Right now, I'm parsing the file 6 Gig file in 60 seconds, which is the limit of my HDD. As a matter of fact, just *reading* the files takes the EXACT same amount of time as parsing it...
>
> This takes 60 seconds.
> //----
>     auto input = File(args[1], "rb");
>     ubyte[] buffer = new ubyte[](BufferSize);
>     do{
>         buffer = input.rawRead(buffer);
>     }while(buffer.length);
> //----
>
> This takes 60 seconds too.
> //----
>     Parser parser = new Parser(args[1]);
>     foreach(q; parser)
>         foreach(char c; q.sequence)
>             globalNucleic.collect(c);
> }
> //----
>
> So at this point, I'd need to test on my Linux box, or publish the code so you can tell me how I'm doing.
>
> I'm still tweaking the code to publish something readable, as there is a lot of sketchy code right now.
>
> I'm also implementing a correct exception handling, so that if there is an erroneous entry, an exception is thrown. However, all the erroneous data is parsed out of the file, and placed inside the exception. This means that:
> a) You can inspect the erroneous data
> b) You can skip the erroneous data, and parse the rest of the file.
>
> Once I deliver the code with the multi-threaded code activated, you should get some better performance on Linux.
>
> When "1.0" is ready, I'll create a github project for it, so work can be done parallel on it.

about threaded version is possible to use get file size function to split it in several thread.
Use fseek read end of section return it to detect end of split to used

February 12, 2013

Re: How to read fastly files ( I/O operation)

Posted by monarch_dodra
in reply to bioinfornatics

monarch_dodra

Posted in reply to bioinfornatics

On Tuesday, 12 February 2013 at 16:28:09 UTC, bioinfornatics wrote:
> On Tuesday, 12 February 2013 at 12:45:26 UTC, monarch_dodra wrote:
>> On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics wrote:
>>> instead to use memcpy I try with slicing ~ lines 136 :
>>> _hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. moveSize + _bufPosition];
>>>
>>> I get same perf
>>
>> I think I figured out why I'm getting different results than you guys are, on my windows machine.
>>
>> AFAIK, file reads in windows are done natively asynchronously.
>>
>> I wrote a multi-threaded version of the parser, with a thread dedicated to reading the file, while the main thread parses the read buffers.
>>
>> I'm getting EXACTLY 0% performance improvement. Not better, not worst, just 0%.
>>
>> I'd have to try again on my SSD. Right now, I'm parsing the file 6 Gig file in 60 seconds, which is the limit of my HDD. As a matter of fact, just *reading* the files takes the EXACT same amount of time as parsing it...
>>
>> This takes 60 seconds.
>> //----
>>    auto input = File(args[1], "rb");
>>    ubyte[] buffer = new ubyte[](BufferSize);
>>    do{
>>        buffer = input.rawRead(buffer);
>>    }while(buffer.length);
>> //----
>>
>> This takes 60 seconds too.
>> //----
>>    Parser parser = new Parser(args[1]);
>>    foreach(q; parser)
>>        foreach(char c; q.sequence)
>>            globalNucleic.collect(c);
>> }
>> //----
>>
>> So at this point, I'd need to test on my Linux box, or publish the code so you can tell me how I'm doing.
>>
>> I'm still tweaking the code to publish something readable, as there is a lot of sketchy code right now.
>>
>> I'm also implementing a correct exception handling, so that if there is an erroneous entry, an exception is thrown. However, all the erroneous data is parsed out of the file, and placed inside the exception. This means that:
>> a) You can inspect the erroneous data
>> b) You can skip the erroneous data, and parse the rest of the file.
>>
>> Once I deliver the code with the multi-threaded code activated, you should get some better performance on Linux.
>>
>> When "1.0" is ready, I'll create a github project for it, so work can be done parallel on it.
>
> about threaded version is possible to use get file size function to split it in several thread.
> Use fseek read end of section return it to detect end of split to used

You'd want to have 2 threads reading the same file at once? I don't think there is much more to be gained anyways, since the IO is the bottleneck anyways.

A better approach would be to have 1 file reader that passes data to two simultaneous parsers. This, however, would make things scary complicated, and I'd doubt we'd even get much better results: I was not able to measure the actual amount of time spent working when compared to the time spent reading the file.

February 12, 2013

Re: How to read fastly files ( I/O operation)

Posted by FG
in reply to bioinfornatics

FG

Posted in reply to bioinfornatics

On 2013-02-12 17:28, bioinfornatics wrote:
> about threaded version is possible to use get file size function to split it in
> several thread.
> Use fseek read end of section return it to detect end of split to used

Yes, but like already mentioned before, it only works well for SSD.
For normal hard drives you'd want the data stored and accessed in sequence without jumping between cylinders whenever you switch threads.
Do you store your data on an SSD?

February 12, 2013

Re: How to read fastly files ( I/O operation)

Posted by FG
in reply to monarch_dodra

FG

Posted in reply to monarch_dodra

On 2013-02-12 17:45, monarch_dodra wrote:
> A better approach would be to have 1 file reader that passes data to two
> simultaneous parsers. This, however, would make things scary complicated, and
> I'd doubt we'd even get much better results: I was not able to measure the
> actual amount of time spent working when compared to the time spent reading the
> file.

Best to keep things simple when the potential benefits aren't certain. :)

February 12, 2013

Re: How to read fastly files ( I/O operation)

Posted by bioinfornatics
in reply to FG

bioinfornatics

Posted in reply to FG

Some time fastq are comressed to gz bz2 or xz as that is often a
huge file.
Maybe we need keep in mind this early in developement and use
std.zlib

February 12, 2013

Re: How to read fastly files ( I/O operation)

Posted by monarch_dodra
in reply to bioinfornatics

monarch_dodra

Posted in reply to bioinfornatics

On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics wrote:
>
> Some time fastq are comressed to gz bz2 or xz as that is often a
> huge file.
> Maybe we need keep in mind this early in developement and use
> std.zlib

While working on making the parser multi-threaded compatible, I was able to seperate the part that feeds data, and the part that parses data.

Long story short, the parser operates on an input range of ubyte[]: It is not responsible any more for acquisition of data.

The range can be a simple (wrapped) File, a byChunk, an asynchroneus file reader, or a zip decompresser, or just stdin I guess. Range can be transient.

However, now that you mention it, I'll make sure it is correctly supported.

I'll *try* to show you what I have so far tomorow (in about 18h).

February 13, 2013

Re: How to read fastly files ( I/O operation)

Posted by monarch_dodra
in reply to monarch_dodra

monarch_dodra

Posted in reply to monarch_dodra

On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra wrote:
> On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics wrote:
>>
>> Some time fastq are comressed to gz bz2 or xz as that is often a
>> huge file.
>> Maybe we need keep in mind this early in developement and use
>> std.zlib
>
> While working on making the parser multi-threaded compatible, I was able to seperate the part that feeds data, and the part that parses data.
>
> Long story short, the parser operates on an input range of ubyte[]: It is not responsible any more for acquisition of data.
>
> The range can be a simple (wrapped) File, a byChunk, an asynchroneus file reader, or a zip decompresser, or just stdin I guess. Range can be transient.
>
> However, now that you mention it, I'll make sure it is correctly supported.
>
> I'll *try* to show you what I have so far tomorow (in about 18h).

Yeah... I played around too much, and the file is dirtier than ever.

The good news is that I was able to test out what I was telling you about: accepting any range is ok:

I used your ZFile range to plug it into my parser: I can now parse zipped files directly.

The good news is that now, I'm not bottle necked by IO anymore! The bad news is that I'm now bottle necked by CPU decompressing. But since I'm using dmd, you may get better results with LDC or GDC.

In any case, I am now parsing the 6Gig packed into 1.5Gig in about 53 seconds (down from 61). I also tried doing a dual-threaded approach (1 thread to unzip, 1 thread to parse), but again, the actual *parse* phase is so ridiculously fast, that it changes *nothing* to the final result.

Long story short: 99% of the time is spent acquiring data. The last 1% is just copying it into local buffers.

The last good news though is that CPU bottleneck is always better than IO bottleneck. If you have multiple cores, you should be able to run multiple *instances* (not threads), and be able to process several files at once, multiplying your throughput.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation