February 06, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to bioinfornatics | On 06/02/13 22:21, bioinfornatics wrote: > Thanks monarch and FG, > i will read your code to see where i failing :-) I wasn't going to mention this as I thought the CPU usage might be trivial, but if both CPU and IO are factors, then it would probably be beneficial to have a separate IO thread/task. I guess you'd need a big task: the task would need to load and return n chunks or n lines, rather than just one line at at time, for example, and the processing/parsing thread (main thread or otherwise) could then churn through that while more IO was done. It would also depend on the size of the file: no point firing up a thread just to read a tiny file that the filesystem can return in a millisecond. If you're talking about 1+ minutes of loading though, a thread should definitely help. Also, if you don't strictly need to parse the file in order, then you could divide and conquer it by breaking it into more sections/tasks. For example, if you're parsing records, you cold split the file in half, find the remaining parts of the record in the second half, move it to the first, and then process the two halves in two threads. If you've a nice function to do that split cleanly, and n cpus, then just call it some more. -- Lee |
February 07, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Lee Braiden | On 2013-02-07 00:41, Lee Braiden wrote: > I wasn't going to mention this as I thought the CPU usage might be trivial, but > if both CPU and IO are factors, then it would probably be beneficial to have a > separate IO thread/task. This wasn't an issue in my version of the program. It took 1m55s to process the file, but then again it takes 1m44s just to read it (as shown previously). > Also, if you don't strictly need to parse the file in order, then you could > divide and conquer it by breaking it into more sections/tasks. For example, if > you're parsing records, you cold split the file in half, find the remaining > parts of the record in the second half, move it to the first, and then process > the two halves in two threads. If you've a nice function to do that split > cleanly, and n cpus, then just call it some more. Now, this could make a big difference! If only parsing out of order is acceptable in this case. |
February 07, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to FG | On Wednesday, 6 February 2013 at 22:55:14 UTC, FG wrote:
> On 2013-02-06 21:43, monarch_dodra wrote:
>> On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
>>> I have processed the file SRR077487_1.filt.fastq from
>>> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
>>> and expect this syntax (no multiline sequences or whitespace).
>>> File takes up almost 6 GB processing took 1m45s - twice as fast as the
>>> fastest D solution so far
>>
>> Do you mean my solution above? I tried your solution with dmd, with -release -O
>> -inline, and both gave about the same result (69s yours, 67s mine).
>
> Yes. Maybe CPU is the bottleneck on my end.
> With DMD32 2.060 on win7-64 compiled with same flags I got:
> MD: 4m30 / FG: 1m55s - both using 100% of one core.
> Quite similar results with GDC64.
>
> You have timed the same file SRR077487_1.filt.fastq at 67s?
Yes, that file exactly. That said, I'm working on an SSD, so maybe I'm less IO bound than you are?
My attempt was mostly to try and see how fast we could go, while doing it only with high level stuff (eg, no fSomething calls).
Probably, going lower level, and parsing the text manually, waiting for magic characters could yield better result (like what you did).
I'm going to also try playing around with threads: Just last week I wrote a program that did exactly this (asynchronous file reads).
That said, I'll be making this priority n°2. I'd like to make the parser work perfectly first, and in a way that is easily upgradeable/useable. Mr. bio made it perfectly clear that he needed support for whites and line feeds ;)
|
February 07, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On 2013-02-07 08:26, monarch_dodra wrote:
>> You have timed the same file SRR077487_1.filt.fastq at 67s?
>
> Yes, that file exactly. That said, I'm working on an SSD, so maybe I'm less IO
> bound than you are?
Ah, now that you mention SSD, I moved the file onto one and it's even more
clear that I am CPU-bound here on the Intel E6600 system. Compare:
7200rpm: MS 4m30s / FG 1m55s
SSD: MS 4m14s / FG 1m44s
Almost the same, but running the utility "wc -l" on the file renders:
7200rpm: 1m45s
SSD: 0m33s
In my case threads would be beneficial but only when using the SSD.
Reading the file by chunk in D takes 33s on SSD and 1m44s on HDD.
Slicing the file in half and reading from both threads would also
be fine only on the SSD, because on a HDD I'd lose sequential disk
reads jumping between threads (expecting lower performance).
Therefore - threads: yes, but gotta use an SSD. :)
Also, threads: yes, if there's gonna be more processing than just
counting letters.
|
February 07, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to FG | Little feed back i named f the f's script and monarch the monarch's script gdmd -O -w -release f.d ~ $ time ./f bigFastq.fastq ['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 'G':1353023772] real 2m14.966s user 0m47.168s sys 0m15.379s ~ $ gdmd -O -w -release monarch.d monarch.d:117: no identifier for declarator Lines monarch.d:117: alias cannot have initializer monarch.d:130: identifier or integer expected, not assert i haven't take the time to look more but in any case it seem memory mapped file is really slowly whereas it is said that is the faster way to read file. Create an index where reading the file need 12 min that is useless as for read and compute you need 2 min |
February 07, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to bioinfornatics | On Thursday, 7 February 2013 at 14:30:11 UTC, bioinfornatics wrote:
> Little feed back
> i named f the f's script and monarch the monarch's script
>
> gdmd -O -w -release f.d
> ~ $ time ./f bigFastq.fastq
> ['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 'G':1353023772]
>
> real 2m14.966s
> user 0m47.168s
> sys 0m15.379s
> ~ $ gdmd -O -w -release monarch.d
> monarch.d:117: no identifier for declarator Lines
> monarch.d:117: alias cannot have initializer
> monarch.d:130: identifier or integer expected, not assert
>
>
> i haven't take the time to look more
>
> but in any case it seem memory mapped file is really slowly whereas it is said that is the faster way to read file. Create an index where reading the file need 12 min that is useless as for read and compute you need 2 min
You must be using dmd 2.060. I'm using some 2.061 features: Namelly "new style alias".
Just change line 117:
alias Lines = typeof(File.init.byLine());
to
alias typeof(File.init.byLine()) Lines;
As for 130, it's a "version(assert)" eg, code that does not get executed in release. Just remove the "version(assert)", if it gets executed, it is not a big deal.
In any case, I think the code is mostly "proof", I wouldn't use it as is.
------------
BTW, I've started working on my library. How would users expect the "quality" format served? As an array of characters, or as an array of integrals (ubytes)?
|
February 07, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On Thursday, 7 February 2013 at 14:42:57 UTC, monarch_dodra wrote:
> On Thursday, 7 February 2013 at 14:30:11 UTC, bioinfornatics wrote:
>> Little feed back
>> i named f the f's script and monarch the monarch's script
>>
>> gdmd -O -w -release f.d
>> ~ $ time ./f bigFastq.fastq
>> ['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 'G':1353023772]
>>
>> real 2m14.966s
>> user 0m47.168s
>> sys 0m15.379s
>> ~ $ gdmd -O -w -release monarch.d
>> monarch.d:117: no identifier for declarator Lines
>> monarch.d:117: alias cannot have initializer
>> monarch.d:130: identifier or integer expected, not assert
>>
>>
>> i haven't take the time to look more
>>
>> but in any case it seem memory mapped file is really slowly whereas it is said that is the faster way to read file. Create an index where reading the file need 12 min that is useless as for read and compute you need 2 min
>
> You must be using dmd 2.060. I'm using some 2.061 features: Namelly "new style alias".
>
> Just change line 117:
> alias Lines = typeof(File.init.byLine());
> to
> alias typeof(File.init.byLine()) Lines;
>
> As for 130, it's a "version(assert)" eg, code that does not get executed in release. Just remove the "version(assert)", if it gets executed, it is not a big deal.
>
> In any case, I think the code is mostly "proof", I wouldn't use it as is.
>
> ------------
>
> BTW, I've started working on my library. How would users expect the "quality" format served? As an array of characters, or as an array of integrals (ubytes)?
ubyte as is a number is maybe easier to understand an cuttoff some value
|
February 08, 2013 [OT] Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to bioinfornatics | 06.02.2013 19:40, bioinfornatics пишет: > On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics wrote: > I agree the spec format is really bad but it is heavily used in biology > so i would like a fast parser to develop some D application instead to > use C++. Yes, lets also create 1 GiB XML files and ask for fast encoding/decoding! The situation can be improved only if: 1. We will find and kill every text format creator; 2. We will create a really good binary format for each such task and support it in every application we create. So after some time text formats will just die because of evolution as everything will support better formats. (the second proposal is a real recommendation) -- Денис В. Шеломовский Denis V. Shelomovskij |
February 08, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | And use size_t instead to int for getChar/getInt method as type returned gdmd -w -O -release monarch.d ~ $ time ./monarch /env/cns/proj/projet_AZH/A/RunsSolexa/121114_FLUOR_C16L5ACXX/AZH_AOSC_8_1_C16L5ACXX.IND1_clean.fastq globalStats: A: 1007129068. C: 1350576504. G: 1353023772. M: 0. D: 0. S: 0. H: 0. N: 39413. V: 0. U: 0. W: 0. R: 0. B: 0. Y: 0. K: 0. T: 999786820. time: 176585 real 2m56.635s user 2m31.376s sys 0m23.077s this program is little less fast than f's program about parser I would like create a set a biology parser and put into a lib with a set of common compute as letter counter. By example you could run a letter counter compute throw a fata or fastq file. rename identifier thwow a fata or fastq file. |
February 08, 2013 Re: How to read fastly files ( I/O operation) | ||||
---|---|---|---|---|
| ||||
Posted in reply to bioinfornatics | On Friday, 8 February 2013 at 09:08:48 UTC, bioinfornatics wrote: > And use size_t instead to int for getChar/getInt method as type returned > > gdmd -w -O -release monarch.d > ~ $ time ./monarch /env/cns/proj/projet_AZH/A/RunsSolexa/121114_FLUOR_C16L5ACXX/AZH_AOSC_8_1_C16L5ACXX.IND1_clean.fastq > globalStats: > A: 1007129068. C: 1350576504. G: 1353023772. M: 0. D: 0. S: > 0. H: 0. N: 39413. V: 0. U: 0. W: 0. R: 0. B: 0. Y: 0. K: 0. T: 999786820. > time: 176585 > > real 2m56.635s > user 2m31.376s > sys 0m23.077s > > > this program is little less fast than f's program I've re-tried running both mine and FG's on a HDD based machine, with dmd, -O -release. Also optional "inline" I also wrote a new parser, which does as FG suggested, and just parses straight up (byLine is indeed more expensive). This one handles whites and line breaks correctly. It also accepts lines of any size (the internal buffer is auto-grow). My results are different from yours though: w/o inline w inline FG 105s 77s MD 72s 64s newMD 61s 59s I have no idea why you guys are getting better results with FG, and I'm getting better results with mine. Is this a win/linux or dmd/gdc issue. My new parser is based on raw reads, so that should be much faster on your machines. > about parser I would like create a set a biology parser and put into a lib with a set of common compute as letter counter. > By example you could run a letter counter compute throw a fata or fastq file. > rename identifier thwow a fata or fastq file. I don't really understand what all that means. In any case, I've been able to implement some cool features so far. My parser is a "true" range you can pass around, and you won't have any problems with it. It returns "shallow" objects that reference a mutable string, however, the user can call "dup" or "idup" to have a new object. Said objects can be printed directly, so there is no need for a specialized "writer". As a matter of fact, this little program will allow you to "clean" a file (strip spaces), and potentially, line-wrap at 80 chars: //---- import std.stdio; import fastq.parser; import fastq.q; void main(string[] args) { Parser parser = new Parser(args[1]); File output = File(args[2], "wb"); foreach(entry; parser) writefln("%80s", entry); } //---- I'll submit it for your review, once it is perfectly implemented. |
Copyright © 1999-2021 by the D Language Foundation