February 06, 2013
On 06/02/13 22:21, bioinfornatics wrote:
> Thanks monarch and FG,
> i will read your code to see where i failing :-)

I wasn't going to mention this as I thought the CPU usage might be trivial, but if both CPU and IO are factors, then it would probably be beneficial to have a separate IO thread/task.

I guess you'd need a big task: the task would need to load and return n chunks or n lines, rather than just one line at at time, for example, and the processing/parsing thread (main thread or otherwise) could then churn through that while more IO was done.

It would also depend on the size of the file: no point firing up a thread just to read a tiny file that the filesystem can return in a millisecond.  If you're talking about 1+ minutes of loading though, a thread should definitely help.

Also, if you don't strictly need to parse the file in order, then you could divide and conquer it by breaking it into more sections/tasks. For example, if you're parsing records, you cold split the file in half, find the remaining parts of the record in the second half, move it to the first, and then process the two halves in two threads.  If you've a nice function to do that split cleanly, and n cpus, then just call it some more.



-- 
Lee

February 07, 2013
On 2013-02-07 00:41, Lee Braiden wrote:
> I wasn't going to mention this as I thought the CPU usage might be trivial, but
> if both CPU and IO are factors, then it would probably be beneficial to have a
> separate IO thread/task.

This wasn't an issue in my version of the program. It took 1m55s to process the
file, but then again it takes 1m44s just to read it (as shown previously).

> Also, if you don't strictly need to parse the file in order, then you could
> divide and conquer it by breaking it into more sections/tasks. For example, if
> you're parsing records, you cold split the file in half, find the remaining
> parts of the record in the second half, move it to the first, and then process
> the two halves in two threads.  If you've a nice function to do that split
> cleanly, and n cpus, then just call it some more.

Now, this could make a big difference!
If only parsing out of order is acceptable in this case.

February 07, 2013
On Wednesday, 6 February 2013 at 22:55:14 UTC, FG wrote:
> On 2013-02-06 21:43, monarch_dodra wrote:
>> On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
>>> I have processed the file SRR077487_1.filt.fastq from
>>> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
>>> and expect this syntax (no multiline sequences or whitespace).
>>> File takes up almost 6 GB processing took 1m45s - twice as fast as the
>>> fastest D solution so far
>>
>> Do you mean my solution above? I tried your solution with dmd, with -release -O
>> -inline, and both gave about the same result (69s yours, 67s mine).
>
> Yes. Maybe CPU is the bottleneck on my end.
> With DMD32 2.060 on win7-64 compiled with same flags I got:
> MD: 4m30 / FG: 1m55s - both using 100% of one core.
> Quite similar results with GDC64.
>
> You have timed the same file SRR077487_1.filt.fastq at 67s?

Yes, that file exactly. That said, I'm working on an SSD, so maybe I'm less IO bound than you are?

My attempt was mostly to try and see how fast we could go, while doing it only with high level stuff (eg, no fSomething calls).

Probably, going lower level, and parsing the text manually, waiting for magic characters could yield better result (like what you did).

I'm going to also try playing around with threads: Just last week I wrote a program that did exactly this (asynchronous file reads).

That said, I'll be making this priority n°2. I'd like to make the parser work perfectly first, and in a way that is easily upgradeable/useable. Mr. bio made it perfectly clear that he needed support for whites and line feeds ;)
February 07, 2013
On 2013-02-07 08:26, monarch_dodra wrote:
>> You have timed the same file SRR077487_1.filt.fastq at 67s?
>
> Yes, that file exactly. That said, I'm working on an SSD, so maybe I'm less IO
> bound than you are?

Ah, now that you mention SSD, I moved the file onto one and it's even more
clear that I am CPU-bound here on the Intel E6600 system. Compare:

    7200rpm: MS 4m30s / FG 1m55s
    SSD:     MS 4m14s / FG 1m44s

Almost the same, but running the utility "wc -l" on the file renders:

    7200rpm: 1m45s
    SSD:     0m33s

In my case threads would be beneficial but only when using the SSD.
Reading the file by chunk in D takes 33s on SSD and 1m44s on HDD.
Slicing the file in half and reading from both threads would also
be fine only on the SSD, because on a HDD I'd lose sequential disk
reads jumping between threads (expecting lower performance).

Therefore - threads: yes, but gotta use an SSD. :)
Also, threads: yes, if there's gonna be more processing than just
counting letters.
February 07, 2013
Little feed back
i named f the f's script and monarch the monarch's script

 gdmd -O -w -release f.d
~ $ time ./f bigFastq.fastq
['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 'G':1353023772]

real	2m14.966s
user	0m47.168s
sys	0m15.379s
~ $ gdmd -O -w -release monarch.d
monarch.d:117: no identifier for declarator Lines
monarch.d:117: alias cannot have initializer
monarch.d:130: identifier or integer expected, not assert


i haven't take the time to look more

but in any case it seem memory mapped file is really slowly whereas it is said that is the faster way to read file. Create an index where reading the file need 12 min that is useless as for read and compute you need 2 min
February 07, 2013
On Thursday, 7 February 2013 at 14:30:11 UTC, bioinfornatics wrote:
> Little feed back
> i named f the f's script and monarch the monarch's script
>
>  gdmd -O -w -release f.d
> ~ $ time ./f bigFastq.fastq
> ['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 'G':1353023772]
>
> real	2m14.966s
> user	0m47.168s
> sys	0m15.379s
> ~ $ gdmd -O -w -release monarch.d
> monarch.d:117: no identifier for declarator Lines
> monarch.d:117: alias cannot have initializer
> monarch.d:130: identifier or integer expected, not assert
>
>
> i haven't take the time to look more
>
> but in any case it seem memory mapped file is really slowly whereas it is said that is the faster way to read file. Create an index where reading the file need 12 min that is useless as for read and compute you need 2 min

You must be using dmd 2.060. I'm using some 2.061 features: Namelly "new style alias".

Just change line 117:
alias Lines = typeof(File.init.byLine());
to
alias typeof(File.init.byLine()) Lines;

As for 130, it's a "version(assert)" eg, code that does not get executed in release. Just remove the "version(assert)", if it gets executed, it is not a big deal.

In any case, I think the code is mostly "proof", I wouldn't use it as is.

------------

BTW, I've started working on my library. How would users expect the "quality" format served? As an array of characters, or as an array of integrals (ubytes)?
February 07, 2013
On Thursday, 7 February 2013 at 14:42:57 UTC, monarch_dodra wrote:
> On Thursday, 7 February 2013 at 14:30:11 UTC, bioinfornatics wrote:
>> Little feed back
>> i named f the f's script and monarch the monarch's script
>>
>> gdmd -O -w -release f.d
>> ~ $ time ./f bigFastq.fastq
>> ['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 'G':1353023772]
>>
>> real	2m14.966s
>> user	0m47.168s
>> sys	0m15.379s
>> ~ $ gdmd -O -w -release monarch.d
>> monarch.d:117: no identifier for declarator Lines
>> monarch.d:117: alias cannot have initializer
>> monarch.d:130: identifier or integer expected, not assert
>>
>>
>> i haven't take the time to look more
>>
>> but in any case it seem memory mapped file is really slowly whereas it is said that is the faster way to read file. Create an index where reading the file need 12 min that is useless as for read and compute you need 2 min
>
> You must be using dmd 2.060. I'm using some 2.061 features: Namelly "new style alias".
>
> Just change line 117:
> alias Lines = typeof(File.init.byLine());
> to
> alias typeof(File.init.byLine()) Lines;
>
> As for 130, it's a "version(assert)" eg, code that does not get executed in release. Just remove the "version(assert)", if it gets executed, it is not a big deal.
>
> In any case, I think the code is mostly "proof", I wouldn't use it as is.
>
> ------------
>
> BTW, I've started working on my library. How would users expect the "quality" format served? As an array of characters, or as an array of integrals (ubytes)?

ubyte as is a number is maybe easier to understand an cuttoff some value
February 08, 2013
06.02.2013 19:40, bioinfornatics пишет:
> On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics wrote:
> I agree the spec format is really bad but it is heavily used in biology
> so i would like a fast parser to develop some D application instead to
> use C++.

Yes, lets also create 1 GiB XML files and ask for fast encoding/decoding!

The situation can be improved only if:
1. We will find and kill every text format creator;
2. We will create a really good binary format for each such task and support it in every application we create. So after some time text formats will just die because of evolution as everything will support better formats.

(the second proposal is a real recommendation)

-- 
Денис В. Шеломовский
Denis V. Shelomovskij
February 08, 2013
And use size_t instead to int for getChar/getInt method as type returned

gdmd -w -O -release monarch.d
~ $ time ./monarch /env/cns/proj/projet_AZH/A/RunsSolexa/121114_FLUOR_C16L5ACXX/AZH_AOSC_8_1_C16L5ACXX.IND1_clean.fastq
globalStats:
A: 1007129068. C: 1350576504. G: 1353023772. M:   0. D:   0. S:   0. H:   0. N: 39413. V:   0. U:   0. W:   0. R:   0. B:   0. Y:   0. K:   0. T: 999786820.
time: 176585

real	2m56.635s
user	2m31.376s
sys	0m23.077s


this program is little less fast than f's program

about parser I would like create a set a biology parser and put into a lib with a set of common compute as letter counter.
By example you could run a letter counter compute throw a fata or fastq file.
rename identifier thwow a fata or fastq file.
February 08, 2013
On Friday, 8 February 2013 at 09:08:48 UTC, bioinfornatics wrote:
> And use size_t instead to int for getChar/getInt method as type returned
>
> gdmd -w -O -release monarch.d
> ~ $ time ./monarch /env/cns/proj/projet_AZH/A/RunsSolexa/121114_FLUOR_C16L5ACXX/AZH_AOSC_8_1_C16L5ACXX.IND1_clean.fastq
> globalStats:
> A: 1007129068. C: 1350576504. G: 1353023772. M:   0. D:   0. S:
>   0. H:   0. N: 39413. V:   0. U:   0. W:   0. R:   0. B:   0. Y:   0. K:   0. T: 999786820.
> time: 176585
>
> real	2m56.635s
> user	2m31.376s
> sys	0m23.077s
>
>
> this program is little less fast than f's program

I've re-tried running both mine and FG's on a HDD based machine, with dmd, -O -release. Also optional "inline"

I also wrote a new parser, which does as FG suggested, and just parses straight up (byLine is indeed more expensive). This one handles whites and line breaks correctly. It also accepts lines of any size (the internal buffer is auto-grow).

My results are different from yours though:

        w/o inline  w inline
FG      105s        77s
MD       72s        64s
newMD    61s        59s

I have no idea why you guys are getting better results with FG, and I'm getting better results with mine. Is this a win/linux or dmd/gdc issue. My new parser is based on raw reads, so that should be much faster on your machines.

> about parser I would like create a set a biology parser and put into a lib with a set of common compute as letter counter.
> By example you could run a letter counter compute throw a fata or fastq file.
> rename identifier thwow a fata or fastq file.

I don't really understand what all that means.

In any case, I've been able to implement some cool features so far. My parser is a "true" range you can pass around, and you won't have any problems with it.

It returns "shallow" objects that reference a mutable string, however, the user can call "dup" or "idup" to have a new object.

Said objects can be printed directly, so there is no need for a specialized "writer". As a matter of fact, this little program will allow you to "clean" a file (strip spaces), and potentially, line-wrap at 80 chars:

//----
import std.stdio;

import fastq.parser;
import fastq.q;

void main(string[] args)
{
    Parser parser = new Parser(args[1]);
    File   output = File(args[2], "wb");
    foreach(entry; parser)
        writefln("%80s", entry);
}
//----

I'll submit it for your review, once it is perfectly implemented.