June 11, 2020 Re: Looking for a Code Review of a Bioinformatics POC | ||||
---|---|---|---|---|
| ||||
Posted in reply to duck_tape | On Thu, Jun 11, 2020 at 10:41:12PM +0000, duck_tape via Digitalmars-d-learn wrote: > On Thursday, 11 June 2020 at 22:19:27 UTC, H. S. Teoh wrote: > > To encourage inlining, you could make it an alias parameter instead of a delegate, something like this: > > > > void overlap(alias cb)(SType start, SType stop) { ... } > > ... > > bed[chr].overlap!callback(st0, en0); > > > > > I don't think ldc can handl that yet. I get an error saying > > ``` > source/app.d(72,7): Error: function app.main.overlap!(callback).overlap > requires a dual-context, which is not yet supported by LDC > ``` > > And I see an open ticket for it on the ldc project. Oh right. :-( But in any case, I'm a little skeptical whether this is the performance bottleneck anyway. But one simple thing to try is to add 'scope' to the callback parameter, which could potentially save you a GC allocation. I'm not 100% certain this will make a difference, but since it's such an easy change it's worth a shot. T -- Philosophy: how to make a career out of daydreaming. |
June 11, 2020 Re: Looking for a Code Review of a Bioinformatics POC | ||||
---|---|---|---|---|
| ||||
Posted in reply to tastyminerals | On Thursday, 11 June 2020 at 22:53:52 UTC, tastyminerals wrote:
> Mir is fine-tuned for LLVM, pointer magic and SIMD optimizations.
I'll have to give that a shot for the biofast version of this. There are other ways of doing this same thing that could very well benefit from Mir.
|
June 11, 2020 Re: Looking for a Code Review of a Bioinformatics POC | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Thursday, 11 June 2020 at 22:57:55 UTC, H. S. Teoh wrote: > But one simple thing to try is to add 'scope' to the callback parameter, which could potentially save you a GC allocation. I'm not 100% certain this will make a difference, but since it's such an easy change it's worth a shot. I will give that a shot! Also of interest, the profiler results on a full runthrough do show file writing and int parsing as the 2nd and 3rd most time consuming activities: ``` Num Tree Func Per Calls Time Time Call 8942869 46473 44660 0 void app.IITree!(int, bool).IITree.overlap(int, int, void delegate(app.IITree!(int, bool).IITree.Interval)) 8942869 33065 9656 0 @safe void std.stdio.File.write!(char[], immutable(char)[], char[], immutable(char)[], char[], immutable(char)[], int, immutable(char)[], int, char).write(char[], immutable(char)[], char[], immutable(char)[], char[], immutable(char)[], int, immutable(char)[], int, char) 20273052 10024 9569 0 pure @safe int std.conv.parse!(int, char[]).parse(ref char[]) 1 128571 8894 8894 _Dmain 80485821 6539 6539 0 nothrow @nogc @trusted ulong std.stdio.trustedFwrite!(char).trustedFwrite(shared(core.stdc.stdio.__sFILE)*, const(char[])) 17885738 8606 3808 0 @safe void std.conv.toTextRange!(int, std.stdio.File.LockingTextWriter).toTextRange(int, std.stdio.File.LockingTextWriter) 30409578 3751 3751 0 pure nothrow @nogc @trusted char[] std.algorithm.searching.find!("a == b", char[], char).find(char[], char).trustedMemchr(ref char[], ref char) 10136528 3300 3274 0 ulong std.stdio.File.readln!(char).readln(ref char[], dchar) 30409578 13151 3047 0 pure @safe char[] app.next!(std.algorithm.iteration.splitter!("a == b", char[], char).splitter(char[], char).Result).next(ref std.algorithm.iteration.splitter!("a == b", char[], char).splitter(char[], char).Result) 30409578 8964 2605 0 pure @property @safe char[] std.algorithm.iteration.splitter!("a == b", char[], char).splitter(char[], char).Result.front() 30409578 6289 2471 0 pure @safe char[] std.algorithm.searching.find!("a == b", char[], char).find(char[], char) ``` |
June 11, 2020 Re: Looking for a Code Review of a Bioinformatics POC | ||||
---|---|---|---|---|
| ||||
Posted in reply to duck_tape | On Thu, Jun 11, 2020 at 11:02:21PM +0000, duck_tape via Digitalmars-d-learn wrote: [...] > I will give that a shot! Also of interest, the profiler results on a full runthrough do show file writing and int parsing as the 2nd and 3rd most time consuming activities: > > ``` > Num Tree Func Per > Calls Time Time Call > > 8942869 46473 44660 0 void app.IITree!(int, bool).IITree.overlap(int, int, void delegate(app.IITree!(int, bool).IITree.Interval)) > 8942869 33065 9656 0 @safe void std.stdio.File.write!(char[], immutable(char)[], char[], immutable(char)[], char[], immutable(char)[], int, immutable(char)[], int, char).write(char[], immutable(char)[], char[], immutable(char)[], char[], immutable(char)[], int, immutable(char)[], int, char) > 20273052 10024 9569 0 pure @safe int std.conv.parse!(int, char[]).parse(ref char[]) Hmm, looks like it's not so much input that's slow, but *output*. In fact, it looks pretty bad, taking almost as much time as overlap() does in total! This makes me think that writing your own output buffer could be worthwhile. Here's a quick-n-dirty way of doing that: import std.array : appender; auto filebuf = appender!string; ... // Replace every call to writeln with this: put(filebuf, text(... /* arguments go here */ ..., "\n")); ... // At the end of the main loop: enum bufLimit = 0x1000; // whatever the limit you want if (filebuf.length > someLimit) { write(filebuf.data); // flush output data stdout.flush; filebuf.clear; } This is just a rough sketch for an initial test, of course. For a truly optimized output buffer I'd write a container struct with methods for managing the appending and flushing of output. But this is just to get an idea of whether it actually improves performance before investing more effort into going in this direction. T -- He who sacrifices functionality for ease of use, loses both and deserves neither. -- Slashdotter |
June 12, 2020 Re: Looking for a Code Review of a Bioinformatics POC | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Thursday, 11 June 2020 at 23:45:31 UTC, H. S. Teoh wrote:
>
> Hmm, looks like it's not so much input that's slow, but *output*. In fact, it looks pretty bad, taking almost as much time as overlap() does in total!
>
> This makes me think that writing your own output buffer could be worthwhile. Here's a quick-n-dirty way of doing that:
>
> import std.array : appender;
> auto filebuf = appender!string;
> ...
> // Replace every call to writeln with this:
> put(filebuf, text(... /* arguments go here */ ..., "\n"));
>
> ...
>
> // At the end of the main loop:
> enum bufLimit = 0x1000; // whatever the limit you want
> if (filebuf.length > someLimit) {
> write(filebuf.data); // flush output data
> stdout.flush;
> filebuf.clear;
> }
>
> This is just a rough sketch for an initial test, of course. For a truly optimized output buffer I'd write a container struct with methods for managing the appending and flushing of output. But this is just to get an idea of whether it actually improves performance before investing more effort into going in this direction.
>
>
> T
I'll play with that a bit tomorrow! I saw a nice implementation on eBay's tsvutils that I may need to look closer at.
Someone else suggested that stdout flushes per line by default. I dug around the stdlib but could confirm that. I also played around with setvbuf but it didn't seem to change anything.
Have you run into that before / know if stdout is flushing every newline? I'm not above opening '/dev/stdout' as a file of that writes faster.
|
June 12, 2020 Re: Looking for a Code Review of a Bioinformatics POC | ||||
---|---|---|---|---|
| ||||
Posted in reply to duck_tape | On Friday, 12 June 2020 at 00:58:34 UTC, duck_tape wrote: > On Thursday, 11 June 2020 at 23:45:31 UTC, H. S. Teoh wrote: >> >> Hmm, looks like it's not so much input that's slow, but *output*. In fact, it looks pretty bad, taking almost as much time as overlap() does in total! >> >> [snip...] > > I'll play with that a bit tomorrow! I saw a nice implementation on eBay's tsvutils that I may need to look closer at. > > Someone else suggested that stdout flushes per line by default. I dug around the stdlib but could confirm that. I also played around with setvbuf but it didn't seem to change anything. > > Have you run into that before / know if stdout is flushing every newline? I'm not above opening '/dev/stdout' as a file of that writes faster. I put some comparative benchmarks in https://github.com/jondegenhardt/dcat-perf. It compares input and output using standard Phobos facilities (File.byLine, File.write), iopipe (https://github.com/schveiguy/iopipe), and the tsv-utils buffered input and buffered output facilities. I haven't spent much time on results presentation, I know it's not that easy to read and interpret the results. Brief summary - On files with short lines buffering will result in dramatic throughput improvements over the standard phobos facilities. This is true for both input and output, through likely for different reasons. For input iopipe is the fastest available. tsv-utils buffered facilities are materially faster than phobos for both input and output, but not as fast as iopipe for input. Combining iopipe for input with tsv-utils BufferOutputRange for output works pretty well. For files with long lines both iopipe and tsv-utils BufferedByLine are materially faster than Phobos File.byLine when reading. For writing there wasn't much difference from Phobos File.write. A note on File.byLine - I've had many opportunities to compare Phobos File.byLine to facilities in other programming languages, and it is not bad at all. But it is beatable. About Memory Mapped Files - The benchmarks don't include compare against mmfile. They certainly make sense as a comparison point. --Jon |
June 11, 2020 Re: Looking for a Code Review of a Bioinformatics POC | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jon Degenhardt | On Fri, Jun 12, 2020 at 03:32:48AM +0000, Jon Degenhardt via Digitalmars-d-learn wrote: [...] > I haven't spent much time on results presentation, I know it's not that easy to read and interpret the results. Brief summary - On files with short lines buffering will result in dramatic throughput improvements over the standard phobos facilities. This is true for both input and output, through likely for different reasons. For input iopipe is the fastest available. tsv-utils buffered facilities are materially faster than phobos for both input and output, but not as fast as iopipe for input. Combining iopipe for input with tsv-utils BufferOutputRange for output works pretty well. > > For files with long lines both iopipe and tsv-utils BufferedByLine are materially faster than Phobos File.byLine when reading. For writing there wasn't much difference from Phobos File.write. Interesting. Based on the OP's posted profile data, I got the impression that input wasn't a big deal, but output was. I wonder why. > A note on File.byLine - I've had many opportunities to compare Phobos File.byLine to facilities in other programming languages, and it is not bad at all. But it is beatable. I glanced over the implementation of byLine. It appears to be the unhappy compromise of trying to be 100% correct, cover all possible UTF encodings, and all possible types of input streams (on-disk file vs. interactive console). It does UTF decoding and resizing of arrays, and a lot of other frilly little squirrelly things. In fact I'm dismayed at how hairy it is, considering the conceptual simplicity of the task! Given this, it will definitely be much faster to load in large chunks of the file at a time into a buffer, and scanning in-memory for linebreaks. I wouldn't bother with decoding at all; I'd just precompute the byte sequence of the linebreaks for whatever encoding the file is expected to be in, and just scan for that byte pattern and return slices to the data. > About Memory Mapped Files - The benchmarks don't include compare against mmfile. They certainly make sense as a comparison point. [...] I'd definitely seriously consider using std.mmfile if I/O is determined to be a significant bottleneck. Letting the OS page in the file on-demand for you instead of copying buffers across the C file API boundary is definitely going to be a lot faster. Plus it will greatly simplify the code -- you could just arbitrarily scan and slice over file data without needing to manually manage buffers on your own, so your code will be much simpler and conducive for the compiler to squeeze the last bit of speed juice out of. I'd definitely avoid stdio.byLine if input was determined to be a bottleneck: decoding characters from file data just to find linebreaks seems to me to be definitely a slow way of doing things. Having said all of that, though: usually in non-trivial programs reading input is the least of your worries, so this kind of micro-optimization is probably unwarranted except for very niche cases and for micro-benchmarks and other such toy programs where the cost of I/O constitutes a significant chunk of running times. But knowing what byLine does under the hood is definitely interesting information for me to keep in mind, the next time I write an input-heavy program. (I'm reminded of that one time when, as little diversion, I decided to see if I could beat GNU wc at counting lines in a file. It was not easy to beat since wc it's optimized to next year and back, but eventually a combination of std.mmfile and std.parallelism to scan large chunks of file simultaneously managed to beat wc by a good margin. In the meantime, though, I also discovered that a file of very short lines triggers poor performance out of wc, whereas a file of very long lines triggers the best performance -- because glibc's memchr appeared to be optimized for a micro-benchmark geared towards scanning arrays with only rare occurrences of the sought character, but typical text files exhibit much more frequent matches (shorter lines). When fed a file of very short lines, the overhead of the hyper-optimized code added up significantly in the latter case, whereas when lines were sufficiently long it far outweighed the overhead cost. Optimization is a tricky beast: always make sure to measure and optimize for your actual use case rather than making your code look good on some artificial micro-benchmark, else your code may look good on the benchmark but actually perform poorly on real-world data.) T -- May you live all the days of your life. -- Jonathan Swift |
June 12, 2020 Re: Looking for a Code Review of a Bioinformatics POC | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Friday, 12 June 2020 at 06:20:59 UTC, H. S. Teoh wrote: > I glanced over the implementation of byLine. It appears to be the unhappy compromise of trying to be 100% correct, cover all possible UTF encodings, and all possible types of input streams (on-disk file vs. interactive console). It does UTF decoding and resizing of arrays, and a lot of other frilly little squirrelly things. In fact I'm dismayed at how hairy it is, considering the conceptual simplicity of the task! > > Given this, it will definitely be much faster to load in large chunks of the file at a time into a buffer, and scanning in-memory for linebreaks. I wouldn't bother with decoding at all; I'd just precompute the byte sequence of the linebreaks for whatever encoding the file is expected to be in, and just scan for that byte pattern and return slices to the data. This is basically what bufferedByLine in tsv-utils does. See: https://github.com/eBay/tsv-utils/blob/master/common/src/tsv_utils/common/utils.d#L793. tsv-utils has the advantage of only needing to support utf-8 files with Unix newlines, so the code is simpler. (Windows newlines are detected, this occurs separately from bufferedByLine.) But as you describe, support for a wider variety of input cases could be done without sacrificing basic performance. iopipe provides much more generic support, and it is quite fast. > Having said all of that, though: usually in non-trivial programs reading input is the least of your worries, so this kind of micro-optimization is probably unwarranted except for very niche cases and for micro-benchmarks and other such toy programs where the cost of I/O constitutes a significant chunk of running times. But knowing what byLine does under the hood is definitely interesting information for me to keep in mind, the next time I write an input-heavy program. tsv-utils tools saw performance gains of 10-40% by moving from File.byLine to bufferedByLine, depending on tool and type of file (narrow or wide). Gains of 5-20% were obtained by switching from File.write to BufferedOutputRange, with some special cases improving by 50%. tsv-utils tools aren't micro-benchmarks, but they are not typical apps either. Most of the tools go into a tight loop of some kind, running a transformation on the input and writing to the output. Performance is a real benefit to these tools, as they get run on reasonably large data sets. |
June 12, 2020 Re: Looking for a Code Review of a Bioinformatics POC | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jon Degenhardt | On Friday, 12 June 2020 at 07:25:09 UTC, Jon Degenhardt wrote: > tsv-utils has the advantage of only needing to support utf-8 files with Unix newlines, so the code is simpler. (Windows newlines are detected, this occurs separately from bufferedByLine.) But as you describe, support for a wider variety of input cases could be done without sacrificing basic performance. iopipe provides much more generic support, and it is quite fast. I will have to look into iopipe for sure. All this info is great. For this particular benchmark the goal is just to show off some 'high-level' languages and how close to c they can get. If I can avoid going way into the weeds writing my own output methods, that's more in the spirit of things. However, I do intend to be using D for bioinformatics, which is incredibly IO intensive, so much of this will be put to good use. For speedups with getting my hands dirty: - Does writef and company flush on every line? I still haven't found the source of this. - It looks like I could use {f}printf if I really wanted to: https://forum.dlang.org/post/hzcjbanvkxgohkbvjnkv@forum.dlang.org It's particularly interesting what is said about short lines doing worse, because these are pretty short, less than 20 characters usually. |
June 12, 2020 Re: Looking for a Code Review of a Bioinformatics POC | ||||
---|---|---|---|---|
| ||||
Posted in reply to duck_tape | On Friday, 12 June 2020 at 12:02:19 UTC, duck_tape wrote:
> For speedups with getting my hands dirty:
> - Does writef and company flush on every line? I still haven't found the source of this.
> - It looks like I could use {f}printf if I really wanted to: https://forum.dlang.org/post/hzcjbanvkxgohkbvjnkv@forum.dlang.org
On Friday, 12 June 2020 at 12:02:19 UTC, duck_tape wrote:
Switching to using `core.stdc.stdio.printf` shaved off nearly two seconds (11->9)!
Once I wrap this up for submission to biofast I will play with mem memmapping / iopipe / tsvutils buffered writers. Sambamba is also doing some non-standard tweaks to it's outputting as well.
I'm still convinced that stdout is flushing by line.
|
Copyright © 1999-2021 by the D Language Foundation