Search
Code improvement for DNA reverse complement?
May 25
aberba
May 19
Biotronic
May 19
Biotronic
May 19
ag0aep6g
May 19
crimaniak
May 22
Biotronic
May 22
Biotronic
May 22
ag0aep6g
May 25
ag0aep6g
I am solving this problem http://rosalind.info/problems/revc/ as an exercise to learn D. This is my solution:

https://dpaste.dzfl.pl/8aa667f962b7

Is there some D tricks I can use to make the `reverseComplement` function more concise and speedy? Any other comments for improvement of the whole solution are also much appreciated.

On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
> I am solving this problem http://rosalind.info/problems/revc/ as an exercise to learn D. This is my solution:
>
> https://dpaste.dzfl.pl/8aa667f962b7
>
> Is there some D tricks I can use to make the `reverseComplement` function more concise and speedy? Any other comments for improvement of the whole solution are also much appreciated.

I think doing a switch or even a if-else chain would be faster then using an AA.

On Friday, 19 May 2017 at 07:46:13 UTC, Stefan Koch wrote:
> On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
>> I am solving this problem http://rosalind.info/problems/revc/ as an exercise to learn D. This is my solution:
>>
>> https://dpaste.dzfl.pl/8aa667f962b7
>>
>> Is there some D tricks I can use to make the `reverseComplement` function more concise and speedy? Any other comments for improvement of the whole solution are also much appreciated.
>
> I think doing a switch or even a if-else chain would be faster then using an AA.

Thank you Stefan. I will try that and benchmark the two implementations. I used AA approach because it looks more readable to me.
On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
> I am solving this problem http://rosalind.info/problems/revc/ as an exercise to learn D. This is my solution:
>
> https://dpaste.dzfl.pl/8aa667f962b7
>
> Is there some D tricks I can use to make the `reverseComplement` function more concise and speedy? Any other comments for improvement of the whole solution are also much appreciated.

Question about your implementation: you assume the input may contain newlines, but don't handle any other non-ACGT characters. The problem definition states 'DNA string' and the sample dataset contains no non-ACGT chars. Is this an oversight my part or yours, or did you just decide to support more than the problem requires?

Here's a few variations I've come up with, and their timings on my machine:

import std.exception;
import std.stdio;
import std.conv;
import std.range;
import std.algorithm;
import std.datetime;
import std.meta;

string randomDna(int length) {
import std.random;
auto rnd = Random(unpredictableSeed);
enum chars = ['A','C','G','T'];
return iota(length).map!(a=>chars[uniform(0,4, rnd)]).array;
}

unittest {
auto input = randomDna(2000);

string previous = null;
foreach (fn; AliasSeq!(revComp0, revComp1, revComp2, revComp3)) {
auto timing = benchmark!({fn(input);})(10_000);
writeln((&fn).stringof[2..\$], ": ", timing[0].to!Duration);
auto current = fn(input);
if (previous != null) {
if (current != previous) {
writeln((&fn).stringof[2..\$], " did not give correct results.");
} else {
previous = current;
}
}
}
}

// 216 ms, 3 us, and 8 hnsecs
string revComp0(string bps) {
const N = bps.length;
char[] result = new char[N];
for (int i = 0; i < N; ++i) {
result[i] = {switch(bps[N-i-1]){
case 'A': return 'T';
case 'C': return 'G';
case 'G': return 'C';
case 'T': return 'A';
default: return '\0';
}}();
}
return result.assumeUnique;
}

// 2 secs, 752 ms, and 969 us
string revComp1(string bps) {
return bps.retro.map!((a){switch(a){
case 'A': return 'T';
case 'C': return 'G';
case 'G': return 'C';
case 'T': return 'A';
default: assert(false);
}}).array;
}

// 10 secs, 419 ms, 335 us, and 6 hnsecs
string revComp2(string bps) {
enum chars = ['A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'];
auto result = appender!string;
foreach_reverse (c; bps) {
result.put(chars[c]);
}
return result.data;
}

// 1 sec, 972 ms, 915 us, and 9 hnsecs
string revComp3(string bps) {
const N = bps.length;
enum chars = [Repeat!('A'-'\0', '\0'), 'T',
Repeat!('C'-'A'-1, '\0'), 'G',
Repeat!('G'-'C'-1, '\0'), 'C',
Repeat!('T'-'G'-1, '\0'), 'A'];

char[] result = new char[N];
for (int i = 0; i < N; ++i) {
result[i] = chars[bps[N-i-1]];
}
return result.assumeUnique;
}
On Friday, 19 May 2017 at 09:17:04 UTC, Biotronic wrote:
> On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
>> [...]
>
> Question about your implementation: you assume the input may contain newlines, but don't handle any other non-ACGT characters. The problem definition states 'DNA string' and the sample dataset contains no non-ACGT chars. Is this an oversight my part or yours, or did you just decide to support more than the problem requires?
>
> [...]

Firstly, thank you for showing me various solutions, and even cool benchmark code. To answer you questions: Yes I assume the input file would realistically contain newlines, even though the problem does not care about them. I also thought about non-CATG bases, but haven't taken care of those cases. In reality we should deal with at least ambiguous bases (N).

I ran your code and also see that switch is faster than AA (i.e. revComp0 is the fastest). And Stefan is right about this.

1. Why do we need to use assumeUnique in 'revComp0' and 'revComp3'?

2. What is going on with the trick of making chars enum like that in 'revComp3'?

On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:
>
> 1. Why do we need to use assumeUnique in 'revComp0' and 'revComp3'?

D strings are immutable, so if I'd created the result array as a string, I couldn't change the individual characters. Instead, I create a mutable array, change the elements in it, then cast it to immutable when I'm done. assumeUnique does that casting while keeping other type information and arguably providing better documentation through its name. Behind the scenes, it's basically doing cast(string)result;

> 2. What is going on with the trick of making chars enum like that in 'revComp3'?

By marking a symbol enum, we tell the compiler that its value should be calculated at compile-time. It's a bit of an optimization (but probably doesn't matter at all, and should be done by the compiler anyway), and a way to say it's really, really const. :p

Mostly, it's a habit I try to build, of declaring symbols as const as possible, to make maintenance easier.

Bonus! Three more variations, all faster than revComp0:

string revComp4(string bps) {
const N = bps.length;
char[] result = new char[N];
for (int i = 0; i < N; ++i) {
switch(bps[N-i-1]) {
case 'A': result[i] = 'T'; break;
case 'C': result[i] = 'G'; break;
case 'G': result[i] = 'C'; break;
case 'T': result[i] = 'A'; break;
default: assert(false);
}
}
return result.assumeUnique;
}

string revComp5(string bps) {
const N = bps.length;
char[] result = new char[N];
foreach (i, ref e; result) {
switch(bps[N-i-1]) {
case 'A': e = 'T'; break;
case 'C': e = 'G'; break;
case 'G': e = 'C'; break;
case 'T': e = 'A'; break;
default: assert(false);
}
}
return result.assumeUnique;
}

string revComp6(string bps) {
char[] result = new char[bps.length];
auto p1 = result.ptr;
auto p2 = &bps[\$-1];

while (p2 > bps.ptr) {
switch(*p2) {
case 'A': *p1 = 'T'; break;
case 'C': *p1 = 'G'; break;
case 'G': *p1 = 'C'; break;
case 'T': *p1 = 'A'; break;
default: assert(false);
}
p1++; p2--;
}
return result.assumeUnique;
}

revComp6 seems to be the fastest, but it's probably also the least readable (a common trade-off).
On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:
> On Friday, 19 May 2017 at 09:17:04 UTC, Biotronic wrote:
>> On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
>>> [...]
>>
>> Question about your implementation: you assume the input may contain newlines, but don't handle any other non-ACGT characters. The problem definition states 'DNA string' and the sample dataset contains no non-ACGT chars. Is this an oversight my part or yours, or did you just decide to support more than the problem requires?
>>
>> [...]
>
> Firstly, thank you for showing me various solutions, and even cool benchmark code. To answer you questions: Yes I assume the input file would realistically contain newlines, even though the problem does not care about them. I also thought about non-CATG bases, but haven't taken care of those cases. In reality we should deal with at least ambiguous bases (N).
>
> I ran your code and also see that switch is faster than AA (i.e. revComp0 is the fastest). And Stefan is right about this.
>
>
> 1. Why do we need to use assumeUnique in 'revComp0' and 'revComp3'?
>

Because `char[] result = new char[N];` is not a string (a.k.a. immutable(char)[]).
But because it was created from the GC in this function we know that it is safe to assume that is a string.

> 2. What is going on with the trick of making chars enum like that in 'revComp3'?

What revComp3 is doing is effectively creating a table for each possible value of char that matches the behaviour of the switch.
it could also be rewritten as
```
char[256] chars; // implicitly memset to '\0'
chars['A'] = 'T';
chars['C'] = 'G';
chars['G'] = 'C';
chars['T'] = 'A';
```

If you haven't already checkout [BioD](https://github.com/biod/BioD), for most (all?) your bioinformatics needs.

If you're trying to be fast you probably don't want to use string for internal calculations as it is very entropy non-optimal (2 bits out of 8 for ACGT, 4 out of 8 for an ambiguous encoding).
I would have at least 2 "Dictionaries": one the standard nucleotides (ACGT) and another for your ambiguous representations (UNRYBDHVMKSW-) and the standard nucleotides, to get a better information density. If you're doing anything with protein sequences then you should use a translation table anyway as the DNA -> amino acid mapping changes between species/organelle (mt|cp|n)DNA.
On 05/19/2017 02:55 PM, Biotronic wrote:
> On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:
>>
>> 1. Why do we need to use assumeUnique in 'revComp0' and 'revComp3'?
>
> D strings are immutable, so if I'd created the result array as a string, I couldn't change the individual characters. Instead, I create a mutable array, change the elements in it, then cast it to immutable when I'm done. assumeUnique does that casting while keeping other type information and arguably providing better documentation through its name. Behind the scenes, it's basically doing cast(string)result;

You could alternatively use `.reserve` on a `string`:

----
string result;
result.reserve(N);
for (...) {
result ~= ...;
}
----

That performs worse, though. Probably still has to check every time if there's reserved space available. On the plus side, it would be `@safe`.

>> 2. What is going on with the trick of making chars enum like that in 'revComp3'?
>
> By marking a symbol enum, we tell the compiler that its value should be calculated at compile-time. It's a bit of an optimization (but probably doesn't matter at all, and should be done by the compiler anyway), and a way to say it's really, really const. :p

You fell into a trap there. The value is calculated at compile time, but it has copy/paste-like behavior. That is, whenever you use `chars`, the code behaves as if you typed out the array literal. That means, the whole array is re-created on every iteration.

Use `static immutable` instead. It still forces compile-time calculation, but it doesn't have copy/paste behavior. Speeds up revComp3 a lot.
On Friday, 19 May 2017 at 12:55:05 UTC, Biotronic wrote:
> revComp6 seems to be the fastest, but it's probably also the least readable (a common trade-off).
Try revComp7 with -release :)

string revComp7(string bps)
{
char[] result = new char[bps.length];
auto p1 = result.ptr;
auto p2 = &bps[\$ - 1];
enum AT = 'A'^'T';
enum CG = 'C'^'G';

while (p2 > bps.ptr)
{
*p1 = *p2 ^ ((*p2 == 'A' || *p2 == 'T') ? AT : CG);
p1++;
p2--;
}
return result.assumeUnique;
}

In fact, when the size of the sequence is growing time difference between procedures is shrinking, so it's much more important to use memory-efficient presentation than to optimize logic.
On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
> I am solving this problem http://rosalind.info/problems/revc/ as an exercise to learn D. This is my solution:
>
> https://dpaste.dzfl.pl/8aa667f962b7
>
> Is there some D tricks I can use to make the `reverseComplement` function more concise and speedy? Any other comments for improvement of the whole solution are also much appreciated.

You might try using ldc in release mode.  With dmd, cost of ranges and so on might be higher than doing it in a more direct low-level way.  With ldc that difference often seems to shrink to insignificance as its optimisation is more powerful.  If you care about performance then use ldc or gdc.  (I have less experience with gdc, but probably similar).  But then it's best to compare results under ldc because the ratios of different options will be different vs dmd.

« First   ‹ Prev
1 2