Thread overview
Announcement and Request: Typesafe Coordinate Systems for High-Throughput Sequencing Applications
Sep 01, 2021
James Blachly
Sep 01, 2021
Arne Ludwig
Sep 02, 2021
James Blachly
September 01, 2021
In another post, I've just announced our D-based high throughput sequencing library, dhtslib.

One feature that is, AFAIK, novel in the field is leveraging the compiler's type system to enforce correctness regarding different genome/reference sequence coordinate systems. Clearly, the encoding of domain specific knowledge in a language's type system is nothing new, but it is surprising that this has not been done before in bioinformatics, and it is an idea that IMO is long overdue given the trainwreck of different coordinate systems in our field.

You can find dhtslib's develop branch, with Typesafe Coordinates merged and ready to use, here:

https://github.com/blachlylab/dhtslib/


**Now the request:**
We've drafted a manuscript describing Typesafe Coordinates as a sort of low-key endorsement of the D language and our library package `dhtslib`. You can find the manuscript here:

https://github.com/blachlylab/typesafe-coordinates/

We would be very grateful to those of you who would take the time to read the manuscript and post comments (publicly or privately), _especially if we have made any incorrect statements_ or our language regarding type systems is awkward or nonstandard.

We did praise D, and gently criticized Rust and OCaml* somewhat as it appeared to me that they lacked the features required to implement Typesafe Coordinate Systems in as ergonomic a way as we could in D. However, being a true novice at both of these other languages there is the possibility that I've missed something significant, and that the Rust and OCaml implementations could be retooled to match the D implementation. I'd still be glad to hear it if that's the case.

I plan to make a few minor cleanups and submit this to a preprint server as well as a scientific journal in the next week or so.

Kind regards

James S Blachly, MD
The Ohio State University


* as a side note, I actually find the OCaml code quite attractive in its terseness: `let j = cl_interval_of_ho (ob_interval_of_zb i)`
September 01, 2021

On Wednesday, 1 September 2021 at 05:36:53 UTC, James Blachly wrote:

>

In another post, I've just announced our D-based high throughput sequencing library, dhtslib.

One feature that is, AFAIK, novel in the field is leveraging the compiler's type system to enforce correctness regarding different genome/reference sequence coordinate systems. Clearly, the encoding of domain specific knowledge in a language's type system is nothing new, but it is surprising that this has not been done before in bioinformatics, and it is an idea that IMO is long overdue given the trainwreck of different coordinate systems in our field.

You can find dhtslib's develop branch, with Typesafe Coordinates merged and ready to use, here:

https://github.com/blachlylab/dhtslib/

Now the request:
We've drafted a manuscript describing Typesafe Coordinates as a sort of low-key endorsement of the D language and our library package dhtslib. You can find the manuscript here:

https://github.com/blachlylab/typesafe-coordinates/

We would be very grateful to those of you who would take the time to read the manuscript and post comments (publicly or privately), especially if we have made any incorrect statements or our language regarding type systems is awkward or nonstandard.

We did praise D, and gently criticized Rust and OCaml* somewhat as it appeared to me that they lacked the features required to implement Typesafe Coordinate Systems in as ergonomic a way as we could in D. However, being a true novice at both of these other languages there is the possibility that I've missed something significant, and that the Rust and OCaml implementations could be retooled to match the D implementation. I'd still be glad to hear it if that's the case.

I plan to make a few minor cleanups and submit this to a preprint server as well as a scientific journal in the next week or so.

Kind regards

James S Blachly, MD
The Ohio State University

  • as a side note, I actually find the OCaml code quite attractive in its terseness: let j = cl_interval_of_ho (ob_interval_of_zb i)

Hi James and Charles,

I am happy to hear of your latest idea of creating type-safe coordinate systems. It's a great idea!

After reading the code on GitHub, I have only one major remark: IMHO, it would be great to separate the novel coordinates systems from any htslib dependencies (see lines 47-50) as there are only auxiliary functions that use both the novel coordinates systems and htslib. The greater goal I have in mind is to provide the coordinate systems in a separate DUB sub-package (e.g. dhtslib:coordinates) that requires only a D compiler. That makes integration into existing projects that do not need htslib much easier.

Also, I have a short list of minor, technical remarks:

  1. The returned type in line 114 has a typo, there is an additional 's'.
  2. The array of identifiers CoordSystemLabels in line 203 is a bit unsafe and not strictly required for two reasons:
    1. It can by generated by the compiler using enum CoordSystemLabels = __traits(allMembers, CoordSystem);.
    2. As far as I can tell its only application is in line 376. The same result can be achieved safely using cs.stringof.split('.')[$ - 1] or without use of std.array.split: cs.stringof[CoordSystem.stringof.length + 1 .. $].
  3. The function unionImpl in line 326 actually computes the convex hull of the two intervals which should be noted in the doc comment for completeness' sake.
  4. I have noted that you use operator overloading for union and intersection of Intervals. You may also add overloads for the offset function in both Interval and Coordinate with auto opBinary(string op, T)(T off) if ((op == '+' || op == '-') && isIntegral!T) and auto opBinaryRight(string op, T)(T off) if ((op == '+' || op == '-') && isIntegral!T).

I enjoyed reading the manuscript. It highlights the issue clearly and presents the solution without getting lost in details. Ignoring typos at this stage, I have no remarks on it – keep going!

Cheers!

-- Arne

September 01, 2021
On 9/1/21 5:01 AM, Arne Ludwig wrote:
> I am happy to hear of your latest idea of creating type-safe coordinate systems. It's a great idea!
> 
> After reading the code on GitHub, I have only one major remark: IMHO, it would be great to separate the novel coordinates systems from any `htslib` dependencies ([see lines 47-50](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L47-L50)) as there are only auxiliary functions that use both the novel coordinates systems and `htslib`. The greater goal I have in mind is to provide the coordinate systems in a separate DUB sub-package (e.g. `dhtslib:coordinates`) that requires only a D compiler. That makes integration into existing projects that do not need `htslib` much easier.

This is an absolutely **outstanding** idea. Those imports were only to reuse an htslib `chr:X-Y` string parsing function, but we can trivially rewrite this in native D to enable sub-package independence!

> Also, I have a short list of minor, technical remarks:
> 
> 1. The returned type in [line 114](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L114) has a typo, there is an additional 's'.

Ahh, the curse of templates. Without 100% test coverage these things which would cause failure to compile in non-template code seem to always sneak in. Thank you so much.

> 2. The array of identifiers `CoordSystemLabels` in [line 203](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L203) is a bit unsafe and not strictly required for two reasons:

A very excellent suggestion. I am still a metaprogramming novice.

> 3. The function `unionImpl` in [line 326](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L326) actually computes the convex hull of the two intervals which should be noted in the doc comment for completeness' sake.

Yes, we had some internal debate about the appropriate result of both union and intersect operations when intervals are non-overlapping and return type is a non-array. Will leave as is and document as convex hull in this case.

> 4. I have noted that you use operator overloading for union and intersection of `Interval`s. You may also add overloads for the `offset` function in both `Interval` and `Coordinate` with `auto opBinary(string op, T)(T off) if ((op == '+' || op == '-') && isIntegral!T)` and `auto opBinaryRight(string op, T)(T off) if ((op == '+' || op == '-') && isIntegral!T)`.

Very nice. I do miss operator overloading in some of the other languages I explored recently.

> I enjoyed reading the manuscript. It highlights the issue clearly and presents the solution without getting lost in details. Ignoring typos at this stage, I have no remarks on it – keep going!

Thanks again for this critical review. As you know we are really pleased with how D has accelerated our science and wish to share it with the world.

James