Thread overview
[OT] Application case study comparing Java, Go, and C++
Feb 28, 2019
Jon Degenhardt
Feb 28, 2019
Seb
Feb 28, 2019
Jon Degenhardt
Mar 01, 2019
Pjotr Prins
Mar 01, 2019
Jon Degenhardt
February 28, 2019
This paper may be of interest to people here:

"A comparison of three programming languages for a full-fledged next-generation sequencing tool", P.Costanza, C.Herzeel, W.Verachrert
https://doi.org/10.1101/558056

The paper compares implementations of a tool operating on SAM/BAM files (bioinformatics) from a performance perspective. Focus is on comparison of GC schemes used in Go and Java with reference counting in C++. The GC schemes were materially faster.

I'm not familiar with the authors or the implementations, so cannot say how well the implementations were done. However, it appears to be a useful case study, and the authors go provide a fair bit of analysis in the paper.

There's a reddit thread also: https://www.reddit.com/r/programming/comments/avsfc6/performance_comparison_of_go_c_and_java_for/
February 28, 2019
On Thursday, 28 February 2019 at 20:48:01 UTC, Jon Degenhardt wrote:
> This paper may be of interest to people here:
>
> "A comparison of three programming languages for a full-fledged next-generation sequencing tool", P.Costanza, C.Herzeel, W.Verachrert
> https://doi.org/10.1101/558056
>
> The paper compares implementations of a tool operating on SAM/BAM files (bioinformatics) from a performance perspective. Focus is on comparison of GC schemes used in Go and Java with reference counting in C++. The GC schemes were materially faster.
>
> I'm not familiar with the authors or the implementations, so cannot say how well the implementations were done. However, it appears to be a useful case study, and the authors go provide a fair bit of analysis in the paper.
>
> There's a reddit thread also: https://www.reddit.com/r/programming/comments/avsfc6/performance_comparison_of_go_c_and_java_for/

I wouldn't give much value to this paper. It hasn't been peer reviewed and I doubt it would pass any. A quick example:

"It [their tool] can be used as a drop-in replacement for many operations implemented by SAMtools [...]". Though no performance comparison was done against samtools (nor any other tools expect their own implementations). I find this pretty shocking, because their entire paper's purpose is about performance...

For reference, samtools is the de-facto standard for a reason (yes it's old and written in C).

Though, to be fair sambamba (written in D) is faster than the C "standard" implementation:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765878
February 28, 2019
On Thursday, 28 February 2019 at 22:58:54 UTC, Seb wrote:
> On Thursday, 28 February 2019 at 20:48:01 UTC, Jon Degenhardt wrote:
>> This paper may be of interest to people here:
>>
>> "A comparison of three programming languages for a full-fledged next-generation sequencing tool", P.Costanza, C.Herzeel, W.Verachrert
>> https://doi.org/10.1101/558056
>>
>> The paper compares implementations of a tool operating on SAM/BAM files (bioinformatics) from a performance perspective. Focus is on comparison of GC schemes used in Go and Java with reference counting in C++. The GC schemes were materially faster.
>>
>> I'm not familiar with the authors or the implementations, so cannot say how well the implementations were done. However, it appears to be a useful case study, and the authors go provide a fair bit of analysis in the paper.
>>
>> There's a reddit thread also: https://www.reddit.com/r/programming/comments/avsfc6/performance_comparison_of_go_c_and_java_for/
>
> I wouldn't give much value to this paper. It hasn't been peer reviewed and I doubt it would pass any. A quick example:
>
> "It [their tool] can be used as a drop-in replacement for many operations implemented by SAMtools [...]". Though no performance comparison was done against samtools (nor any other tools expect their own implementations). I find this pretty shocking, because their entire paper's purpose is about performance...
>
> For reference, samtools is the de-facto standard for a reason (yes it's old and written in C).
>
> Though, to be fair sambamba (written in D) is faster than the C "standard" implementation:
>
> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765878

They do have benchmark comparisons against GATK 4 in another paper:
"elPrep 4: A multithreaded framework for sequence analysis"
https://doi.org/10.1371/journal.pone.0209523

I'm not so familiar with these tool sets. How does GATK 4 stack up against other tools?

From the paper it looks like many of the performance gains over GATK 4 resulted from architecture and algorithm changes, so it may not be valid from the perspective of comparing C++/Go/Java and GC vs reference counting.
March 01, 2019
On Thursday, 28 February 2019 at 23:50:44 UTC, Jon Degenhardt wrote:
> On Thursday, 28 February 2019 at 22:58:54 UTC, Seb wrote:
>> On Thursday, 28 February 2019 at 20:48:01 UTC, Jon Degenhardt wrote:
>>> [...]
>>
>> I wouldn't give much value to this paper. It hasn't been peer reviewed and I doubt it would pass any. A quick example:
>>
>> "It [their tool] can be used as a drop-in replacement for many operations implemented by SAMtools [...]". Though no performance comparison was done against samtools (nor any other tools expect their own implementations). I find this pretty shocking, because their entire paper's purpose is about performance...
>>
>> For reference, samtools is the de-facto standard for a reason (yes it's old and written in C).
>>
>> Though, to be fair sambamba (written in D) is faster than the C "standard" implementation:
>>
>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765878
>
> They do have benchmark comparisons against GATK 4 in another paper:
> "elPrep 4: A multithreaded framework for sequence analysis"
> https://doi.org/10.1371/journal.pone.0209523
>
> I'm not so familiar with these tool sets. How does GATK 4 stack up against other tools?
>
> From the paper it looks like many of the performance gains over GATK 4 resulted from architecture and algorithm changes, so it may not be valid from the perspective of comparing C++/Go/Java and GC vs reference counting.

As the co-author of sambamba and having a pretty good understanding of samtools I call BS on mentioned Go/C++/Java comparison paper. It is all about implementation, i.e., the programmer. Saying that Go is faster than C++ makes no sense to me (go figure). Maybe the C++ implementation should have used a ring buffer like Sambamba does in D (Artem did the smart thing).

One reason I like chess is that it is an honest comparison of skill. Have two people play and you can tell quickly who is superior. In computing we don't have such an easy framework. You can compare tools, i.e., implementation, but to make it a language comparison is bound to be flawed. The problem with that comparision paper is the way they wrote it up.

March 01, 2019
On Friday, 1 March 2019 at 12:46:12 UTC, Pjotr Prins wrote:
> On Thursday, 28 February 2019 at 23:50:44 UTC, Jon Degenhardt wrote:
>> On Thursday, 28 February 2019 at 22:58:54 UTC, Seb wrote:
>>> On Thursday, 28 February 2019 at 20:48:01 UTC, Jon Degenhardt wrote:
>>>> [...]
>>>
>>> I wouldn't give much value to this paper. It hasn't been peer reviewed and I doubt it would pass any. A quick example:
>>>
>>> "It [their tool] can be used as a drop-in replacement for many operations implemented by SAMtools [...]". Though no performance comparison was done against samtools (nor any other tools expect their own implementations). I find this pretty shocking, because their entire paper's purpose is about performance...
>>>
>>> For reference, samtools is the de-facto standard for a reason (yes it's old and written in C).
>>>
>>> Though, to be fair sambamba (written in D) is faster than the C "standard" implementation:
>>>
>>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765878
>>
>> They do have benchmark comparisons against GATK 4 in another paper:
>> "elPrep 4: A multithreaded framework for sequence analysis"
>> https://doi.org/10.1371/journal.pone.0209523
>>
>> I'm not so familiar with these tool sets. How does GATK 4 stack up against other tools?
>>
>> From the paper it looks like many of the performance gains over GATK 4 resulted from architecture and algorithm changes, so it may not be valid from the perspective of comparing C++/Go/Java and GC vs reference counting.
>
> As the co-author of sambamba and having a pretty good understanding of samtools I call BS on mentioned Go/C++/Java comparison paper. It is all about implementation, i.e., the programmer. Saying that Go is faster than C++ makes no sense to me (go figure). Maybe the C++ implementation should have used a ring buffer like Sambamba does in D (Artem did the smart thing).
>
> One reason I like chess is that it is an honest comparison of skill. Have two people play and you can tell quickly who is superior. In computing we don't have such an easy framework. You can compare tools, i.e., implementation, but to make it a language comparison is bound to be flawed. The problem with that comparision paper is the way they wrote it up.

Thanks for the feedback (both Seb and Pjotr).

It's too bad the paper doesn't provide more meaningful value, as application level comparisons of alternate programming environments are quite rare. Application level benchmarks are useful in conjunction with the micro-benchmarks that are more the norm. More important, in my view. But, if the work isn't well founded, or least can't be shown to be well founded, then it's not useful. If there were a number of similar results it might be seen as contributing evidence. As a single work it'd always need to be viewed skeptically, but if people who have expertise in the application area don't find it worthy, well...

--Jon