They wrote the fastest parallelized BAM parser in D (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » They wrote the fastest parallelized BAM parser in D (page 2)

March 30, 2015

Re: They wrote the fastest parallelized BAM parser in D

Posted by Paulo Pinto
in reply to george

Paulo Pinto

Posted in reply to george

On Monday, 30 March 2015 at 18:04:58 UTC, george wrote:
>
>> .NET actually already has a foothold in bioinformatics, specially in user facing software and steering of reading equipments and robots.
>>
>> So D's needs a story over C# and F# (alongside WPF for data visualization) use cases.
>>
>> --
>> Paulo
>
> Though when it comes to open source bioinformatics projects, Perl and Python have a large foothold
> among most most bioinformaticians. Most utilities that require speed are often written in C and C++ (BLAST, HMMER, SAMTOOLS etc).
>
> I think D stands a good chance as a language of choice for bioinformatics projects.
>
> George

Yes on the server side and UNIX based research.

However, I have learned in the last years that Windows based systems are also used a lot, specially in controlling robots and doing the first processing steps and visualization.

At least in commercial research.

--
Paulo

March 30, 2015

Re: They wrote the fastest parallelized BAM parser in D

Posted by george
in reply to CraigDillabaugh

george

Posted in reply to CraigDillabaugh

> I did some image processing work with D and didn't find the lack of specific D tools for visualization a big issue.
>
> There is some advantage to being able to perform visualization tasks in the same lanaguage as you do the data processing work, but I wouldn't this this would be a major obstacle.

I personally prefer the model where I create a tool that takes some input and provides output in a suitable format that I can load to a proper statistical environment  (R or Julia ) for visualisation and manipulation. Therefore I would rather write a  tool that performs a single task optimally and pipes its output to  a different tool for another task. This way I can use the tools and allow for flexible pipelines.

rawdata -> clean -> QC –> to format Y –> to format X -> tool A -> tool B-> visualize

George

March 30, 2015

Re: They wrote the fastest parallelized BAM parser in D

Posted by lobo
in reply to CraigDillabaugh

lobo

Posted in reply to CraigDillabaugh

On Monday, 30 March 2015 at 20:25:33 UTC, CraigDillabaugh wrote:
> On Monday, 30 March 2015 at 20:09:35 UTC, Laeeth Isharc wrote:
>>
> clip
>>
>> You're right about the lack of visualization being a shame. I have been thinking about porting Bokeh bindings to D.  There isn't much too it on the server side - all you need to do is build up the object model and translate it to JSON - but I have not time right now to do it all myself.
>>
> clip
>
> A comment on the visualization thing. Is this really a big issue?
[snip]

Yes of course, why do you think Pyhton + sciPy/Numpy has such a foothold in the scientific community. Visualisation is an important part of data processing pipeline.

It's also why Matlab is so useful for those lucky enough to work for a company that can afford it.

bye,
lobo

March 31, 2015

Re: They wrote the fastest parallelized BAM parser in D

Posted by Craig Dillabaugh
in reply to lobo

Craig Dillabaugh

Posted in reply to lobo

On Monday, 30 March 2015 at 22:55:37 UTC, lobo wrote:
> On Monday, 30 March 2015 at 20:25:33 UTC, CraigDillabaugh wrote:
>> On Monday, 30 March 2015 at 20:09:35 UTC, Laeeth Isharc wrote:
>>>
>> clip
>>>
>>> You're right about the lack of visualization being a shame. I have been thinking about porting Bokeh bindings to D.  There isn't much too it on the server side - all you need to do is build up the object model and translate it to JSON - but I have not time right now to do it all myself.
>>>
>> clip
>>
>> A comment on the visualization thing. Is this really a big issue?
> [snip]
>
> Yes of course, why do you think Pyhton + sciPy/Numpy has such a foothold in the scientific community. Visualisation is an important part of data processing pipeline.
>
> It's also why Matlab is so useful for those lucky enough to work for a company that can afford it.
>
> bye,
> lobo

My point wasn't that visualization isn't important, it is that in most scientific computing it is very easy (and sensible) to separate the processing and visualization aspects.  So lack of D visualization tools should not hinder  its value as a data processing tool.

For example, Hadoop is immensely popular for data processing, but it includes no visualization tools. That is a slightly different domain I understand, but there are similarities.

So in short, if there were nice D visualization tools that would certainly be helpful, but I don't think is should be a show stopper.

March 31, 2015

Re: They wrote the fastest parallelized BAM parser in D

Posted by Laeeth Isharc
in reply to Craig Dillabaugh

Laeeth Isharc

Posted in reply to Craig Dillabaugh

On Tuesday, 31 March 2015 at 02:31:58 UTC, Craig Dillabaugh wrote:
> On Monday, 30 March 2015 at 22:55:37 UTC, lobo wrote:
>> On Monday, 30 March 2015 at 20:25:33 UTC, CraigDillabaugh wrote:
>>> On Monday, 30 March 2015 at 20:09:35 UTC, Laeeth Isharc wrote:
>>>>
>>> clip
>>>>
>>>> You're right about the lack of visualization being a shame. I have been thinking about porting Bokeh bindings to D.  There isn't much too it on the server side - all you need to do is build up the object model and translate it to JSON - but I have not time right now to do it all myself.
>>>>
>>> clip
>>>
>>> A comment on the visualization thing. Is this really a big issue?
>> [snip]
>>
>> Yes of course, why do you think Pyhton + sciPy/Numpy has such a foothold in the scientific community. Visualisation is an important part of data processing pipeline.
>>
>> It's also why Matlab is so useful for those lucky enough to work for a company that can afford it.
>>
>> bye,
>> lobo
>
> My point wasn't that visualization isn't important, it is that in most scientific computing it is very easy (and sensible) to separate the processing and visualization aspects.  So lack of D visualization tools should not hinder  its value as a data processing tool.
>
> For example, Hadoop is immensely popular for data processing, but it includes no visualization tools. That is a slightly different domain I understand, but there are similarities.
>
> So in short, if there were nice D visualization tools that would certainly be helpful, but I don't think is should be a show stopper.

Yes, I tried to pick my words carefully.  It is not a disaster, as a someone seemed to imply, but it would be nice to have visualization, particularly for interactive exploration of data.  One is back to Walter's quote about the two language combination being an indicator that something is lacking.

March 31, 2015

Re: They wrote the fastest parallelized BAM parser in D

Posted by Andrew Brown
in reply to Laeeth Isharc

Andrew Brown

Posted in reply to Laeeth Isharc

Visualisation is certainly not behind python's success in bioinformatics, which predates ipython. If you look through journals, very few of the figures are done in python (and none at all in julia). It succeeded because it allows you to hack your way through massive text files and it's not perl.

One problem with using D instead of C or C++ for projects like this, is that these projects are a few people developing software for many users, who are working on frequently very old clusters where they don't have admin rights. Getting an executable file to work for them is not trivial. Programs like samtools solve this by expecting people to compile it themselves, knowing they can rely on gcc to be installed. But none of these clusters have a D compiler handy.

On my university, out of the box executables for ldc don't run, gdc executable files don't link with libc, and dmd sometimes shouts it can't find dmd.conf. And this is a fairly up to date and well administered cluster, I know quite a few instituions still on centOS 5. Now, I can work to fix these problems for myself, but I can't expect a user spend 3 hours compiling llvm, then ldc and various libraries to use my software, rather than just look for the C/C++ equivalent.

Yesterday I was asked if I'd rewrite my code in C++ to solve this problem, not really an option as I don't know C++. I guess this is a fairly niche issue, D Learn kindly pointed me in the direction of VMs which I think will solve most of my problems. The sambabamba authors seem to be sharing dockers (congrat on the paper by the way!). But I think it is a factor to be considered when using D: disseminating software is trickier than with C/C++.

On Tuesday, 31 March 2015 at 03:30:09 UTC, Laeeth Isharc wrote:
> On Tuesday, 31 March 2015 at 02:31:58 UTC, Craig Dillabaugh wrote:
>> On Monday, 30 March 2015 at 22:55:37 UTC, lobo wrote:
>>> On Monday, 30 March 2015 at 20:25:33 UTC, CraigDillabaugh wrote:
>>>> On Monday, 30 March 2015 at 20:09:35 UTC, Laeeth Isharc wrote:
>>>>>
>>>> clip
>>>>>
>>>>> You're right about the lack of visualization being a shame. I have been thinking about porting Bokeh bindings to D.  There isn't much too it on the server side - all you need to do is build up the object model and translate it to JSON - but I have not time right now to do it all myself.
>>>>>
>>>> clip
>>>>
>>>> A comment on the visualization thing. Is this really a big issue?
>>> [snip]
>>>
>>> Yes of course, why do you think Pyhton + sciPy/Numpy has such a foothold in the scientific community. Visualisation is an important part of data processing pipeline.
>>>
>>> It's also why Matlab is so useful for those lucky enough to work for a company that can afford it.
>>>
>>> bye,
>>> lobo
>>
>> My point wasn't that visualization isn't important, it is that in most scientific computing it is very easy (and sensible) to separate the processing and visualization aspects.  So lack of D visualization tools should not hinder  its value as a data processing tool.
>>
>> For example, Hadoop is immensely popular for data processing, but it includes no visualization tools. That is a slightly different domain I understand, but there are similarities.
>>
>> So in short, if there were nice D visualization tools that would certainly be helpful, but I don't think is should be a show stopper.
>
> Yes, I tried to pick my words carefully.  It is not a disaster, as a someone seemed to imply, but it would be nice to have visualization, particularly for interactive exploration of data.  One is back to Walter's quote about the two language combination being an indicator that something is lacking.

March 31, 2015

Re: They wrote the fastest parallelized BAM parser in D

Posted by John Colvin
in reply to Paulo Pinto

John Colvin

Posted in reply to Paulo Pinto

On Monday, 30 March 2015 at 20:28:11 UTC, Paulo Pinto wrote:
> On Monday, 30 March 2015 at 18:04:58 UTC, george wrote:
>>
>>> .NET actually already has a foothold in bioinformatics, specially in user facing software and steering of reading equipments and robots.
>>>
>>> So D's needs a story over C# and F# (alongside WPF for data visualization) use cases.
>>>
>>> --
>>> Paulo
>>
>> Though when it comes to open source bioinformatics projects, Perl and Python have a large foothold
>> among most most bioinformaticians. Most utilities that require speed are often written in C and C++ (BLAST, HMMER, SAMTOOLS etc).
>>
>> I think D stands a good chance as a language of choice for bioinformatics projects.
>>
>> George
>
> Yes on the server side and UNIX based research.
>
> However, I have learned in the last years that Windows based systems are also used a lot, specially in controlling robots and doing the first processing steps and visualization.
>
> At least in commercial research.
>
> --
> Paulo

Yes, to the benefit of literally no-one. To be fair, it's not a problem of the operating system, just that special purpose GUI programmes for scientific work always seem to be utterly dreadful.
"Hey, we need to record some time series and show a spectrum on the fly" "OK great, let's commission a closed source Windows GUI application with its own proprietary file format, sure it'll crash once a day and have scientifically important paramters hard-coded and undocumented, but at least you can point and click!"

It seems to be true across the board in government research facilities, pharmaceutical companies, most of academia and so on... Enormous piles of proprietary vomit being propped up by an endless stream of disinterested and semi-incompetent programmers, steadily digging their way to job security.

March 31, 2015

Re: They wrote the fastest parallelized BAM parser in D

Posted by John Colvin
in reply to Andrew Brown

John Colvin

Posted in reply to Andrew Brown

On Tuesday, 31 March 2015 at 08:09:00 UTC, Andrew Brown wrote:
> Visualisation is certainly not behind python's success in bioinformatics, which predates ipython. If you look through journals, very few of the figures are done in python (and none at all in julia). It succeeded because it allows you to hack your way through massive text files and it's not perl.
>
> One problem with using D instead of C or C++ for projects like this, is that these projects are a few people developing software for many users, who are working on frequently very old clusters where they don't have admin rights. Getting an executable file to work for them is not trivial. Programs like samtools solve this by expecting people to compile it themselves, knowing they can rely on gcc to be installed. But none of these clusters have a D compiler handy.
>
> On my university, out of the box executables for ldc don't run, gdc executable files don't link with libc, and dmd sometimes shouts it can't find dmd.conf. And this is a fairly up to date and well administered cluster, I know quite a few instituions still on centOS 5. Now, I can work to fix these problems for myself, but I can't expect a user spend 3 hours compiling llvm, then ldc and various libraries to use my software, rather than just look for the C/C++ equivalent.
>
> Yesterday I was asked if I'd rewrite my code in C++ to solve this problem, not really an option as I don't know C++. I guess this is a fairly niche issue, D Learn kindly pointed me in the direction of VMs which I think will solve most of my problems. The sambabamba authors seem to be sharing dockers (congrat on the paper by the way!). But I think it is a factor to be considered when using D: disseminating software is trickier than with C/C++.

Building LDC and its depedencies isn't that difficult, but it was still a pain to have to do that just to compile my code for the cluster.

There needs to be some sort of bootstrap script, downloads included, available to go from a bare bones c++ toolchain to a working D compiler. Or even just some executables online compiled with an ancient glibc.

March 31, 2015

Re: They wrote the fastest parallelized BAM parser in D

Posted by Chris
in reply to Russel Winder

Chris

Posted in reply to Russel Winder

On Monday, 30 March 2015 at 18:23:31 UTC, Russel Winder wrote:
> On Mon, 2015-03-30 at 18:04 +0000, george via Digitalmars-d wrote:
>> > .NET actually already has a foothold in bioinformatics, specially in user facing software and steering of reading equipments and robots.
>> > 
>> > So D's needs a story over C# and F# (alongside WPF for data visualization) use cases.
>> > 
>> > --
>> > Paulo
>
> Paulo,
>
> Can you send me some pointers to this stuff?
>
>> 
>> Though when it comes to open source bioinformatics projects, Perl and Python have a large foothold
>> among most most bioinformaticians. Most utilities that require speed are often written in C and C++ (BLAST, HMMER, SAMTOOLS etc).
>> 
>> I think D stands a good chance as a language of choice for bioinformatics projects.
>> 
>> George
>
> My "prejudice", based on training people in Python and C++ over the
> last few years, is that Python and C++ have a very strong position in
> the bioinformatics community, with the use of IPython (now becoming
> Jupyter) increasing and solidifying the Python position.
>
> D's position is quite weak here because one of the important things is
> visualising data, something SciPy/Matplotlib are very good at. D has
> no real play in this arena and so there is no way (currently) of
> creating a foothold. Sad, but…

As Andrew Brown pointed out, visualization is not behind Pythons success. Its success lies in the fact that it's a language you can hack away in easily. Almost everybody who has to do some data processing (most researchers do these days) and has limited or no experience with programming will opt for Python: easy (at first!), well-documented and everyone else uses it. However, the initial euphoria of being able to automatically rename files and extract value X from file Y soon gives way to frustration when it comes to performance.

The paper shows well that in a world where data processing is of utmost importance, and we're talking about huge sets of data, languages like Python don't cut it anymore. Two things are happening at the moment: on the one hand people still use Python for various reasons (see above and hundreds of posts on this forum), at the same time there's growing discontent among researchers, scientists and engineers as regards performance, simply because the data sets are becoming bigger and bigger every day and the algorithms are getting more and more refined. Sooner or later people will have to find new ways, out of sheer necessity.

Don't forget that "the state of the art" can change very quickly in IT and the name of the game is anticipating new developments rather than taking snapshots of the current state of the art and frame them. D really has a lot to offer for data processing and I wouldn't rule it out that more and more programmers will turn to it for this task.

March 31, 2015

Re: They wrote the fastest parallelized BAM parser in D

Posted by Laeeth Isharc
in reply to Chris

Laeeth Isharc

Posted in reply to Chris

> As Andrew Brown pointed out, visualization is not behind Pythons success. Its success lies in the fact that it's a language you can hack away in easily.

Sounds right.  I am not in the camp that says it is a killer for D.  It would just be nice to have both at least a passable solution for visualization, and some way of making it interactive.  (The REPL might be one route).  The problem with separating the processes completely and just piping the output from D code that does the heavy lifting to a python or julia front end is it may make it more painful to play with and explore the data.  My interests are finance more than science, so that may lead to a different set of needs.  Finishing mathgl and writing D bindings for bokeh (take a look - it is pretty cool, particularly to be able to use the browser as client, acknowledging that it is a tradeoff) is not so much work.  But some help on bokeh particularly would be nice, as I fear picking one way of implementing the object structure and later finding it is a mistake.

> the initial euphoria of being able to automatically rename files and extract value X from file Y soon gives way to frustration when it comes to performance.

Yep.

> The paper shows well that in a world where data processing is of utmost importance, and we're talking about huge sets of data, languages like Python don't cut it anymore.

I could not agree more, and I do think the intersection of two trends creates tremendous opportunity for D.  It's also commonsensical to look at notable successes - and I hope it is not just my biases that lead me to think many of these are in just this kind of application.  Data sets keep getting larger (but not necessarily more information rich in dollar terms), and Moore's Law/memory speed+latency is not keeping pace.  This is exactly the kind of change that creeps up on you because not much changes in a few months (which is the kind of horizon many of us tend to think in).

People say "what is D's edge", but my personal perception is "where is the competition for D" in this area.  It has to be native code/JIT, and I refuse to learn Java; it also should be plastic and lend itself to rapid iteration.

> at the same time there's growing discontent among researchers, scientists and engineers as regards performance, simply because the data sets are becoming bigger and bigger every day and the algorithms are getting more and more refined. Sooner or later people will have to find new ways, out of sheer necessity.

upvote.  I would love to see any references you have on this - not because it's not rather obvious to me, but because it is helpful when talking to other people.

> Don't forget that "the state of the art" can change very quickly in IT and the name of the game is anticipating new developments rather than taking snapshots of the current state of the art and frame them. D really has a lot to offer for data processing and I wouldn't rule it out that more and more programmers will turn to it for this task.

I fully agree.  If we started a section on use cases, would you be able to write a page or two on D's advantages in data processing?

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation