They wrote the fastest parallelized BAM parser in D (page 3)

On Tuesday, 31 March 2015 at 11:04:50 UTC, Laeeth Isharc wrote: > >> As Andrew Brown pointed out, visualization is not behind Pythons success. Its success lies in the fact that it's a language you can hack away in easily. > > Sounds right. I am not in the camp that says it is a killer for D. It would just be nice to have both at least a passable solution for visualization, and some way of making it interactive. (The REPL might be one route). The problem with separating the processes completely and just piping the output from D code that does the heavy lifting to a python or julia front end is it may make it more painful to play with and explore the data. My interests are finance more than science, so that may lead to a different set of needs. Finishing mathgl and writing D bindings for bokeh (take a look - it is pretty cool, particularly to be able to use the browser as client, acknowledging that it is a tradeoff) is not so much work. But some help on bokeh particularly would be nice, as I fear picking one way of implementing the object structure and later finding it is a mistake. > >> the initial euphoria of being able to automatically rename files and extract value X from file Y soon gives way to frustration when it comes to performance. > > Yep. > >> The paper shows well that in a world where data processing is of utmost importance, and we're talking about huge sets of data, languages like Python don't cut it anymore. > > I could not agree more, and I do think the intersection of two trends creates tremendous opportunity for D. It's also commonsensical to look at notable successes - and I hope it is not just my biases that lead me to think many of these are in just this kind of application. Data sets keep getting larger (but not necessarily more information rich in dollar terms), and Moore's Law/memory speed+latency is not keeping pace. This is exactly the kind of change that creeps up on you because not much changes in a few months (which is the kind of horizon many of us tend to think in). > > People say "what is D's edge", but my personal perception is "where is the competition for D" in this area. It has to be native code/JIT, and I refuse to learn Java; it also should be plastic and lend itself to rapid iteration. > >> at the same time there's growing discontent among researchers, scientists and engineers as regards performance, simply because the data sets are becoming bigger and bigger every day and the algorithms are getting more and more refined. Sooner or later people will have to find new ways, out of sheer necessity. > > upvote. I would love to see any references you have on this - not because it's not rather obvious to me, but because it is helpful when talking to other people. The article that gave rise to this thread is a good reference. I came from a slightly different angle, I looked for alternatives to Python, because I needed: 1. fast native execution (real time) 2. easy interfacing to C 3. cross-platform development (Modern convenience, templates, ranges etc. were bonuses I discovered bit by bit) As regards algorithms and data processing, most people in research use Matlab (proprietary) and Python. However, in my field they're useless when it comes to building data-driven systems (fast analysis, retraining of machine based on (slight) modifications), and putting computationally heavy algorithms into real world applications. Proof of concept is all it amounts to, usually. So D has a real chance here, because of 1. native code 2. modern convenience 3. templates, structs, mixins, ranges, std.algorithm etcetc. 4. interfacing to C libs >> Don't forget that "the state of the art" can change very quickly in IT and the name of the game is anticipating new developments rather than taking snapshots of the current state of the art and frame them. D really has a lot to offer for data processing and I wouldn't rule it out that more and more programmers will turn to it for this task. > > I fully agree. If we started a section on use cases, would you be able to write a page or two on D's advantages in data processing? I think that Dicebot et al would have good examples.

March 31, 2015

Re: They wrote the fastest parallelized BAM parser in D

Posted by Chris
in reply to Chris

Permalink

Chris

Posted in reply to Chris

Permalink

On Tuesday, 31 March 2015 at 13:31:33 UTC, Chris wrote:
> On Tuesday, 31 March 2015 at 11:04:50 UTC, Laeeth Isharc wrote:
>>
>>> As Andrew Brown pointed out, visualization is not behind Pythons success. Its success lies in the fact that it's a language you can hack away in easily.
>>
>> Sounds right.  I am not in the camp that says it is a killer for D.  It would just be nice to have both at least a passable solution for visualization, and some way of making it interactive.  (The REPL might be one route).  The problem with separating the processes completely and just piping the output from D code that does the heavy lifting to a python or julia front end is it may make it more painful to play with and explore the data.  My interests are finance more than science, so that may lead to a different set of needs.  Finishing mathgl and writing D bindings for bokeh (take a look - it is pretty cool, particularly to be able to use the browser as client, acknowledging that it is a tradeoff) is not so much work.  But some help on bokeh particularly would be nice, as I fear picking one way of implementing the object structure and later finding it is a mistake.
>>
>>> the initial euphoria of being able to automatically rename files and extract value X from file Y soon gives way to frustration when it comes to performance.
>>
>> Yep.
>>
>>> The paper shows well that in a world where data processing is of utmost importance, and we're talking about huge sets of data, languages like Python don't cut it anymore.
>>
>> I could not agree more, and I do think the intersection of two trends creates tremendous opportunity for D.  It's also commonsensical to look at notable successes - and I hope it is not just my biases that lead me to think many of these are in just this kind of application.  Data sets keep getting larger (but not necessarily more information rich in dollar terms), and Moore's Law/memory speed+latency is not keeping pace.  This is exactly the kind of change that creeps up on you because not much changes in a few months (which is the kind of horizon many of us tend to think in).
>>
>> People say "what is D's edge", but my personal perception is "where is the competition for D" in this area.  It has to be native code/JIT, and I refuse to learn Java; it also should be plastic and lend itself to rapid iteration.
>>
>>> at the same time there's growing discontent among researchers, scientists and engineers as regards performance, simply because the data sets are becoming bigger and bigger every day and the algorithms are getting more and more refined. Sooner or later people will have to find new ways, out of sheer necessity.
>>
>> upvote.  I would love to see any references you have on this - not because it's not rather obvious to me, but because it is helpful when talking to other people.
>
> The article that gave rise to this thread is a good reference.
>
> I came from a slightly different angle, I looked for alternatives to Python, because I needed:
>
> 1. fast native execution (real time)
> 2. easy interfacing to C
> 3. cross-platform development
>
> (Modern convenience, templates, ranges etc. were bonuses I discovered bit by bit)
>
> As regards algorithms and data processing, most people in research use Matlab (proprietary) and Python. However, in my field they're useless when it comes to building data-driven systems (fast analysis, retraining of machine based on (slight) modifications), and putting computationally heavy algorithms into real world applications. Proof of concept is all it amounts to, usually.
>
> So D has a real chance here, because of
>
> 1. native code
> 2. modern convenience
> 3. templates, structs, mixins, ranges, std.algorithm etcetc.
> 4. interfacing to C libs
>
>>> Don't forget that "the state of the art" can change very quickly in IT and the name of the game is anticipating new developments rather than taking snapshots of the current state of the art and frame them. D really has a lot to offer for data processing and I wouldn't rule it out that more and more programmers will turn to it for this task.
>>
>> I fully agree.  If we started a section on use cases, would you be able to write a page or two on D's advantages in data processing?
>
> I think that Dicebot et al would have good examples.

It'd be nice, if we had a dedicated data-analysis section and/or library. I'm almost sure that people working with massive amounts of data would find it by googling "efficient data analysis" or something like that.

Facebook probably has a wealth of data analysis examples / techniques, too.

Forums