D language manipulation of dataframe type structures (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » D language manipulation of dataframe type structures (page 2)

December 26, 2014

Re: D language manipulation of dataframe type structures

Posted by aldanor
in reply to Jay Norwood

aldanor

Posted in reply to Jay Norwood

On Wednesday, 25 September 2013 at 03:41:36 UTC, Jay Norwood wrote:
> I've been playing with the python pandas app enables interactive manipulation of tables of data in their dataframe structure, which they say is similar to the structures used in R.
>
> It appears pandas has laid claim to being a faster version of R, but is doing so basically limited to what they can exploit from moving operations back and forth from underlying cython code.
>
> Has anyone written an example app in D that manipulates dataframe type structures?

Pandas has numpy as "backend" which does a lot of heavy lifting, so first things first -- imo D needs a fast and flexible blas/lapack-compatible multi-dimensional rectangular array library that could later serve as backend for pandas-like libraries.

December 27, 2014

Re: Data frames in D?

Posted by Laeeth Isharc
in reply to Russel Winder

Laeeth Isharc

Posted in reply to Russel Winder

"
> REPLs are over-hyped and have become a fashion touchstone that few dare argue against for fear of being denounced as un-hip. REPLs have their
> place, but in the main are nowhere near as useful as people claim.
> IPython Notebooks on the other hand are a balance between
> editor/execution environment and REPL that really has a lot going for
> it."

Fair argument against an earlier poster but from my perspective, all I meant is that the absence of a shell is not a good reason to write off D for exploring data.  Because there is a shell already that could be developed, and because one can call D from python / Julia in a notebook.

>Stats folks using R, love R and hate Python. Stats folk using
> Python, love Python and hate R. In the end it's all about what you know and can use to get the job done. To be frank (as in open rather than Jill), D hasn't got the infrastructure to compete with either R or Python and so is a non-starter in the data science arena.

About the future you may or may not be right.  (Whether it is commercially interesting to run workshops in D for stats people is certainly a interesting question.  However given the ways that technology unfolds it may be that it is less relevant for the question I am most interested today in answering).

I want to do things in D myself, and I would find a data frame helpful.  I understand you don't program much in D these days, and that's a reasonable decision, but for those who want to use it to do quantish things with dataframes, perhaps we could think about how to approach the problem.  And having weighed your warnings, if you have any suggestions on how best to implement this, I would be open to these also.


Laeeth.

December 27, 2014

Data Frames in D - let's not wait for linear algebra; useful today in finance and Internet of Things

Posted by Laeeth Isharc
in reply to aldanor

Laeeth Isharc

Posted in reply to aldanor

On Friday, 26 December 2014 at 21:31:00 UTC, aldanor wrote:
> On Wednesday, 25 September 2013 at 03:41:36 UTC, Jay Norwood wrote:
>> I've been playing with the python pandas app enables interactive manipulation of tables of data in their dataframe structure, which they say is similar to the structures used in R.
>>
>> It appears pandas has laid claim to being a faster version of R, but is doing so basically limited to what they can exploit from moving operations back and forth from underlying cython code.
>>
>> Has anyone written an example app in D that manipulates dataframe type structures?
>
> Pandas has numpy as "backend" which does a lot of heavy lifting, so first things first -- imo D needs a fast and flexible blas/lapack-compatible multi-dimensional rectangular array library that could later serve as backend for pandas-like libraries.

I don't believe I agree that we need a perfect multi-dimensional rectangular array library to serve as a backend before thinking and doing much on data frames (although it will certainly be very useful when ready).

First, it seems we do have matrices, even if lacking in complete functionality for linear algebra, and the like.  There is a chicken and egg aspect in the development of tools - it is rarely the case that one kind of tool necessarily totally precedes another, and often complementarities and dynamic effects between different stages.  If one waits till one has everything one needs done for one, one won't get much done.

Secondly, much of the kind of thing Pandas is useful for is not exactly rocket science from a quantitative perspective, but it's just the kinds of thing that is very useful if you are thinking about working with data sets of a decent size.The concepts seem to me to fit very well with std.algorithm and std.range, and can be thought of as just as way to bring out the power of the tools we alreaady have when working with data in the world as it is.  See here for an example of just how simple.  Remember Excel pivottables?

http://pandas.pydata.org/pandas-docs/stable/groupby.html

Thirdly, one of the reasons Pandas is popular is because it is written in C/Cython and very fast.  It's significantly faster than Julia.  One might hit roadblocks down the line when it comes to the Global Interpreter Lock and difficulty of processing larger sets quickly in Python, but at least this stage is fast and easy.  So people do care about speed, but they also care about the frictions being taken away, so that they can spend their energies on addressing the problem at hand.  Ie a dataframe will be helpful, in my view.

Processing of log data is a growing domain - partly from internet, but also from the internet of things.  See below for one company using D to process logs:

http://venturebeat.com/2014/11/12/adroll-hits-gigantic-130-terabytes-of-ad-data-processed-daily-says-size-matters/
http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

A poster on this forum is already using D as a library to call from R (from Reddit), which brings home the point that it isn't necessary for D to be able to do every part of the process for it to be able to take over some of the heavy work.

"[–]bachmeier 6 points 1 month ago

I call D shared libraries from R. I've put together a library that offers similar functionality to Rcpp. I've got a presentation showing its use on Linux. Both the presentation and library code should be made available within the next couple of days.

My library makes available the R API and anything in Gretl. You can allocate and manipulate R objects in D, add R assert statements in your D code, and so on. What I'm working on now is calling into GSL for optimization.

These are all mature libraries - my code is just an interface. It's generally easy to call any C library from D, and modern Fortran, which provides C interoperability, is not too much harder.
"

See here, for just one use case in the internet of things.  They don't use D, but maybe they should have.  And it shows an example where perhaps at least log processing could easily be handled by what we have with a few small additional data structures - even if people use outside libraries for the machine learning part.

http://www.forbes.com/sites/danwoods/2014/11/04/how-splunk-caught-wall-streets-eye-by-taming-the-messy-world-of-iot-data/3/

"By using Splunk software, Hrebek said that his division’s leader product is able to offer customers a real-time view of operations on a train and to use machine learning to suggest optimal strategies for driving trains along various routes. Just shaving a small percentage off of fuel costs can mean huge savings for a railroad.

Why Doesn’t BI Work for the IoT?

In both of the use cases just mentioned, for years, existing business intelligence technology had been applied to the problem of making sense of the data with little success.

The problem is not that that it is impossible to use traditional ETL technology and an RDBMS or, more commonly, spreadsheets to get something working so that some of the data becomes useful. It is just that the effort involved is great and the technical effort involved in maintaining such systems is massive. Hrebek compared using spreadsheets for IoT data to living in the ninth circle of hell in Dante’s Inferno, because the process is so tedious and error prone.

Machine data is different from flat files that are the paradigm for BI technology, which works in rows and columns. Also, machine data can be naturally organized into a time series, but this is not the default way that a spreadsheet or an RDBMS works.

Why Does Splunk Work for the IoT?

IoT data essentially looks exactly the same as the machine data from servers in a data center that Splunk Enterprise was initially created to handle. The software allows you to:

    Automatically parse fields
    Identify several different types of records as a related group
    Organize and store records by timestamp
    Create dashboards and analytics that are updated in real time

With each successive release, Splunk is making the process of parsing machine data as automatic and machine assisted as possible. Its software handles variations of IoT data by allowing a simple mapping of a field into a standard name. For example, the GPS coordinates of a train car might be recorded in six or seven different ways in various forms of machine data, but can be unified via Splunk Enterprise. Splunk software allows these mappings to be implemented and maintained with a minimum of effort.

The bottom line is that there is no way to avoid the imperfections that naturally occur in the real world. We are always going to have lots of trees and to have to deal with them both as individuals and as a forest, in a normalized aggregate form. The reason Splunk is making such inroads in IoT applications is that it can handle both the trees and the forest and turn the information from the real world into a clear view of what is happening that allows useful models of reality to be created. If you are building an IOT application, you must find a way to handle the messy nature of the real world."

Many more similar oppties for D here: https://www.google.de/search?q=internet+of+things+massive+log+processing+growth&btnG=Search&oe=utf-8&gws_rd=cr

Laeeth.

December 27, 2014

Re: Data frames in D?

Posted by Russel Winder
in reply to Laeeth Isharc

Russel Winder

Posted in reply to Laeeth Isharc

Attachments:

signature.asc (This is a digitally signed message part)

On Sat, 2014-12-27 at 01:33 +0000, Laeeth Isharc via Digitalmars-d-learn
wrote:
[…]
> Fair argument against an earlier poster but from my perspective, all I meant is that the absence of a shell is not a good reason to write off D for exploring data.  Because there is a shell already that could be developed, and because one can call D from python / Julia in a notebook.

I think we are agreeing. Very lightweight editor and executor of code fragments is as good, if not better, that the one line REPL.

[…]
> About the future you may or may not be right.  (Whether it is commercially interesting to run workshops in D for stats people is certainly a interesting question.  However given the ways that technology unfolds it may be that it is less relevant for the question I am most interested today in answering).

Part of the problem here is tribalism. Most data science people want to use the same tools that other data science people use, even though the issue is to differentiate themselves. Currently R and Python are the tools of the moment. Julia hasn't made deep penetration, but is totally focused on trying to replace R and Python for data analysis.

> I want to do things in D myself, and I would find a data frame helpful.  I understand you don't program much in D these days, and that's a reasonable decision, but for those who want to use it to do quantish things with dataframes, perhaps we could think about how to approach the problem.  And having weighed your warnings, if you have any suggestions on how best to implement this, I would be open to these also.

A BLAS library is certainly a precusor, as is very good data visualization tools, graphs, diagrams etc. It isn't the language per se that make R, Python and increasingly Julia, but the fact that the results of the analysis can be rendered graphically.

I know much less about R, but the whole Python/NumPy thing works but only because it is faster and easier than Python alone. NumPy performance is actually quite poor. I am finding I can write Python + Numba code that hugely outperforms that same algorithm using NumPy.

Go is making great play of the fact that it can attract Python people using Python for system style programming. Go has Gtk and Qt for graphics. D has Gtk, but no real Qt. But in the end D isn't getting the traction as the C/Python replacement as Go has done. Go has masses of people putting a lot of effort into Web. It's not the ideas, it's the number of people getting on board and doing things.

To get some traction in any of these areas, finance data analysis and model building, or systems activity, it is all about people doing it, publicizing it and making things available for others to use.

Taking the R array types and Pandas' DataFrames and TimeSeries and building and using D versions is going to be needed for D to get traction. But it needs to be better than Julia in some way that makes others sit up and take notice. There has to be the ability to create some hype.
-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

December 27, 2014

Re: Data Frames in D - let's not wait for linear algebra; useful today in finance and Internet of Things

Posted by Russel Winder
in reply to Laeeth Isharc

Russel Winder

Posted in reply to Laeeth Isharc

Attachments:

signature.asc (This is a digitally signed message part)

On Sat, 2014-12-27 at 06:21 +0000, Laeeth Isharc via Digitalmars-d-learn
wrote:
[…]
> I don't believe I agree that we need a perfect multi-dimensional rectangular array library to serve as a backend before thinking and doing much on data frames (although it will certainly be very useful when ready).

Also, if there is a ready made C or C++ library that can be made use of, do it.

> First, it seems we do have matrices, even if lacking in complete functionality for linear algebra, and the like.  There is a chicken and egg aspect in the development of tools - it is rarely the case that one kind of tool necessarily totally precedes another, and often complementarities and dynamic effects between different stages.  If one waits till one has everything one needs done for one, one won't get much done.

In the end there is no point in a language/compiler/editor if there is not the perceived support for the things that large numbers of people want to do. C, C++, C#, F#, Java, Scala, Groovy, Python, R, Julia, Go, all find themselves with a vocal audience doing things. The language evolves with the libraries and "end user" applications. In the end it is all about people doing things with a language and hyping it up.

> Secondly, much of the kind of thing Pandas is useful for is not exactly rocket science from a quantitative perspective, but it's just the kinds of thing that is very useful if you are thinking about working with data sets of a decent size.The concepts seem to me to fit very well with std.algorithm and std.range, and can be thought of as just as way to bring out the power of the tools we alreaady have when working with data in the world as it is. See here for an example of just how simple.  Remember Excel pivottables?
> 
> http://pandas.pydata.org/pandas-docs/stable/groupby.html

I recently discovered a number of hedge funds work solely on moving average based algorithmic trading. NumPy, SciPy and Pandas all have variations on this basic algorithm.

Isn't "group by" standard in all languages. Certainly, Python, Groovy, Scala, Haskell,…

> Thirdly, one of the reasons Pandas is popular is because it is written in C/Cython and very fast.  It's significantly faster than Julia.  One might hit roadblocks down the line when it comes to the Global Interpreter Lock and difficulty of processing larger sets quickly in Python, but at least this stage is fast and easy.  So people do care about speed, but they also care about the frictions being taken away, so that they can spend their energies on addressing the problem at hand.  Ie a dataframe will be helpful, in my view.

Perceived to be fast. In fact it isn't anything like as fast as it should be. NumPy (which underpins Pandas and provides all the data structures and basic algorithms), is actually quite slow.

I have ranted many times about GIL in Python, and on two occasions spent 2 or 3 hours trying to convince Guido about the lunacy of a GIL based interpreted in 2014. Armin Rigo has an STM-based version in PyPy and CPython and has shown it can work just fine. Guido though is I/O bound rather than CPU bound in his work and doesn't see a need for anything other than multiprocessing for accessing parallelism in Python. Sadly, it can be shown that multiprocessing is slow and inefficient at what it does and it needs replacing.

NumPy's approach to parallelism is nice as an abstraction, but doesn't really "cut it" unless you do not know any better.

In principle this is fertile territory for a new language to take the stage. Hence Julia. I fear D has missed the boat of this opportunity now. On the other hand if some real data science people begin to do data science with D and show that more can be done with less, and without loss of functionality, then there is an opportunity for marketing and possible traction in the market.

> Processing of log data is a growing domain - partly from internet, but also from the internet of things.  See below for one company using D to process logs:
> 
> http://venturebeat.com/2014/11/12/adroll-hits-gigantic-130-terabytes-of-ad-data-processed-daily-says-size-matters/ http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

This is worth hyping up, it should be front and centre on teh dlang pages along with Facebook funding bug fixes. Having the tweets list is great but too ephemeral, the "D is for Data Science" tweet will fade too quickly.

> A poster on this forum is already using D as a library to call from R (from Reddit), which brings home the point that it isn't necessary for D to be able to do every part of the process for it to be able to take over some of the heavy work.

Funny isn't it how every language must do everything. So for all new languages you have to have a new build system and a new event loop. The problem is though that C is the language of extension for R, Python,… even though it is a language that should now only used for working "right on the metal", if at all.

> "[–]bachmeier 6 points 1 month ago
> 
> I call D shared libraries from R. I've put together a library that offers similar functionality to Rcpp. I've got a presentation showing its use on Linux. Both the presentation and library code should be made available within the next couple of days.
> 
> My library makes available the R API and anything in Gretl. You can allocate and manipulate R objects in D, add R assert statements in your D code, and so on. What I'm working on now is calling into GSL for optimization.
> 
> These are all mature libraries - my code is just an interface.
> It's generally easy to call any C library from D, and modern
> Fortran, which provides C interoperability, is not too much
> harder.
> "

But if all the libraries are C , C++ and Fortran, is there any value add role for D?

Lots of C++ system embed Python or Lua for dynamic scripting capability, lots of Python and R system call out to C. This seems a well established milieu. Is there a good way for D to, in an evolutionary way establish a permanent foothold. Certainly it cannot be a revolutionary one.

[…]

Splunk stuff is just an example of using dataflow networks for processing data rather than using SQL. The "Big Data using JVM" community are already on this road, cf. various proprietary frameworks running over Hadoop and Spark.

Dataflow frameworks are likely to be the next big thing. Java and Groovy have established offerings, no other language really does other than Go. If D could get a really good dataflow framework before C++, Rust, etc. then that might be a route to traction.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

December 27, 2014

Re: Data frames in D?

Posted by aldanor
in reply to Russel Winder

aldanor

Posted in reply to Russel Winder

On Saturday, 27 December 2014 at 10:54:01 UTC, Russel Winder via Digitalmars-d-learn wrote:
> I know much less about R, but the whole Python/NumPy thing works but
> only because it is faster and easier than Python alone. NumPy
> performance is actually quite poor. I am finding I can write Python +
> Numba code that hugely outperforms that same algorithm using NumPy.
There will sure be some algorithms where numba/cython would do better (especially if they cannot be easily vectorized), but that's not the point. The thing about numpy is that it provides a unified accepted interface (plus a reasonable set of reasonably fast tools and algorithms) for arrays and buffers for a multitude of scientific libraries (scipy, pytables, h5py, pandas, scikit-*, just to name a few), which then makes it much easier to use them together and write your own ones.

December 27, 2014

Re: Data Frames in D - let's not wait for linear algebra; useful today in finance and Internet of Things

Posted by Ola Fosheim Grøstad
in reply to Russel Winder

Ola Fosheim Grøstad

Posted in reply to Russel Winder

On Saturday, 27 December 2014 at 13:39:59 UTC, Russel Winder via Digitalmars-d-learn wrote:
> I have ranted many times about GIL in Python, and on two occasions spent
> 2 or 3 hours trying to convince Guido about the lunacy of a GIL based
> interpreted in 2014. Armin Rigo has an STM-based version in PyPy and
> CPython and has shown it can work just fine.

I wonder how TSX would work with GIL. I suppose most GIL locks are short lived enough to be covered by TSX before it fails and takes a lock.

> In principle this is fertile territory for a new language to take the
> stage. Hence Julia. I fear D has missed the boat of this opportunity
> now. On the other hand if some real data science people begin to do data
> science with D and show that more can be done with less, and without
> loss of functionality, then there is an opportunity for marketing and
> possible traction in the market.

To be fair, you also have to compete against commercial solutions such as SPSS, SAS and others.

Then you have OpenMP for C++ and Fortran, which it will be difficult for D to compete with in terms of performance vs effort.

December 27, 2014

Re: Data frames in D?

Posted by Russel Winder
in reply to aldanor

Russel Winder

Posted in reply to aldanor

Attachments:

signature.asc (This is a digitally signed message part)

On Sat, 2014-12-27 at 13:46 +0000, aldanor via Digitalmars-d-learn wrote:
> On Saturday, 27 December 2014 at 10:54:01 UTC, Russel Winder via Digitalmars-d-learn wrote:
> > I know much less about R, but the whole Python/NumPy thing
> > works but
> > only because it is faster and easier than Python alone. NumPy
> > performance is actually quite poor. I am finding I can write
> > Python +
> > Numba code that hugely outperforms that same algorithm using
> > NumPy.
> There will sure be some algorithms where numba/cython would do better (especially if they cannot be easily vectorized), but that's not the point. The thing about numpy is that it provides a unified accepted interface (plus a reasonable set of reasonably fast tools and algorithms) for arrays and buffers for a multitude of scientific libraries (scipy, pytables, h5py, pandas, scikit-*, just to name a few), which then makes it much easier to use them together and write your own ones.

Agreed, it is not NumPy that is the win, it is PyTables, Pandas, SciKit-Learn etc. These are the standard tools because they are domain specific and aimed at the audience. The audience neither knows nor cares that NumPy is actually not very good because they have the tools they need and nothing to compare them against – unless Julia gets real traction, or a language like D can use it's one or two entries in the field to create a usable set of libraries. As with the Vibe.d, and Dub experience, pick a field, write and use something that does the job better than anything else in that field, then market the experience.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

December 27, 2014

Re: Data Frames in D - let's not wait for linear algebra; useful today in finance and Internet of Things

Posted by Russel Winder
in reply to Ola Fosheim Grøstad

Russel Winder

Posted in reply to Ola Fosheim Grøstad

Attachments:

signature.asc (This is a digitally signed message part)

On Sat, 2014-12-27 at 13:53 +0000, via Digitalmars-d-learn wrote: […]
> 
> I wonder how TSX would work with GIL. I suppose most GIL locks are short lived enough to be covered by TSX before it fails and takes a lock.

For Intel chips this is good stuff (stolen from Sun's Rock processor). Hardware supported transactional memory easily beats software transactional memory, but the latter is portable.

[…]
> 
> To be fair, you also have to compete against commercial solutions such as SPSS, SAS and others.

It is relatively easy to compete against these generally. Small
organizations (which actually make up the bulk of users) prefer not to
pay the extortionate fees. Anecdotal evidence clearly show a mass move
from Matlab to Python+NumPy+… – the anecdotes being my Python Workshops
last year where 40%+ of people were in this position.

> Then you have OpenMP for C++ and Fortran, which it will be difficult for D to compete with in terms of performance vs effort.

If you said MPI, then yes, it is the de facto standard native code clustering system: on JVM there is Netty and a few other systems. OpenMP is really just a way of hacking sequential code to create parallel code on a multicore single address space; and a very good hack it is too. But it remains a hack and not a good way of transitioning from fundamentally sequential code to fundamentally parallel code. OpenMP exists exactly because Fortran, C and C++ codes had to be made data parallel without being rewritten. D should not be in this boat.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

December 27, 2014

Re: Data Frames in D - let's not wait for linear algebra; useful today in finance and Internet of Things

Posted by Ola Fosheim Grøstad
in reply to Russel Winder

Ola Fosheim Grøstad

Posted in reply to Russel Winder

On Saturday, 27 December 2014 at 14:07:51 UTC, Russel Winder via Digitalmars-d-learn wrote:
> sequential code to fundamentally parallel code. OpenMP exists exactly
> because Fortran, C and C++ codes had to be made data parallel without
> being rewritten. D should not be in this boat.

I don't disagree in principle, but if an OpenMP supporting compiler can generate code for GPGPU then D will be miles behind for many homogeneous workloads.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation