Thread overview | |||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
September 25, 2013 D language manipulation of dataframe type structures | ||||
---|---|---|---|---|
| ||||
I've been playing with the python pandas app enables interactive manipulation of tables of data in their dataframe structure, which they say is similar to the structures used in R. It appears pandas has laid claim to being a faster version of R, but is doing so basically limited to what they can exploit from moving operations back and forth from underlying cython code. Has anyone written an example app in D that manipulates dataframe type structures? |
September 25, 2013 Re: D language manipulation of dataframe type structures | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jay Norwood | I thought about it once but quickly abandoned the idea. The primary reason was that D doesn't have REPL and is thus not suitable for interactive data exploration. |
September 25, 2013 Re: D language manipulation of dataframe type structures | ||||
---|---|---|---|---|
| ||||
Posted in reply to lomereiter | lomereiter:
> I thought about it once but quickly abandoned the idea. The primary reason was that D doesn't have REPL and is thus not suitable for interactive data exploration.
The quick compile times could allow interactive data exploration in D, perhaps a little less well than Python.
People have created a D repl two or more times but Walter&Andrei seem not interested in it.
Bye,
bearophile
|
September 25, 2013 Re: D language manipulation of dataframe type structures | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | While the interactive exploratory aspects of the pandas are attractive, in my case the interaction has just been a crutch to discover how to correctly use their api. Once through that api learning curve, I'd mainly be interested in repeating the operations that worked correctly. The execution speed would be more important to me at that point. In the recent pandas documents, they describe some speed improvements available from using eval(expression_string) calls that get executed by a numexpr app. Their testing shows it only improves execution time when table sizes go beyond about 10k rows. Seems like this puts the improvements beyond the reach of my particular app. ok, thanks. I'll have to dig into it some more. |
September 25, 2013 Re: D language manipulation of dataframe type structures | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jay Norwood | I agree with other posters that a D REPL and
interactive/visualization data environment would be very cool,
but unfortunately doesn't exist. Batch computing is more
practical, but REPLs really hook new users. I see statistical
computing as a huge opportunity for D adoption. (R is just
super-ugly and slow, leaving Python + its various native-code
cyborg appendages as the hot new stats environment).
There are tons of ways of accomplishing the same thing in D, but
as far as I know there isn't a "standard" at this point. A
statically typed dataframe is, at minimum, just a range of
structs -- even more minimally, a bare *array* of structs, or
alternatively just a 2-D array in a thin wrapper that provides
access via column labels rather than indexes. You can manipulate
these ranges with functions from std.range and std.algorithm.
Missing or N/A data is a common issue, and can be represented in
a variety of ways, with integers being the most annoying since
there is no built-in NaN value for ints (check out the Nullable
template from std.typecons).
Supporting features like having *both* rows and columns are
accessible via labels rather than indexes requires a little bit
more wrapping. We have a NamedMatrix class at my workplace for
that purpose. It's easy to overload the index operator [] for
access, * for matrix multiplication, etc.
CSV loads can be done with std.csv; unfortunately there's no
corresponding support in that module for *writing* CSV (I've
rolled my own). At my workplace we also have a MysqlConnection
class that provides one-liner loading from a SQL query into
minimalist, range-of-structs dataframes.
Beyond that, it really depends on how you want to manipulate the
dataframes. What specific things do you want to do? If you've got
an idea, I could work up some sample code.
So yes, there are people doing it in The Real World.
Unfortunately my colleagues don't have a nice, tidy,
self-contained DataFrame module to share (yet). But having one
would be a great thing for D. The bigger problem though is
matching the huge 3rd-party stats libraries (like CRAN for R).
On Wednesday, 25 September 2013 at 03:41:36 UTC, Jay Norwood
wrote:
> I've been playing with the python pandas app enables interactive manipulation of tables of data in their dataframe structure, which they say is similar to the structures used in R.
>
> It appears pandas has laid claim to being a faster version of R, but is doing so basically limited to what they can exploit from moving operations back and forth from underlying cython code.
>
> Has anyone written an example app in D that manipulates dataframe type structures?
|
September 25, 2013 Re: D language manipulation of dataframe type structures | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jared Miller | On Wednesday, 25 September 2013 at 18:37:48 UTC, Jared Miller wrote:
> I agree with other posters that a D REPL and
> interactive/visualization data environment would be very cool,
> but unfortunately doesn't exist. Batch computing is more
> practical, but REPLs really hook new users. I see statistical
> computing as a huge opportunity for D adoption. (R is just
> super-ugly and slow, leaving Python + its various native-code
> cyborg appendages as the hot new stats environment).
>
> There are tons of ways of accomplishing the same thing in D, but
> as far as I know there isn't a "standard" at this point. A
> statically typed dataframe is, at minimum, just a range of
> structs -- even more minimally, a bare *array* of structs, or
> alternatively just a 2-D array in a thin wrapper that provides
> access via column labels rather than indexes. You can manipulate
> these ranges with functions from std.range and std.algorithm.
> Missing or N/A data is a common issue, and can be represented in
> a variety of ways, with integers being the most annoying since
> there is no built-in NaN value for ints (check out the Nullable
> template from std.typecons).
>
> Supporting features like having *both* rows and columns are
> accessible via labels rather than indexes requires a little bit
> more wrapping. We have a NamedMatrix class at my workplace for
> that purpose. It's easy to overload the index operator [] for
> access, * for matrix multiplication, etc.
>
> CSV loads can be done with std.csv; unfortunately there's no
> corresponding support in that module for *writing* CSV (I've
> rolled my own). At my workplace we also have a MysqlConnection
> class that provides one-liner loading from a SQL query into
> minimalist, range-of-structs dataframes.
>
> Beyond that, it really depends on how you want to manipulate the
> dataframes. What specific things do you want to do? If you've got
> an idea, I could work up some sample code.
>
> So yes, there are people doing it in The Real World.
> Unfortunately my colleagues don't have a nice, tidy,
> self-contained DataFrame module to share (yet). But having one
> would be a great thing for D. The bigger problem though is
> matching the huge 3rd-party stats libraries (like CRAN for R).
>
>
> On Wednesday, 25 September 2013 at 03:41:36 UTC, Jay Norwood
> wrote:
>> I've been playing with the python pandas app enables interactive manipulation of tables of data in their dataframe structure, which they say is similar to the structures used in R.
>>
>> It appears pandas has laid claim to being a faster version of R, but is doing so basically limited to what they can exploit from moving operations back and forth from underlying cython code.
>>
>> Has anyone written an example app in D that manipulates dataframe type structures?
I had considered one day making some a semi-port of pandas, at the very least stealing Wes' basic algorithms (no point reinventing the hard stuff). The interface could be better in D than python I reckon, although of course the lack of a repl is a bit of a show-stopper.
|
September 25, 2013 Re: D language manipulation of dataframe type structures | ||||
---|---|---|---|---|
| ||||
Posted in reply to John Colvin | John Colvin:
> although of course the lack of a repl is a bit of a show-stopper.
There are (or were) two different repls for D. The second is for D2.
Bye,
bearophile
|
December 26, 2014 Data frames in D? | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | "
> I thought about it once but quickly abandoned the idea. The primary reason was that D doesn't have REPL and is thus not suitable for interactive data exploration.
The quick compile times could allow interactive data exploration
I agree with other posters that a D REPL and
interactive/visualization data environment would be very cool,
but unfortunately doesn't exist. Batch computing is more
practical, but REPLs really hook new users. I see statistical
computing as a huge opportunity for D adoption. (R is just
super-ugly and slow, leaving Python + its various native-code
cyborg appendages as the hot new stats environment).
There are tons of ways of accomplishing the same thing in D, but
as far as I know there isn't a "standard" at this point. A
statically typed dataframe is, at minimum, just a range of
structs -- even more minimally, a bare *array* of structs, or
alternatively just a 2-D array in a thin wrapper that provides
access via column labels rather than indexes. You can manipulate
these ranges with functions from std.range and std.algorithm.
Missing or N/A data is a common issue, and can be represented in
a variety of ways, with integers being the most annoying since
there is no built-in NaN value for ints (check out the Nullable
template from std.typecons).
Supporting features like having *both* rows and columns are
accessible via labels rather than indexes requires a little bit
more wrapping. We have a NamedMatrix class at my workplace for
that purpose. It's easy to overload the index operator [] for
access, * for matrix multiplication, etc.
CSV loads can be done with std.csv; unfortunately there's no
corresponding support in that module for *writing* CSV (I've
rolled my own). At my workplace we also have a MysqlConnection
class that provides one-liner loading from a SQL query into
minimalist, range-of-structs dataframes.
Beyond that, it really depends on how you want to manipulate the
dataframes. What specific things do you want to do? If you've got
an idea, I could work up some sample code.
So yes, there are people doing it in The Real World.
Unfortunately my colleagues don't have a nice, tidy,
self-contained DataFrame module to share (yet). But having one
would be a great thing for D. The bigger problem though is
matching the huge 3rd-party stats libraries (like CRAN for R).
"
----
Since we do have an interactive shell (the pastebin), and now bindings and wrappers for hdf5 (key for large data sets) and basic seeds for a matrix library, should we start to think about what would be needed for a dataframe, and the best way to approach it, starting very simply?
One doesn't need to have a comparable library to R for it to start being useful in particular use cases.
Pandas and Julia would be obvious potential sources of inspiration (and it may be that one still uses them to call out to D in some cases), but rather than trying to just port pandas to D, it seems to make sense to ask how one should do it from scratch to better suit D.
Laeeth.
|
December 26, 2014 Re: Data frames in D? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Laeeth Isharc Attachments:
| On Fri, 2014-12-26 at 20:44 +0000, Laeeth Isharc via Digitalmars-d-learn wrote: […] > I agree with other posters that a D REPL and interactive/visualization data environment would be very cool, but unfortunately doesn't exist. Batch computing is more practical, but REPLs really hook new users. I see statistical computing as a huge opportunity for D adoption. (R is just super-ugly and slow, leaving Python + its various native-code cyborg appendages as the hot new stats environment). REPLs are over-hyped and have become a fashion touchstone that few dare argue against for fear of being denounced as un-hip. REPLs have their place, but in the main are nowhere near as useful as people claim. IPython Notebooks on the other hand are a balance between editor/execution environment and REPL that really has a lot going for it. Stats folks using R, love R and hate Python. Stats folk using Python, love Python and hate R. In the end it's all about what you know and can use to get the job done. To be frank (as in open rather than Jill), D hasn't got the infrastructure to compete with either R or Python and so is a non-starter in the data science arena. > There are tons of ways of accomplishing the same thing in D, but as far as I know there isn't a "standard" at this point. A statically typed dataframe is, at minimum, just a range of structs -- even more minimally, a bare *array* of structs, or alternatively just a 2-D array in a thin wrapper that provides access via column labels rather than indexes. You can manipulate these ranges with functions from std.range and std.algorithm. Missing or N/A data is a common issue, and can be represented in a variety of ways, with integers being the most annoying since there is no built-in NaN value for ints (check out the Nullable template from std.typecons). > > Supporting features like having *both* rows and columns are accessible via labels rather than indexes requires a little bit more wrapping. We have a NamedMatrix class at my workplace for that purpose. It's easy to overload the index operator [] for access, * for matrix multiplication, etc. > > CSV loads can be done with std.csv; unfortunately there's no corresponding support in that module for *writing* CSV (I've rolled my own). At my workplace we also have a MysqlConnection class that provides one-liner loading from a SQL query into minimalist, range-of-structs dataframes. > > Beyond that, it really depends on how you want to manipulate the dataframes. What specific things do you want to do? If you've got an idea, I could work up some sample code. > > So yes, there are people doing it in The Real World. Unfortunately my colleagues don't have a nice, tidy, self-contained DataFrame module to share (yet). But having one would be a great thing for D. The bigger problem though is matching the huge 3rd-party stats libraries (like CRAN for R). " Nor the whole Python/SciPy/Matplotlib thing. > ---- > > Since we do have an interactive shell (the pastebin), and now bindings and wrappers for hdf5 (key for large data sets) and basic seeds for a matrix library, should we start to think about what would be needed for a dataframe, and the best way to approach it, starting very simply? > > One doesn't need to have a comparable library to R for it to start being useful in particular use cases. Whilst I can do workshops for data science folk using Python and have an argument why Python beats R for almost all cases so far brought up, there is no way I can even start to mention D. > Pandas and Julia would be obvious potential sources of inspiration (and it may be that one still uses them to call out to D in some cases), but rather than trying to just port pandas to D, it seems to make sense to ask how one should do it from scratch to better suit D. Pandas is just one of the "native code cyborg appendages" you were railing about earlier. It happens to be "a big thing" in data science and one of the reasons Python is running away with the market, reducing the R market penetration and only being a little bit dented in same places by Julia. It's not about the language, its about the total milieu. Whether or not Python is a good language vs D is irrelevant, Python/SciPy/Matplotlib/Pandas/IPython is there and ready, D has no play in the game. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder@ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel@winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder |
December 26, 2014 Re: D language manipulation of dataframe type structures | ||||
---|---|---|---|---|
| ||||
Posted in reply to lomereiter | On Wednesday, 25 September 2013 at 04:35:57 UTC, lomereiter wrote: > I thought about it once but quickly abandoned the idea. The primary reason was that D doesn't have REPL and is thus not suitable for interactive data exploration. https://github.com/MartinNowak/drepl https://drepl.dawg.eu/ |
Copyright © 1999-2021 by the D Language Foundation