[GSoC] Dataframes for D (page 5) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » [GSoC] Dataframes for D (page 5)

July 23, 2019

Re: [GSoC] Dataframes for D

Posted by jmh530
in reply to Prateek Nayak

jmh530

Posted in reply to Prateek Nayak

On Tuesday, 23 July 2019 at 18:21:30 UTC, Prateek Nayak wrote:
> [snip]
>
> Right now there is a CSV reader in Magpie but it isn't perfect enough to go into Phobos yet. I'll improve the parser and when I'm happy with the read speed, I'll send a PR (^_^)

mir was originally intended to be included in Phobos, but got split off to its own library. If anything, Magpie has a better place in mir than Phobos. However, I think there is probably value in splitting off the csv reader to a separate project and just putting that up on dub when it is ready for broader use.

July 25, 2019

Re: [GSoC] Dataframes for D

Posted by Suliman
in reply to Prateek Nayak

Suliman

Posted in reply to Prateek Nayak

Could you do any benchmarks against Python Pandas.

July 25, 2019

Re: [GSoC] Dataframes for D

Posted by Prateek Nayak
in reply to Suliman

Prateek Nayak

Posted in reply to Suliman

On Thursday, 25 July 2019 at 11:55:48 UTC, Suliman wrote:
> Could you do any benchmarks against Python Pandas.

As soon as the aggregate is done, I'll get onto this.
It's best to benchmark with real world examples IMHO and aggregate brings most of the analytics functionality.
I'll keep the thread updated (^_^)

August 08, 2019

Re: [GSoC] Dataframes for D

Posted by Prateeek Nayak
in reply to Prateek Nayak

Prateeek Nayak

Posted in reply to Prateek Nayak

-----------
Update Time
-----------

Pardon me for the delay, my university just started and it has been a busy first week. However I have some good news

* Aggregate implementation is under review - The preliminary implementation restricted the set of operations that aggregate could do but then Mr. Wilson suggested there should be a way to expand it's usability so we worked on a revamp which takes the function you desire as input and operates them on row/column of DataFrame
* There is a new way set index using index operation
* to_csv supports setting precision for floating point numbers - this was a problem I knew existed but I hadn't addressed it till now. Better late then never.
* Homogeneous DataFrame don't use TypeTuple anymore
* at overload coming soon


--------------------
What is to come next
--------------------

* The first few responses from the community were mostly regarding bringing binary file I/O support because of their lean size and fast read/write. I will explore more regarding this.
* Time Series is gaining importance with the rise of Machine Learning. I would like to implement something along the lines of time series functionality Pandas has.
* Something you would line to see. I am open to suggestions (^_^)

--------------
Problems faced
--------------

There remains a small implementation detail that remains - a dispatch function. Given non-homogeneous cases still require traversal to a column, a function to apply an alias statically or non-statically depending on the DataFrame is under discussion.
This will reduce code redundancy however my preliminary attempts to tackle this have ended in failure. I will try to finish it by the weekend. If I cannot solve it by then, I will seek your help in the Learn section (^_^)
Thank you

August 09, 2019

Re: [GSoC] Dataframes for D

Posted by bioinfornatics
in reply to Prateeek Nayak

bioinfornatics

Posted in reply to Prateeek Nayak

On Thursday, 8 August 2019 at 16:49:09 UTC, Prateeek Nayak wrote:
> -----------
> Update Time
> -----------
>
> Pardon me for the delay, my university just started and it has been a busy first week. However I have some good news
>
> * Aggregate implementation is under review - The preliminary implementation restricted the set of operations that aggregate could do but then Mr. Wilson suggested there should be a way to expand it's usability so we worked on a revamp which takes the function you desire as input and operates them on row/column of DataFrame
> * There is a new way set index using index operation
> * to_csv supports setting precision for floating point numbers - this was a problem I knew existed but I hadn't addressed it till now. Better late then never.
> * Homogeneous DataFrame don't use TypeTuple anymore
> * at overload coming soon
>
>
> --------------------
> What is to come next
> --------------------
>
> * The first few responses from the community were mostly regarding bringing binary file I/O support because of their lean size and fast read/write. I will explore more regarding this.
> * Time Series is gaining importance with the rise of Machine Learning. I would like to implement something along the lines of time series functionality Pandas has.
> * Something you would line to see. I am open to suggestions (^_^)
>
> --------------
> Problems faced
> --------------
>
> There remains a small implementation detail that remains - a dispatch function. Given non-homogeneous cases still require traversal to a column, a function to apply an alias statically or non-statically depending on the DataFrame is under discussion.
> This will reduce code redundancy however my preliminary attempts to tackle this have ended in failure. I will try to finish it by the weekend. If I cannot solve it by then, I will seek your help in the Learn section (^_^)
> Thank you

Dear D community,

Thanks, Prateeek Nayak for your works.
As currently, I am working with pandas (python, dataframe ...) . They are an extra feature that I appreciate a lot, it is the IO tool part:

* SQL
method: read_sql and to_sql
Description: which allow to read and save from a DataBase. These methods combined with SqlAlchemy are awesome.

* Parquet
method: read_parquet and to_parquet
Description: In BigData environment Parquet is a file format often used

These abilities made Panda and its Dataframe API a core library to have. Using like this, allow standardizing data structured used into our application and in same time offer rich statistics API.

Indeed it is important for tho code maintainability. And the FairData point that an application is a set of input data + program's feature = result. Thus put data structured as the first component to think how to develop an application is important.
The application is more robust and flexible as we can handle multiple input data file format.

I hope to see such features in D.


Best regards

Source:
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
- https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#parquet

August 09, 2019

Re: [GSoC] Dataframes for D

Posted by James Blachly
in reply to Prateeek Nayak

James Blachly

Posted in reply to Prateeek Nayak

On 8/8/19 12:49 PM, Prateeek Nayak wrote:
> --------------------
> What is to come next
> --------------------
> 
> * The first few responses from the community were mostly regarding bringing binary file I/O support because of their lean size and fast read/write. I will explore more regarding this.
> * Time Series is gaining importance with the rise of Machine Learning. I would like to implement something along the lines of time series functionality Pandas has.
> * Something you would line to see. I am open to suggestions (^_^)

Again, thank you so much for working on this!
We will be excited to put Magpie through its paces in our lab, but it is missing* a few key (really, basic IMO) features we make heavy use of in pandas.

* I have read the README and glanced at code but not used Magpie yet, so if I am wrong about below please correct me!

Since you are soliciting ideas:
1. Selecting/indexing into data with boolean vectors. e.g:

df[df.A > 30 && df.B != "ignore"]

1a. This really means returning a boolean vector for df.COL <op> <operand>

1b. ...and being able to subset data by a bool vector

2. We make heavy use of "pivot" functionality.

Kind regards

August 10, 2019

Re: [GSoC] Dataframes for D

Posted by Prateek Nayak
in reply to bioinfornatics

Prateek Nayak

Posted in reply to bioinfornatics

On Friday, 9 August 2019 at 08:08:39 UTC, bioinfornatics wrote:
> Dear D community,
>
> Thanks, Prateeek Nayak for your works.
> As currently, I am working with pandas (python, dataframe ...) . They are an extra feature that I appreciate a lot, it is the IO tool part:
>
> * SQL
> method: read_sql and to_sql
> Description: which allow to read and save from a DataBase. These methods combined with SqlAlchemy are awesome.
>
> * Parquet
> method: read_parquet and to_parquet
> Description: In BigData environment Parquet is a file format often used
>
> These abilities made Panda and its Dataframe API a core library to have. Using like this, allow standardizing data structured used into our application and in same time offer rich statistics API.
>
> Indeed it is important for tho code maintainability. And the FairData point that an application is a set of input data + program's feature = result. Thus put data structured as the first component to think how to develop an application is important.
> The application is more robust and flexible as we can handle multiple input data file format.
>
> I hope to see such features in D.
>
>
> Best regards
>
> Source:
> - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html
> - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
> - https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#parquet

I was looking into Parquet and it even came up in the reddit post i had linked to earlier on - smaller file size and better I/O makes it really good for industrial use.
A quick search on DUB didn't give any result for a parser so I'll probably work on a library to work with Parquet files.
I looked into Cap'n Proto too - it looks promising but its missing from Pandas I/O section which was disappointing.
Thanks for mentioning SQL. I will start working on these features soon.

> Again, thank you so much for working on this!
> We will be excited to put Magpie through its paces in our lab, but it is missing* a few key (really, basic IMO) features we make heavy use of in pandas.
> * I have read the README and glanced at code but not used Magpie yet, so if I am > wrong about below please correct me!
> Since you are soliciting ideas:
> 1. Selecting/indexing into data with boolean vectors. e.g:
> df[df.A > 30 && df.B != "ignore"]
> 1a. This really means returning a boolean vector for df.COL <op> <operand>
> 1b. ...and being able to subset data by a bool vector
> 2. We make heavy use of "pivot" functionality.
> Kind regards

I was thinking of the same feature as 1 - a filter like function for DataFrame and Group - finding possible ways to implement it
I'm really embarrassed to admit I never even thought about Pivot. I looks like a beautiful feature to have - will definitely add to Magpie soon (possibly over the next couple of weeks - I'm a bit tied down right now with commencement of University academics but it will definitely come soon)

August 10, 2019

Re: [GSoC] Dataframes for D

Posted by Joseph Rushton Wakeling
in reply to Prateek Nayak

Joseph Rushton Wakeling

Posted in reply to Prateek Nayak

On Saturday, 10 August 2019 at 04:10:31 UTC, Prateek Nayak wrote:
> I was looking into Parquet and it even came up in the reddit post i had linked to earlier on - smaller file size and better I/O makes it really good for industrial use.
> A quick search on DUB didn't give any result for a parser so I'll probably work on a library to work with Parquet files.
> I looked into Cap'n Proto too - it looks promising but its missing from Pandas I/O section which was disappointing.
> Thanks for mentioning SQL. I will start working on these features soon.

It's clearly important that your project supports the same data exchange formats as pandas, but it doesn't seem inherently a problem to support other formats as well, assuming you have the time and inclination to do so.

August 10, 2019

Re: [GSoC] Dataframes for D

Posted by Prateek Nayak
in reply to Joseph Rushton Wakeling

Prateek Nayak

Posted in reply to Joseph Rushton Wakeling

On Saturday, 10 August 2019 at 12:38:19 UTC, Joseph Rushton Wakeling wrote:
> It's clearly important that your project supports the same data exchange formats as pandas, but it doesn't seem inherently a problem to support other formats as well, assuming you have the time and inclination to do so.

It is never an inherent problem to support a new file format but the initial comments from the community was mainly regarding easy interop with python. That is why I was thinking of Parquet support.
Cap'n Proto is great and I'll love to implement a Cap'n Proto I/O sooner or later but Parquet seems to have a heavier presence due to popularity of Pandas so I decided to look into it first.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation