Jump to page: 1 25  
Page
Thread overview
[GSoC] Dataframes for D
May 29, 2019
Prateek Nayak
May 29, 2019
Andre Pany
May 29, 2019
Laeeth Isharc
May 30, 2019
Prateek Nayak
May 29, 2019
jmh530
May 29, 2019
Yatheendra
May 29, 2019
jmh530
May 29, 2019
Laeeth Isharc
May 30, 2019
Prateek Nayak
May 30, 2019
jmh530
May 30, 2019
Prateek Nayak
May 30, 2019
jmh530
May 30, 2019
welkam
May 30, 2019
jmh530
Jun 04, 2019
Prateek Nayak
Jun 04, 2019
jmh530
Jun 06, 2019
James Blachly
Jun 06, 2019
Prateek Nayak
Jun 11, 2019
Prateek Nayak
Jun 18, 2019
Prateek Nayak
Jun 25, 2019
Prateek Nayak
Jun 25, 2019
jmh530
Jun 25, 2019
jmh530
Jun 25, 2019
Prateek Nayak
Jun 25, 2019
jmh530
Jun 26, 2019
Prateek Nayak
Jun 26, 2019
jmh530
Jun 29, 2019
Prateek Nayak
Jul 03, 2019
Prateek Nayak
Jul 18, 2019
Prateeek Nayak
Jul 18, 2019
jmh530
Jul 18, 2019
Prateeek Nayak
Jul 18, 2019
jmh530
Jul 18, 2019
Prateeek Nayak
May 30, 2019
James Blachly
May 30, 2019
jmh530
Jul 18, 2019
bachmeier
Jul 23, 2019
Prateek Nayak
Jul 19, 2019
Dejan Lekic
Jul 23, 2019
Prateek Nayak
Jul 23, 2019
jmh530
Jul 25, 2019
Suliman
Jul 25, 2019
Prateek Nayak
Aug 08, 2019
Prateeek Nayak
Aug 09, 2019
bioinfornatics
Aug 10, 2019
Prateek Nayak
Aug 10, 2019
Prateek Nayak
Aug 10, 2019
James Blachly
May 29, 2019
Hello everyone,

I have began work on my Google Summer of Code 2019 project DataFrame for D.

-----------------
About the Project
-----------------

DataFrames have become a standard while handling and manipulating data. They give a neat representation, access and power to modulate the data in way user wants.
This project aims at bringing native DataFrame to D one which brings with it:

* A User Friendly API
* Multi - Indexing
* Writing to CSV and parsing from CSV
* Column binary operation in the form: df["Index1"] = df["Index2"] + df["Index3"];
* groupBy on an arbitrary number of columns
* Data Aggregation

Disclaimer: The entire structuring was inspired by Pandas, the most popular DataFrame library in Python and hence most of the usage will look very similar to the ones in Pandas.

Main focus of this project is user-friendliness of the API while also maintaining fair amount of speed and power.
The preliminary road map can be viewed here -> https://docs.google.com/document/d/1Zrf_tFYLauAd_NM4-UMBGt_z-fORhFMrGvW633x8rZs/edit?usp=sharing

The core developments can be seen here -> https://github.com/Kriyszig/magpie


-----------------------------
Brief idea of what is to come
-----------------------------

This month
----------
* Finish up with structure of DataFrame
* Finish Terminal Output (What good is data which cannot be seen)
* Finish writing to CSV
* Parsing DataFrame from CSV (Both single and multi-indexed)
* Accessing Elements
* Accessing Rows and Columns
* Assignment of element, an entire row or column
* Binary operation on rows and columns

Next Month
----------
* groupBy
* join
* Begin writing ops for aggregation


-----------
Speed Bumps
-----------

I am relatively new to D and hail from functional C background. Sometimes (most of the times) my code can start to look more C than D.
However I am adapting thanks to my mentors Nicholas Wilson and Ilya Yaroshenko. They have helped me a ton - whether it be with debugging errors or me falling back to my functional C past, they have always come for my rescue and I am grateful for their support.


-------------------------------------
Addressing Suggestions from Community
-------------------------------------

This suggestion comes from Laeeth Isharc
Source: https://github.com/dlang/projects/issues/15#issuecomment-495831750

Though this is not on my current road map, I would love to pursue this idea. Adding an easy way to inter operate with other libraries would be very beneficial.
Although I haven't formally addressed this in the road map, I would love to implement a msgpack based I/O as I continue to develop the library. Also JSON I/O was something on my mind to implement after the data aggregation part.  (I had prioritised JSON as I believed there were much more datasets as JSON compared to any other format)
May 29, 2019
On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
> Hello everyone,
>
> I have began work on my Google Summer of Code 2019 project DataFrame for D.
>
> -----------------
> About the Project
> -----------------
>
> DataFrames have become a standard while handling and manipulating data. They give a neat representation, access and power to modulate the data in way user wants.
> This project aims at bringing native DataFrame to D one which brings with it:
>
> * A User Friendly API
> * Multi - Indexing
> * Writing to CSV and parsing from CSV
> * Column binary operation in the form: df["Index1"] = df["Index2"] + df["Index3"];
> * groupBy on an arbitrary number of columns
> * Data Aggregation
>
> Disclaimer: The entire structuring was inspired by Pandas, the most popular DataFrame library in Python and hence most of the usage will look very similar to the ones in Pandas.
>
> Main focus of this project is user-friendliness of the API while also maintaining fair amount of speed and power.
> The preliminary road map can be viewed here -> https://docs.google.com/document/d/1Zrf_tFYLauAd_NM4-UMBGt_z-fORhFMrGvW633x8rZs/edit?usp=sharing
>
> The core developments can be seen here -> https://github.com/Kriyszig/magpie
>
>
> -----------------------------
> Brief idea of what is to come
> -----------------------------
>
> This month
> ----------
> * Finish up with structure of DataFrame
> * Finish Terminal Output (What good is data which cannot be seen)
> * Finish writing to CSV
> * Parsing DataFrame from CSV (Both single and multi-indexed)
> * Accessing Elements
> * Accessing Rows and Columns
> * Assignment of element, an entire row or column
> * Binary operation on rows and columns
>
> Next Month
> ----------
> * groupBy
> * join
> * Begin writing ops for aggregation
>
>
> -----------
> Speed Bumps
> -----------
>
> I am relatively new to D and hail from functional C background. Sometimes (most of the times) my code can start to look more C than D.
> However I am adapting thanks to my mentors Nicholas Wilson and Ilya Yaroshenko. They have helped me a ton - whether it be with debugging errors or me falling back to my functional C past, they have always come for my rescue and I am grateful for their support.
>
>
> -------------------------------------
> Addressing Suggestions from Community
> -------------------------------------
>
> This suggestion comes from Laeeth Isharc
> Source: https://github.com/dlang/projects/issues/15#issuecomment-495831750
>
> Though this is not on my current road map, I would love to pursue this idea. Adding an easy way to inter operate with other libraries would be very beneficial.
> Although I haven't formally addressed this in the road map, I would love to implement a msgpack based I/O as I continue to develop the library. Also JSON I/O was something on my mind to implement after the data aggregation part.  (I had prioritised JSON as I believed there were much more datasets as JSON compared to any other format)

The outlook looks really great. CSV as a starting point is good. I am not sure wheter adding a second text format like json brings real value.
Text formats have 2 disadvantages, string to number and vise versa slows down loading and saving files. If I remember correctly saving a file as CSV needed around 50 seconds while saving the same data as binary file (Parquet) took 3 seconds.
Second issue is the file size. A CSV with a size of 580 MB has a size of 280 MB while saving as e.g. Parquet file.

The file size isn't an issue on your local file system but it is a big issue while storing these file in the cloud e.g. Amazon S3. The file size will cause longer transfer times.

Adding a packed binary format would be great, if possible.

Kind regards
Andre
May 29, 2019
On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
> [snip]

Glad to see progress being made on this!

Somewhat tangentially on the interoperability, I have also made quite a bit of use with R's data.frames. One difference between that and what I have seen of the implementation is that R's data.frames allow for different columns to be different types. This makes certain kinds of analysis of groups very easy. For instance, right now I'm working with a dataset whose columns are doubles, dates, integers, bools, and strings. I can do the equivalent of groupby on the strings as "factors" in R and it's pretty straightforward to get everything working nicely.
May 29, 2019
Is it worth considering some binary format that is standard-ish & whose toolset can write out text when viewing? I'm thinking https://capnproto.org .
May 29, 2019
On Wednesday, 29 May 2019 at 19:33:38 UTC, Yatheendra wrote:
> Is it worth considering some binary format that is standard-ish & whose toolset can write out text when viewing? I'm thinking https://capnproto.org .

I have heard of that, but I don't know too much about it.

I think there are some hierarchical data formats that have some popularity, like hdf5. I think there's already a D wrapper for the C library. I don't have much experience with this kind of stuff though.
May 29, 2019
On Wednesday, 29 May 2019 at 18:26:46 UTC, Andre Pany wrote:
> On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
>> Hello everyone,
>>
>> I have began work on my Google Summer of Code 2019 project DataFrame for D.
>>
>> -----------------
>> About the Project
>> -----------------
>>
>> DataFrames have become a standard while handling and manipulating data. They give a neat representation, access and power to modulate the data in way user wants.
>> This project aims at bringing native DataFrame to D one which brings with it:
>>
>> * A User Friendly API
>> * Multi - Indexing
>> * Writing to CSV and parsing from CSV
>> * Column binary operation in the form: df["Index1"] = df["Index2"] + df["Index3"];
>> * groupBy on an arbitrary number of columns
>> * Data Aggregation
>>
>> Disclaimer: The entire structuring was inspired by Pandas, the most popular DataFrame library in Python and hence most of the usage will look very similar to the ones in Pandas.
>>
>> Main focus of this project is user-friendliness of the API while also maintaining fair amount of speed and power.
>> The preliminary road map can be viewed here -> https://docs.google.com/document/d/1Zrf_tFYLauAd_NM4-UMBGt_z-fORhFMrGvW633x8rZs/edit?usp=sharing
>>
>> The core developments can be seen here -> https://github.com/Kriyszig/magpie
>>
>>
>> -----------------------------
>> Brief idea of what is to come
>> -----------------------------
>>
>> This month
>> ----------
>> * Finish up with structure of DataFrame
>> * Finish Terminal Output (What good is data which cannot be seen)
>> * Finish writing to CSV
>> * Parsing DataFrame from CSV (Both single and multi-indexed)
>> * Accessing Elements
>> * Accessing Rows and Columns
>> * Assignment of element, an entire row or column
>> * Binary operation on rows and columns
>>
>> Next Month
>> ----------
>> * groupBy
>> * join
>> * Begin writing ops for aggregation
>>
>>
>> -----------
>> Speed Bumps
>> -----------
>>
>> I am relatively new to D and hail from functional C background. Sometimes (most of the times) my code can start to look more C than D.
>> However I am adapting thanks to my mentors Nicholas Wilson and Ilya Yaroshenko. They have helped me a ton - whether it be with debugging errors or me falling back to my functional C past, they have always come for my rescue and I am grateful for their support.
>>
>>
>> -------------------------------------
>> Addressing Suggestions from Community
>> -------------------------------------
>>
>> This suggestion comes from Laeeth Isharc
>> Source: https://github.com/dlang/projects/issues/15#issuecomment-495831750
>>
>> Though this is not on my current road map, I would love to pursue this idea. Adding an easy way to inter operate with other libraries would be very beneficial.
>> Although I haven't formally addressed this in the road map, I would love to implement a msgpack based I/O as I continue to develop the library. Also JSON I/O was something on my mind to implement after the data aggregation part.  (I had prioritised JSON as I believed there were much more datasets as JSON compared to any other format)
>
> The outlook looks really great. CSV as a starting point is good. I am not sure wheter adding a second text format like json brings real value.
> Text formats have 2 disadvantages, string to number and vise versa slows down loading and saving files. If I remember correctly saving a file as CSV needed around 50 seconds while saving the same data as binary file (Parquet) took 3 seconds.
> Second issue is the file size. A CSV with a size of 580 MB has a size of 280 MB while saving as e.g. Parquet file.
>
> The file size isn't an issue on your local file system but it is a big issue while storing these file in the cloud e.g. Amazon S3. The file size will cause longer transfer times.
>
> Adding a packed binary format would be great, if possible.
>
> Kind regards
> Andre

Interoperability with pandas would be important in our use case, and I think probably for quite a few others.  So yes I agree that it's not ideal to use JSON but lots of things are not ideal.  And I think people use JSON, msgpack and hdf5 for interop with pandas.  CSV is more complicated in practice than one might initially think.  Finally of course Excel interop.

It's not my cup of tea but gzipped JSON is quite compact...

I have a little streaming msgpack to our own Variable type deserialiser (it can store a primitive or a variant).  It's not long and I could share that.

I don't think the initial dataframe needs to have all this stuff in it from day one necessarily.

It's worth reusing another JSON implementation.  Since you work with Ilya - asdf isn't bad and quite fast though the error messages leave something to be desired.

You might see John Colvin repo from his talk on nogc map, filter,fold etc.  He didn't do chunkBy yet.



May 29, 2019
On Wednesday, 29 May 2019 at 19:33:38 UTC, Yatheendra wrote:
> Is it worth considering some binary format that is standard-ish & whose toolset can write out text when viewing? I'm thinking https://capnproto.org .

Ultimately you need to be able to talk to the world because this kind of thing is social and you may not have a choice about formats.  However no point trying to do it all in one summer...

May 29, 2019
On 5/29/19 2:00 PM, Prateek Nayak wrote:
> (snip)

Outstanding, and greatly needed. Congratulations to you and your mentors.

Our lab has transitioned to D for new software but still relies on python+pandas for some analytics pipelines.

I second the notions about importance of interoperability. An interesting in-memory interop framework I haven't seen mentioned here yet is Apache Arrow. In this 2017 blog post, Wes McKinney, the author of Pandas, discusses it in the context of mistakes made designing pandas; recommended reading if you have not:

https://wesmckinney.com/blog/apache-arrow-pandas-internals/

May 30, 2019
On Wednesday, 29 May 2019 at 18:41:28 UTC, jmh530 wrote:
> On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
>> [snip]
>
> Glad to see progress being made on this!
>
> Somewhat tangentially on the interoperability, I have also made quite a bit of use with R's data.frames. One difference between that and what I have seen of the implementation is that R's data.frames allow for different columns to be different types. This makes certain kinds of analysis of groups very easy. For instance, right now I'm working with a dataset whose columns are doubles, dates, integers, bools, and strings. I can do the equivalent of groupby on the strings as "factors" in R and it's pretty straightforward to get everything working nicely.

The DataFrame currently uses Mir's ndslice at the core of it which allows for homogeneous data to be stored within it.
Right now, we are considering operable data to be homogeneous keeping the API simpler.
I'm not sure how something like Variant will play out in this scenario. It may allow for data to be flexible but parsing will probably require an assertion library.
May 30, 2019
On Wednesday, 29 May 2019 at 22:31:57 UTC, Laeeth Isharc wrote:
> On Wednesday, 29 May 2019 at 18:26:46 UTC, Andre Pany wrote:
>> On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
>>> [...]
>>
>> The outlook looks really great. CSV as a starting point is good. I am not sure wheter adding a second text format like json brings real value.
>> Text formats have 2 disadvantages, string to number and vise versa slows down loading and saving files. If I remember correctly saving a file as CSV needed around 50 seconds while saving the same data as binary file (Parquet) took 3 seconds.
>> Second issue is the file size. A CSV with a size of 580 MB has a size of 280 MB while saving as e.g. Parquet file.
>>
>> The file size isn't an issue on your local file system but it is a big issue while storing these file in the cloud e.g. Amazon S3. The file size will cause longer transfer times.
>>
>> Adding a packed binary format would be great, if possible.
>>
>> Kind regards
>> Andre
>
> Interoperability with pandas would be important in our use case, and I think probably for quite a few others.  So yes I agree that it's not ideal to use JSON but lots of things are not ideal.  And I think people use JSON, msgpack and hdf5 for interop with pandas.  CSV is more complicated in practice than one might initially think.  Finally of course Excel interop.
>
> It's not my cup of tea but gzipped JSON is quite compact...
>
> I have a little streaming msgpack to our own Variable type deserialiser (it can store a primitive or a variant).  It's not long and I could share that.
>
> I don't think the initial dataframe needs to have all this stuff in it from day one necessarily.
>
> It's worth reusing another JSON implementation.  Since you work with Ilya - asdf isn't bad and quite fast though the error messages leave something to be desired.
>
> You might see John Colvin repo from his talk on nogc map, filter,fold etc.  He didn't do chunkBy yet.

Interop is important and I see many binary format being suggested as a way of interop.
Pandas I/O tools cover some of the popular formats: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Apache Arrow was something new that I hadn't heard of before. I'll look into it but then again I couldn't find any way to integrate it with D right away.

I would love to hear which binary format does the community want to be added as a way of interop in the future.
« First   ‹ Prev
1 2 3 4 5