[GSoC] Dataframes for D (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » [GSoC] Dataframes for D (page 3)

June 11, 2019

Re: [GSoC] Dataframes for D

Posted by Prateek Nayak
in reply to Prateek Nayak

Prateek Nayak

Posted in reply to Prateek Nayak

On Tuesday, 4 June 2019 at 03:13:03 UTC, Prateek Nayak wrote:
> On Thursday, 30 May 2019 at 09:41:59 UTC, jmh530 wrote:
>> On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
>>> [snip]
>>>
>>> Im not familiar with data frames but this sounds like array of structs
>>
>> I think I was thinking of it more like a struct of arrays, but I think an array of structs may also work (see my responses to Prateek)...
>
> Due to the popularity of heterogeneous DataFrames, we decided to take care of it the early stages of development before it's too late.
>
> The heterogeneous DataFrame is now live at: https://github.com/Kriyszig/magpie/tree/experimental
>
> Some parts are still under development but the goals in the road maps will be reached on time.
>
> ---------------------------------
> Summing up the first week of GSoC
> ---------------------------------
>
> * Base and file I/O ops were built for homogeneous DataFrame
> * Based on the type of data community has worked with, it seemed evident homogeneous DataFrames weren't gonna cut it. So a rebuild was initiated over the weekend to allow for heterogeneous data.
> * The API was overhauled to allow for Heterogeneous DataFrames.
> * New parser that can parse selective columns.
>
> The code will land in master once it's cleaned up and is deemed stable.
>
> -----------------------------------
> Things that will be dealt this week
> -----------------------------------
>
> This week will be for:
>
> * Improving Parser
> * Overhaul code structure (in Experimental)
> * Adding setters for data and index in DataFrame
> * Adding functions to create a multi-indexed DataFrame, the same way one can do in Python.
> * Adding Documentation and examples
> * Index Operations
> * Retrieve rows and columns
>
> The last one will set in motion the implementation of Column Binary ops of the form:
> df["Index1"] = df["Index2"] + df["Index3"];
>
> Meanwhile if you guys have any more suggestion please feel free to contact me - you can use this thread, open an issue on Github, reach out to me on slack (Prateek Nayak) or you can email me directly (lelouch.cpp@gmail.com)

I have decided to make weekly updates for the community to know the progress of the project.

-------------------------------
Summing up last week's progress
-------------------------------

* Brought Heterogeneous DataFrames to the same point as the previous Homogeneous DataFrame development.
* Assignment ops - both direct and indexed.
* Retrieve an entire column using index operation.
* Retrieving entire row using index operation.
* Small redesign here and there to reduce code size.

The index op for rows and columns will return the value as an Axis structure.
Binary Ops on Axis structure will basically translate to Column and Row binary operation on DataFrame.

----------------------------------
Tasks that will be dealt this week
----------------------------------

* Column and row binary operation
* There are a few places where O(n) operations can be converted to O(log n) operations - these few optimisations will be done.
* Updating the Documentation with the developments of the week.

So far no major roadblocks have been encountered.

P.S.
Found this interesting Reddit post regarding file formats on r/datasets: https://www.reddit.com/r/datasets/comments/bymru3/are_there_any_data_formats_for_storing_text_worth/

June 18, 2019

Re: [GSoC] Dataframes for D

Posted by Prateek Nayak
in reply to Prateek Nayak

Prateek Nayak

Posted in reply to Prateek Nayak

On Tuesday, 11 June 2019 at 04:35:22 UTC, Prateek Nayak wrote:
> On Tuesday, 4 June 2019 at 03:13:03 UTC, Prateek Nayak wrote:
>> On Thursday, 30 May 2019 at 09:41:59 UTC, jmh530 wrote:
>>> On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
>>>> [snip]
>>>>
>>>> Im not familiar with data frames but this sounds like array of structs
>>>
>>> I think I was thinking of it more like a struct of arrays, but I think an array of structs may also work (see my responses to Prateek)...
>>
>> Due to the popularity of heterogeneous DataFrames, we decided to take care of it the early stages of development before it's too late.
>>
>> The heterogeneous DataFrame is now live at: https://github.com/Kriyszig/magpie/tree/experimental
>>
>> Some parts are still under development but the goals in the road maps will be reached on time.
>>
>> ---------------------------------
>> Summing up the first week of GSoC
>> ---------------------------------
>>
>> * Base and file I/O ops were built for homogeneous DataFrame
>> * Based on the type of data community has worked with, it seemed evident homogeneous DataFrames weren't gonna cut it. So a rebuild was initiated over the weekend to allow for heterogeneous data.
>> * The API was overhauled to allow for Heterogeneous DataFrames.
>> * New parser that can parse selective columns.
>>
>> The code will land in master once it's cleaned up and is deemed stable.
>>
>> -----------------------------------
>> Things that will be dealt this week
>> -----------------------------------
>>
>> This week will be for:
>>
>> * Improving Parser
>> * Overhaul code structure (in Experimental)
>> * Adding setters for data and index in DataFrame
>> * Adding functions to create a multi-indexed DataFrame, the same way one can do in Python.
>> * Adding Documentation and examples
>> * Index Operations
>> * Retrieve rows and columns
>>
>> The last one will set in motion the implementation of Column Binary ops of the form:
>> df["Index1"] = df["Index2"] + df["Index3"];
>>
>> Meanwhile if you guys have any more suggestion please feel free to contact me - you can use this thread, open an issue on Github, reach out to me on slack (Prateek Nayak) or you can email me directly (lelouch.cpp@gmail.com)
>
> I have decided to make weekly updates for the community to know the progress of the project.
>
> -------------------------------
> Summing up last week's progress
> -------------------------------
>
> * Brought Heterogeneous DataFrames to the same point as the previous Homogeneous DataFrame development.
> * Assignment ops - both direct and indexed.
> * Retrieve an entire column using index operation.
> * Retrieving entire row using index operation.
> * Small redesign here and there to reduce code size.
>
> The index op for rows and columns will return the value as an Axis structure.
> Binary Ops on Axis structure will basically translate to Column and Row binary operation on DataFrame.
>
> ----------------------------------
> Tasks that will be dealt this week
> ----------------------------------
>
> * Column and row binary operation
> * There are a few places where O(n) operations can be converted to O(log n) operations - these few optimisations will be done.
> * Updating the Documentation with the developments of the week.
>
>
> So far no major roadblocks have been encountered.
>
> P.S.
> Found this interesting Reddit post regarding file formats on r/datasets: https://www.reddit.com/r/datasets/comments/bymru3/are_there_any_data_formats_for_storing_text_worth/

-------------
Weekly Update
-------------

This week, the development was bit slower compared to the last couple of weeks - I had to attend college for a couple of days and it took more time than I would have liked.
However, that said, everything from the last week's goal is achieved.

-----------------------
What happened last week
-----------------------

* Redesigned Axis - the structure that returns the values during column binary operations
* Added binary operations for Axis [This was equivalent of binary operations on DataFrame]
* Tested whether the Binary Operations worked fine on DataFrames
* Fixed couple of tiny bugs here and there.
* Added more ways to build index.
* HashMap like implementation to check for duplicates in index.

-------------------
Goals for this week
-------------------
* Work on apply - That applies a function to values in a row/column or the entire DataFrame
* Add some inbuilt operations [This will be operations like mean, median - operations that are essential]
* Optimize parts of DataFrame.
* Addition of helpers that will eventually trigger the development of groupBy

----------
Roadblocks
----------
There were a few moments where I ran into trouble while returning Axis structure because of the variable return type. My mentors helped me a lot when I got stuck with the implementation. There ware also a couple of bugs which were simple but still took a while to solve.
Other than that, things went really smooth.

---------------------
Community Suggestions
---------------------

Petar Kirov [ZombineDev] suggested using static arrays as a way to declare a DataFrame.
Soon it will be possible to declare DataFrame as:

DataFrame!(int[5], double[10]) df;

[Implemented in BinaryOps branch - Will land in master with BinaryOps implementation soon]
Thank you for the suggestion.

June 25, 2019

Re: [GSoC] Dataframes for D

Posted by Prateek Nayak
in reply to Prateek Nayak

Prateek Nayak

Posted in reply to Prateek Nayak

>>> [snip]

-------------
Week 4 Update
-------------

This marks the completion of Stage I of Google Summer of Code 2019. It seems like it was only yesterday when I started working on this project and it has already been a month.

--------------------------
So what happened last week
--------------------------
* apply - to apply a function on a row/column
* function to convert a column of data to level of Index
* drop - to drop a row/column

Going back to the original proposal, I had allocated some time for optimisations in case there was time:
I was testing old parser with large files and it failed miserably. So I redesigned the from_csv function and added it to the library as fastCSV.
fastCSV gives 40x speed improvement over from_csv and fastCSV will eventually replace from_csv

* fastCSV was added to the library.


-------------------
Plans for this week
-------------------

Plans for this entire stage isn't strictly on a week by week timeline but the following things will be dealt sequentially throughout this stage:

This stage is reserved for implementation of groupBy. So for the beginning the internal structure and grouping will be decided. Later things like display and combining into a DataFrame struct will be dealt with.

These tasks were scheduled for Stage-III but will again fall under sequential implementation. If the above tasks are done. The following tasks will be dealt with:
* Aggregate [with complete set of popular operations]
* Join will be implemented to merge two DataFrame.

Aggregate was reserved for later stages to support implementation for both normal DataFrame and groubBy at once.


----------
Roadblocks
----------
This week there hasn't been any roadblocks. I needed the help of my mentors to solve a couple errors here and there but other than that things were smooth.
As for the future roadblocks, I cannot see any apparent ones but then again they show up when you are least expecting them :(

June 25, 2019

Re: [GSoC] Dataframes for D

Posted by jmh530
in reply to Prateek Nayak

jmh530

Posted in reply to Prateek Nayak

On Tuesday, 25 June 2019 at 17:25:34 UTC, Prateek Nayak wrote:
>>>> [snip]
>
> -------------
> Week 4 Update
> -------------
>
> This marks the completion of Stage I of Google Summer of Code 2019. It seems like it was only yesterday when I started working on this project and it has already been a month.
>
> --------------------------
> So what happened last week
> --------------------------
> * apply - to apply a function on a row/column
> * function to convert a column of data to level of Index
> * drop - to drop a row/column
>
[snip]

Glad to see you're still making great progress.

I had worked on the byDim function in mir.ndslice.topology is byDim because I had wanted the same sort of functionality as R's apply. It works a little differently than R's, but I find it very good for a lot of things. Your version of apply (I'm looking at the apply branch of magpie) looks like it operates a bit like a byDim chained with an each, so byDim!N.each!f. However, it also has this index variable allowing it to skip rows or something (I'm not really sure if this feature pulls its weight...).

So I have two questions: 1) does byDim also work with dataframes?, 2) can you add an overload that is apply(f, axis) without the index parameter?

One of my take-a-ways from looking at the apply function (again just looking at that apply branch) is that you might benefit from using more of what is Ilya has already put in mir.ndslice where available. For instance, the overload of apply that is just apply!f is basically the same as mir's each, but each has more features.

June 25, 2019

Re: [GSoC] Dataframes for D

Posted by jmh530
in reply to jmh530

jmh530

Posted in reply to jmh530

On Tuesday, 25 June 2019 at 17:54:36 UTC, jmh530 wrote:
> On Tuesday, 25 June 2019 at 17:25:34 UTC, Prateek Nayak wrote:
>>>>> [snip]
>>
>> -------------
>> Week 4 Update
>> -------------
>>
>> This marks the completion of Stage I of Google Summer of Code 2019. It seems like it was only yesterday when I started working on this project and it has already been a month.
>>
>> --------------------------
>> So what happened last week
>> --------------------------
>> * apply - to apply a function on a row/column
>> * function to convert a column of data to level of Index
>> * drop - to drop a row/column
>>
> [snip]
>
> Glad to see you're still making great progress.
>
> I had worked on the byDim function in mir.ndslice.topology is byDim because I had wanted the same sort of functionality as R's apply. It works a little differently than R's, but I find it very good for a lot of things. Your version of apply (I'm looking at the apply branch of magpie) looks like it operates a bit like a byDim chained with an each, so byDim!N.each!f. However, it also has this index variable allowing it to skip rows or something (I'm not really sure if this feature pulls its weight...).
>
> So I have two questions: 1) does byDim also work with dataframes?, 2) can you add an overload that is apply(f, axis) without the index parameter?
>
> One of my take-a-ways from looking at the apply function (again just looking at that apply branch) is that you might benefit from using more of what is Ilya has already put in mir.ndslice where available. For instance, the overload of apply that is just apply!f is basically the same as mir's each, but each has more features.

Stupid typos.

I had worked on the byDim function in mir.ndslice.topology because ..."

June 25, 2019

Re: [GSoC] Dataframes for D

Posted by Prateek Nayak
in reply to jmh530

Prateek Nayak

Posted in reply to jmh530

On Tuesday, 25 June 2019 at 17:54:36 UTC, jmh530 wrote:
>
> Glad to see you're still making great progress.
>
> I had worked on the byDim function in mir.ndslice.topology is byDim because I had wanted the same sort of functionality as R's apply. It works a little differently than R's, but I find it very good for a lot of things. Your version of apply (I'm looking at the apply branch of magpie) looks like it operates a bit like a byDim chained with an each, so byDim!N.each!f. However, it also has this index variable allowing it to skip rows or something (I'm not really sure if this feature pulls its weight...).
>
> So I have two questions: 1) does byDim also work with dataframes?, 2) can you add an overload that is apply(f, axis) without the index parameter?
>
> One of my take-a-ways from looking at the apply function (again just looking at that apply branch) is that you might benefit from using more of what is Ilya has already put in mir.ndslice where available. For instance, the overload of apply that is just apply!f is basically the same as mir's each, but each has more features.

1) Current, byDim doesn't work on DataFrame DataFrame.
2) Sure, the overload can be made but what are you specifically looking for?
apply(f, axis)(indexes) ?

You are right, apply works like byDim!axis.each on particular columns/rows.
I'll look into Mir's implementation. Thanks for that advice. I do believe apply can be strengthened to account for different use cases.
When the heterogeneous DataFrame support came, mir-algorithms was dropped from dependencies and Structure of Array implementation was taken up using TypeTuples. Once the basic working is solid, I'll port useful features from Mir to Magpie.

June 25, 2019

Re: [GSoC] Dataframes for D

Posted by jmh530
in reply to Prateek Nayak

jmh530

Posted in reply to Prateek Nayak

On Tuesday, 25 June 2019 at 20:44:43 UTC, Prateek Nayak wrote:
> [snip]
> 2) Sure, the overload can be made but what are you specifically looking for?
> apply(f, axis)(indexes) ?
> [snip]

I see
void apply(alias Fn, int axis, T)(T index)
and
void apply(alias Fn)()
in the current implementation.

I think you interpreted what I am asking as something like
void apply(alias Fn, int axis, T[])(T[] indices)
which also might make sense.

But I guess I was suggesting a little simpler as
void apply(alias Fn, int axis)()
so that it applies to all the rows or columns.

This is particularly relevant in the homogeneous data case. My motivation reflects a common use case of the apply function in R to calculate summary statistics of an array/matrix by column or row. For instance, I might want to calculate the standard deviation of every column.

June 26, 2019

Re: [GSoC] Dataframes for D

Posted by Prateek Nayak
in reply to jmh530

Prateek Nayak

Posted in reply to jmh530

On Tuesday, 25 June 2019 at 21:07:35 UTC, jmh530 wrote:
> On Tuesday, 25 June 2019 at 20:44:43 UTC, Prateek Nayak wrote:
>> [snip]
>> 2) Sure, the overload can be made but what are you specifically looking for?
>> apply(f, axis)(indexes) ?
>> [snip]
>
> I see
> void apply(alias Fn, int axis, T)(T index)
> and
> void apply(alias Fn)()
> in the current implementation.
>
> I think you interpreted what I am asking as something like
> void apply(alias Fn, int axis, T[])(T[] indices)
> which also might make sense.
>
> But I guess I was suggesting a little simpler as
> void apply(alias Fn, int axis)()
> so that it applies to all the rows or columns.
>
> This is particularly relevant in the homogeneous data case. My motivation reflects a common use case of the apply function in R to calculate summary statistics of an array/matrix by column or row. For instance, I might want to calculate the standard deviation of every column.

The apply right now works exactly as
void apply(alias Fn, int axis, T)(T indices)
indices can be an array of integer or a 2D array of string index
https://github.com/Kriyszig/magpie/blob/dec86d1942f9c9b4db31438407798329af0aed96/source/magpie/dataframe.d#L1200

The overload you need also exists: apply(Fn)
https://github.com/Kriyszig/magpie/blob/dec86d1942f9c9b4db31438407798329af0aed96/source/magpie/dataframe.d#L1246

Unittest for apply -
https://github.com/Kriyszig/magpie/blob/dec86d1942f9c9b4db31438407798329af0aed96/source/magpie/dataframe.d#L2821

I agree, things like mean and standard deviation calculations are of utmost importance in data science. Aggregate will bring such features as inbuilt functions. Count, Min, Max, Mean, SD, Variance, etc.
This will be added soon (by soon I mean somewhere between the final week of this stage [possibly sooner] and the fist week of the next - As soon as groupBy is stable, I will get onto aggregate)
Sorry for the inconvenience.

June 26, 2019

Re: [GSoC] Dataframes for D

Posted by jmh530
in reply to Prateek Nayak

jmh530

Posted in reply to Prateek Nayak

On Wednesday, 26 June 2019 at 05:41:48 UTC, Prateek Nayak wrote:
> [snip]
>
> I agree, things like mean and standard deviation calculations are of utmost importance in data science. Aggregate will bring such features as inbuilt functions. Count, Min, Max, Mean, SD, Variance, etc.
> This will be added soon (by soon I mean somewhere between the final week of this stage [possibly sooner] and the fist week of the next - As soon as groupBy is stable, I will get onto aggregate)
> Sorry for the inconvenience.

By no means do you need to apologize for any inconvenience.

I suppose what I am thinking is more about leveraging work that is already done as much as possible. For instance, I know that count/sum/min/max are part of mir-algorithm already and I had helped add sd and variance to numir.

Do you mind if I send you an email?

June 29, 2019

Re: [GSoC] Dataframes for D

Posted by Prateek Nayak
in reply to jmh530

Prateek Nayak

Posted in reply to jmh530

On Wednesday, 26 June 2019 at 12:50:23 UTC, jmh530 wrote:
> On Wednesday, 26 June 2019 at 05:41:48 UTC, Prateek Nayak wrote:
>> [snip]
>>
>> I agree, things like mean and standard deviation calculations are of utmost importance in data science. Aggregate will bring such features as inbuilt functions. Count, Min, Max, Mean, SD, Variance, etc.
>> This will be added soon (by soon I mean somewhere between the final week of this stage [possibly sooner] and the fist week of the next - As soon as groupBy is stable, I will get onto aggregate)
>> Sorry for the inconvenience.
>
> By no means do you need to apologize for any inconvenience.
>
> I suppose what I am thinking is more about leveraging work that is already done as much as possible. For instance, I know that count/sum/min/max are part of mir-algorithm already and I had helped add sd and variance to numir.
>
> Do you mind if I send you an email?

I'm sorry I couldn't reply sooner - I was sick for past couple of days.
I don't mind emails one bit. The email id is: lelouch.cpp@gmail.com [I know its weird :)]
I'll reply as soon as I see the mail [At worse it will take 12hrs to get a reply for me when my phone doesn't notify me of a new mail. At best I'll reply immediately]

Again, sorry for the delayed response. Hope to hear from you soon :)

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation