[GSoC] Dataframes for D (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » [GSoC] Dataframes for D (page 4)

July 03, 2019

Re: [GSoC] Dataframes for D

Posted by Prateek Nayak
in reply to Prateek Nayak

Prateek Nayak

Posted in reply to Prateek Nayak

On Saturday, 29 June 2019 at 04:39:39 UTC, Prateek Nayak wrote:
> [snip]

-------------
Weekly Update
-------------

I caught a flu this week which was really unfortunate. However I'm getting better and the work is going forward :)

-----------------------
What happened last week
-----------------------

I mostly dealt with the internal structure of `Group` - the structure that is returned during groupBy operation.
First I thought an array of DataFrames might be a good idea but soon dropped the idea as it would mean some parts - like the column index remain same but need to be copied to every DataFrame structure in the array and its just a waste of space at that point.
The implementation now looks somewhat similar to the DataFrame structure itself - there is an `Index` and `data`. Indexes are sorted based on the groups formed as a result of groupBy.

There are few places where optimizations can be made [mostly wrt space used] and I'll work on it this week.

Some of the functionality added to `Group` so far:
* display - User can choose to display a singe group or multiple groups
* combine - returns a DataFrame combining the groups user would like

At this point there was a need fora function in DataFrame which could convert a level of indexing to a column of operable data if required. This is because combine on groupBy doesn't remember the position from where the data was extracted. Hence if a level of data is used for groupBy, it would automatically be converted to a level of index in the result of combine. Hence `indexToData` was added to revert the result of this if the user desires so.

There were a few minor updates here and there. Nothing major. They include a new argument for `extends` in `Index` which can now insert the index at the position of user's choice. The other was stripping of trailing white spaces which appeared in display.


--------------------------
What will happen this week
--------------------------
This week will deal with optimizations of Group, add binary operations to `Group` which may be helpful. Document the changes once stability is reached. Start work on aggregate/join.

----------
Roadblocks
----------
I can't spot any major roadblocks up ahead. Work should go smoothly this week :)



-> Thank you jmh530 for sharing your work. This should help in improving the functionality of DataFrames further.

July 18, 2019

Re: [GSoC] Dataframes for D

Posted by Prateeek Nayak
in reply to Prateek Nayak

Prateeek Nayak

Posted in reply to Prateek Nayak

On Wednesday, 3 July 2019 at 05:04:20 UTC, Prateek Nayak wrote:
> On Saturday, 29 June 2019 at 04:39:39 UTC, Prateek Nayak wrote:
>> [snip]
> [snip]

---------------
Progress Update
---------------

The past couple of weeks went as expected without any Roadblocks

* groupBy can group a DataFrame based on arbitrary number of columns
* groupBy returns a Group structure which supports binary operations
* Retrieve single/multiple group as DataFrame.
* Merge two/more Group into a single DataFrame.
* Index operations on DataFrame. An entire column/row is returned as Axis the same way Index operation on DataFrame is implemented.
* Display Group on Terminal

Works on DataFrame
* Added short hand data operations which I missed before! \(°^°)/
* Added function to convert index to an operable data column and vice versa

---------------------
What is due this week
---------------------
This week was mostly reserved for refactor. Mr. Wilson introduced me to the beautiful lockstep in range and I worked it in the codebase wherever it's necessary.
This week I as adding ways to retrieve data as a Slice and assign Slice to DataFrame. This IMHO is important as ndslice is used widely and it opens a lot of doors for data computations. A way to easily retrieve data as Slice operate on it and assign the data back to DataFrame sees valuable. I hope to get the initial PR ready by the beginning of next week.

After this will bring Aggregate - on whole Frame/Group on a selected column/row of DataFrame or Group, selective operation on selective columns/rows.

-----------------
Future Roadblocks
-----------------
I can't see any obvious roadblocks but then you never do see them coming ¯\_(ツ)_/¯

July 18, 2019

Re: [GSoC] Dataframes for D

Posted by jmh530
in reply to Prateeek Nayak

jmh530

Posted in reply to Prateeek Nayak

On Thursday, 18 July 2019 at 05:03:38 UTC, Prateeek Nayak wrote:
> [snip]

Thanks for the update. I'm glad you're still making good progress.

I'm just looking over the readme.md. I noticed the "at" function has a signature like at!(row, column)(). Because it uses a template, doesn't that imply that the row and column parameters must be known at compile-time? What if we want run-time access using a function style instead of like df[0, 0]? mir's ndslice also has a set of select functions that are also useful for access.

There's also a typo in the GroupBy text:
"Group DataFrame based on na arbitrary number of columns."

I noticed that you make a lot of use of static foreach's over RowType in dataframe.d. Does that this means that this means there isn't any extra cost if you use a homogeneous dataframe with RowType.length == 1? If you can advertise that it doesn't have any additional overhead for working with homogeneous, then that's probably a win. You might also add a trait for isHomogeneous that checks if RowType.length == 1.

July 18, 2019

Re: [GSoC] Dataframes for D

Posted by Prateeek Nayak
in reply to jmh530

Prateeek Nayak

Posted in reply to jmh530

On Thursday, 18 July 2019 at 10:55:55 UTC, jmh530 wrote:
> On Thursday, 18 July 2019 at 05:03:38 UTC, Prateeek Nayak wrote:
>> [snip]
>
> Thanks for the update. I'm glad you're still making good progress.
>
> I'm just looking over the readme.md. I noticed the "at" function has a signature like at!(row, column)(). Because it uses a template, doesn't that imply that the row and column parameters must be known at compile-time? What if we want run-time access using a function style instead of like df[0, 0]? mir's ndslice also has a set of select functions that are also useful for access.
>
> There's also a typo in the GroupBy text:
> "Group DataFrame based on na arbitrary number of columns."
>
> I noticed that you make a lot of use of static foreach's over RowType in dataframe.d. Does that this means that this means there isn't any extra cost if you use a homogeneous dataframe with RowType.length == 1? If you can advertise that it doesn't have any additional overhead for working with homogeneous, then that's probably a win. You might also add a trait for isHomogeneous that checks if RowType.length == 1.

* "at" was for a fast access to element. It's only necessary to know one of the two argument at compile time to be honest but df[i1, i2] has to be written as at!(i2)(i1) which reverses the two position hence I thought at!(i1, i2) could reduce some mishap that position reversal can cause.
I agree a method to access the element at runtime. I will overload at for that.

* Sorry about the typo, will fix it soon (^_^)

* The data in DataFrame is stored as TypeTuple which requires the column index to be known statically. When trying to do a runtime operation on data, I was forced to traverse the tuple statically to find the particular index. Homogeneous DataFrame defined as DataFrame!(int, 5) will give RowType as (int, int, int, int, int).
For now that overhead still exists but I think isHomogeneous template can open some new door for optimization. I will definitely look into this over the next week. Thanks for bringing it to my notice.

July 18, 2019

Re: [GSoC] Dataframes for D

Posted by bachmeier
in reply to Prateek Nayak

bachmeier

Posted in reply to Prateek Nayak

Looking at the readme, I see the following example for accessing elements by name:

df[["1"], ["0"]];

Why can't that instead be

df["1", "0"];

Something that gets in the way of adoption is verbose notation, and I'm not seeing any advantage to the array notation.

Also, for this example:

Index indx;
indx.setIndex([1, 2, 3, 4], ["Row Index"], [1, 2, 3], ["Column Index"]);

That's pretty verbose/hard to parse compared to

rownames(x) = [1, 2, 3, 4];
colnames(x) = [1, 2, 3];

July 18, 2019

Re: [GSoC] Dataframes for D

Posted by jmh530
in reply to Prateeek Nayak

jmh530

Posted in reply to Prateeek Nayak

On Thursday, 18 July 2019 at 15:34:32 UTC, Prateeek Nayak wrote:
> [snip]
>
> * The data in DataFrame is stored as TypeTuple which requires the column index to be known statically. When trying to do a runtime operation on data, I was forced to traverse the tuple statically to find the particular index. Homogeneous DataFrame defined as DataFrame!(int, 5) will give RowType as (int, int, int, int, int).
> For now that overhead still exists but I think isHomogeneous template can open some new door for optimization. I will definitely look into this over the next week. Thanks for bringing it to my notice.

Ah, so what you would want to check is that all the RowTypes are the same instead.

July 18, 2019

Re: [GSoC] Dataframes for D

Posted by Prateeek Nayak
in reply to jmh530

Prateeek Nayak

Posted in reply to jmh530

On Thursday, 18 July 2019 at 16:23:20 UTC, jmh530 wrote:
> On Thursday, 18 July 2019 at 15:34:32 UTC, Prateeek Nayak wrote:
>> [snip]
> Ah, so what you would want to check is that all the RowTypes are the same instead.

Yes. Will require a small redesign in the internal structure and some optimizations here and there but can seriously cut down overheads.

July 19, 2019

Re: [GSoC] Dataframes for D

Posted by Dejan Lekic
in reply to Prateek Nayak

Dejan Lekic

Posted in reply to Prateek Nayak

On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
> Hello everyone,
>
> I have began work on my Google Summer of Code 2019 project DataFrame for D.

Really glad to see someone working on that. I hope you will have time to implement a good CSV/TSV reader/writer based on the fantastic iopipe project (that should IMHO go into Phobos this way or another)...

July 23, 2019

Re: [GSoC] Dataframes for D

Posted by Prateek Nayak
in reply to bachmeier

Prateek Nayak

Posted in reply to bachmeier

On Thursday, 18 July 2019 at 16:02:40 UTC, bachmeier wrote:
> Looking at the readme, I see the following example for accessing elements by name:
>
> df[["1"], ["0"]];
>
> Why can't that instead be
>
> df["1", "0"];
>
> Something that gets in the way of adoption is verbose notation, and I'm not seeing any advantage to the array notation.
>
> Also, for this example:
>
> Index indx;
> indx.setIndex([1, 2, 3, 4], ["Row Index"], [1, 2, 3], ["Column Index"]);
>
> That's pretty verbose/hard to parse compared to
>
> rownames(x) = [1, 2, 3, 4];
> colnames(x) = [1, 2, 3];

I'm really sorry I overlooked this. Sorry about that (－‸ლ)
I'll fix the first case in the PR where I'll make optimizations for homogeneous DataFrame.
I'll address the second problem of verbosity soon but not in the immediate PR

Thanks for the feedback ٩(^‿^)۶

July 23, 2019

Re: [GSoC] Dataframes for D

Posted by Prateek Nayak
in reply to Dejan Lekic

Prateek Nayak

Posted in reply to Dejan Lekic

On Friday, 19 July 2019 at 14:50:35 UTC, Dejan Lekic wrote:
> On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
>> Hello everyone,
>>
>> I have began work on my Google Summer of Code 2019 project DataFrame for D.
>
> Really glad to see someone working on that. I hope you will have time to implement a good CSV/TSV reader/writer based on the fantastic iopipe project (that should IMHO go into Phobos this way or another)...

Right now there is a CSV reader in Magpie but it isn't perfect enough to go into Phobos yet. I'll improve the parser and when I'm happy with the read speed, I'll send a PR (^_^)

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation