May 30, 2019
On Wednesday, 29 May 2019 at 18:41:28 UTC, jmh530 wrote:
> On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
>> [snip]
>
> Glad to see progress being made on this!
>
> Somewhat tangentially on the interoperability, I have also made quite a bit of use with R's data.frames. One difference between that and what I have seen of the implementation is that R's data.frames allow for different columns to be different types. This makes certain kinds of analysis of groups very easy. For instance, right now I'm working with a dataset whose columns are doubles, dates, integers, bools, and strings. I can do the equivalent of groupby on the strings as "factors" in R and it's pretty straightforward to get everything working nicely.

On a second thought, my mentor Nicholas Wilson led me to an interesting Github Gist
-> https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe

A similar structure can be used to represent non homogeneous data. The DataFrame structure can be overloaded for such an integration. However homogeneous DataFrame still remain the main objective for now. This integration will definitely happen once the homogeneous DataFrame comes close to looking and working like an actual DataFrame.

I'll keep you updated here in case I find anything better for non homogeneous data and when the whole things starts to take shape.
May 30, 2019
On Wednesday, 29 May 2019 at 18:41:28 UTC, jmh530 wrote:
> R's data.frames allow for different columns to be different types.

Im not familiar with data frames but this sounds like array of structs


May 30, 2019
On Thursday, 30 May 2019 at 03:38:50 UTC, Prateek Nayak wrote:
> [snip]
>
> The DataFrame currently uses Mir's ndslice at the core of it which allows for homogeneous data to be stored within it.
> Right now, we are considering operable data to be homogeneous keeping the API simpler.
> I'm not sure how something like Variant will play out in this scenario. It may allow for data to be flexible but parsing will probably require an assertion library.

It's probably smart to focus on getting the homogeneous case working first.

I don't think of it as the entire thing being Variant, so much as a tuple containing 1-dimensional mir slices that are all the same length. The idea is that each column should have its own type. I had done a simple implementation of this a year or so ago and had shown Ilya.
May 30, 2019
On Thursday, 30 May 2019 at 04:38:09 UTC, Prateek Nayak wrote:
> 
>
> On a second thought, my mentor Nicholas Wilson led me to an interesting Github Gist
> -> https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe
>
> A similar structure can be used to represent non homogeneous data. The DataFrame structure can be overloaded for such an integration. However homogeneous DataFrame still remain the main objective for now. This integration will definitely happen once the homogeneous DataFrame comes close to looking and working like an actual DataFrame.
>
> I'll keep you updated here in case I find anything better for non homogeneous data and when the whole things starts to take shape.

Hmm, my point above was for a tuple of mir slices, which seems to correspond more to a struct of arrays. I wonder if your implementation would be able to take an array of structs approach using built-in mir. I know mir can handle complex numbers, which requires a struct. I don't know off the top of my head how it can handle more generally a slice whose type is a struct.
May 30, 2019
On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
> [snip]
>
> Im not familiar with data frames but this sounds like array of structs

I think I was thinking of it more like a struct of arrays, but I think an array of structs may also work (see my responses to Prateek)...
May 30, 2019
On Thursday, 30 May 2019 at 02:16:16 UTC, James Blachly wrote:
> [snip]
>
> https://wesmckinney.com/blog/apache-arrow-pandas-internals/

This was a good read.

The columnar data structures he describes sound more like struct of arrays than array of structs.
June 04, 2019
On Thursday, 30 May 2019 at 09:41:59 UTC, jmh530 wrote:
> On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
>> [snip]
>>
>> Im not familiar with data frames but this sounds like array of structs
>
> I think I was thinking of it more like a struct of arrays, but I think an array of structs may also work (see my responses to Prateek)...

Due to the popularity of heterogeneous DataFrames, we decided to take care of it the early stages of development before it's too late.

The heterogeneous DataFrame is now live at: https://github.com/Kriyszig/magpie/tree/experimental

Some parts are still under development but the goals in the road maps will be reached on time.

---------------------------------
Summing up the first week of GSoC
---------------------------------

* Base and file I/O ops were built for homogeneous DataFrame
* Based on the type of data community has worked with, it seemed evident homogeneous DataFrames weren't gonna cut it. So a rebuild was initiated over the weekend to allow for heterogeneous data.
* The API was overhauled to allow for Heterogeneous DataFrames.
* New parser that can parse selective columns.

The code will land in master once it's cleaned up and is deemed stable.

-----------------------------------
Things that will be dealt this week
-----------------------------------

This week will be for:

* Improving Parser
* Overhaul code structure (in Experimental)
* Adding setters for data and index in DataFrame
* Adding functions to create a multi-indexed DataFrame, the same way one can do in Python.
* Adding Documentation and examples
* Index Operations
* Retrieve rows and columns

The last one will set in motion the implementation of Column Binary ops of the form:
df["Index1"] = df["Index2"] + df["Index3"];

Meanwhile if you guys have any more suggestion please feel free to contact me - you can use this thread, open an issue on Github, reach out to me on slack (Prateek Nayak) or you can email me directly (lelouch.cpp@gmail.com)
June 04, 2019
On Tuesday, 4 June 2019 at 03:13:03 UTC, Prateek Nayak wrote:
> [snip]
>
> Due to the popularity of heterogeneous DataFrames, we decided to take care of it the early stages of development before it's too late.
>

Excellent!
June 05, 2019
On 6/3/19 11:13 PM, Prateek Nayak wrote:
> Due to the popularity of heterogeneous DataFrames, we decided to take care of it the early stages of development before it's too late.
> 
> The heterogeneous DataFrame is now live at: https://github.com/Kriyszig/magpie/tree/experimental

Amazing, thanks!

experimental branch readme code snippet has typo; var name should be `heterogeneous`

```
// Creating a heterogeneous DataFrame of 10 integer columns and 10 double columns
DataFrame!(int, 10, double, 10) homogeneous;
```

Again I cannot thank you enough!
June 06, 2019
On Thursday, 6 June 2019 at 00:24:17 UTC, James Blachly wrote:
> On 6/3/19 11:13 PM, Prateek Nayak wrote:
>> Due to the popularity of heterogeneous DataFrames, we decided to take care of it the early stages of development before it's too late.
>> 
>> The heterogeneous DataFrame is now live at: https://github.com/Kriyszig/magpie/tree/experimental
>
> Amazing, thanks!
>
> experimental branch readme code snippet has typo; var name should be `heterogeneous`
>
> ```
> // Creating a heterogeneous DataFrame of 10 integer columns and 10 double columns
> DataFrame!(int, 10, double, 10) homogeneous;
> ```
>
> Again I cannot thank you enough!

Oops! Thanks for spotting that. I'll update it today with a complete example for the usage and a snippet each for the functions added since it was last modified.