Pandas like features (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Pandas like features (page 4)

October 30, 2020

Re: Pandas like features

Posted by jmh530
in reply to Russel Winder

jmh530

Posted in reply to Russel Winder

On Thursday, 29 October 2020 at 22:52:59 UTC, Russel Winder wrote:
> [snip]
>
> I only quickly skimmed the blog page, so this is a first reaction. I shall read the material more carefully tomorrow and send an update.
>
> 1. People have been trying to make Python execute faster for 30 years. In the end everyone ends up just using CPython with any and all optimisations it can get in.
>
> 2. Python is slow, and fundamentally single threaded. Attempts to make Python multi-threaded seem to fall by the wayside. The micro-benchmarks seem to indicate Pyston is just a slightly faster Python and thus nothing really to write home about – yes even a headline figure of 20% is nothing to write home about!
>
> [snip]

I think the point on multi-threaded Python came away as a big complaint there. Lots of mentions of the GIL or people being CPU-bound. Pandas was mentioned in this context as well.

October 30, 2020

Re: Pandas like features

Posted by Russel Winder
in reply to jmh530

Russel Winder

Posted in reply to jmh530

Attachments:

signature.asc (This is a digitally signed message part)

On Fri, 2020-10-30 at 10:12 +0000, jmh530 via Digitalmars-d wrote:
> 
[…]
> I think the point on multi-threaded Python came away as a big complaint there. Lots of mentions of the GIL or people being CPU-bound. Pandas was mentioned in this context as well.

<< I haven't properly read the blog entry as yet. Sorry. >>

Guido saw (cf. he and I had a long "discussion" at EuroPython 2010, there were many witnesses) GIL as absolutely fine for CPython in perpetuity, that if Pypy came up with a GIL-free VM then that would be fine. His mindset was (and I suspect may still be) that Python code was/is not about being CPU bound code, it was/is about sequential and concurrent, not parallel for performance, code. As long as there is NumPy and other PVM extensions, or use of message passing between processes, that allow for GIL-free parallel, CPU bound processing, it is hard to say Guido was/is wrong. (And in 2010 it was even harder :-) )

Having thought about it on and off for a decade, I am happy with the status quo around Python. Python code is (or should be) highly maintainable code designed for execution on a single threaded VM, easily understood and amended. Anyone trying to do CPU bound code using Python is "doing it wrong". Whether D is the right alternative, or a language such as Chapel is better, is a moot point.

Pandas is build on NumPy and so has the same parallelism properties as any other NumPy realised package.

-- 
Russel.
===========================================
Dr Russel Winder      t: +44 20 7585 2200
41 Buckmaster Road    m: +44 7770 465 077
London SW11 1EN, UK   w: www.russel.org.uk

October 30, 2020

Re: Pandas like features

Posted by Ola Fosheim Grøstad
in reply to Russel Winder

Ola Fosheim Grøstad

Posted in reply to Russel Winder

On Thursday, 29 October 2020 at 22:52:59 UTC, Russel Winder wrote:
> 1. People have been trying to make Python execute faster for 30 years. In the end everyone ends up just using CPython with any and all optimisations it can get in.

I think where such efforts go wrong is that they try to optimize Python instead of looking at the usage pattern that most programmers have. Most Python users never use much of the esoteric features (including concurrency, beyond generators) that Python offers.

So you could easily create a simpler language with low level implemented libraries that exhibit behaviour close enough to Python for current Python users to feel at home with it.

October 30, 2020

Re: Pandas like features

Posted by Abdulhaq
in reply to Russel Winder

Abdulhaq

Posted in reply to Russel Winder

On Friday, 30 October 2020 at 12:15:58 UTC, Russel Winder wrote:
> On Fri, 2020-10-30 at 10:12 +0000, jmh530 via Digitalmars-d wrote:
>> 
> […]
>> I think the point on multi-threaded Python came away as a big complaint there. Lots of mentions of the GIL or people being CPU-bound. Pandas was mentioned in this context as well.
>
> << I haven't properly read the blog entry as yet. Sorry. >>
>
> Guido saw (cf. he and I had a long "discussion" at EuroPython 2010, there were many witnesses) GIL as absolutely fine for CPython in perpetuity, that if Pypy came up with a GIL-free VM then that would be fine. His mindset was (and I suspect may still be) that Python code was/is not about being CPU bound code, it was/is about sequential and concurrent, not parallel for performance, code. As long as there is NumPy and other PVM extensions, or use of message passing between processes, that allow for GIL-free parallel, CPU bound processing, it is hard to say Guido was/is wrong. (And in 2010 it was even harder :-) )
>
> Pandas is build on NumPy and so has the same parallelism properties as any other NumPy realised package.

I've spent much of the last 5 years writing code for trade studies and other optimisations on top of python, numpy and multiprocessing. Lately I have been working a lot with Pandas for multi-dimensional optimisation and machine learning.

The slow performance of python in the glue layer between numpy, multiprocessing etc. is a non-issue. I can easily keep all 8 cores very busy running efficient C++ CFD, machine learning codes etc. using the above combination.

The migration from P2 to P3 was also pretty tame. For people doing real work, it's not a big deal. Sure it was a distraction but it has its benefits, I'm glad they did it. Boring opinion, and doesn't generate ad income from blog hits, but there you go.

I would like to see D have a numpy equivalent but realistically you won't duplicate the numy ecosystem here, it's too much work. And why do it? Just wrap up the numpy ecosystem from D and use it like that.

Core Pandas on its own BTW isn't hard to implement IMO. It turns out it's very expressive and very useful, but not a hard thing to copy.

October 30, 2020

Re: Pandas like features

Posted by bachmeier
in reply to Abdulhaq

bachmeier

Posted in reply to Abdulhaq

On Friday, 30 October 2020 at 18:23:38 UTC, Abdulhaq wrote:

> I would like to see D have a numpy equivalent but realistically you won't duplicate the numy ecosystem here, it's too much work. And why do it? Just wrap up the numpy ecosystem from D and use it like that.

I would love to see this. A project to use the functionality of Python, R, and Julia from inside a D program with little effort. William Stein did something like that with SageMath, but from a different angle. I can say the R part is simple. (Not only the parts written in R, but any underlying C, C++, or Fortran code with R bindings as well.) I wouldn't expect it to be much harder for the other languages, but since I don't work with them, I can't say. The advantage of D would be the new functionality you write in D on top of the existing functionality in those languages.

November 03, 2020

Re: Pandas like features

Posted by Laeeth Isharc
in reply to bachmeier

Laeeth Isharc

Posted in reply to bachmeier

On Friday, 30 October 2020 at 20:32:32 UTC, bachmeier wrote:
> On Friday, 30 October 2020 at 18:23:38 UTC, Abdulhaq wrote:
>
>> I would like to see D have a numpy equivalent but realistically you won't duplicate the numy ecosystem here, it's too much work. And why do it? Just wrap up the numpy ecosystem from D and use it like that.
>
> I would love to see this. A project to use the functionality of Python, R, and Julia from inside a D program with little effort. William Stein did something like that with SageMath, but from a different angle. I can say the R part is simple. (Not only the parts written in R, but any underlying C, C++, or Fortran code with R bindings as well.) I wouldn't expect it to be much harder for the other languages, but since I don't work with them, I can't say. The advantage of D would be the new functionality you write in D on top of the existing functionality in those languages.

We can call C++ libraries from our little language written in D and you can even write C++ inline, compile it at runtime and call it thanks to Cling.

Can call python although it's not yet in master.  Initially via pyd but people have their own particular versions, installs and setups so instead moving to RPC over named pipes using nanomsg.  That should generalise to anything other languages we would want to call too.  Serialisation and deserialisation isn't dirt cheap but the idea isn't to write inner loops in python.

There's a lot more overhead doing it this way - it's not for free.  But it is valuable for internal use for the problems we currently have.

I have a little plugin that uses your R wrapper but it's not used by anyone yet.

Time taken to a first version matters for us.  The first version doesn't usually need to be fast for user code.  This should allow us to access libraries without having to combine that with language choices.

In time I figure we could use cling to generate declarations and light wrappers for C++ too.

Robert Schadek made a beginning on Julia integration work but we haven't had time to do more than that.

November 05, 2020

Re: Pandas like features

Posted by data pulverizer
in reply to Laeeth Isharc

data pulverizer

Posted in reply to Laeeth Isharc

On Tuesday, 3 November 2020 at 22:51:14 UTC, Laeeth Isharc wrote:
> Robert Schadek made a beginning on Julia integration work but we haven't had time to do more than that.

If you're just passing arrays and pointers between Julia and D, this is pretty simple no? Julia's ccall makes that relatively simple. You can even compile D code and call it from Julia - that should be pretty straightforward. Calling Julia from D just needs the Julia C API, which again is pretty straightforward. You'll need to convert what you need from julia.h header file.

November 05, 2020

Re: Pandas like features

Posted by bachmeier
in reply to data pulverizer

bachmeier

Posted in reply to data pulverizer

On Thursday, 5 November 2020 at 13:11:17 UTC, data pulverizer wrote:
> On Tuesday, 3 November 2020 at 22:51:14 UTC, Laeeth Isharc wrote:
>> Robert Schadek made a beginning on Julia integration work but we haven't had time to do more than that.
>
> If you're just passing arrays and pointers between Julia and D, this is pretty simple no? Julia's ccall makes that relatively simple. You can even compile D code and call it from Julia - that should be pretty straightforward. Calling Julia from D just needs the Julia C API, which again is pretty straightforward. You'll need to convert what you need from julia.h header file.

The question for me is if you can work with the same data structures in D, R, Python, and Julia. Can your main program be written in D, but calling out to all three for loading, transforming, and analyzing the data? I'm guessing not, but would be awesome if you could do it.

November 05, 2020

Re: Pandas like features

Posted by jmh530
in reply to bachmeier

jmh530

Posted in reply to bachmeier

On Thursday, 5 November 2020 at 19:18:11 UTC, bachmeier wrote:
> [snip]
>
> The question for me is if you can work with the same data structures in D, R, Python, and Julia. Can your main program be written in D, but calling out to all three for loading, transforming, and analyzing the data? I'm guessing not, but would be awesome if you could do it.

Yeah, that would be pretty nice. However, I would emphasize what aberba has been saying across several different threads, which is the importance of documentation and tutorials. It's nice to have the ability to do it, but if you don't make it clear for the typical user of R/Python/Julia to figure it out, then the reach will be limited.

November 05, 2020

Re: Pandas like features

Posted by data pulverizer
in reply to bachmeier

data pulverizer

Posted in reply to bachmeier

On Thursday, 5 November 2020 at 19:18:11 UTC, bachmeier wrote:
>
> The question for me is if you can work with the same data structures in D, R, Python, and Julia. Can your main program be written in D, but calling out to all three for loading, transforming, and analyzing the data? I'm guessing not, but would be awesome if you could do it.

It's actually a problem I've been thinking about on and off for a while but haven't gone round to actually trying to implement it.

1. If I had to do this, I would first decide on a collection of common data structures to share starting with *compositions* of R/Python/Julia style multi-dimensional arrays - contiguous arrays with basic element types with a dimensional information in form of another array. So a 2x3 double matrix is a double array of length 6 with another long array containing [2, 3]. R has externalptr, Julia can interface with pointers, as can Python.

2. Next I would use memory mapped i/o for storage. Usually memory mapped files are only accessible by one thread for security but I believe that this can be changed. For security you could use cryptographic keys to access the files between threads. So that memory written in one language can be access by another.

3. Binary file i/o for those is pretty simple, but necessary to store results and read then in any of the programs afterwards.

4. All the languages have C APIs so you'd write interfaces in D using these to call from D to the languages. All the languages can call D extern C functions in dlls directly using their versions of ccall.

Another alternative to mmap is using network serialization which would be more cross-platform and fungible but this seems like it could be slow to me.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation