October 15, 2015
On Thursday, 15 October 2015 at 07:57:51 UTC, Russel Winder wrote:
> On Thu, 2015-10-15 at 06:48 +0000, data pulverizer via Digitalmars-d- learn wrote:
>> 
> […]
>> A journey of a thousand miles ...
>
> Exactly.
>
>> I tried to start creating a data table type object by
>> investigating variantArray:
>> http://forum.dlang.org/thread/hhzavwrkbrkjzfohczyq@forum.dlang.org
>>  but hit the snag that D is a static programming language and may not
>> allow the kind of behaviour you need for creating the same kind of
>> behaviour you need in data table - like objects.
>> 
>> I envisage such an object as being composed of arrays of vectors where each vector represents a column in a table as in R - easier for model matrix creation. Some people believe that you should work with arrays of tuple rows - which may be more big data friendly. I am not overly wedded to either approach.
>> 
>> Anyway it seems I have hit an inherent limitation in the language. Correct me if I am wrong. The data frame needs to have dynamic behaviour bind rows and columns and return parts of itself as a data table etc and since D is a static language we cannot do this.
>
> Just because D doesn't have this now doesn't mean it cannot. C doesn't have such capability but R and Python do even though R and CPython are just C codes.
>
> Pandas data structures rely on the NumPy n-dimensional array implementation, it is not beyond the bounds of possibility that that data structure could be realized as a D module.
>
> Is R's data.table written in R or in C? In either case, it is not beyond the bounds of possibility that that data structure could be realized as a D module.
>
> The core issue is to have a seriously efficient n-dimensional array that is amenable to data parallelism and is extensible. As far as I am aware currently (I will investigate more) the NumPy array is a good native code array, but has some issues with data parallelism and Pandas has to do quite a lot of work to get the extensibility. I wonder how the R data.table works.
>
> I have this nagging feeling that like NumPy, data.table seems a lot better than it could be. From small experiments D is (and also Chapel is even more) hugely faster than Python/NumPy at things Python people think NumPy is brilliant for. Expectations of Python programmers are set by the scale of Python performance, so NumPy seems brilliant. Compared to the scale set by D and Chapel, NumPy is very disappointing. I bet the same is true of R (I have never really used R).
>
> This is therefore an opportunity for D to step in. However it is a journey of a thousand miles to get something production worthy. Python/NumPy/Pandas have had a very large number of programmer hours expended on them.  Doing this poorly as a D modules is likely worse than not doing it at all.

I think it's much better to start, which means solving your own problems in a way that is acceptable to you rather than letting perfection be the enemy of the good.  It's always easier to do something a second time too, as you learn from successes and mistakes and you have a better idea about what you want.  Of course it's better to put some thought into design early on, but that shouldn't end up in analysis paralysis.  John Colvin and others are putting quite a lot of thought into dlang science, it seems to me, but he is also getting stuff done.  Running D in a Jupyter notebook is something very useful.  It doesn't matter that it's cosmetically imperfect at this stage, and it won't stay that way.  And that's just a small step towards the bigger goal.

October 15, 2015
On Wednesday, 14 October 2015 at 15:25:22 UTC, David DeWitt wrote:
> On Wednesday, 14 October 2015 at 14:48:22 UTC, John Colvin wrote:
>> On Wednesday, 14 October 2015 at 14:32:00 UTC, jmh530 wrote:
>>> On Tuesday, 13 October 2015 at 23:26:14 UTC, Laeeth Isharc wrote:
>>>> https://www.quora.com/Why-is-Python-so-popular-despite-being-so-slow
>>>> Andrei suggested posting more widely.
>>>
>>> I was just writing some R code yesterday after playing around with D for a couple weeks. I accomplished more in an afternoon of R coding than I think I had in like a month's worth of playing around with D. The same is true for python.
>>
>> As someone who uses both D and Python every day, I find that - once you are proficient in both - initial productivity is higher in Python and then D starts to overtake as a project gets larger and/or has stricter requirements. I hope never to have to write anything longer than a thousand lines in Python ever again.
>
> That's true until you need to connect to other systems.  There are countless clients built for other systems thats are used in real world applications.  With web development the Python code really just becomes glue nowadays and api's.  I understand D is faster until you have to build the clients for systems to connect.  We have an application that uses Postgres, ElasticSearch, Kafka, Redis, etc. This is plenty fast and the productivity of Python is more than D as the clients for Elasticsearch, Postgres and various other systems are unavailable or incomplete.  Sure D is faster but when you have other real world systems to connect to and time constraints on projects how can D be more productive or faster?  Our python code essentially becomes the API and usage of clients to other systems which handle a majority of the hardcore processing.  Once D gets established with those clients and they are battle tested then I will agree.  To me productivity is more than the language itself but also building real world applications in a reasonable time-frame.  D will get there but is nowhere near where Python is.

Few thoughts:

1. It's easy to embed Python in your D applications.  I do this for things like web scraping and when I want to write something quick to read simple XML (I just convert to JSON).

2. Of course there is a Redis client.  Elasticsearch is an amazing product, but hardly requires much work to have a complete API.  I made a start on this, and if I use Elasticsearch more then I'll have one done and will release it.  I don't know the finer aspects of Postgres to know what is involved.

3. That raises a broader point, which is that it depends on the ultimate aim of your project and what it is about the right tradeoff between different things.  It will ultimately be much more productive for me to do things in D for the reasons John alludes to.  A little work to get started is neither here nor there in the major scheme of things.  Adam Ruppe made the same point - it's not all that much work to put a foundation that suits you in place.  You do it once (and maybe add things when something like Elasticsearch comes out), and that's it, apart from minor updates.  The dollar expenditure on building these things is not enormous given the stakes involved for me.  But that doesn't mean that you should get to the same answer, as it depends.

4. I am not sure that all web development is just glue, or will be going forward given what might be on the horizon, but time will tell.


Laeeth.


October 17, 2015
On Wednesday, 14 October 2015 at 18:17:29 UTC, Russel Winder wrote:
> On Wed, 2015-10-14 at 14:48 +0000, John Colvin via Digitalmars-d-learn wrote:
>> On Wednesday, 14 October 2015 at 14:32:00 UTC, jmh530 wrote:
>> > On Tuesday, 13 October 2015 at 23:26:14 UTC, Laeeth Isharc wrote:
>> > > https://www.quora.com/Why-is-Python-so-popular-despite-being-so-s
>> > > low
>> > > Andrei suggested posting more widely.
>> > 
>> > I was just writing some R code yesterday after playing around with D for a couple weeks. I accomplished more in an afternoon of R coding than I think I had in like a month's worth of playing around with D. The same is true for python.
>> 
>> As someone who uses both D and Python every day, I find that - once you are proficient in both - initial productivity is higher in Python and then D starts to overtake as a project gets larger and/or has stricter requirements. I hope never to have to write anything longer than a thousand lines in Python ever again.
>
> The thing about Python is NumPy, SciPy, Pandas, Matplotlib, IPython, Jupyter, GNU Radio. The data science, bioinformatics, quant, signal provessing, etc. people do not give a sh!t which language they used, what they want is to get their results as fast as possible. Most of them do not write programs that are to last, they are effectively throw away programs. This leads them to Python (or R) and they are not really interested in learning anything else.
>
> The fact that NumPy sort of sucks in terms of performance, isn't
> noticed by them
> as they get their results "fast enough" and a lot faster than
> sequential Python. The fact that if they used Chapel or even D for
> their compute intensive code they would rapidly discover that NumPy
> sort of sucks never really occurs to these people as they are focussed
> on the results not the means of achieving them.
>
> Polyglot Python/D or Python/Chapel with Matplotlib is the way to go. But that really requires a D replacement for Pandas.


Russell, thanks for your thoughts - I appreciate it.

What would a Pandas replacement look like in D?

October 18, 2015
On Thursday, 15 October 2015 at 21:16:18 UTC, Laeeth Isharc wrote:
> On Wednesday, 14 October 2015 at 22:11:56 UTC, data pulverizer wrote:
>> On Tuesday, 13 October 2015 at 23:26:14 UTC, Laeeth Isharc wrote:
>>> https://www.quora.com/Why-is-Python-so-popular-despite-being-so-slow
>>> Andrei suggested posting more widely.
>>
>> I am coming at D by way of R, C++, Python etc. so I speak as a statistician who is interested in data science applications.
>
> Welcome...  Looks like we have similar interests.

That's good to know

>> To sit on the deployment side, D needs to grow it's big data/noSQL infrastructure for a start, then hook into a whole ecosystem of analytic tools in an easy and straightforward manner. This will take a lot of work!
>
> Indeed.  The dlangscience project managed by John Colvin is very interesting.  It is not a pure stats project, but there will be many shared areas of need.  He has some v interesting ideas, and being able to mix Python and D in a Jupyter notebook is rather nice (you can do this already).

Thanks for bringing my attention to this, this looks interesting.

> Sounds interesting.  Take a look at Colvin's dlang science draft white paper, and see what you would add.  It's a chance to shape things whilst they are still fluid.

Good suggestion.

>> 3. Solid interface to a big data database, that allows a D data table <-> database easily
>
> Which ones do you have in mind for stats?  The different choices seem to serve quite different needs.  And when you say big data, how big do you typically mean ?

What I mean is to start by tapping into current big data technologies. HDFS and Cassandra have C APIs which we can wrap for D.

>> 4. Functional programming: especially around data table and array structures. R's apply(), lapply(), tapply(), plyr and now data.table(,, by = list()) provides powerful tools for data manipulation.
>
> Any thoughts on what the design should look like?

Yes, I think this is easy to implement but still important. The real devil is my point #1 the dynamic data table object.

>
> To an extent there is a balance between wanting to explore data iteratively (when you don't know where you will end up), and wanting to build a robust process for production.  I have been wondering myself about using LuaJIT to strap together D building blocks for the exploration (and calling it based on a custom console built around Adam Ruppe's terminal).

Sounds interesting

>> 6. Nullable types makes talking about missing data more straightforward and gives you the opportunity to code them into a set value in your analysis. D is streaks ahead of Python here, but this is built into R at a basic level.
>
> So matrices with nullable types within?  Is nan enough for you ?  If not then could be quite expensive if back end is C.

I am not suggesting that we pass nullable matrices to C algorithms, yes nan is how this is done in practice but you wouldn't have nans in your matrix at the point of modeling - they'll just propagate and trash your answer. Nullable types are useful in data acquisition and exploration - the more practical side of data handling. I was quite shocked to see them in D, when they are essentially absent from "high level" programming languages like Python. Real data is messy and having nullable types is useful in processing, storing and summarizing raw data. I put in as #6 because I think it is possible to do practical statistics working around them by using notional hacks. Nullables are something that C#, and R have and Python's pandas has struggled with. The great news is that they are available in D so we can use them.

>>
>> If D can get points 1, 2, 3 many people would be all over D because it is a fantastic programming language and is wicked fast.
> What do you like best about it ?  And in your own domain, what have the biggest payoffs been in practice?

I am playing with D at the moment. To become useful to me the data table structure is a must. I previously said points 1, 2, and 3 would get data scientists sucked into D. But the data table structure is the seed. A dynamic structure like that in D would catalyze the rest. Everything else is either wrappers, routine and maybe a lot of work but straightforward to implement. The data table structure for me is the real enigma.

The way that R's data types are structured around SEXPs is the key to all of this. I am currently reading through R's internal documentation to get my head around this.

https://cran.r-project.org/doc/manuals/r-release/R-ints.html

October 18, 2015
On Tuesday, 13 October 2015 at 23:26:14 UTC, Laeeth Isharc wrote:
> https://www.quora.com/Why-is-Python-so-popular-despite-being-so-slow
> Andrei suggested posting more widely.

Maybe also interesting: https://docs.google.com/presentation/d/1LO_WI3N-3p2Wp9PDWyv5B6EGFZ8XTOTNJ7Hd40WOUHo/mobilepresent?pli=1&slide=id.g70b0035b2_1_168
October 18, 2015
On Sunday, 18 October 2015 at 12:50:43 UTC, Namespace wrote:
> On Tuesday, 13 October 2015 at 23:26:14 UTC, Laeeth Isharc wrote:
>> https://www.quora.com/Why-is-Python-so-popular-despite-being-so-slow
>> Andrei suggested posting more widely.
>
> Maybe also interesting: https://docs.google.com/presentation/d/1LO_WI3N-3p2Wp9PDWyv5B6EGFZ8XTOTNJ7Hd40WOUHo/mobilepresent?pli=1&slide=id.g70b0035b2_1_168

What I got out of that is that someone at Mozilla were writing a push service (stateful connections, which more demanding than regular http) and found that jitted Python was more suitable than Go for productivity reasons. Then they speculate that their own Rust will be better suited than Go for such services in the future, apparently not yet.


To the poster further up in the thread: turns out that reddit.com is implemented in Python and a little bit of C: https://github.com/reddit/reddit

So there we have it. Python gives higher productive at the cost of efficiency, but does not have a significant impact on effectiveness, for regular web services that are built to scale.

October 18, 2015
On Sunday, 18 October 2015 at 13:29:50 UTC, Ola Fosheim Grøstad wrote:
> On Sunday, 18 October 2015 at 12:50:43 UTC, Namespace wrote:
>> On Tuesday, 13 October 2015 at 23:26:14 UTC, Laeeth Isharc wrote:
>>> https://www.quora.com/Why-is-Python-so-popular-despite-being-so-slow
>>> Andrei suggested posting more widely.
>>
>> Maybe also interesting: https://docs.google.com/presentation/d/1LO_WI3N-3p2Wp9PDWyv5B6EGFZ8XTOTNJ7Hd40WOUHo/mobilepresent?pli=1&slide=id.g70b0035b2_1_168
>
> What I got out of that is that someone at Mozilla were writing a push service (stateful connections, which more demanding than regular http) and found that jitted Python was more suitable than Go for productivity reasons. Then they speculate that their own Rust will be better suited than Go for such services in the future, apparently not yet.
I liked the fact that Python with PyPy is more performant than Go (in contrast to the title "Python is slow") and that Go-Routines leak.

>
> To the poster further up in the thread: turns out that reddit.com is implemented in Python and a little bit of C: https://github.com/reddit/reddit
>
> So there we have it. Python gives higher productive at the cost of efficiency, but does not have a significant impact on effectiveness, for regular web services that are built to scale.


October 18, 2015
On Sunday, 18 October 2015 at 13:57:40 UTC, Namespace wrote:
> I liked the fact that Python with PyPy is more performant than Go (in contrast to the title "Python is slow") and that Go-Routines leak.

Yes, Python apparently used less memory, which is rather important when you write a service with persistent websocket connections, like a webchat or game. Old school stackless coroutines probably would be better than fibers like D and Go uses.

An alternative to writing such code for the application is to get persistent connections by "ready made" server-infrastructure (which probably is more reliable anyway). On AppEngine you have something called channels which basically allows you to send messages to a connected client push-style:

https://cloud.google.com/appengine/docs/python/channel/

As far as I can tell that means that the application server can die without loosing the connection.

October 18, 2015
On Sunday, 18 October 2015 at 13:29:50 UTC, Ola Fosheim Grøstad wrote:
> On Sunday, 18 October 2015 at 12:50:43 UTC, Namespace wrote:
>> On Tuesday, 13 October 2015 at 23:26:14 UTC, Laeeth Isharc wrote:
>>> https://www.quora.com/Why-is-Python-so-popular-despite-being-so-slow
>>> Andrei suggested posting more widely.
>>
>> Maybe also interesting: https://docs.google.com/presentation/d/1LO_WI3N-3p2Wp9PDWyv5B6EGFZ8XTOTNJ7Hd40WOUHo/mobilepresent?pli=1&slide=id.g70b0035b2_1_168
>
> What I got out of that is that someone at Mozilla were writing a push service (stateful connections, which more demanding than regular http) and found that jitted Python was more suitable than Go for productivity reasons. Then they speculate that their own Rust will be better suited than Go for such services in the future, apparently not yet.
>
>
> To the poster further up in the thread: turns out that reddit.com is implemented in Python and a little bit of C: https://github.com/reddit/reddit
>
> So there we have it. Python gives higher productive at the cost of efficiency, but does not have a significant impact on effectiveness, for regular web services that are built to scale.

that's the pylons guy. he also has many python libraries for web development. reddit is built with pylons btw and pylons is now pyramid.

i've seen the presentation and i can't stop thinking how it'd be if they had chosen D instead of Go.
October 18, 2015
On Sunday, 18 October 2015 at 20:44:44 UTC, Mengu wrote:
> i've seen the presentation and i can't stop thinking how it'd be if they had chosen D instead of Go.

Not much better, probably worse, given that Go has stack protection for fibers and D doesn't. So in Go you can get away with 2K growable stacks, in D you would need a lot more to stay on the safe side.

IIRC he claims that CPython would  fast enough for their application and that the application was memory limited and not computation limited.


1 2 3 4
Next ›   Last »