October 15, 2015
On Wednesday, 14 October 2015 at 18:17:29 UTC, Russel Winder wrote:

>
> The thing about Python is NumPy, SciPy, Pandas, Matplotlib, IPython, Jupyter, GNU Radio. The data science, bioinformatics, quant, signal provessing, etc. people do not give a sh!t which language they used, what they want is to get their results as fast as possible. Most of them do not write programs that are to last, they are effectively throw away programs. This leads them to Python (or R) and they are not really interested in learning anything else.
>

Scary, but I agree with you again. In science this is exactly what usually happens. Throw away programs, a list here, a loop there, clumsy, inefficient code. And that's fine, in a way that's what scripting is for. The problems start to kick in when the same guys get the idea to go public and write a program that everyone can use. Then you have a mess of slow code (undocumented) in a slow language. This is why I always say "Use C, C++ or D from the very beginning" or at least document your code in a way that it can easily be rewritten in D or C. But well, you know, results, papers, conferences ... This is why many innovations live in an eternal Matlab or Python limbo.
October 15, 2015
On Thursday, 15 October 2015 at 09:24:52 UTC, Chris wrote:
> Yep. This occurred to me too. Sorry Ola, but I think you don't know how sausages are made.

I most certainly do. I am both doing backend programming and we have a farm... :-)

> Do you really think that all the websites out there are performance tuned by network programming specialists? You'd be surprised!

If they are to scale, then they have to pick algorithms and architectures that scale. This is commodity nowadays. You want to get as close to O(1) as possible for requests. This is how you build scalable systems. No point in having 1ms response time under low load and 10000ms response time when the incoming link is saturated.

You'd rather have 100ms response under low load and 120ms response time when saturated + 99.9999% availability/uptime.

Robustness and scaling costs latency, but you want acceptable and stable QoS, not brilliant QoS under low load and horrible QoS under high load.

Scalable websites aren't designed like sportcars, they are designed like trains.

October 15, 2015
On Thursday, 15 October 2015 at 09:47:56 UTC, Ola Fosheim Grøstad wrote:
> On Thursday, 15 October 2015 at 09:24:52 UTC, Chris wrote:
>> Yep. This occurred to me too. Sorry Ola, but I think you don't know how sausages are made.
>
> I most certainly do. I am both doing backend programming and we have a farm... :-)

Well, you know how gourmet sausages are made (100% meat), because you make them yourself apparently. But I was talking about the sausages you get out there ;) A lot of websites are not "planned". They are quickly put together to promote an idea. The code/architecture is not important at that stage. The idea is important. The website has to have dynamic content that can be edited by non-programmers (Not even PHP! HTML at most!). If you designed a website from a programming point of view first, you'd never get the idea out in time.
October 15, 2015
On Thu, 2015-10-15 at 10:00 +0000, Chris via Digitalmars-d-learn wrote:
> 
[…]
> Well, you know how gourmet sausages are made (100% meat), because you make them yourself apparently. But I was talking about the sausages you get out there ;) A lot of websites are not "planned". They are quickly put together to promote an idea. The code/architecture is not important at that stage. The idea is important. The website has to have dynamic content that can be edited by non-programmers (Not even PHP! HTML at most!). If you designed a website from a programming point of view first, you'd never get the idea out in time.#

And most commercial websites selling things are truly appalling: slow performance, atrocious usability/UX. Who cares if the site is brilliantly tuned if it is unusable?


-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder



October 15, 2015
On Thu, 2015-10-15 at 09:35 +0000, Ola Fosheim Grøstad via Digitalmars- d-learn wrote:
> On Thursday, 15 October 2015 at 07:57:51 UTC, Russel Winder wrote:
> > lot better than it could be. From small experiments D is (and also Chapel is even more) hugely faster than Python/NumPy at things Python people think NumPy is brilliant for. Expectations
> 
> Have you had a chance to look at PyOpenCL and PYCUDA?

Yes.

CUDA is of course doomed in the long run as Intel put GPGPU on the processor chip. OpenCL will eventually be replaced with Vulkan (assuming they can get the chips made).

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder



October 15, 2015
On Thursday, 15 October 2015 at 07:57:51 UTC, Russel Winder wrote:
> On Thu, 2015-10-15 at 06:48 +0000, data pulverizer via Digitalmars-d- learn wrote:
> Just because D doesn't have this now doesn't mean it cannot. C doesn't have such capability but R and Python do even though R and CPython are just C codes.

I think the way R does this is that its dynamic runtime environment is used bind together native C basic type arrays. I wander if we could simulate dynamic behaviour by leveraging D's short compilation time to dynamically write/update data table source file(s) containing the structure of new/modified data tables?

> Pandas data structures rely on the NumPy n-dimensional array implementation, it is not beyond the bounds of possibility that that data structure could be realized as a D module.

Julia's DArray object is an interested take on this: https://github.com/JuliaParallel/DistributedArrays.jl

I believe that parallelism on arrays and data tables are different challenges. Data tables are easier since we can parallelise by row, thus the preference of having row-based tuples.

> The core issue is to have a seriously efficient n-dimensional array that is amenable to data parallelism and is extensible. As far as I am aware currently (I will investigate more) the NumPy array is a good native code array, but has some issues with data parallelism and Pandas has to do quite a lot of work to get the extensibility. I wonder how the R data.table works.

R's data table is not currently parallelised

> I have this nagging feeling that like NumPy, data.table seems a lot better than it could be. From small experiments D is (and also Chapel is even more) hugely faster than Python/NumPy at things Python people think NumPy is brilliant for. Expectations of Python programmers are set by the scale of Python performance, so NumPy seems brilliant. Compared to the scale set by D and Chapel, NumPy is very disappointing. I bet the same is true of R (I have never really used R).

Thanks for notifying me about Chapel - something else interesting to investigate. When it comes to speed R is very strange. Basic math (e.g. *, +, /) operation on an R array can be fast but for-looping will kill speed by hundreds of times - most things are slow in R unless they are directly baked into its base operations. You can write code in C and C++ can call it very easily in R though using its Rcpp interface.


> This is therefore an opportunity for D to step in. However it is a journey of a thousand miles to get something production worthy. Python/NumPy/Pandas have had a very large number of programmer hours expended on them.  Doing this poorly as a D modules is likely worse than not doing it at all.

I think D has a lot to offer the world of data science.
October 15, 2015
On Thursday, 15 October 2015 at 10:00:21 UTC, Chris wrote:
> about the sausages you get out there ;) A lot of websites are not "planned". They are quickly put together to promote an idea.

They are WordPress sites... :-(

> If you designed a website from a programming point of view first, you'd never get the idea out in time.

It's not that bad, but modelling data for nosql databases is a bigger challenge than getting decent performance from the code.

There is another issue with using languages like Rust/C++/D and that is: if it crashes you loose all the concurrent requests, perhaps even without a reasonable log trace. What I'd want for handling requests is something less fragile where only the single request that went bad crash out. Pure Python and Java provide this property.

October 15, 2015
On Thursday, 15 October 2015 at 10:33:54 UTC, Russel Winder wrote:
>
> CUDA is of course doomed in the long run as Intel put GPGPU on the processor chip. OpenCL will eventually be replaced with Vulkan (assuming they can get the chips made).

I thought Vulkan was meant to replace OpenGL.


October 15, 2015
On Thu, 2015-10-15 at 17:00 +0000, jmh530 via Digitalmars-d-learn wrote:
> On Thursday, 15 October 2015 at 10:33:54 UTC, Russel Winder wrote:
> > 
> > CUDA is of course doomed in the long run as Intel put GPGPU on the processor chip. OpenCL will eventually be replaced with Vulkan (assuming they can get the chips made).
> 
> I thought Vulkan was meant to replace OpenGL.

True, but there is an intent to try and have Vulkan allow for replacing both OpenGL and OpenCL.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder



October 15, 2015
On Wednesday, 14 October 2015 at 22:11:56 UTC, data pulverizer wrote:
> On Tuesday, 13 October 2015 at 23:26:14 UTC, Laeeth Isharc wrote:
>> https://www.quora.com/Why-is-Python-so-popular-despite-being-so-slow
>> Andrei suggested posting more widely.
>
> I am coming at D by way of R, C++, Python etc. so I speak as a statistician who is interested in data science applications.

Welcome...  Looks like we have similar interests.

> To sit on the deployment side, D needs to grow it's big data/noSQL infrastructure for a start, then hook into a whole ecosystem of analytic tools in an easy and straightforward manner. This will take a lot of work!

Indeed.  The dlangscience project managed by John Colvin is very interesting.  It is not a pure stats project, but there will be many shared areas of need.  He has some v interesting ideas, and being able to mix Python and D in a Jupyter notebook is rather nice (you can do this already).
>
> I believe it is easier and more effective to start on the research side. D will need:
>
> 1. A data table structure like R's data.frame or data.table. This is a dynamic data structure that represents a table that can have lots of operations applied to it. It is the data structure that separates R from most programming languages. It is what pandas tries to emulate. This includes text file and database i/o from mySQL and ODBC for a start.

I fully agree, and have made a very simple start on this.  See github. It's usable for my needs as they stand, although far from production ready or elegant.  You can read and write to/from CSV and HDF5.  I guess mysql and ODBC wouldn't be hard to add, but I don't myself need for now and won't have time to do myself.  If I have space I may channel some reesources in that direction some time next year.

> 2. Formula class : the ability to talk about statistical models using formulas e.g. y ~ x1 + x2 + x3 etc and then use these formulas to generate model matrices for input into statistical algorithms.

Sounds interesting.  Take a look at Colvin's dlang science draft white paper, and see what you would add.  It's a chance to shape things whilst they are still fluid.

> 3. Solid interface to a big data database, that allows a D data table <-> database easily

Which ones do you have in mind for stats?  The different choices seem to serve quite different needs.  And when you say big data, how big do you typically mean ?

> 4. Functional programming: especially around data table and array structures. R's apply(), lapply(), tapply(), plyr and now data.table(,, by = list()) provides powerful tools for data manipulation.

Any thoughts on what the design should look like?

To an extent there is a balance between wanting to explore data iteratively (when you don't know where you will end up), and wanting to build a robust process for production.  I have been wondering myself about using LuaJIT to strap together D building blocks for the exploration (and calling it based on a custom console built around Adam Ruppe's terminal).
>
> 5. A factor data type:for categorical variables. This is easy to implement! This ties into the creation of model matrices.
>
> 6. Nullable types makes talking about missing data more straightforward and gives you the opportunity to code them into a set value in your analysis. D is streaks ahead of Python here, but this is built into R at a basic level.

So matrices with nullable types within?  Is nan enough for you ?  If not then could be quite expensive if back end is C.
>
> If D can get points 1, 2, 3 many people would be all over D because it is a fantastic programming language and is wicked fast.
What do you like best about it ?  And in your own domain, what have the biggest payoffs been in practice?