Statistics library

Statistics library
Oct 23, 2008 dsimcha
Oct 23, 2008 bearophile
Oct 23, 2008 BCS
Oct 24, 2008 dsimcha
Oct 24, 2008 BCS
Oct 26, 2008 dsimcha
Oct 27, 2008 BCS
Oct 23, 2008 Andrei Alexandrescu
Oct 24, 2008 BCS
Oct 24, 2008 dsimcha
Oct 30, 2008 Don
Oct 24, 2008 Walter Bright
Oct 27, 2008 Dejan Lekic
Oct 27, 2008 dsimcha
Oct 24, 2008 Bill Baxter
Oct 24, 2008 dsimcha
Oct 24, 2008 Bill Baxter
Oct 24, 2008 dsimcha
Oct 24, 2008 Don
Oct 24, 2008 dsimcha
Oct 25, 2008 Don

October 23, 2008

Statistics library

Posted by dsimcha

Permalink

dsimcha

Permalink

Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available:

Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall
tau correlation is a very efficient O(N log N) version.

Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values.

Shannon entropy, mutual information.

Kolmogorov-Smirnov tests

Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs.

Inverse normal distribution, and normally distributed random number generation.

A struct to generate all possible permutations of a sequence.

On the other hand, I'm a scientist, not a full-time programmer, and although I can write working code, I have no clue what it takes to get code up to the gold standard of "production." Also, this library is very D2-dependent, and I have no interest in back-porting it. Of course if by some chance someone else wanted to back-port it, they'd be more than welcome.

Most of the code is covered somehow or another by unit tests, although I cheated a lot by having some unit tests depend on multiple functions.

Is there any interest in this from others in the D community? Do other people think that D would benefit from having a decent statistics library? Other comments?

October 23, 2008

Re: Statistics library

Posted by bearophile
in reply to dsimcha

Permalink

bearophile

Posted in reply to dsimcha

Permalink

dsimcha, I think the struct to generate permutations is out of place there, and more fit in a module like the comb (combinatorics) of mine.

Beside that detail, I like the idea of having a standard module with basic statistics, so I am interested :-)

Bye,
bearophile

October 23, 2008

Re: Statistics library

Posted by BCS
in reply to dsimcha

Permalink

BCS

Posted in reply to dsimcha

Permalink

Reply to dsimcha,

> Since there's really no good comprehensive statistics library for D
> (Tango has a little bit, the beginnings of a few are on dsource, but
> nothing much), Ive been rolling my own statistics functions as
> necessary.  Almost by accident, it seems like I've built up the
> beginnings of a decent statistics library.  I'm debating whether it
> might be interesting enough to people to be worth releasing, and
> whether enough community help would be available to really make it
> production quality, or to merge it with other people's efforts in this
> area.

Well for starters, just ask and I'll get you access to put it on scrapple. That's if you don't want to go to the trouble of having your own project (it's not much trouble BTW)

October 23, 2008

Re: Statistics library

Posted by Andrei Alexandrescu
in reply to dsimcha

Permalink

Andrei Alexandrescu

Posted in reply to dsimcha

Permalink

dsimcha wrote:
> Since there's really no good comprehensive statistics library for D (Tango has
> a little bit, the beginnings of a few are on dsource, but nothing much), Ive
> been rolling my own statistics functions as necessary.  Almost by accident, it
> seems like I've built up the beginnings of a decent statistics library.  I'm
> debating whether it might be interesting enough to people to be worth
> releasing, and whether enough community help would be available to really make
> it production quality, or to merge it with other people's efforts in this
> area.  The following functionality is currently available:
> 
> Correlation (Pearson, Spearman rho, Kendall tau).   Note that the     Kendall
> tau correlation is a very efficient O(N log N) version.
> 
> Mean, standard deviation, variance, kurtosis, percent variance for arrays of
> numeric values.
> 
> Shannon entropy, mutual information.
> 
> Kolmogorov-Smirnov tests
> 
> Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric,
> Poisson, binomial PDFs.
> 
> Inverse normal distribution, and normally distributed random number generation.
> 
> A struct to generate all possible permutations of a sequence.
> 
> 
> On the other hand, I'm a scientist, not a full-time programmer, and although I
> can write working code, I have no clue what it takes to get code up to the
> gold standard of "production."  Also, this library is very D2-dependent, and I
> have no interest in back-porting it.  Of course if by some chance someone else
> wanted to back-port it, they'd be more than welcome.
> 
> Most of the code is covered somehow or another by unit tests, although I
> cheated a lot by having some unit tests depend on multiple functions.
> 
> Is there any interest in this from others in the D community?  Do other people
> think that D would benefit from having a decent statistics library?  Other
> comments?

If the community is interested, I'd be glad to take over your code and put it in Phobos.

Andrei

October 24, 2008

Re: Statistics library

Posted by Bill Baxter
in reply to dsimcha

Permalink

Bill Baxter

Posted in reply to dsimcha

Permalink

On Fri, Oct 24, 2008 at 7:43 AM, dsimcha <dsimcha@yahoo.com> wrote:
> Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary.  Almost by accident, it seems like I've built up the beginnings of a decent statistics library.  I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area.  The following functionality is currently available:
>
> Correlation (Pearson, Spearman rho, Kendall tau).   Note that the     Kendall
> tau correlation is a very efficient O(N log N) version.
>
> Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values.
>
> Shannon entropy, mutual information.
>
> Kolmogorov-Smirnov tests
>
> Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs.
>
> Inverse normal distribution, and normally distributed random number generation.
>
> A struct to generate all possible permutations of a sequence.


I don't know what a lot of those things are, but statistics to me means you will probably have (or eventually want) things like covariance which are best represented as matrices.  Does your package also have a matrix library?

--bb

October 24, 2008

Re: Statistics library

Posted by BCS
in reply to Andrei Alexandrescu

Permalink

BCS

Posted in reply to Andrei Alexandrescu

Permalink

Reply to Andrei,

> dsimcha wrote:
> 
>> Since there's really no good comprehensive statistics library for D
>> (Tango has a little bit, the beginnings of a few are on dsource, but
>> nothing much), Ive been rolling my own statistics functions as
>> necessary.  Almost by accident, it seems like I've built up the
>> beginnings of a decent statistics library.  I'm debating whether it
>> might be interesting enough to people to be worth releasing, and
>> whether enough community help would be available to really make it
>> production quality, or to merge it with other people's efforts in
>> this area.  The following functionality is currently available:
>> 
>> Correlation (Pearson, Spearman rho, Kendall tau).   Note that the
>> Kendall tau correlation is a very efficient O(N log N) version.
>> 
>> Mean, standard deviation, variance, kurtosis, percent variance for
>> arrays of numeric values.
>> 
>> Shannon entropy, mutual information.
>> 
>> Kolmogorov-Smirnov tests
>> 
>> Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs,
>> hypergeometric, Poisson, binomial PDFs.
>> 
>> Inverse normal distribution, and normally distributed random number
>> generation.
>> 
>> A struct to generate all possible permutations of a sequence.
>> 
>> On the other hand, I'm a scientist, not a full-time programmer, and
>> although I can write working code, I have no clue what it takes to
>> get code up to the gold standard of "production."  Also, this library
>> is very D2-dependent, and I have no interest in back-porting it.  Of
>> course if by some chance someone else wanted to back-port it, they'd
>> be more than welcome.
>> 
>> Most of the code is covered somehow or another by unit tests,
>> although I cheated a lot by having some unit tests depend on multiple
>> functions.
>> 
>> Is there any interest in this from others in the D community?  Do
>> other people think that D would benefit from having a decent
>> statistics library?  Other comments?
>> 
> If the community is interested, I'd be glad to take over your code and
> put it in Phobos.
> 
> Andrei
> 

Even better would be getting it in both Phobos and Tango. Shouldn't be hard as I can't think it should depend on much.

October 24, 2008

Re: Statistics library

Posted by dsimcha
in reply to Bill Baxter

Permalink

dsimcha

Posted in reply to Bill Baxter

Permalink

== Quote from Bill Baxter (wbaxter@gmail.com)'s article
> On Fri, Oct 24, 2008 at 7:43 AM, dsimcha <dsimcha@yahoo.com> wrote:
> > Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary.  Almost by accident, it seems like I've built up the beginnings of a decent statistics library.  I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area.  The following functionality is currently available:
> >
> > Correlation (Pearson, Spearman rho, Kendall tau).   Note that the     Kendall
> > tau correlation is a very efficient O(N log N) version.
> >
> > Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values.
> >
> > Shannon entropy, mutual information.
> >
> > Kolmogorov-Smirnov tests
> >
> > Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs.
> >
> > Inverse normal distribution, and normally distributed random number generation.
> >
> > A struct to generate all possible permutations of a sequence.
> I don't know what a lot of those things are, but statistics to me
> means you will probably have (or eventually want) things like
> covariance which are best represented as matrices.  Does your package
> also have a matrix library?
> --bb

No, it doesn't have a matrix library right now.  I make no claim that it is in any way complete right now, but I do think it has some pretty useful stuff that's not likely to be anywhere else for D.

October 24, 2008

Re: Statistics library

Posted by Bill Baxter
in reply to dsimcha

Permalink

Bill Baxter

Posted in reply to dsimcha

Permalink

On Fri, Oct 24, 2008 at 9:39 AM, dsimcha <dsimcha@yahoo.com> wrote:
> == Quote from Bill Baxter (wbaxter@gmail.com)'s article
>> On Fri, Oct 24, 2008 at 7:43 AM, dsimcha <dsimcha@yahoo.com> wrote:
>> > Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary.  Almost by accident, it seems like I've built up the beginnings of a decent statistics library.  I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area.  The following functionality is currently available:
>> >
>> > Correlation (Pearson, Spearman rho, Kendall tau).   Note that the     Kendall
>> > tau correlation is a very efficient O(N log N) version.
>> >
>> > Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values.
>> >
>> > Shannon entropy, mutual information.
>> >
>> > Kolmogorov-Smirnov tests
>> >
>> > Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs.
>> >
>> > Inverse normal distribution, and normally distributed random number generation.
>> >
>> > A struct to generate all possible permutations of a sequence.
>> I don't know what a lot of those things are, but statistics to me
>> means you will probably have (or eventually want) things like
>> covariance which are best represented as matrices.  Does your package
>> also have a matrix library?
>> --bb
>
> No, it doesn't have a matrix library right now.  I make no claim that it is in any way complete right now, but I do think it has some pretty useful stuff that's not likely to be anywhere else for D.

Ok, so it's mainly for 1d statistics then?

--bb

October 24, 2008

Re: Statistics library

Posted by dsimcha
in reply to BCS

Permalink

dsimcha

Posted in reply to BCS

Permalink

== Quote from BCS (ao@pathlink.com)'s article
> Even better would be getting it in both Phobos and Tango. Shouldn't be hard as I can't think it should depend on much.

First, Tango needs to be ported to D2 (I realize that this is happening) or my code needs to be ported to D1.  Anyhow, here are the dependencies:

Non-trivial, i.e. in several places:
std.math, std.traits, std.functional, some custom sorting functions I wrote, which
could just be included

Trivial, i.e. in only one or two small places, pretty sure Tango has a drop-in
replacement
std.bigint (for factorial, although all functions that actually use a factorial
are calculated in log space, and therefore don't depend on this), std.algorithm
(for swap, isSorted), std.random

October 24, 2008

Re: Statistics library

Posted by dsimcha
in reply to Bill Baxter

Permalink

dsimcha

Posted in reply to Bill Baxter

Permalink

== Quote from Bill Baxter (wbaxter@gmail.com)'s article
> Ok, so it's mainly for 1d statistics then?
> --bb

Right.

Top | Forum index | About this forum

Forums