dstats updated for new Phobos

April 24, 2009
Posted by dsimcha
Permalink
dsimcha
Permalink
My statistics library for D, dstats, has been updated to take advantage of the concepts introduced in the new Phobos:  http://dsource.org/projects/dstats . It has also received several smaller miscellaneous updates:

1.  Instead of being specific to arrays, all functions now accept the most general range type feasible.

2.  StackHash and StackSet now "officially" exist in dstats.alloc and are documented.  These are a hash table and a hash set that use TempAlloc (a stack based allocator), leading to some impressive speedups when using a hash table or set within a function as part of an algorithm.  Note that TempAlloc, etc. are included with dstats because they are used heavily internally and been co-evolving with the needs of dstats.  For a while, I tried to keep them as nominally separate libraries, but given how much TempAlloc evolves according to the needs of dstats, this might not be the best idea.

3.  Mean, standard deviation, variance, skewness, and kurtosis can now be calculated via an output range interface (in addition to the obvious input range interface):

void outputFloats(O)(O someOutputRange) {
    // Output a bunch of floats to an output range.
}

OnlineSummary s;
outputFloats(s);
writeln(s.kurtosis);

4.  The information theory module has been reworked.  Rather than making the functions directly variadic, joint distributions are handled via the Joint struct.  This eliminates ambiguities and allows things like conditional entropies involving joint distributions to be calculated.

5.  All modules that were written exclusively by me have been relicensed from BSD only to dual license Phobos/BSD, to satisfy both the Tango and Phobos people.  Since the only place I have borrowed code from was MathExtra/Don Clugston's Tango modules, with his permission, I will probably be able to put these modules under the dual license also.

5.  Miscellaneous bug fixes.

Now that D2 is getting relatively close to complete, I'm open to suggestions about setting up some kind of user-friendly installation system for dstats. The current system is "figure it out yourself".  Assume I know nothing about this topic.

Also, since scientific computing seems to be a big potential killer app for D, here are some things that I need help with for dstats, if you want to contribute:

1.  Make the random number generators faster.  They're currently "better than nothing grade", i.e. they are correct, but that's about it.  I included them because I haven't gotten around to reading up on how to do a better job and there is currently no other way AFAIK in D2 to generate random samples from these distributions at all.

2.  Many of the p-value calculations for non-parametric tests use asymptotic approximations.  Implementing reasonably efficient exact calculations requires very good knowledge of dynamic programming.  Currently, I need exact calculations for Kendall's Tau, Spearman's Rho, the Kolmogorov-Smirnov test, and the runs test.

3.  Bug reports would be appreciated.  Also, feedback on the API before it hardens too much would be nice.

4.  Long term, once 1-d statistical stuff is stabilized, it would probably be good to add support for higher dimensional statistics.
Forums