August 09, 2012
Hello D Users,

The Software Editor for the Journal of Applied Econometrics has agreed to let me write a review of the D programming language for econometricians (econometrics is where economic theory and statistical analysis meet).  I will have only about 6 pages.  I have an idea of what I am going to write about, but I thought I would ask here what features are most relevant (in your minds) to numerical programmers writing codes for statistical inference.

I look forward to your suggestions.

Thanks,

TJB
August 09, 2012
Ok, so IIUC the audience is academic BUT is people interested in using D as a means to an end, not computer scientists?  I use D for bioinformatics, which IIUC has similar requirements to econometrics.  From my point of view:

I'd emphasize the following:

Native efficiency.  (Important for large datasets and monte carlo simulations)

Garbage collection.  (Important because it makes it much easier to write non-trivial data structures that don't leak memory, and statistical analyses are a lot easier if the data is structured well.)

Ranges/std.range/builtin arrays and associative arrays.  (Again, these make data handling a pleasure.)

Templates.  (Makes it easier to write algorithms that aren't overly specialized to the data structure they operate on.  This can also be done with OO containers but requires more boilerplate and compromises on efficiency.)

Disclaimer:  These last two are things I'm the primary designer and implementer of.  I intentionally put them last so it doesn't look like a shameless plug.

std.parallelism  (Important because you can easily parallelize your simulation, etc.)

dstats  (https://github.com/dsimcha/dstats  Important because a lot of statistical analysis code is already implemented for you.  It's admittedly very basic compared to e.g. R or Matlab, but it's also in many cases better integrated and more efficient.  I'd say that it has the 15% of the functionality that covers ~70% of use cases.  I welcome contributors to add more stuff to it.  I imagine economists would be interested in time series, which is currently a big area of missing functionality.)

August 09, 2012
On Thu, 09 Aug 2012 17:57:27 +0200, TJB wrote:

> Hello D Users,
> 
> The Software Editor for the Journal of Applied Econometrics has agreed to let me write a review of the D programming language for econometricians (econometrics is where economic theory and statistical analysis meet).  I will have only about 6 pages.  I have an idea of what I am going to write about, but I thought I would ask here what features are most relevant (in your minds) to numerical programmers writing codes for statistical inference.
> 
> I look forward to your suggestions.
> 
> Thanks,
> 
> TJB

Lazy ranges are a lifesaver when dealing with big data.  E.g. read a large csv file, use filter and map to clean and transform the data, collect stats as you go, then output to a destination file.  The lazy nature of most of the ranges in Phobos means that you don't need to have the data in memory, but you can write simple imperative code just as if it was.
August 09, 2012
On Thursday, 9 August 2012 at 18:20:08 UTC, Justin Whear wrote:
> On Thu, 09 Aug 2012 17:57:27 +0200, TJB wrote:
>
>> Hello D Users,
>> 
>> The Software Editor for the Journal of Applied Econometrics has agreed
>> to let me write a review of the D programming language for
>> econometricians (econometrics is where economic theory and statistical
>> analysis meet).  I will have only about 6 pages.  I have an idea of what
>> I am going to write about, but I thought I would ask here what features
>> are most relevant (in your minds) to numerical programmers writing codes
>> for statistical inference.
>> 
>> I look forward to your suggestions.
>> 
>> Thanks,
>> 
>> TJB
>
> Lazy ranges are a lifesaver when dealing with big data.  E.g. read a
> large csv file, use filter and map to clean and transform the data,
> collect stats as you go, then output to a destination file.  The lazy
> nature of most of the ranges in Phobos means that you don't need to have
> the data in memory, but you can write simple imperative code just as if
> it was.

Ah, the beauty of functional programming and streams.
August 09, 2012
On 8/9/2012 10:40 AM, dsimcha wrote:
> I'd emphasize the following:

I'd like to add to that:

1. Proper support for 80 bit floating point types. Many compilers' libraries have inaccurate 80 bit math functions, or don't implement 80 bit floats at all. 80 bit floats reduce the incidence of creeping roundoff error.

2. Support for SIMD vectors as native types.

3. Floating point values are default initialized to NaN.

4. Correct support for NaN and infinity values.

5. Correct support for unordered operations.

6. Array types do not degenerate into pointer types whenever passed to a function. In other words, array types know their dimension.

7. Array loop operations, i.e.:

    for (size_t i = 0; i < a.length; i++)
           a[i] = b[i] + c;

can be written as:

    a[] = b[] + c;

8. Global data is thread local by default, lessening the risk of unintentional unsynchronized sharing between threads.
August 10, 2012
Walter Bright wrote:
> 3. Floating point values are default initialized to NaN.

This isn't a good feature, IMO. C# handles this much more conveniently with just as much optimization/debugging benefit (arguably more so, because it catches NaN issues at compile-time). In C#:

    class Foo
    {
        float x; // defaults to 0.0f

        void bar()
        {
            float y; // doesn't default
            y ++; // ERROR: use of unassigned local

            float z = 0.0f;
            z ++; // OKAY
        }
    }

This is the same behavior for any local variable, so where in D you need to explicitly set variables to 'void' to avoid assignment costs, C# automatically benefits and catches your NaN mistakes before runtime.

Sorry, I'm not trying to derail this thread. I just think D's has other, much better advertising points that this one.
August 10, 2012
1) I think compile-time function execution is a very big plus for people doing calculations.

For example:

ulong fibonacci(ulong n) { .... }

static x = fibonacci(50); // calculated at compile time! runtime cost = 0 !!!

2) It has support for a BigInt structure in its standard library (which is really fast!)
August 10, 2012
On Thursday, 9 August 2012 at 18:35:22 UTC, Walter Bright wrote:
> On 8/9/2012 10:40 AM, dsimcha wrote:
>> I'd emphasize the following:
>
> I'd like to add to that:
>
> 1. Proper support for 80 bit floating point types. Many compilers' libraries have inaccurate 80 bit math functions, or don't implement 80 bit floats at all. 80 bit floats reduce the incidence of creeping roundoff error.

How unique to D is this feature?  Does this imply that things like BLAS and LAPACK, random number generators, statistical distribution functions, and other numerical software should be rewritten in pure D rather than calling out to external C or Fortran codes?

TJB
August 10, 2012
On 8/10/2012 1:38 AM, F i L wrote:
> Walter Bright wrote:
>> 3. Floating point values are default initialized to NaN.
>
> This isn't a good feature, IMO. C# handles this much more conveniently with just
> as much optimization/debugging benefit (arguably more so, because it catches NaN
> issues at compile-time). In C#:
>
>      class Foo
>      {
>          float x; // defaults to 0.0f
>
>          void bar()
>          {
>              float y; // doesn't default
>              y ++; // ERROR: use of unassigned local
>
>              float z = 0.0f;
>              z ++; // OKAY
>          }
>      }
>
> This is the same behavior for any local variable,

It catches only a subset of these at compile time. I can craft any number of ways of getting it to miss diagnosing it. Consider this one:

    float z;
    if (condition1)
         z = 5;
    ... lotsa code ...
    if (condition2)
         z++;

To diagnose this correctly, the static analyzer would have to determine that condition1 produces the same result as condition2, or not. This is impossible to prove. So the static analyzer either gives up and lets it pass, or issues an incorrect diagnostic. So our intrepid programmer is forced to write:

    float z = 0;
    if (condition1)
         z = 5;
    ... lotsa code ...
    if (condition2)
         z++;

Now, as it may turn out, for your algorithm the value "0" is an out-of-range, incorrect value. Not a problem as it is a dead assignment, right?

But then the maintenance programmer comes along and changes condition1 so it is not always the same as condition2, and now the z++ sees the invalid "0" value sometimes, and a silent bug is introduced.

This bug will not remain undetected with the default NaN initialization.


> so where in D you need to
> explicitly set variables to 'void' to avoid assignment costs,

This is incorrect, as the optimizer is perfectly capable of removing dead assignments like:

   f = nan;
   f = 0.0f;

The first assignment is optimized away.

> I just think D's has other, much better advertising points that this one.

Whether you agree with it being a good feature or not, it is a feature unique to D and merits discussion when talking about D's suitability for numerical programming.



August 10, 2012
On 8/10/2012 8:31 AM, TJB wrote:
> On Thursday, 9 August 2012 at 18:35:22 UTC, Walter Bright wrote:
>> On 8/9/2012 10:40 AM, dsimcha wrote:
>>> I'd emphasize the following:
>>
>> I'd like to add to that:
>>
>> 1. Proper support for 80 bit floating point types. Many compilers' libraries
>> have inaccurate 80 bit math functions, or don't implement 80 bit floats at
>> all. 80 bit floats reduce the incidence of creeping roundoff error.
>
> How unique to D is this feature?  Does this imply that things like BLAS and
> LAPACK, random number generators, statistical distribution functions, and other
> numerical software should be rewritten in pure D rather than calling out to
> external C or Fortran codes?

I attended a talk given by a physicist a few months ago where he was using C transcendental functions. I pointed out to him that those functions were unreliable, producing wrong bits in a manner that suggested to me that they were internally truncating to double precision.

He expressed astonishment and told me I must be mistaken.

What can I say? I run across this repeatedly, and that's exactly why Phobos (with Don's help) has its own implementations, rather than simply calling the corresponding C ones.

I encourage you to run your own tests, and draw your own conclusions.

« First   ‹ Prev
1 2 3 4 5 6 7 8
Top | Discussion index | About this forum | D home