August 11, 2012
Walter Bright wrote:
> I'd rather have a 100 easy to find bugs than 1 unnoticed one that went out in the field.

That's just the thing, bugs are arguably easier to hunt down when things default to a consistent, usable value. When variables are defaulted to Zero, I have a guarantee that any propagated NaN bug is _not_ coming from them (directly). With NaN defaults, I only have a guarantee that the value _might_ be coming said variable.

Then, I also have more to be aware of when searching through code, because my ints behave differently than my floats. Arguably, you always have to be aware of this, but at least with explicit sets to NaN, I know the potential culprits earlier (because they'll have distinct assignment).

With static analysis warning against local scope NaN issues, there's really only one situation where setting to NaN catches bugs, and that's when you want to guarantee that a member variable is specifically assigned a value (of some kind) during construction. This is a corner case situation because:

1. It makes no guarantees about what value is actually assigned to the variable, only that it's set to something. Which means it's either forgotten in favor of a  'if' statement, or in combination with an if statement.

2. Because of it's singular debugging potential, NaN safeguards are, most often, intentionally put in place (or in D's case, left in place).

This is why I think such situations should require an explicit assignment to NaN. The "100 easy bugs" you mentioned weren't actually "bugs", they where times I forgot floats defaulted _differently_. The 10 times where NaN caught legitimate bugs, I would have had to hunt down the mistake either way, and it was trivial to do regardless of the the NaN. Even if it wasn't trivial, I could have very easily assigned NaN to questionable variables explicitly.
August 11, 2012
On 8/11/2012 3:01 PM, F i L wrote:
> Walter Bright wrote:
>> I'd rather have a 100 easy to find bugs than 1 unnoticed one that went out in
>> the field.
>
> That's just the thing, bugs are arguably easier to hunt down when things default
> to a consistent, usable value.

Many, many programming bugs trace back to assumptions that floating point numbers act like ints. There's just no way to avoid knowing and understanding the differences.


> When variables are defaulted to Zero, I have a
> guarantee that any propagated NaN bug is _not_ coming from them (directly). With
> NaN defaults, I only have a guarantee that the value _might_ be coming said
> variable.

I don't see why this is a bad thing. The fact is, with NaN you know there is a bug. With 0, you may never realize there is a problem. Andrei wrote me about the output of a program he is working on having billions of result values, and he noticed a few were NaNs, which he traced back to a bug. If the bug had set the float value to 0, there's no way he would have ever noticed the issue.

It's all about daubing bugs with day-glo orange paint so you know there's a problem. Painting them with camo is not the right solution.


August 11, 2012
On 8/11/2012 2:41 PM, bearophile wrote:
> 2) Where the compiler is certain a variable is read before any possible
> initialization, it generates a compile-time error;

This has been suggested repeatedly, but it is in utter conflict with the whole notion of default initialization, which nobody complains about for user-defined types.
August 11, 2012
On 08/10/2012 06:01 PM, Walter Bright wrote:
> On 8/10/2012 1:38 AM, F i L wrote:
>> Walter Bright wrote:
>>> 3. Floating point values are default initialized to NaN.
>>
>> This isn't a good feature, IMO. C# handles this much more conveniently
>> with just
>> as much optimization/debugging benefit (arguably more so, because it
>> catches NaN
>> issues at compile-time). In C#:
>>
>> class Foo
>> {
>> float x; // defaults to 0.0f
>>
>> void bar()
>> {
>> float y; // doesn't default
>> y ++; // ERROR: use of unassigned local
>>
>> float z = 0.0f;
>> z ++; // OKAY
>> }
>> }
>>
>> This is the same behavior for any local variable,
>
> It catches only a subset of these at compile time. I can craft any
> number of ways of getting it to miss diagnosing it. Consider this one:
>
> float z;
> if (condition1)
> z = 5;
> ... lotsa code ...
> if (condition2)
> z++;
>
> To diagnose this correctly, the static analyzer would have to determine
> that condition1 produces the same result as condition2, or not. This is
> impossible to prove. So the static analyzer either gives up and lets it
> pass, or issues an incorrect diagnostic. So our intrepid programmer is
> forced to write:
>
> float z = 0;
> if (condition1)
> z = 5;
> ... lotsa code ...
> if (condition2)
> z++;
>
> Now, as it may turn out, for your algorithm the value "0" is an
> out-of-range, incorrect value. Not a problem as it is a dead assignment,
> right?
>
> But then the maintenance programmer comes along and changes condition1
> so it is not always the same as condition2, and now the z++ sees the
> invalid "0" value sometimes, and a silent bug is introduced.
>
> This bug will not remain undetected with the default NaN initialization.
>

To address the concern of static analysis being too hard: I wish we could have it but limit the amount of static analysis that's done. Something like this: the compiler will test branches of if-else statements and switch-case statements, but it will not drop into function calls with ref parameters nor will it accept initialization in looping constructs (foreach, for, while, etc).  A compiler is an incorrect implementation if it implements /too much/ static analysis.

The example code you give can be implemented with such limited static analysis:

void lotsaCode() {
	... lotsa code ...
}

float z;
if ( condition1 )
{
	z = 5;
	lotsaCode();
	z++;
}
else
{
	lotsaCode();
}

I will, in advance, concede that this does not prevent people from just writing "float z = 0;".  In my dream-world the compiler recognizes a set of common mistake-inducing patterns like the one you mentioned and then prints helpful error messages suggesting alternative design patterns. That way, bugs are prevented and users become better programmers.
August 12, 2012
On Saturday, 11 August 2012 at 23:49:18 UTC, Chad J wrote:
> On 08/10/2012 06:01 PM, Walter Bright wrote:
>> It catches only a subset of these at compile time. I can craft any number of ways of getting it to miss diagnosing it. Consider this one:
>>
>> float z;
>> if (condition1)
>> z = 5;
>> ... lotsa code ...
>> if (condition2)
>> z++;
>>
>> To diagnose this correctly, the static analyzer would have to determine that condition1 produces the same result as condition2, or not. This is impossible to prove. So the static analyzer either gives up and lets it pass, or issues an incorrect diagnostic. So our intrepid programmer is forced to write:
>>
>> float z = 0;
>> if (condition1)
>> z = 5;
>> ... lotsa code ...
>> if (condition2)
>> z++;
>>
>> Now, as it may turn out, for your algorithm the value "0" is an out-of-range, incorrect value. Not a problem as it is a dead assignment, right?
>>
>> But then the maintenance programmer comes along and changes condition1 so it is not always the same as condition2, and now the z++ sees the invalid "0" value sometimes, and a silent bug is introduced.
>>
>> This bug will not remain undetected with the default NaN initialization.

 Let's keep in mind everyone of these truths:

1) Programmers are lazy; If you can get away with not initializing something then you'll avoid it. In C I've failed to initialized variables many times until a bug crops up and it's difficult to find sometimes, where a NaN or equiv would have quickly cropped them out before running with any real data.

2) There are a lot of inexperienced programmers. I worked for a company for a short period of time that did minimal training on a language like Java, where I ended up being seen as an utter genius (compared to even the teachers).

3) Bugs in a large environment and/or scenarios are far more difficult if not impossible to debug. I've made a program that handles merging of various dialogs (using double linked-like lists); I can debug them if they are 100 or less to work with, but after 100 (and often it's tens of thousands) it can become such a pain based on it's indirection and how the original structure was built that I refuse based on difficulty vs end results (Plus sanity).

 We also need to sometimes laugh at our mistakes, and learn from others. I'll recommend everyone read from rinkworks a bit if you have the time and refresh yourselves.

 http://www.rinkworks.com/stupid/cs_programming.shtml
August 12, 2012
Walter Bright wrote:
>> That's just the thing, bugs are arguably easier to hunt down when things default
>> to a consistent, usable value.
>
> Many, many programming bugs trace back to assumptions that floating point numbers act like ints. There's just no way to avoid knowing and understanding the differences.

My point was that the majority of the time there wasn't a bug introduced. Meaning the code was written an functioned as expected after I initialized the value to 0. I was only expecting the value to act similar (in initial value) as it's 'int' relative, but received a NaN in the output because I forgot to be explicit.


> I don't see why this is a bad thing. The fact is, with NaN you know there is a bug. With 0, you may never realize there is a problem. Andrei wrote me about the output of a program he is working on having billions of result values, and he noticed a few were NaNs, which he traced back to a bug. If the bug had set the float value to 0, there's no way he would have ever noticed the issue.
>
> It's all about daubing bugs with day-glo orange paint so you know there's a problem. Painting them with camo is not the right solution.

Yes, and this is an excellent argument for using NaN as a debugging practice in general, but I don't see anything in favor of defaulting to NaN. If you don't do some kind of check against code, especially with such large data sets, bugs of various kinds are going to go unchecked regardless.

A bug where an initial data value was accidentally initialized to 0 (by a third party later on, for instance), could be just as hard to miss, or harder if you're expecting a NaN to appear. In fact, an explicit set to NaN might discourage a third party to assigning without first questioning the original intention. In this situation I imagine best practice would be to write:

float dataValue = float.nan; // MUST BE NaN, DO NOT CHANGE!
                             // set to NaN to ensure is-set.
August 12, 2012
On 8/11/12 7:33 PM, Walter Bright wrote:
[snip]

Allow me to insert an opinion here. This post illustrates quite well how opinionated our community is (for better or worse).

The OP has asked a topical question in a matter that is interesting and also may influence the impact of the language to the larger community. Before long the thread has evolved into the familiar pattern of a debate over a minor issue on which reasonable people may disagree and that's unlikely to change. We should instead do our best to give a balanced high-level view of what D offers for econometrics.

To the OP - here are a few aspects that may deserve interest:

* Modeling power - from what I understand econometrics is modeling-heavy, which is more difficult to address in languages such as Fortran, C, C++, Java, Python, or the likes of Matlab.

* Efficiency - D generates native code for floating point operations and has control over data layout and allocation. Speed of generated code is dependent on the compiler, and the reference compiler (dmd) does a poorer job at it than the gnu-based compiler (gdc) compiler.

* Convenience - D is designed to "do what you mean" wherever possible and simplify common programming tasks, numeric or not. That makes the language comfortable to use even by a non-specialist, in particular in conjunction with appropriate libraries.

A few minuses I can think of:

- Maturity and availability of numeric and econometrics library is an obvious issue. There are some libraries (e.g. https://github.com/kyllingstad/scid/wiki) maintained and extended through volunteer effort.

- The language's superior modeling power and level of control comes at an increase in complexity compared to languages such as e.g. Python. So the statistician would need a larger upfront investment in order to reap the associated benefits.


Andrei

August 12, 2012
Andrei Alexandrescu:

> - The language's superior modeling power and level of control comes at an increase in complexity compared to languages such as e.g. Python. So the statistician would need a larger upfront investment in order to reap the associated benefits.

Statistician often use the R language (http://en.wikipedia.org/wiki/R_language ).
Python contains much more "computer science" and CS complexity compared to R. Not just advanced stuff like coroutines, metaclasses, decorators, Abstract Base Classes, operator overloading, and so on, but even simpler things, like generators, standard library collections like heaps and deques, and so on.
For some statisticians I've seen, even several parts of Python are too much hard to use or understand. I have rewritten several of their Python scripts.

Bye,
bearophile
August 12, 2012
On Sunday, 12 August 2012 at 02:28:44 UTC, Andrei Alexandrescu wrote:
> On 8/11/12 7:33 PM, Walter Bright wrote:
> [snip]
>
> Allow me to insert an opinion here. This post illustrates quite well how opinionated our community is (for better or worse).
>
> The OP has asked a topical question in a matter that is interesting and also may influence the impact of the language to the larger community. Before long the thread has evolved into the familiar pattern of a debate over a minor issue on which reasonable people may disagree and that's unlikely to change. We should instead do our best to give a balanced high-level view of what D offers for econometrics.
>
> To the OP - here are a few aspects that may deserve interest:
>
> * Modeling power - from what I understand econometrics is modeling-heavy, which is more difficult to address in languages such as Fortran, C, C++, Java, Python, or the likes of Matlab.
>
> * Efficiency - D generates native code for floating point operations and has control over data layout and allocation. Speed of generated code is dependent on the compiler, and the reference compiler (dmd) does a poorer job at it than the gnu-based compiler (gdc) compiler.
>
> * Convenience - D is designed to "do what you mean" wherever possible and simplify common programming tasks, numeric or not. That makes the language comfortable to use even by a non-specialist, in particular in conjunction with appropriate libraries.
>
> A few minuses I can think of:
>
> - Maturity and availability of numeric and econometrics library is an obvious issue. There are some libraries (e.g. https://github.com/kyllingstad/scid/wiki) maintained and extended through volunteer effort.
>
> - The language's superior modeling power and level of control comes at an increase in complexity compared to languages such as e.g. Python. So the statistician would need a larger upfront investment in order to reap the associated benefits.
>
>
> Andrei

Andrei,

Thanks for bringing this back to the original topic and for your thoughts.

Indeed, a lot of econometricians are using MATLAB, R, Guass, Ox and the like. But there are a number of econometricians who need the raw power of a natively compiled language (especially financial econometricians whose data are huge) who typically program in either Fortran or C/C++.  It is really this group that I am trying to reach.  I think D has a lot to offer this group in terms of programmer productivity and reliability of code.  I think this applies to statisticians as well, as I see a lot of them in this latter group too.

I also want to reach the MATLABers because I think they can get a lot more modeling power (I like how you put that) without too much more difficulty (see Ox - nearly as complicated as C++ but without the power).  Many MATLAB and R programmers end up recoding a good part of their algorithms in C++ and calling that code from the interpreted language.  I have always found this kind of mixed language programming to be messy, time consuming, and error prone.  Special tools are cropping up to handle this (see Rcpp).  This just proves to me the usefulness of a productive AND powerful language like D for econometricians!

I am sensitive to the drawbacks you mention (especially lack of numeric libraries).  I am so sick of wasting my time in C++ though that I have almost decided to just start writing my own econometric library in D.  Earlier in this thread there was a discussion of extended precision in D and I mentioned the need to recode things like BLAS and LAPACK in D.  Templates in D seem perfect for this problem.  As an expert in template meta-programming what are your thoughts?  How is this different than what is being done in SciD?  It seems they are mostly concerned about wrapping the old CBLAS and CLAPACK libraries.

Again, thanks for your thoughts and your TDPL book. Probably the best programming book I've ever read!

TJB
August 12, 2012
Am 12.08.2012 02:43, schrieb F i L:
> Yes, and this is an excellent argument for using NaN as a
> debugging practice in general, but I don't see anything in favor
> of defaulting to NaN. If you don't do some kind of check against
> code, especially with such large data sets, bugs of various kinds
> are going to go unchecked regardless.
>

is makes absolutely no sense to have different initialization stylel in debug an release - and according to Andrei example: there are many situations where slow-debug code isn't capable to reproduce the error in a human-timespan - especially when working with million, billion datasets (like i also do...)