View mode: basic / threaded / horizontal-split · Log in · Help
August 11, 2012
Re: Which D features to emphasize for academic review article
Walter Bright wrote:
> I'd rather have a 100 easy to find bugs than 1 unnoticed one 
> that went out in the field.

That's just the thing, bugs are arguably easier to hunt down when 
things default to a consistent, usable value. When variables are 
defaulted to Zero, I have a guarantee that any propagated NaN bug 
is _not_ coming from them (directly). With NaN defaults, I only 
have a guarantee that the value _might_ be coming said variable.

Then, I also have more to be aware of when searching through 
code, because my ints behave differently than my floats. 
Arguably, you always have to be aware of this, but at least with 
explicit sets to NaN, I know the potential culprits earlier 
(because they'll have distinct assignment).

With static analysis warning against local scope NaN issues, 
there's really only one situation where setting to NaN catches 
bugs, and that's when you want to guarantee that a member 
variable is specifically assigned a value (of some kind) during 
construction. This is a corner case situation because:

1. It makes no guarantees about what value is actually assigned 
to the variable, only that it's set to something. Which means 
it's either forgotten in favor of a  'if' statement, or in 
combination with an if statement.

2. Because of it's singular debugging potential, NaN safeguards 
are, most often, intentionally put in place (or in D's case, left 
in place).

This is why I think such situations should require an explicit 
assignment to NaN. The "100 easy bugs" you mentioned weren't 
actually "bugs", they where times I forgot floats defaulted 
_differently_. The 10 times where NaN caught legitimate bugs, I 
would have had to hunt down the mistake either way, and it was 
trivial to do regardless of the the NaN. Even if it wasn't 
trivial, I could have very easily assigned NaN to questionable 
variables explicitly.
August 11, 2012
Re: Which D features to emphasize for academic review article
On 8/11/2012 3:01 PM, F i L wrote:
> Walter Bright wrote:
>> I'd rather have a 100 easy to find bugs than 1 unnoticed one that went out in
>> the field.
>
> That's just the thing, bugs are arguably easier to hunt down when things default
> to a consistent, usable value.

Many, many programming bugs trace back to assumptions that floating point 
numbers act like ints. There's just no way to avoid knowing and understanding 
the differences.


> When variables are defaulted to Zero, I have a
> guarantee that any propagated NaN bug is _not_ coming from them (directly). With
> NaN defaults, I only have a guarantee that the value _might_ be coming said
> variable.

I don't see why this is a bad thing. The fact is, with NaN you know there is a 
bug. With 0, you may never realize there is a problem. Andrei wrote me about the 
output of a program he is working on having billions of result values, and he 
noticed a few were NaNs, which he traced back to a bug. If the bug had set the 
float value to 0, there's no way he would have ever noticed the issue.

It's all about daubing bugs with day-glo orange paint so you know there's a 
problem. Painting them with camo is not the right solution.
August 11, 2012
Re: Which D features to emphasize for academic review article
On 8/11/2012 2:41 PM, bearophile wrote:
> 2) Where the compiler is certain a variable is read before any possible
> initialization, it generates a compile-time error;

This has been suggested repeatedly, but it is in utter conflict with the whole 
notion of default initialization, which nobody complains about for user-defined 
types.
August 11, 2012
Re: Which D features to emphasize for academic review article
On 08/10/2012 06:01 PM, Walter Bright wrote:
> On 8/10/2012 1:38 AM, F i L wrote:
>> Walter Bright wrote:
>>> 3. Floating point values are default initialized to NaN.
>>
>> This isn't a good feature, IMO. C# handles this much more conveniently
>> with just
>> as much optimization/debugging benefit (arguably more so, because it
>> catches NaN
>> issues at compile-time). In C#:
>>
>> class Foo
>> {
>> float x; // defaults to 0.0f
>>
>> void bar()
>> {
>> float y; // doesn't default
>> y ++; // ERROR: use of unassigned local
>>
>> float z = 0.0f;
>> z ++; // OKAY
>> }
>> }
>>
>> This is the same behavior for any local variable,
>
> It catches only a subset of these at compile time. I can craft any
> number of ways of getting it to miss diagnosing it. Consider this one:
>
> float z;
> if (condition1)
> z = 5;
> ... lotsa code ...
> if (condition2)
> z++;
>
> To diagnose this correctly, the static analyzer would have to determine
> that condition1 produces the same result as condition2, or not. This is
> impossible to prove. So the static analyzer either gives up and lets it
> pass, or issues an incorrect diagnostic. So our intrepid programmer is
> forced to write:
>
> float z = 0;
> if (condition1)
> z = 5;
> ... lotsa code ...
> if (condition2)
> z++;
>
> Now, as it may turn out, for your algorithm the value "0" is an
> out-of-range, incorrect value. Not a problem as it is a dead assignment,
> right?
>
> But then the maintenance programmer comes along and changes condition1
> so it is not always the same as condition2, and now the z++ sees the
> invalid "0" value sometimes, and a silent bug is introduced.
>
> This bug will not remain undetected with the default NaN initialization.
>

To address the concern of static analysis being too hard: I wish we 
could have it but limit the amount of static analysis that's done. 
Something like this: the compiler will test branches of if-else 
statements and switch-case statements, but it will not drop into 
function calls with ref parameters nor will it accept initialization in 
looping constructs (foreach, for, while, etc).  A compiler is an 
incorrect implementation if it implements /too much/ static analysis.

The example code you give can be implemented with such limited static 
analysis:

void lotsaCode() {
	... lotsa code ...
}

float z;
if ( condition1 )
{
	z = 5;
	lotsaCode();
	z++;
}
else
{
	lotsaCode();
}

I will, in advance, concede that this does not prevent people from just 
writing "float z = 0;".  In my dream-world the compiler recognizes a set 
of common mistake-inducing patterns like the one you mentioned and then 
prints helpful error messages suggesting alternative design patterns. 
That way, bugs are prevented and users become better programmers.
August 12, 2012
Re: Which D features to emphasize for academic review article
On Saturday, 11 August 2012 at 23:49:18 UTC, Chad J wrote:
> On 08/10/2012 06:01 PM, Walter Bright wrote:
>> It catches only a subset of these at compile time. I can craft 
>> any number of ways of getting it to miss diagnosing it. 
>> Consider this one:
>>
>> float z;
>> if (condition1)
>> z = 5;
>> ... lotsa code ...
>> if (condition2)
>> z++;
>>
>> To diagnose this correctly, the static analyzer would have to 
>> determine that condition1 produces the same result as 
>> condition2, or not. This is impossible to prove. So the static 
>> analyzer either gives up and lets it pass, or issues an 
>> incorrect diagnostic. So our intrepid programmer is forced to 
>> write:
>>
>> float z = 0;
>> if (condition1)
>> z = 5;
>> ... lotsa code ...
>> if (condition2)
>> z++;
>>
>> Now, as it may turn out, for your algorithm the value "0" is 
>> an out-of-range, incorrect value. Not a problem as it is a 
>> dead assignment, right?
>>
>> But then the maintenance programmer comes along and changes 
>> condition1 so it is not always the same as condition2, and now 
>> the z++ sees the invalid "0" value sometimes, and a silent bug 
>> is introduced.
>>
>> This bug will not remain undetected with the default NaN 
>> initialization.

 Let's keep in mind everyone of these truths:

1) Programmers are lazy; If you can get away with not 
initializing something then you'll avoid it. In C I've failed to 
initialized variables many times until a bug crops up and it's 
difficult to find sometimes, where a NaN or equiv would have 
quickly cropped them out before running with any real data.

2) There are a lot of inexperienced programmers. I worked for a 
company for a short period of time that did minimal training on a 
language like Java, where I ended up being seen as an utter 
genius (compared to even the teachers).

3) Bugs in a large environment and/or scenarios are far more 
difficult if not impossible to debug. I've made a program that 
handles merging of various dialogs (using double linked-like 
lists); I can debug them if they are 100 or less to work with, 
but after 100 (and often it's tens of thousands) it can become 
such a pain based on it's indirection and how the original 
structure was built that I refuse based on difficulty vs end 
results (Plus sanity).

 We also need to sometimes laugh at our mistakes, and learn from 
others. I'll recommend everyone read from rinkworks a bit if you 
have the time and refresh yourselves.

 http://www.rinkworks.com/stupid/cs_programming.shtml
August 12, 2012
Re: Which D features to emphasize for academic review article
Walter Bright wrote:
>> That's just the thing, bugs are arguably easier to hunt down 
>> when things default
>> to a consistent, usable value.
>
> Many, many programming bugs trace back to assumptions that 
> floating point numbers act like ints. There's just no way to 
> avoid knowing and understanding the differences.

My point was that the majority of the time there wasn't a bug 
introduced. Meaning the code was written an functioned as 
expected after I initialized the value to 0. I was only expecting 
the value to act similar (in initial value) as it's 'int' 
relative, but received a NaN in the output because I forgot to be 
explicit.


> I don't see why this is a bad thing. The fact is, with NaN you 
> know there is a bug. With 0, you may never realize there is a 
> problem. Andrei wrote me about the output of a program he is 
> working on having billions of result values, and he noticed a 
> few were NaNs, which he traced back to a bug. If the bug had 
> set the float value to 0, there's no way he would have ever 
> noticed the issue.
>
> It's all about daubing bugs with day-glo orange paint so you 
> know there's a problem. Painting them with camo is not the 
> right solution.

Yes, and this is an excellent argument for using NaN as a 
debugging practice in general, but I don't see anything in favor 
of defaulting to NaN. If you don't do some kind of check against 
code, especially with such large data sets, bugs of various kinds 
are going to go unchecked regardless.

A bug where an initial data value was accidentally initialized to 
0 (by a third party later on, for instance), could be just as 
hard to miss, or harder if you're expecting a NaN to appear. In 
fact, an explicit set to NaN might discourage a third party to 
assigning without first questioning the original intention. In 
this situation I imagine best practice would be to write:

float dataValue = float.nan; // MUST BE NaN, DO NOT CHANGE!
                             // set to NaN to ensure is-set.
August 12, 2012
Re: Which D features to emphasize for academic review article
On 8/11/12 7:33 PM, Walter Bright wrote:
[snip]

Allow me to insert an opinion here. This post illustrates quite well how 
opinionated our community is (for better or worse).

The OP has asked a topical question in a matter that is interesting and 
also may influence the impact of the language to the larger community. 
Before long the thread has evolved into the familiar pattern of a debate 
over a minor issue on which reasonable people may disagree and that's 
unlikely to change. We should instead do our best to give a balanced 
high-level view of what D offers for econometrics.

To the OP - here are a few aspects that may deserve interest:

* Modeling power - from what I understand econometrics is 
modeling-heavy, which is more difficult to address in languages such as 
Fortran, C, C++, Java, Python, or the likes of Matlab.

* Efficiency - D generates native code for floating point operations and 
has control over data layout and allocation. Speed of generated code is 
dependent on the compiler, and the reference compiler (dmd) does a 
poorer job at it than the gnu-based compiler (gdc) compiler.

* Convenience - D is designed to "do what you mean" wherever possible 
and simplify common programming tasks, numeric or not. That makes the 
language comfortable to use even by a non-specialist, in particular in 
conjunction with appropriate libraries.

A few minuses I can think of:

- Maturity and availability of numeric and econometrics library is an 
obvious issue. There are some libraries (e.g. 
https://github.com/kyllingstad/scid/wiki) maintained and extended 
through volunteer effort.

- The language's superior modeling power and level of control comes at 
an increase in complexity compared to languages such as e.g. Python. So 
the statistician would need a larger upfront investment in order to reap 
the associated benefits.


Andrei
August 12, 2012
Re: Which D features to emphasize for academic review article
Andrei Alexandrescu:

> - The language's superior modeling power and level of control 
> comes at an increase in complexity compared to languages such 
> as e.g. Python. So the statistician would need a larger upfront 
> investment in order to reap the associated benefits.

Statistician often use the R language 
(http://en.wikipedia.org/wiki/R_language ).
Python contains much more "computer science" and CS complexity 
compared to R. Not just advanced stuff like coroutines, 
metaclasses, decorators, Abstract Base Classes, operator 
overloading, and so on, but even simpler things, like generators, 
standard library collections like heaps and deques, and so on.
For some statisticians I've seen, even several parts of Python 
are too much hard to use or understand. I have rewritten several 
of their Python scripts.

Bye,
bearophile
August 12, 2012
Re: Which D features to emphasize for academic review article
On Sunday, 12 August 2012 at 02:28:44 UTC, Andrei Alexandrescu 
wrote:
> On 8/11/12 7:33 PM, Walter Bright wrote:
> [snip]
>
> Allow me to insert an opinion here. This post illustrates quite 
> well how opinionated our community is (for better or worse).
>
> The OP has asked a topical question in a matter that is 
> interesting and also may influence the impact of the language 
> to the larger community. Before long the thread has evolved 
> into the familiar pattern of a debate over a minor issue on 
> which reasonable people may disagree and that's unlikely to 
> change. We should instead do our best to give a balanced 
> high-level view of what D offers for econometrics.
>
> To the OP - here are a few aspects that may deserve interest:
>
> * Modeling power - from what I understand econometrics is 
> modeling-heavy, which is more difficult to address in languages 
> such as Fortran, C, C++, Java, Python, or the likes of Matlab.
>
> * Efficiency - D generates native code for floating point 
> operations and has control over data layout and allocation. 
> Speed of generated code is dependent on the compiler, and the 
> reference compiler (dmd) does a poorer job at it than the 
> gnu-based compiler (gdc) compiler.
>
> * Convenience - D is designed to "do what you mean" wherever 
> possible and simplify common programming tasks, numeric or not. 
> That makes the language comfortable to use even by a 
> non-specialist, in particular in conjunction with appropriate 
> libraries.
>
> A few minuses I can think of:
>
> - Maturity and availability of numeric and econometrics library 
> is an obvious issue. There are some libraries (e.g. 
> https://github.com/kyllingstad/scid/wiki) maintained and 
> extended through volunteer effort.
>
> - The language's superior modeling power and level of control 
> comes at an increase in complexity compared to languages such 
> as e.g. Python. So the statistician would need a larger upfront 
> investment in order to reap the associated benefits.
>
>
> Andrei

Andrei,

Thanks for bringing this back to the original topic and for your 
thoughts.

Indeed, a lot of econometricians are using MATLAB, R, Guass, Ox 
and the like. But there are a number of econometricians who need 
the raw power of a natively compiled language (especially 
financial econometricians whose data are huge) who typically 
program in either Fortran or C/C++.  It is really this group that 
I am trying to reach.  I think D has a lot to offer this group in 
terms of programmer productivity and reliability of code.  I 
think this applies to statisticians as well, as I see a lot of 
them in this latter group too.

I also want to reach the MATLABers because I think they can get a 
lot more modeling power (I like how you put that) without too 
much more difficulty (see Ox - nearly as complicated as C++ but 
without the power).  Many MATLAB and R programmers end up 
recoding a good part of their algorithms in C++ and calling that 
code from the interpreted language.  I have always found this 
kind of mixed language programming to be messy, time consuming, 
and error prone.  Special tools are cropping up to handle this 
(see Rcpp).  This just proves to me the usefulness of a 
productive AND powerful language like D for econometricians!

I am sensitive to the drawbacks you mention (especially lack of 
numeric libraries).  I am so sick of wasting my time in C++ 
though that I have almost decided to just start writing my own 
econometric library in D.  Earlier in this thread there was a 
discussion of extended precision in D and I mentioned the need to 
recode things like BLAS and LAPACK in D.  Templates in D seem 
perfect for this problem.  As an expert in template 
meta-programming what are your thoughts?  How is this different 
than what is being done in SciD?  It seems they are mostly 
concerned about wrapping the old CBLAS and CLAPACK libraries.

Again, thanks for your thoughts and your TDPL book. Probably the 
best programming book I've ever read!

TJB
August 12, 2012
Re: Which D features to emphasize for academic review article
Am 12.08.2012 02:43, schrieb F i L:
> Yes, and this is an excellent argument for using NaN as a
> debugging practice in general, but I don't see anything in favor
> of defaulting to NaN. If you don't do some kind of check against
> code, especially with such large data sets, bugs of various kinds
> are going to go unchecked regardless.
>

is makes absolutely no sense to have different initialization stylel in 
debug an release - and according to Andrei example: there are many 
situations where slow-debug code isn't capable to reproduce the error in 
a human-timespan - especially when working with million, billion 
datasets (like i also do...)
1 2 3 4 5 6 7 8
Top | Discussion index | About this forum | D home