Treating the abusive unsigned syndrome

Treating the abusive unsigned syndrome
Nov 25, 2008 Andrei Alexandrescu
Nov 25, 2008 Denis Koroskin
Nov 25, 2008 Andrei Alexandrescu
Nov 25, 2008 bearophile
Nov 25, 2008 bearophile
Nov 25, 2008 Steven Schveighoffer
Nov 25, 2008 bearophile
Nov 25, 2008 Nick Sabalausky
Nov 26, 2008 KennyTM~
Nov 26, 2008 Nick Sabalausky
Nov 26, 2008 bearophile
Nov 26, 2008 Andrei Alexandrescu
Nov 26, 2008 bearophile
Nov 26, 2008 Andrei Alexandrescu
Nov 26, 2008 Christopher Wright
Nov 26, 2008 Kagamin
Nov 25, 2008 Andrei Alexandrescu
Nov 25, 2008 Andrei Alexandrescu
Nov 25, 2008 Steven Schveighoffer
Nov 25, 2008 Andrei Alexandrescu
Nov 25, 2008 Sergey Gromov
Nov 25, 2008 Andrei Alexandrescu
Nov 25, 2008 Sergey Gromov
Nov 25, 2008 Andrei Alexandrescu
Nov 25, 2008 Russell Lewis
Nov 25, 2008 Andrei Alexandrescu
Nov 25, 2008 bearophile
Nov 25, 2008 Nick Sabalausky
Nov 26, 2008 Kagamin
Nov 25, 2008 Daniel de Kok
Nov 26, 2008 Ary Borenszweig
Nov 25, 2008 Sean Kelly
Nov 26, 2008 Andrei Alexandrescu
Nov 26, 2008 Michel Fortin
Nov 26, 2008 Andrei Alexandrescu
Nov 26, 2008 Michel Fortin
Nov 26, 2008 Don
Nov 26, 2008 Andrei Alexandrescu
Nov 26, 2008 Michel Fortin
Nov 26, 2008 Denis Koroskin
Nov 26, 2008 Andrei Alexandrescu
Nov 26, 2008 Denis Koroskin
Nov 27, 2008 Don
Nov 27, 2008 Andrei Alexandrescu
Nov 27, 2008 Don
Nov 27, 2008 Andrei Alexandrescu
Nov 27, 2008 KennyTM~
Nov 27, 2008 KennyTM~
Nov 27, 2008 Andrei Alexandrescu
Nov 28, 2008 Andrei Alexandrescu
Nov 28, 2008 Michel Fortin
Nov 28, 2008 Don
Nov 28, 2008 Andrei Alexandrescu
Nov 28, 2008 Don
Nov 28, 2008 Andrei Alexandrescu
Nov 28, 2008 Don
Dec 01, 2008 Fawzi Mohamed
Nov 28, 2008 Derek Parnell
Nov 29, 2008 Frits van Bommel
Nov 29, 2008 Derek Parnell
Nov 28, 2008 Sean Kelly
Nov 25, 2008 Sean Kelly
Nov 26, 2008 Don
Nov 26, 2008 Andrei Alexandrescu
Nov 26, 2008 Sean Kelly
Nov 26, 2008 Andrei Alexandrescu
Nov 26, 2008 Lars Kyllingstad
Nov 26, 2008 Lars Kyllingstad
Nov 27, 2008 Kagamin
Nov 27, 2008 Andrei Alexandrescu
Nov 26, 2008 Sergey Gromov
Nov 26, 2008 Andrei Alexandrescu
Nov 26, 2008 Sergey Gromov
Nov 27, 2008 bearophile
Nov 27, 2008 Kagamin
Nov 27, 2008 Andrei Alexandrescu
Nov 26, 2008 Walter Bright
Nov 26, 2008 Sean Kelly
Nov 26, 2008 Andrei Alexandrescu
Nov 27, 2008 Sean Kelly
Nov 27, 2008 Denis Koroskin
Nov 27, 2008 Sean Kelly
Nov 26, 2008 Michel Fortin
Nov 26, 2008 Andrei Alexandrescu
Nov 26, 2008 Nick Sabalausky
Nov 27, 2008 Tomas Lindquist Olsen
Nov 27, 2008 Christopher Wright
Nov 27, 2008 Simen Kjaeraas
Nov 27, 2008 Derek Parnell
Nov 27, 2008 Andrei Alexandrescu
Nov 27, 2008 Derek Parnell
Nov 28, 2008 Andrei Alexandrescu
Nov 28, 2008 bearophile

November 25, 2008

Posted by Andrei Alexandrescu

Permalink

Andrei Alexandrescu

Permalink

D pursues compatibility with C and C++ in the following manner: if a code snippet compiles in both C and D or C++ and D, then it should have the same semantics.

A classic problem with C and C++ integer arithmetic is that any operation involving at least an unsigned integral receives automatically an unsigned type, regardless of how silly that actually is, semantically. About the only advantage of this rule is that it's simple. IMHO it only has disadvantages from then on.

The following operations suffer from the "abusive unsigned syndrome" (u is an unsigned integral, i is a signed integral):

(1) u + i, i + u
(2) u - i, i - u
(3) u - u
(4) u * i, i * u, u / i, i / u, u % i, i % u (compatibility with C requires that these all return unsigned, ouch)
(5) u < i, i < u, u <= i etc. (all ordering comparisons)
(6) -u

Logic operations &, |, and ^ also yield unsigned, but such cases are less abusive because at least the operation wasn't arithmetic in the first place. Comparing for equality is also quite a conundrum - should minus two billion compare equal to 2_294_967_296? I'll ignore these for now and focus on (1) - (6).

So far we haven't found a solid solution to this problem that at the same time allows "good" code pass through, weeds out "bad" code, and is compatible with C and C++. The closest I got was to have the compiler define the following internal types:

__intuint
__longulong

I've called them "dual-signed integers" in the past, but let's try the shorter "undecided sign". Each of these is a subtype of both the signed and the unsigned integral in its name, e.g. __intuint is a subtype of both int and uint. (Originally I thought of defining __byteubyte and __shortushort as well but dropped them in the interest of simplicity.)

The sign-ambiguous operations (1) - (6) yield __intuint if no operand size was larger than 32 bits, and __longulong otherwise. Undecided sign types define their own operations. Let x and y be values of undecided sign. Then x + y, x - y, and -x also return a sign-ambiguous integral (the size is that of the largest operand). However, the other operators do not work on sign-ambiguous integrals, e.g. x / y would not compile because you must decide what sign x and y should have prior to invoking the operation. (Rationale: multiplication/division work differently depending on the signedness of their operands).

User code cannot define a symbol of sign-ambiguous type, e.g.

auto a = u + i;

would not compile. However, given that __intuint is a subtype of both int and uint, it can be freely converted to either whenever there's no ambiguity:

int a = u + i; // fine
uint b = u + i; // fine

The advantage of this scheme is that it weeds out many (most? all?) surprises and oddities caused by the abusive unsigned rule of C and C++. The disadvantage is that it is more complex and may surprise the novice in its own way by refusing to compile code that looks legit.

At the moment, we're in limbo regarding the decision to go forward with this. Walter, as many good long-time C programmers, knows the abusive unsigned rule so well he's not hurt by it and consequently has little incentive to see it as a problem. I have had to teach C and C++ to young students coming from Java introductory courses and have a more up-to-date perspective on the dangers. My strong belief is that we need to address this mess somehow, which type inference will only make more painful (in the hand of the beginner, auto can be a quite dangerous tool for wrong belief propagation). I also know seasoned programmers who had no idea that -u compiles and that it also oddly returns an unsigned type.

Your opinions, comments, and suggestions for improvements would as always be welcome.

Andrei

November 25, 2008

Re: Treating the abusive unsigned syndrome

Posted by Denis Koroskin
in reply to Andrei Alexandrescu

Permalink

Denis Koroskin

Posted in reply to Andrei Alexandrescu

Permalink

On Tue, 25 Nov 2008 18:59:01 +0300, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> D pursues compatibility with C and C++ in the following manner: if a code snippet compiles in both C and D or C++ and D, then it should have the same semantics.
>
> A classic problem with C and C++ integer arithmetic is that any operation involving at least an unsigned integral receives automatically an unsigned type, regardless of how silly that actually is, semantically. About the only advantage of this rule is that it's simple. IMHO it only has disadvantages from then on.
>
> The following operations suffer from the "abusive unsigned syndrome" (u is an unsigned integral, i is a signed integral):
>
> (1) u + i, i + u
> (2) u - i, i - u
> (3) u - u
> (4) u * i, i * u, u / i, i / u, u % i, i % u (compatibility with C requires that these all return unsigned, ouch)
> (5) u < i, i < u, u <= i etc. (all ordering comparisons)
> (6) -u
>
> Logic operations &, |, and ^ also yield unsigned, but such cases are less abusive because at least the operation wasn't arithmetic in the first place. Comparing for equality is also quite a conundrum - should minus two billion compare equal to 2_294_967_296? I'll ignore these for now and focus on (1) - (6).
>
> So far we haven't found a solid solution to this problem that at the same time allows "good" code pass through, weeds out "bad" code, and is compatible with C and C++. The closest I got was to have the compiler define the following internal types:
>
> __intuint
> __longulong
>
> I've called them "dual-signed integers" in the past, but let's try the shorter "undecided sign". Each of these is a subtype of both the signed and the unsigned integral in its name, e.g. __intuint is a subtype of both int and uint. (Originally I thought of defining __byteubyte and __shortushort as well but dropped them in the interest of simplicity.)
>
> The sign-ambiguous operations (1) - (6) yield __intuint if no operand size was larger than 32 bits, and __longulong otherwise. Undecided sign types define their own operations. Let x and y be values of undecided sign. Then x + y, x - y, and -x also return a sign-ambiguous integral (the size is that of the largest operand). However, the other operators do not work on sign-ambiguous integrals, e.g. x / y would not compile because you must decide what sign x and y should have prior to invoking the operation. (Rationale: multiplication/division work differently depending on the signedness of their operands).
>
> User code cannot define a symbol of sign-ambiguous type, e.g.
>
> auto a = u + i;
>
> would not compile. However, given that __intuint is a subtype of both int and uint, it can be freely converted to either whenever there's no ambiguity:
>
> int a = u + i; // fine
> uint b = u + i; // fine
>
> The advantage of this scheme is that it weeds out many (most? all?) surprises and oddities caused by the abusive unsigned rule of C and C++. The disadvantage is that it is more complex and may surprise the novice in its own way by refusing to compile code that looks legit.
>
> At the moment, we're in limbo regarding the decision to go forward with this. Walter, as many good long-time C programmers, knows the abusive unsigned rule so well he's not hurt by it and consequently has little incentive to see it as a problem. I have had to teach C and C++ to young students coming from Java introductory courses and have a more up-to-date perspective on the dangers. My strong belief is that we need to address this mess somehow, which type inference will only make more painful (in the hand of the beginner, auto can be a quite dangerous tool for wrong belief propagation). I also know seasoned programmers who had no idea that -u compiles and that it also oddly returns an unsigned type.
>
> Your opinions, comments, and suggestions for improvements would as always be welcome.
>
>
> Andrei

I think it's fine. That's the way the LLVM stores the integral values internally, IIRC.

But what is the type of -u? If it is undecided, then the following should compile:

uint u = 100;
uint s = -u; // undecided implicitly convertible to unsigned

November 25, 2008

Re: Treating the abusive unsigned syndrome

Posted by Andrei Alexandrescu
in reply to Denis Koroskin

Permalink

Andrei Alexandrescu

Posted in reply to Denis Koroskin

Permalink

Denis Koroskin wrote:
> On Tue, 25 Nov 2008 18:59:01 +0300, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
> 
>> D pursues compatibility with C and C++ in the following manner: if a code snippet compiles in both C and D or C++ and D, then it should have the same semantics.
>>
>> A classic problem with C and C++ integer arithmetic is that any operation involving at least an unsigned integral receives automatically an unsigned type, regardless of how silly that actually is, semantically. About the only advantage of this rule is that it's simple. IMHO it only has disadvantages from then on.
>>
>> The following operations suffer from the "abusive unsigned syndrome" (u is an unsigned integral, i is a signed integral):
>>
>> (1) u + i, i + u
>> (2) u - i, i - u
>> (3) u - u
>> (4) u * i, i * u, u / i, i / u, u % i, i % u (compatibility with C requires that these all return unsigned, ouch)
>> (5) u < i, i < u, u <= i etc. (all ordering comparisons)
>> (6) -u
>>
>> Logic operations &, |, and ^ also yield unsigned, but such cases are less abusive because at least the operation wasn't arithmetic in the first place. Comparing for equality is also quite a conundrum - should minus two billion compare equal to 2_294_967_296? I'll ignore these for now and focus on (1) - (6).
>>
>> So far we haven't found a solid solution to this problem that at the same time allows "good" code pass through, weeds out "bad" code, and is compatible with C and C++. The closest I got was to have the compiler define the following internal types:
>>
>> __intuint
>> __longulong
>>
>> I've called them "dual-signed integers" in the past, but let's try the shorter "undecided sign". Each of these is a subtype of both the signed and the unsigned integral in its name, e.g. __intuint is a subtype of both int and uint. (Originally I thought of defining __byteubyte and __shortushort as well but dropped them in the interest of simplicity.)
>>
>> The sign-ambiguous operations (1) - (6) yield __intuint if no operand size was larger than 32 bits, and __longulong otherwise. Undecided sign types define their own operations. Let x and y be values of undecided sign. Then x + y, x - y, and -x also return a sign-ambiguous integral (the size is that of the largest operand). However, the other operators do not work on sign-ambiguous integrals, e.g. x / y would not compile because you must decide what sign x and y should have prior to invoking the operation. (Rationale: multiplication/division work differently depending on the signedness of their operands).
>>
>> User code cannot define a symbol of sign-ambiguous type, e.g.
>>
>> auto a = u + i;
>>
>> would not compile. However, given that __intuint is a subtype of both int and uint, it can be freely converted to either whenever there's no ambiguity:
>>
>> int a = u + i; // fine
>> uint b = u + i; // fine
>>
>> The advantage of this scheme is that it weeds out many (most? all?) surprises and oddities caused by the abusive unsigned rule of C and C++. The disadvantage is that it is more complex and may surprise the novice in its own way by refusing to compile code that looks legit.
>>
>> At the moment, we're in limbo regarding the decision to go forward with this. Walter, as many good long-time C programmers, knows the abusive unsigned rule so well he's not hurt by it and consequently has little incentive to see it as a problem. I have had to teach C and C++ to young students coming from Java introductory courses and have a more up-to-date perspective on the dangers. My strong belief is that we need to address this mess somehow, which type inference will only make more painful (in the hand of the beginner, auto can be a quite dangerous tool for wrong belief propagation). I also know seasoned programmers who had no idea that -u compiles and that it also oddly returns an unsigned type.
>>
>> Your opinions, comments, and suggestions for improvements would as always be welcome.
>>
>>
>> Andrei
> 
> I think it's fine. That's the way the LLVM stores the integral values internally, IIRC.
> 
> But what is the type of -u? If it is undecided, then the following should compile:
> 
> uint u = 100;
> uint s = -u; // undecided implicitly convertible to unsigned

Yah, but at least you actively asked for an unsigned. Compare and contrast with surprises such as:

uint a = 5;
writeln(-a); // this won't print -5

Such code would be disallowed in the undecided-sign regime.


Andrei

November 25, 2008

Re: Treating the abusive unsigned syndrome

Posted by bearophile
in reply to Andrei Alexandrescu

Permalink

bearophile

Posted in reply to Andrei Alexandrescu

Permalink

Few general comments.

Andrei Alexandrescu:

> D pursues compatibility with C and C++ in the following manner: if a code snippet compiles in both C and D or C++ and D, then it should have the same semantics.

I didn't know of such "support" for C++ syntax too, isn't such "support" for C syntax only? D has very little to share with C++.

This rule is good because you can take a piece of C code and convert it to D with less work and fewer surprises. I have already translated large pieces of C code to D, so I appreciate this.

But in several things C syntax and semantics is too much error prone or "wrong", so sometimes it can also become a significant disadvantage for a language like D that tries to be much less error-prone than C.

One solution is to "disable" some of the more error-prone syntax allowed in C, turning it into a compilation error. For example I have seen newbies write bugs caused by leaving & where a && was necessary. In such case just adopting "and" and making "&&" a syntax error solves the problem and doesn't lead to bugs when you convert C code to D (you just use a search&replace, replacing && with and on the code).

In other situations it may be less easy to find such kind of solutions (that is invent an alternative syntax/semantics and make the C one a syntax error), in such cases I think it's better to discuss each one of such situations independently. In some situations we can even break the standard way D pursues compatibility, for the sake of avoiding bugs and making the semantics better.


> The disadvantage is that it is more complex

It's not really more complex, it just makes visible some hidden complexity that is already present and inherent of the signed/unsigned nature of the numbers.
It also follows the Python Zen rule: "In the face of ambiguity, refuse the temptation to guess."


> and may surprise the novice in its own way by refusing to compile code that looks legit.

A compile error is better than a potential runtime bug.


> Walter, as many good long-time C programmers, knows the abusive unsigned rule so well he's not hurt by it and consequently has little incentive to see it as a problem.

I'm not a newbie of programming, but in the last year I have put in my code two bugs related to this, so I suggest to find ways to avoid this silly situation. I think the first bug was something like:
if (arr.lenght > x) ...
where x was a signed int with value -5 (this specific bug can also be solved making array length a signed value. What's the point of making it unsigned in the first place? I have seen that in D it's safer to use signed values everywhere you don't strictly need an unsigned value. And that length doesn't need to be unsigned).

Beside the unsigned/signed problems discussed here, it may be positive to list some of other situations where the C syntax/semantics may lead to bugs. For example, does fixes the C semantics of % (modulo) operation?
Another example: in both Pascal and Python3 there are two different operators for the division, one for the FP one and one for the integer one (in Pascal they are / and div, in Python3 they are / and // ).. So can it be positive for D too to define two different operators for such purpose?

Bye,
bearophile

November 25, 2008

Re: Treating the abusive unsigned syndrome

Posted by bearophile
in reply to bearophile

Permalink

bearophile

Posted in reply to bearophile

Permalink

bearophile:
> if (arr.lenght > x) ...

Oh, yes :-) and writing "lenght" instead of "lenght" is a common mistake of mine, usually the code editor allows me to avoid this error because the right one becomes colored. That's why in the past I have suggested something simpler and shorter like "len" (others have suggested "size" instead, it too is acceptable to me).

Bye,
bearophile

November 25, 2008

Re: Treating the abusive unsigned syndrome

Posted by Andrei Alexandrescu
in reply to bearophile

Permalink

Andrei Alexandrescu

Posted in reply to bearophile

Permalink

bearophile wrote:
>> Walter, as many good long-time C programmers, knows the abusive unsigned rule so well he's not hurt by it and consequently has
>> little incentive to see it as a problem.
> 
> I'm not a newbie of programming, but in the last year I have put in
> my code two bugs related to this, so I suggest to find ways to avoid
> this silly situation. I think the first bug was something like: if
> (arr.lenght > x) ...

> where x was a signed int with value -5 (this specific bug can also be
> solved making array length a signed value. What's the point of making
> it unsigned in the first place? I have seen that in D it's safer to
> use signed values everywhere you don't strictly need an unsigned
> value. And that length doesn't need to be unsigned).

It's worthwhile keeping length an unsigned type if we can convincingly sell unsigned types as models of natural numbers. With the current rules, we can't make a convincing argument. But if we do manage to improve the rules, then we'll all be better off.

Andrei

November 25, 2008

Re: Treating the abusive unsigned syndrome

Posted by Andrei Alexandrescu
in reply to Andrei Alexandrescu

Permalink

Andrei Alexandrescu

Posted in reply to Andrei Alexandrescu

Permalink

I remembered a couple more details. The names bits8, bits16, bits32, and bits64 were a possible choice for undecided-sign integrals. Walter and I liked that quite some. Walter also suggested that we make those actually full types accessible to programmers. We both were concerned that they'd add to the already large panoply of integral types in D. Dropping bits8 and bits16 would reduce bloating at the cost of consistency.

So we're contemplating:

(a) Add bits8, bits16, bit32, bits64 public types.
(b) Add bit32, bits64 public types.
(c) Add bits8, bits16, bit32, bits64 compiler-internal types.
(d) Add bit32, bits64 compiler-internal types.

Make your pick or add more choices!


Andrei

November 25, 2008

Re: Treating the abusive unsigned syndrome

Posted by Steven Schveighoffer
in reply to bearophile

Permalink

Steven Schveighoffer

Posted in reply to bearophile

Permalink

"bearophile" wrote
> bearophile:
>> if (arr.lenght > x) ...
>
> Oh, yes :-) and writing "lenght" instead of "lenght" is a common mistake of mine

lol!!!

November 25, 2008

Re: Treating the abusive unsigned syndrome

Posted by bearophile
in reply to Steven Schveighoffer

Permalink

bearophile

Posted in reply to Steven Schveighoffer

Permalink

Steven Schveighoffer:
> lol!!!

I know, I know... :-) But when people do errors so often, the error is elsewhere, in the original choice of that word to denote how many items an iterable has.

In my libs I have defined len() like this, that I use now and then (where running speed isn't essential):

long len(TyItems)(TyItems items) {
    static if (HasLength!(TyItems))
        return items.length;
    else {
        long len;
        // this generates: foreach (p1, p2, p3; items) len++;  with a variable number of p1, p2...
        mixin("foreach (" ~ SeriesGen1!("p", ", ", OpApplyCount!(TyItems), 1) ~ "; items) len++;");
        return len;
    }
} // End of len(items)

/// ditto
long len(TyItems, TyFun)(TyItems items, TyFun pred) {
    static assert(IsCallable!(TyFun), "len(): predicate must be a callable");
    long len;

    static if (IsAA!(TyItems)) {
        foreach (key, val; items)
            if (pred(key, val))
                len++;
    } else static if (is(typeof(TyItems.opApply))) {
        mixin("foreach (" ~ SeriesGen1!("p", ", ", OpApplyCount!(TyItems), 1) ~ "; items)
            if (pred(" ~ SeriesGen1!("p", ", ", OpApplyCount!(TyItems), 1) ~ "))
                len++;");
    } else {
        foreach (el; items)
            if (pred(el))
                len++;
    }

    return len;
} // End of len(items, pred)

alias len!(string) strLen; /// ditto
alias len!(int[]) intLen; /// ditto
alias len!(float[]) floatLen; /// ditto

Having a global callable like len() instead of an attribute is (sometimes) better, because you can use it for example like this (this is working syntax of my dlibs):

children.sort(&len!(string));
That sorts the array of strings "children" according to the given callable key, that is the len of the strings.

Bye,
bearophile

November 25, 2008

Re: Treating the abusive unsigned syndrome

Posted by Steven Schveighoffer
in reply to Andrei Alexandrescu

Permalink

Steven Schveighoffer

Posted in reply to Andrei Alexandrescu

Permalink

"Andrei Alexandrescu" wrote
>I remembered a couple more details. The names bits8, bits16, bits32, and bits64 were a possible choice for undecided-sign integrals. Walter and I liked that quite some. Walter also suggested that we make those actually full types accessible to programmers. We both were concerned that they'd add to the already large panoply of integral types in D. Dropping bits8 and bits16 would reduce bloating at the cost of consistency.
>
> So we're contemplating:
>
> (a) Add bits8, bits16, bit32, bits64 public types.
> (b) Add bit32, bits64 public types.
> (c) Add bits8, bits16, bit32, bits64 compiler-internal types.
> (d) Add bit32, bits64 compiler-internal types.
>
> Make your pick or add more choices!

One other thing to contemplate:

What happens if you add a bits32 to a bits64, long, or ulong value?  This needs to be illegal since you don't know whether to sign-extend or not.  Or you could reinterpret the expression to promote the original types to 64-bit first?

This makes the version with 8 and 16 bit types less attractive.

Another alternative is to select the bits type based on the entire expression.  Of course, you'd have to disallow them as public types.  And you'd want to do some special optimizations.  You could represent it conceptually as calculating for all the bits types until the one that is decided is used, and then the compiler can optimize out the unused ones, which would at least keep it context-free.

-Steve

Top | Forum index | About this forum

Forums