November 27, 2008
Andrei Alexandrescu wrote:
> Sean Kelly wrote:
>> Don wrote:
>>>
>>> Although it would be nice to have a type which was range-limited, 'uint' doesn't do it. Instead, it guarantees the number is between 0 and int.max*2+1 inclusive. Allowing mixed operations encourages programmers to focus the benefit of 'the lower bound is zero!' while forgetting that there is an enormous downside ('I'm saying that this could be larger than int.max!')
>>
>> This inspired me to think about where I use uint and I realized that I don't.  I use size_t for size/length representations (largely because sizes can theoretically be >2GB on a 32-bit system), and ubyte for bit-level stuff, but that's it.
> 
> For the record, I use unsigned types wherever there's a non-negative number involved (e.g. a count). So I'd be helped by better unsigned operations.

To be fair, I generally use unsigned numbers for values that are logically always positive.  These just tend to be sizes and counts in my code.

> I wonder how often these super-large arrays do occur on 32-bit systems. I do have programs that try to allocate as large a contiguous matrix as possible, but never sat down and tested whether a >2GB chunk was allocated on the Linux cluster I work on. I'm quite annoyed by this >2GB issue because it's a very practical and very rare issue in a weird contrast with a very principled issue (modeling natural numbers).

Yeah, I have no idea how common they are, though my guess would be that they are rather uncommon.  As a library programmer, I simply must assume that they are in use, which is why I use size_t as a matter of course.


Sean
November 27, 2008
The more I read about this, the more I am convinced that removing the following
- implicit int <-> uint conversion
- uint - uint (not 100% sure about this)
- mixed int / uint arithmetic
As well as changing array.length to int, would remove most problems.

If you desperately need a > 2^31 element array, having to roll your own is not the main problem.

The fact that the type of uint - uint could be int or uint depending on what the programmer wants, tells me that the programmer should be tasked with informing the compiler what he really wants - i.e. cast.

-- 
Simen
November 27, 2008
27.11.08 в 03:46 Sean Kelly в своём письме писал(а):

> Andrei Alexandrescu wrote:
>> Sean Kelly wrote:
>>> Don wrote:
>>>>
>>>> Although it would be nice to have a type which was range-limited, 'uint' doesn't do it. Instead, it guarantees the number is between 0 and int.max*2+1 inclusive. Allowing mixed operations encourages programmers to focus the benefit of 'the lower bound is zero!' while forgetting that there is an enormous downside ('I'm saying that this could be larger than int.max!')
>>>
>>> This inspired me to think about where I use uint and I realized that I don't.  I use size_t for size/length representations (largely because sizes can theoretically be >2GB on a 32-bit system), and ubyte for bit-level stuff, but that's it.
>>  For the record, I use unsigned types wherever there's a non-negative number involved (e.g. a count). So I'd be helped by better unsigned operations.
>
> To be fair, I generally use unsigned numbers for values that are logically always positive.  These just tend to be sizes and counts in my code.
>
>> I wonder how often these super-large arrays do occur on 32-bit systems. I do have programs that try to allocate as large a contiguous matrix as possible, but never sat down and tested whether a >2GB chunk was allocated on the Linux cluster I work on. I'm quite annoyed by this  
>> >2GB issue because it's a very practical and very rare issue in a weird  
>> contrast with a very principled issue (modeling natural numbers).
>
> Yeah, I have no idea how common they are, though my guess would be that they are rather uncommon.  As a library programmer, I simply must assume that they are in use, which is why I use size_t as a matter of course.
>
>
> Sean

If they can be more than 2Gb, why can't they be more than 4GB? It is dangerous to assume that they won't, that's why uint is dangerous. You exchange one additional bit of information for safety, this is wrong.

Soon enough we won't use uints the same way we don't use ushorts (I should have asked if anyone uses ushort these day first, but there is so little gain to use  ushort as opposed to short or int that I consider it impractical). 64bit era will give us 64bit pointers and 64 bit counters. Do you think you will prefer ulong over long for an additional bit? You really shoudn't.

My proposal

Short summary:
- Disallow bitwise operations on both signed types and unsigned types, allow arithmetic operations
- Discourage usage of unsigned types. Introduce bits8, bits16, bits32 and bits64 as a replacement
- Disallow arithmetic operations on bits* types, allow bitwise operations on them
- Disallow mixed-type operations (compare, add, sub, mul and div)
- Disallow implicit casts between all types
- Use int and long (or ranged types) for length and indices with runtime checks (a.length-- is always dangerous no mater what CT checks you will make).
- Add type constructors for int/uint/etc: "auto x = int(int.max + 1);" throws at run-time


The two most common uses of uints are:

0) Bitfields or masks, packed values and hexademical constants (bitfields later on)
1) Numbers that can't be negative (counters, sizes/lengths etc)


Bitfields

Bitfields are handy, and using of an unsigned type over a signed is surely preferable. Most common operations on bitfields are bitwise AND, OR, (R/L)SHIFT and XOR. You shouldn't substruct from or add to them, it is an error in most cases. This is what new bits8, bits16, bits32 and bits64 types should be used for:

bits32 argbColor;
int alphaShift = 24; // any type here, actually

// shift
bits32 alphaMask = (0xFF << alphaShift); // 0xFF is of type bits8

auto value2 = value1 & mask; // all 3 are of type bits*

// you can only shift bits, result is in bits, too, i.e. the following is incorrect:
int i = -42;
int x = (i << 8); // An error
// 1) can't shift value of type int
// 2) can't assign valus of type bits32 to variable of type int

// ubyte is still handy sometimes (color should belong to [0..255] range)
auto red = (argbColor & alphaMask) >> alphaShift; // result is in bits32, use explicit cast to convert it to target data type:

ubyte red = cast(ubyte)((argbColor & alphaMask) >> alphaShift);

// Alternatively:
ubyte alpha = ubyte((argbColor & alphaMask) >> alphaShift);

Type constructor throws an error if source value (which is of type bits32 in this example) can't be stored in ubyte. This might be a replacement for signed/unsigned methods.

int i = 0xFFFFFFFF; // an error, can't convert value of type bits32 to variable of type int
int i = int.max + 1; // ok
int i =  int(int.max + 1); // an exception is raised at runtime

int i = 0xABCD - 0xDCBA; // not allowed. Add explicit casts

auto u = cast(uint)0xABCD - cast(uint)0xDCBA; // result type is uint, no checks for overflow
auto i = cast(int)0xABCD - cast(int)0xDCBA; // result type is int, no checks for overflow

auto e = cast(uint)0xABCD - cast(int)0xDCBA; // an error, can't substruct int from uint

// type ctors in action:
auto i = int(cast(int)0xABCD - cast(int)0xDCBA); // result type is int, an exception on overflow
auto u = int(cast(uint)0xABCD - cast(uint)0xDCBA); // same here for uint


Non-negative values

Just use int/long. Or some ranged type ([0..short.max], [0..int.max], [0..long.max]) could be used as well. A library type, perhaps. Let's call it nshort/nint/nlong. It should have the same set of operations as short/int/long but makes additional checks. Throws on under- and overflow.

int x = 42;
nint nx = x; // ok
nx = -x; // throws

nx = int.max; // ok
++nx; // throws

nx = 0;
--nx; // throws

nx = 0;
nint ny = 42;

nx = ny; // no checking is done

int y = ny; // no checking is done, either
short s = ny; // error, cast needed

short s = cast(short)ny; // never throws
short s = short(ny); // might throw
November 27, 2008
Tomas Lindquist Olsen wrote:
> I'm not really sure what I think about all this. I try to always insert assertions before operations like this, which makes me think the nicest solution would be if the compiler errors out if it detects a problematic expression that is unchecked...
> 
> uint diff(uint begin, uint end)
> {
>     return end - begin; // error
> }
> 
> 
> uint diff(uint begin, uint end)
> {
>     assert(begin <= end);
>     return end - begin; // ok because of the assert
> }
> 
> 
> I'm not going to get into how this would be implemented in the compiler,  but it sure would be sweet :)

On the other hand, the CPU can report on integer overflow, so you could turn that into an exception if the expression doesn't include a cast.
November 27, 2008
Andrei Alexandrescu wrote:
> Denis Koroskin wrote:
>> On Wed, 26 Nov 2008 18:24:17 +0300, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
>>
>>> Also consider:
>>>
>>> auto delta = a1.length - a2.length;
>>>
>>> What should the type of delta be? Well, it depends. In my scheme that wouldn't even compile, which I think is a good thing; you must decide whether prior information makes it an unsigned or a signed integral.
>>>
>>
>> Sure, it shouldn't compile. But explicit casting to either type won't help. Let's say you expect that a1.length > a2.length and thus expect a strictly positive result. Putting an explicit cast will not detect (but suppress) an error and give you an erroneous result silently.
> 
> But "silently" and "putting a cast" don't go together. It's the cast that makes the erroneous result non-silent.
> 
> Besides, you don't need to cast. You can always use a function that does the requisite checks. std.conv will have some of those, should any change in the rules make it necessary.

I doubt that would be used in practice.

> By this I'm essentially replying Don's message in the bugs newsgroup: nobody puts a gun to your head to cast.
> 
>> Putting an assert(a1.length > a2.length) might help, but the check will be unavailable unless code is compiled with asserts enabled.
> 
> Put an enforce(a1.length > a2.length) then.
> 
>> A better solution would be to write code as follows:
>>
>> auto delta = unsigned(a1.length - a2.length); // returns an unsigned value, throws on overflow (i.e., "2 - 4")
>> auto delta = signed(a1.length - a2.length); // returns result as a signed value. Throws on overflow (i.e., "int.min - 1")
>> auto delta = a1.length - a2.length; // won't compile
> 
> Amazingly this solution was discussed with these exact names! The signed and unsigned functions can be implemented as libraries, but unfortunately (or fortunately I guess) that means the bits32 and bits64 are available to all code.
> 
> One fear of mine is the reaction of throwing of hands in the air "how many integral types are enough???". However, if we're to judge by the addition of long long and a slew of typedefs to C99 and C++0x, the answer is "plenty". I'd be interested in gaging how people feel about adding two (bits64, bits32) or even four (bits64, bits32, bits16, and bits8) types as basic types. They'd be bitbags with undecided sign ready to be converted to their counterparts of decided sign.

Here I think we have a fundamental disagreement: what is an 'unsigned int'? There are two disparate ideas:

(A) You think that it is an approximation to a natural number, ie, a 'positive int'.
(B) I think that it is a 'number with NO sign'; that is, the sign depends on context. It may, for example, be part of a larger number. Thus, I largely agree with the C behaviour -- once you have an unsigned in a calculation, it's up to the programmer to provide an interpretation.

Unfortunately, the two concepts are mashed together in C-family languages. (B) is the concept supported by the language typing rules, but usage of (A) is widespread in practice.

If we were going to introduce a slew of new types, I'd want them to be for 'positive int'/'natural int', 'positive byte', etc.

Natural int can always be implicitly converted to either int or uint, with perfect safety. No other conversions are possible without a cast.
Non-negative literals and manifest constants are naturals.

The rules are:
1. Anything involving unsigned is unsigned, (same as C).
2. Else if it contains an integer, it is an integer.
3. (Now we know all quantities are natural):
If it contains a subtraction, it is an integer [Probably allow subtraction of compile-time quantities to remain natural, if the values stay in range; flag an error if an overflow occurs].
4. Else it is a natural.


The reason I think literals and manifest constants are so important is that they are a significant fraction of the natural numbers in a program.

[Just before posting I've discovered that other people have posted some similar ideas].

November 27, 2008
Don wrote:
> Andrei Alexandrescu wrote:
>> One fear of mine is the reaction of throwing of hands in the air "how many integral types are enough???". However, if we're to judge by the addition of long long and a slew of typedefs to C99 and C++0x, the answer is "plenty". I'd be interested in gaging how people feel about adding two (bits64, bits32) or even four (bits64, bits32, bits16, and bits8) types as basic types. They'd be bitbags with undecided sign ready to be converted to their counterparts of decided sign.
> 
> Here I think we have a fundamental disagreement: what is an 'unsigned int'? There are two disparate ideas:
> 
> (A) You think that it is an approximation to a natural number, ie, a 'positive int'.
> (B) I think that it is a 'number with NO sign'; that is, the sign depends on context. It may, for example, be part of a larger number. Thus, I largely agree with the C behaviour -- once you have an unsigned in a calculation, it's up to the programmer to provide an interpretation.
> 
> Unfortunately, the two concepts are mashed together in C-family languages. (B) is the concept supported by the language typing rules, but usage of (A) is widespread in practice.

In fact we are in agreement. C tries to make it usable as both, and partially succeeds by having very lax conversions in all directions. This leads to the occasional puzzling behaviors. I do *want* uint to be an approximation of a natural number, while acknowledging that today it isn't much of that.

> If we were going to introduce a slew of new types, I'd want them to be for 'positive int'/'natural int', 'positive byte', etc.
> 
> Natural int can always be implicitly converted to either int or uint, with perfect safety. No other conversions are possible without a cast.
> Non-negative literals and manifest constants are naturals.
> 
> The rules are:
> 1. Anything involving unsigned is unsigned, (same as C).
> 2. Else if it contains an integer, it is an integer.
> 3. (Now we know all quantities are natural):
> If it contains a subtraction, it is an integer [Probably allow subtraction of compile-time quantities to remain natural, if the values stay in range; flag an error if an overflow occurs].
> 4. Else it is a natural.
> 
> 
> The reason I think literals and manifest constants are so important is that they are a significant fraction of the natural numbers in a program.
> 
> [Just before posting I've discovered that other people have posted some similar ideas].

That sounds encouraging. One problem is that your approach leaves the unsigned mess as it is, so although natural types are a nice addition, they don't bring a complete solution to the table.


Andrei
November 27, 2008
Andrei Alexandrescu Wrote:

> There are several schools of thought (for the lack of a better phrase):
> 
> 1. The Purist Mathematician: We want unsigned to approximate natural numbers, natural numbers aren't closed for subtraction, therefore u1 - u2 should be disallowed.

I thought, mathematics doesn't distinguish between, say, natural 5, integral 5 and real 5. N, Z and R are sets, not types of numbers. There is even notion of equivalence class to deem numbers with different representation as the same (not just equal).
November 27, 2008
Kagamin wrote:
> Andrei Alexandrescu Wrote:
> 
>> There are several schools of thought (for the lack of a better
>> phrase):
>> 
>> 1. The Purist Mathematician: We want unsigned to approximate
>> natural numbers, natural numbers aren't closed for subtraction,
>> therefore u1 - u2 should be disallowed.
> 
> I thought, mathematics doesn't distinguish between, say, natural 5,
> integral 5 and real 5. N, Z and R are sets, not types of numbers.
> There is even notion of equivalence class to deem numbers with
> different representation as the same (not just equal).

Right, but the notion of set closedness for an operation comes from math.

Andrei
November 27, 2008
bearophile Wrote:

> In my programs I use use unsigned integers and unsigned longs as:
> - bitfields, a single size_t, for example to represent a small set of items.
> - bitarrays, in an array of size_t, to represent a larger set, to have array of bit flags, etc.
> - to pack small variables into a uint, size_t, etc, for example use the first 5 bits to represent a, the following 2 bits to represent b, etc. In such situation I have never pack such variables into a signed int.

I think, signed ints can hold bits as gracefully as unsigned ones.

> - when I need very large integer values, but this has to be done with care, because they can't be converted back to ints.

I don't think that large integers know or respect computers-specific integers limits. They just get larger and larger.

> - Probably there are few other situations, for example I think I've used an ushort once, but not many of them.

legacy technologies tend to use unsigneds intensively and people got used to unsigned chars (for comparison and character maps).
November 27, 2008
Kagamin wrote:
> bearophile Wrote:
> 
>> In my programs I use use unsigned integers and unsigned longs as: -
>> bitfields, a single size_t, for example to represent a small set of
>> items. - bitarrays, in an array of size_t, to represent a larger
>> set, to have array of bit flags, etc. - to pack small variables
>> into a uint, size_t, etc, for example use the first 5 bits to
>> represent a, the following 2 bits to represent b, etc. In such
>> situation I have never pack such variables into a signed int.
> 
> I think, signed ints can hold bits as gracefully as unsigned ones.

Problem is there is an odd jump whenever the sign bit gets into play. An expert programmer can easily deal with that, but it's rather tricky.

>> - when I need very large integer values, but this has to be done
>> with care, because they can't be converted back to ints.
> 
> I don't think that large integers know or respect computers-specific
> integers limits. They just get larger and larger.

Often large integers hold counts or sizes of objects fitting in computer memory. There is a sense of completeness of a systems-level language in being able to use a native type to express any offset in memory. That's why it would be some of a bummer if we defined size_t as int on 32-bit systems: I, at least, would feel like giving something up.


Andrei