Thread overview
Types and sizes
Aug 17, 2001
Cem Karan
Aug 17, 2001
Kent Sandvik
Aug 17, 2001
Walter
Aug 17, 2001
smilechaser
Aug 17, 2001
Cem Karan
Aug 17, 2001
Cem Karan
Aug 26, 2001
Walter
August 17, 2001
I've been going over the discussions on what kind of character support D should have: should it be Unicode, ASCII, etc.  and it just struck me that there are a series of fundamental problems that are exposed by this train of thought.  I'll address my thoughts for types first, and then argue that you should ditch the char type completely.

First off, we have too many types that are not orthogonal in C: the
byte, short, int, and long (and char, although that is a hack in my
mind) along with float and double.  When specifying a variable, you
should specify the kind that it is, the amount of storage that it
needs, and if it is signed or not.  E.g.:

   unsigned 8 int foo;
   12 float bar;

where the leading numerals are the number of bytes of storage.  If you want to be really specific, make it the number of bits; that way, you never run into the problem that the concept of byte means different things on different machines.

With this scheme, there are only 2 types: integers and floats.  bytes are (unsigned 1 int) and if you really want to use 'byte' instead, typedef it.

As for the char, there are a LOT of problems with it, and arguing over the idea of Unicode or something else won't help much.

1) Not all character encodings are a uniform size. UTF-8 can have anywhere from 1 to 6 bytes necessary for encoding characters.

2) Character encodings that do have a uniform size often are incomplete.  ISO 10646 (also known as UCS-4) has a large number of character planes that are deliberately left unencoded.  Unicode has ranges that are also undefined.  This means that although you created a Unicode file, it doesn't appear to be the same to two different sets of users on two different machines.

3) Character sorting is a MAJOR headache.  Read the specs for UCS-4 at http://anubis.dkuug.dk/JTC1/SC22/WG20/docs/projects#14651 for a better idea of what I'm talking about.   Here is the problem in a nutshell: in certain scripts, characters are combined when they are displayed.  At the same time, there is another single character that when rendered looks exactly the same as the two other characters that were combined. Because of their locations in the code tables, you can't just cast them to ints and compare them for sorting.  To a human, they have the same meaning and should therefore be sorted next to each other. In addition, how do you compare two strings that are from different scripts (Japanese and English for example)?

So this brings me to my point.  Ditch the idea of the char completely. Make it a compiler supplied class instead.  This will allow you to support multiple types of encodings.  I know that this might sound like a dumb idea, but try looking at the complete specifications for some of these encodings, and you'll see why I'm saying this.  Also, you need to think about the number of encodings that have been invented so far. IBM has a package called International Components for Unicode which just does transcoding from one encoding to another.  Currently it handles 150 different encoding types.

The problem is that people are used to the idea of a char.  The trick is to make it possible to use a shorthand notation for 0 argument and single argument constructors.

   Unicode mychar;
         would be equivalent to:
   Unicode mychar = new Unicode();

   Unicode yourchar = 'c';
         would be equivalent to:
   Unicode mychar = new Unicode('c');

   Unicode theirchar = "\U0036";
         would be equivalent to:
   Unicode mychar = new Unicode("\U0036");
         which would allow you to use characters that your own system
can't handle.

References:

http://www.unicode.org

The link below goes to UCS-4, which is a superset of Unicode.  It is 32
bits in size.
http://anubis.dkuug.dk/JTC1/SC2/WG2/

The link below goes to the ISO working group on internationalization of programming languages, environments, and subsystems. http://anubis.dkuug.dk/jtc1/sc22/

READ THIS.  They have spent quite a bit of time on how Strings and such should be handled, and this is what they've come up with.  Even if you don't support it directly, it is necessary. http://anubis.dkuug.dk/JTC1/SC22/WG20/docs/projects#15435

-- 
Cem Karan

"Why would I want to conquer the world?!  Then I'd be expected to solve its problems!"
August 17, 2001
"Cem Karan" <cfkaran2@eos.ncsu.edu> wrote in message news:160820012037366028%cfkaran2@eos.ncsu.edu...

> So this brings me to my point.  Ditch the idea of the char completely. Make it a compiler supplied class instead.  This will allow you to support multiple types of encodings.  I know that this might sound like a dumb idea, but try looking at the complete specifications for some of these encodings, and you'll see why I'm saying this.  Also, you need to think about the number of encodings that have been invented so far. IBM has a package called International Components for Unicode which just does transcoding from one encoding to another.  Currently it handles 150 different encoding types.

Yes, this is a good point. In some markets, for example in Japan, you need to worry about multiple character encodings. In a project I was involved in, we converted everything from whatever input format into Java (UTF-8), stored the data as XML (UTF-8), and when rendered the text then back over the browser converted it into SHIFT-JIS using the Java libraries. This way it's much easier, you use one single canonical string representation in the actual runtime environment, do the actual operations, and then let various translation libraries do the output.

Unicode UTF-8 would be one natural internal format.

--Kent



August 17, 2001
I understand it is a complicated subject, and you've well explained why. But I'm not willing to give up on ascii char's yet, they're too useful! -Walter

"Cem Karan" <cfkaran2@eos.ncsu.edu> wrote in message news:160820012037366028%cfkaran2@eos.ncsu.edu...
> I've been going over the discussions on what kind of character support D should have: should it be Unicode, ASCII, etc.  and it just struck me that there are a series of fundamental problems that are exposed by this train of thought.  I'll address my thoughts for types first, and then argue that you should ditch the char type completely.
>
> First off, we have too many types that are not orthogonal in C: the byte, short, int, and long (and char, although that is a hack in my mind) along with float and double.  When specifying a variable, you should specify the kind that it is, the amount of storage that it needs, and if it is signed or not.  E.g.:
>
>    unsigned 8 int foo;
>    12 float bar;
>
> where the leading numerals are the number of bytes of storage.  If you want to be really specific, make it the number of bits; that way, you never run into the problem that the concept of byte means different things on different machines.
>
> With this scheme, there are only 2 types: integers and floats.  bytes are (unsigned 1 int) and if you really want to use 'byte' instead, typedef it.
>
> As for the char, there are a LOT of problems with it, and arguing over the idea of Unicode or something else won't help much.
>
> 1) Not all character encodings are a uniform size. UTF-8 can have anywhere from 1 to 6 bytes necessary for encoding characters.
>
> 2) Character encodings that do have a uniform size often are incomplete.  ISO 10646 (also known as UCS-4) has a large number of character planes that are deliberately left unencoded.  Unicode has ranges that are also undefined.  This means that although you created a Unicode file, it doesn't appear to be the same to two different sets of users on two different machines.
>
> 3) Character sorting is a MAJOR headache.  Read the specs for UCS-4 at http://anubis.dkuug.dk/JTC1/SC22/WG20/docs/projects#14651 for a better idea of what I'm talking about.   Here is the problem in a nutshell: in certain scripts, characters are combined when they are displayed.  At the same time, there is another single character that when rendered looks exactly the same as the two other characters that were combined. Because of their locations in the code tables, you can't just cast them to ints and compare them for sorting.  To a human, they have the same meaning and should therefore be sorted next to each other. In addition, how do you compare two strings that are from different scripts (Japanese and English for example)?
>
> So this brings me to my point.  Ditch the idea of the char completely. Make it a compiler supplied class instead.  This will allow you to support multiple types of encodings.  I know that this might sound like a dumb idea, but try looking at the complete specifications for some of these encodings, and you'll see why I'm saying this.  Also, you need to think about the number of encodings that have been invented so far. IBM has a package called International Components for Unicode which just does transcoding from one encoding to another.  Currently it handles 150 different encoding types.
>
> The problem is that people are used to the idea of a char.  The trick is to make it possible to use a shorthand notation for 0 argument and single argument constructors.
>
>    Unicode mychar;
>          would be equivalent to:
>    Unicode mychar = new Unicode();
>
>    Unicode yourchar = 'c';
>          would be equivalent to:
>    Unicode mychar = new Unicode('c');
>
>    Unicode theirchar = "\U0036";
>          would be equivalent to:
>    Unicode mychar = new Unicode("\U0036");
>          which would allow you to use characters that your own system
> can't handle.
>
> References:
>
> http://www.unicode.org
>
> The link below goes to UCS-4, which is a superset of Unicode.  It is 32
> bits in size.
> http://anubis.dkuug.dk/JTC1/SC2/WG2/
>
> The link below goes to the ISO working group on internationalization of programming languages, environments, and subsystems. http://anubis.dkuug.dk/jtc1/sc22/
>
> READ THIS.  They have spent quite a bit of time on how Strings and such should be handled, and this is what they've come up with.  Even if you don't support it directly, it is necessary. http://anubis.dkuug.dk/JTC1/SC22/WG20/docs/projects#15435
>
> --
> Cem Karan
>
> "Why would I want to conquer the world?!  Then I'd be expected to solve
its
> problems!"


August 17, 2001
"Cem Karan" <cfkaran2@eos.ncsu.edu> wrote in message news:160820012037366028%cfkaran2@eos.ncsu.edu...
> I've been going over the discussions on what kind of character support D should have: should it be Unicode, ASCII, etc.  and it just struck me that there are a series of fundamental problems that are exposed by this train of thought.  I'll address my thoughts for types first, and then argue that you should ditch the char type completely.
>
> First off, we have too many types that are not orthogonal in C: the byte, short, int, and long (and char, although that is a hack in my mind) along with float and double.  When specifying a variable, you should specify the kind that it is, the amount of storage that it needs, and if it is signed or not.  E.g.:
>
>    unsigned 8 int foo;
>    12 float bar;
>

-- snip --

I agree with your treatment of types (int and float). In my experience the fact that different platforms define "int" and "long" differently leads to confusion. You also end up with absurd size qualifiers grafted onto the base types when the platform "get's bigger" (i.e. 8088-80x86-Intel's 64-bit archicture) so that you end up with a "long long long int" or some other such nonsense.

One solution is to do something like Java did and support only the lowest common denominator - the byte. If a larger type is required it is "built" from the basic building block (the byte). The advantage here is that you can write code that seamlessly uses variables bigger than what the host machine supports.

Now, the major problem with this is that it doesn't make good use of the host hardware. If I have a machine that can process 64-bit words at a time the previous solution is still going to be banging around byte values to emulate a 64-bit number.

My solution to this is:

    1) Declare two base types (float and int) that can have a variable size.

    2) This size would always be expressed in bytes, and would have a
maximum value (say 16).

    3) The compiler would treat each datatype as a "plugin". If the data
type was supported natively (i.e. a 32-bit int on a 32-bit machine) the
plugin used would be the "native int32" plugin. If it wasn't the "emulated
int32" plugin would be used.

This ensures that programs will compile/run on just about any platform. Also previously written code could take advantage of hardware advances just by recompiling them (assuming of course that the compiler author would create a native plugin for the new data types).

Now the use of this technology for only floats and ints may seem like overkill, but think of extending it to other data types such as the matrix or vector...




August 17, 2001
In article <9li2tp$jfr$1@digitaldaemon.com>, smilechaser <smilechaser@SPAMGUARDyahoo.com> wrote:

>>SNIP<<
>     1) Declare two base types (float and int) that can have a variable size.
> 
>     2) This size would always be expressed in bytes, and would have a
> maximum value (say 16).

No maximums please.  I do a fair amount of high precision scientific computing, and there isn't anything quite as irritating as limitations on precision.

>     3) The compiler would treat each datatype as a "plugin". If the data
> type was supported natively (i.e. a 32-bit int on a 32-bit machine) the
> plugin used would be the "native int32" plugin. If it wasn't the "emulated
> int32" plugin would be used.
> 
> This ensures that programs will compile/run on just about any platform. Also previously written code could take advantage of hardware advances just by recompiling them (assuming of course that the compiler author would create a native plugin for the new data types).
> 
> Now the use of this technology for only floats and ints may seem like overkill, but think of extending it to other data types such as the matrix or vector...

I LIKE this idea!  It actually allows you to have arbitrary precision integers and floats on any hardware.  If you decide that you really, really need 256 byte integers, you can do it, and the compiler transparently supports it.  There is only one possible problem that this might cause, and that is because the definition of what a 'byte' is, is not always the same on all machines (most modern machines now consider a byte to be 8 bits, but there have been cases in the past where this wasn't so)  I would like to suggest that you define a byte to be 8 bits, no matter what the underlying hardware says it is.  That way the developer isn't bitten by a 'bug' when doing a simple recompile...
August 17, 2001
Here's a few more thoughts to extend what I was talking about.  Include the concept of range for your variables.  This means that when a compiler is running, it can perform greater logic error checking, and it can perform some fairly unique optimizations.  First, the error checking.

Have any of you guys had the problem of adding '1' one too many times to a variable?  Like in a loop where you had something like this:

   for (short i = 0 ; i < q; i++)
      // do stuff

What if q is greater than 32768?  This loop won't end, and because the compiler doesn't have a concept of range for q, it won't flag the possible error.

Or what about divide by 0 errors?

   for (int i = 0; i < 300; i++
      for (int j = a; j< b; j++)
         k = i/j;

As long as a * b > 0, then this will work, but if that isn't true at any point in time, then k == Inf.   This is probably not what we wanted.  So how to solve this?  It requires two parts.  First, you need to identify all of the use-def chains in your program, and trace back to those variables that are not defined within the program (user input for example)  and define ranges for those variables.  The compiler will have to identify those variables that are unspecified, and then the programmer will have to specify them.  Once this is done, the compiler can check for logic errors that are possible.  (This won't guarantee that an error will occur, it only states that it is possible to have an error occur.  Also, this technique won't catch all logic errors, it just makes the program more robust.)

This also allows an optimisation that isn't currently possible; you can tell the compiler to reduce the size of variables to the minimum necessary.  If the ranges are known, then you can do another trick; you can pack multiple variables into vectors.  (Anti-flame alert.  I know that this optimization is not always useful, and may even be virtually impossible to prove in certain cases.  My example only shows part of the checking that would be necessary for this case to work.  The point is that this is an optimization that cannot be done currently, that could be done if you knew the ranges)

For example, lets say that you have a long string of 7 bit ASCII characters.  You know that they are all lowercase characters and you want to uppercase them.  The code that you use is:

   for (char* temp = charArrayPointer; temp < charArraySize; temp++)
         *temp -= 32;

If the compiler knows nothing about the range of values that each of the chars in the array, then it must treat each one as having the possibility of becoming negetive in value, and that would require all 8 bits in a char to hold.  That means that it cannot treat the chars as a vector of bytes and operate on all of them at the same time.  On a 32 bit machine, that means that each byte gets promoted to an integer, worked on, then demoted to a byte.  (promoting to a short an operating in pairs won't work; theres too much overhead in splitting the shorts up and then recombining them)

On the other hand, if the compiler determines that all values are bounded in the array to the values [97,123] decimal, then it knows that the most significant bit is immaterial.  That means that it can concatenate 4 bytes together into an integer, concatenate 4 bytes that each contain the number 32 together, and then subtract the latter from the former.  It will have done 4 operations in a single clock (or 8 if we're talking about a 64 bit machine) It can then do the next 4 operations and so on until it runs out of data.  And since the integer is 'packed' that means that you don't have to do some sort of demotion to bytes; just replace those 4 bytes and keep on rolling.  (And you don't have to point out to me that this only works on integer aligned arrays that are a multiple of 4 in size; again, this would require range checking to know for sure if it is possible)

I know that this idea won't hold in a number of cases, especially those where the ranges truly aren't known.  But it might help in other cases.
August 26, 2001
The idea of ranges as an extension is a great one, for exactly the reasons you mention. -Walter

Cem Karan wrote in message <170820011658229482%cfkaran2@eos.ncsu.edu>...
>Here's a few more thoughts to extend what I was talking about.  Include the concept of range for your variables.  This means that when a compiler is running, it can perform greater logic error checking, and it can perform some fairly unique optimizations.  First, the error checking.
>
>Have any of you guys had the problem of adding '1' one too many times to a variable?  Like in a loop where you had something like this:
>
>   for (short i = 0 ; i < q; i++)
>      // do stuff
>
>What if q is greater than 32768?  This loop won't end, and because the compiler doesn't have a concept of range for q, it won't flag the possible error.
>
>Or what about divide by 0 errors?
>
>   for (int i = 0; i < 300; i++
>      for (int j = a; j< b; j++)
>         k = i/j;
>
>As long as a * b > 0, then this will work, but if that isn't true at any point in time, then k == Inf.   This is probably not what we wanted.  So how to solve this?  It requires two parts.  First, you need to identify all of the use-def chains in your program, and trace back to those variables that are not defined within the program (user input for example)  and define ranges for those variables.  The compiler will have to identify those variables that are unspecified, and then the programmer will have to specify them.  Once this is done, the compiler can check for logic errors that are possible.  (This won't guarantee that an error will occur, it only states that it is possible to have an error occur.  Also, this technique won't catch all logic errors, it just makes the program more robust.)
>
>This also allows an optimisation that isn't currently possible; you can tell the compiler to reduce the size of variables to the minimum necessary.  If the ranges are known, then you can do another trick; you can pack multiple variables into vectors.  (Anti-flame alert.  I know that this optimization is not always useful, and may even be virtually impossible to prove in certain cases.  My example only shows part of the checking that would be necessary for this case to work.  The point is that this is an optimization that cannot be done currently, that could be done if you knew the ranges)
>
>For example, lets say that you have a long string of 7 bit ASCII characters.  You know that they are all lowercase characters and you want to uppercase them.  The code that you use is:
>
>   for (char* temp = charArrayPointer; temp < charArraySize; temp++)
>         *temp -= 32;
>
>If the compiler knows nothing about the range of values that each of the chars in the array, then it must treat each one as having the possibility of becoming negetive in value, and that would require all 8 bits in a char to hold.  That means that it cannot treat the chars as a vector of bytes and operate on all of them at the same time.  On a 32 bit machine, that means that each byte gets promoted to an integer, worked on, then demoted to a byte.  (promoting to a short an operating in pairs won't work; theres too much overhead in splitting the shorts up and then recombining them)
>
>On the other hand, if the compiler determines that all values are bounded in the array to the values [97,123] decimal, then it knows that the most significant bit is immaterial.  That means that it can concatenate 4 bytes together into an integer, concatenate 4 bytes that each contain the number 32 together, and then subtract the latter from the former.  It will have done 4 operations in a single clock (or 8 if we're talking about a 64 bit machine) It can then do the next 4 operations and so on until it runs out of data.  And since the integer is 'packed' that means that you don't have to do some sort of demotion to bytes; just replace those 4 bytes and keep on rolling.  (And you don't have to point out to me that this only works on integer aligned arrays that are a multiple of 4 in size; again, this would require range checking to know for sure if it is possible)
>
>I know that this idea won't hold in a number of cases, especially those where the ranges truly aren't known.  But it might help in other cases.