Jump to page: 1 2
Thread overview
A char is also not an int
May 27, 2004
Arcane Jill
May 27, 2004
Matthew
May 27, 2004
Benji Smith
May 27, 2004
Kevin Bealer
May 27, 2004
Stewart Gordon
May 27, 2004
Walter
May 28, 2004
James McComb
May 28, 2004
Roberto Mariottini
May 29, 2004
Phill
May 31, 2004
Roberto Mariottini
Jun 04, 2004
Matthew
May 28, 2004
Derek Parnell
May 27, 2004
While we're on the subject of disunifying one type from another, may I point out that a char is also not an int.

Back in the old days of C, there was no 8-bit wide type other than char, so if you wanted an 8-bit wide numeric type, you used a char.

Similarly, in Java, there is no UNSIGNED 16-bit wide type other than char, so if that's what you need, you use char.

D has no such problems, so maybe it's about time to make the distinction clear. Logically, it makes no sense to try to do addition and subtraction with the at-sign or the square-right-bracket symbol. We all KNOW that the zero glyph is *NOT* the same thing as the number 48.

This was true even back in the days of ASCII, but it's even more true in Unicode. A char in D stores, not a character, but a fragment of UTF-8, an encoding of Unicode character - and even a Unicode character is /itself/ an encoding. There is no longer a one-to-one correspondance between character and glyph. (There IS such a one-to-one correspondence in the old ASCII range of \u0020 to \u007E, of course, since Unicode is a superset of ASCII).

Perhaps it's time to change this one too?

>       int a = 'X';            // wrong
>       char a = 'X';           // right
>       int a = cast(int) 'X'   // right

Arcane Jill


May 27, 2004
> While we're on the subject of disunifying one type from another, may I point
out
> that a char is also not an int.

Can one implicitly convert char to int?

Man, that sucks!

Pardon my indignance by crediting my claim never to have tried it because I have a long-standing aversion to such things from C/C++.

If it's true it needs to be made untrue ASAP.

(Was that strong enough? I hope so ...)


May 27, 2004
On Thu, 27 May 2004 07:16:19 +0000 (UTC), Arcane Jill
<Arcane_member@pathlink.com> wrote:

>Perhaps it's time to change this one too?
>
>>       int a = 'X';            // wrong
>>       char a = 'X';           // right
>>       int a = cast(int) 'X'   // right

I don't even like the notion of being able to explicitly cast from a char to an int. Especially in the case of unicode characters, the semantics of a cast (even an explicit cast) are not very well defined.

Getting the int value of a character should, in my opinion, be the provice of a static method from a specific string class.

--Benji
May 27, 2004
Arcane Jill wrote:

<snip>
> D has no such problems, so maybe it's about time to make the
> distinction clear. Logically, it makes no sense to try to do addition
> and subtraction with the at-sign or the square-right-bracket symbol.

Not even in cryptography and the like?

> We all KNOW that the zero glyph is *NOT* the same thing as the number
> 48.
> 
> This was true even back in the days of ASCII, but it's even more true
> in Unicode. A char in D stores, not a character, but a fragment of
> UTF-8, an encoding of Unicode character - and even a Unicode
> character is /itself/ an encoding. There is no longer a one-to-one
> correspondance between character and glyph.
<snip>

By 'character' do you mean 'character' or 'char value'?

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the
unfortunate victim of intensive mail-bombing at the moment.  Please keep
replies on the 'group where everyone may benefit.
May 27, 2004
In article <fd8cb0dfge0cm85o781a2rjpp9ait6fskq@4ax.com>, Benji Smith says...
>
>On Thu, 27 May 2004 07:16:19 +0000 (UTC), Arcane Jill
><Arcane_member@pathlink.com> wrote:
>
>>Perhaps it's time to change this one too?
>>
>>>       int a = 'X';            // wrong
>>>       char a = 'X';           // right
>>>       int a = cast(int) 'X'   // right
>
>I don't even like the notion of being able to explicitly cast from a char to an int. Especially in the case of unicode characters, the semantics of a cast (even an explicit cast) are not very well defined.
>
>Getting the int value of a character should, in my opinion, be the provice of a static method from a specific string class.
>
>--Benji

I think the opposite is true; with Unicode, the semantics CAN be solid.  In a normal C program, this is not the case.  Consider:

int chA = 'A';
int chZ = 'Z';

if ((chZ - chA) == 25) {
// Is this true for EBCDIC?  I dunno.
}

In C, the encoding is assumed to be the default system architecture encoding, which is not necessarily Unicode or ASCII.  But, if the language DEFINES unicode as the operative representation, then the value 'A' should always be the same integer value.

In any case, sometimes you need the integer value.

Kevin



May 27, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:c944k3$1o53$1@digitaldaemon.com...
> While we're on the subject of disunifying one type from another, may I
point out
> that a char is also not an int.
>
> Back in the old days of C, there was no 8-bit wide type other than char,
so if
> you wanted an 8-bit wide numeric type, you used a char.
>
> Similarly, in Java, there is no UNSIGNED 16-bit wide type other than char,
so if
> that's what you need, you use char.
>
> D has no such problems, so maybe it's about time to make the distinction
clear.
> Logically, it makes no sense to try to do addition and subtraction with
the
> at-sign or the square-right-bracket symbol. We all KNOW that the zero
glyph is
> *NOT* the same thing as the number 48.
>
> This was true even back in the days of ASCII, but it's even more true in Unicode. A char in D stores, not a character, but a fragment of UTF-8, an encoding of Unicode character - and even a Unicode character is /itself/
an
> encoding. There is no longer a one-to-one correspondance between character
and
> glyph. (There IS such a one-to-one correspondence in the old ASCII range
of
> \u0020 to \u007E, of course, since Unicode is a superset of ASCII).
>
> Perhaps it's time to change this one too?
>
> >       int a = 'X';            // wrong
> >       char a = 'X';           // right
> >       int a = cast(int) 'X'   // right

I understand where you're coming from, and this is a compelling idea, but this idea has been tried out before in Pascal. And I can say from personal experience it is one reason I hate Pascal <g>. Chars do want to be integral data types, and requiring a cast for it leads to execrably ugly expressions filled with casts. In moving to C, one of the breaths of fresh air was to not need all those %^&*^^% casts any more. Let me enumerate a few ways that chars are used as integral types:

1) converting case
2) using char as index into a translation table
3) encoding/decoding UTF strings
4) encryption/decryption software
5) compression code
6) hashing
7) regex internal implementation
8) char value as input to a state machine like a lexer
9) encoding/decoding strings to/from integers

in other words, routine system programming tasks. The improvement D has, however, is to have chars be a separate type from byte, which makes for better self-documenting code, and one can have different overloads for them.


May 28, 2004
On Thu, 27 May 2004 07:16:19 +0000 (UTC), Arcane Jill wrote:

> While we're on the subject of disunifying one type from another, may I point out that a char is also not an int.
> 
> Back in the old days of C, there was no 8-bit wide type other than char, so if you wanted an 8-bit wide numeric type, you used a char.
> 
> Similarly, in Java, there is no UNSIGNED 16-bit wide type other than char, so if that's what you need, you use char.
> 
> D has no such problems, so maybe it's about time to make the distinction clear. Logically, it makes no sense to try to do addition and subtraction with the at-sign or the square-right-bracket symbol. We all KNOW that the zero glyph is *NOT* the same thing as the number 48.
> 
> This was true even back in the days of ASCII, but it's even more true in Unicode. A char in D stores, not a character, but a fragment of UTF-8, an encoding of Unicode character - and even a Unicode character is /itself/ an encoding. There is no longer a one-to-one correspondance between character and glyph. (There IS such a one-to-one correspondence in the old ASCII range of \u0020 to \u007E, of course, since Unicode is a superset of ASCII).
> 
> Perhaps it's time to change this one too?
> 
>>       int a = 'X';            // wrong
>>       char a = 'X';           // right
>>       int a = cast(int) 'X'   // right
> 

Maybe... Another way of looking at is that a character has (at least) two properties: a Glyph and an Identifer. Within an encoding set (eg. Unicode, ASCII, EBCDIC, ...), no two characters have the same identifier even though they may have the same glyph (eg. Space and Non-Breaking Space). One may then argue that an efficient datatype for the identier is an unsigned integer value. This makes it simple to be used as an index into a glyph table. In fact, an encoding set is like to have multiple glyph tables for various font representations, but that is another issue all together.

So, an implicit cast from char to int would be just getting the character's identifier value, which is not such a bad thing.

What is a bad thing is making assumptions about the relationships between character identifers. There is no necessary correlation between an character set's collation sequence and the characters' identifiers.

I frequently work with encryption algorithms, and integer character identifiers are a *very* handy thing indeed.

-- 
Derek
28/May/04 10:50:16 AM
May 28, 2004
Walter wrote:

> I understand where you're coming from, and this is a compelling idea, but
> this idea has been tried out before in Pascal. And I can say from personal
> experience it is one reason I hate Pascal <g>. Chars do want to be integral
> data types, and requiring a cast for it leads to execrably ugly expressions
> filled with casts.

I agree with you about chars Walter, but this is because I think chars are different from bools.

The way I see it, bools can be either TRUE or FALSE, and these values not numeric. TRUE + 32 is not defined. (Of course, bools will be *implemented* as numeric values, but I'm talking about syntax.)

But character standards, such as ASCII and Unicode, *define* characters as numeric quantities. ASCII *defines* A to be 65. So characters really are numeric.

'A' + 32 equals 'a'. This behaviour is well-defined.

So I'd like to have a proper bool type, but I'd prefer D chars to remain as they are.

James McComb
May 28, 2004
In article <c95al5$19mr$1@digitaldaemon.com>, Walter says...
>
>I understand where you're coming from, and this is a compelling idea, but this idea has been tried out before in Pascal. And I can say from personal experience it is one reason I hate Pascal <g>.

That's strange, because this is one of the reasons the makes me *like* Pascal :-)

>Chars do want to be integral
>data types, and requiring a cast for it leads to execrably ugly expressions
>filled with casts. In moving to C, one of the breaths of fresh air was to
>not need all those %^&*^^% casts any more.

In my experience, only poor programming practice leads to manu int <-> char casts.

>Let me enumerate a few ways that
>chars are used as integral types:
>
>1) converting case

This is true only for English. Real natural languages are more complex than this, needing collating tables. I don't know about non-latin alphabets.

>2) using char as index into a translation table

type
a: array['a'..'z'] of 'A'..'Z';
b: array[char] of char;

>3) encoding/decoding UTF strings
>4) encryption/decryption software
>5) compression code
>6) hashing
>7) regex internal implementation

This is something you just won't do frequently, once they are in a library. Simply converting all input to integers and reconverting the final output to chars should work.

>8) char value as input to a state machine like a lexer
>9) encoding/decoding strings to/from integers

I don't see the point here.

>in other words, routine system programming tasks. The improvement D has, however, is to have chars be a separate type from byte, which makes for better self-documenting code, and one can have different overloads for them.

This is better than nothing :-)

Ciao


May 29, 2004
Roberto:

Can you explain what you mean by "Real natural languages"?



« First   ‹ Prev
1 2