July 29, 2006
Andrew Fedoniouk wrote:
> To Walter:
> 
> Following assumption ( http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):
> 
> "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, it is guaranteed by the
> Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character.
> This codepoint will remain forever unassigned, precisely so that it may be used
> for purposes such as this."
> 
> is just wrong.
> 
> 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
> R-zone: {U+FFF0..U+FFFF} - region assigned already.

Yep, 0xFFFF is in the "Specials" range. In fact, together with 0xFFFE it forms the subrange of the "Noncharacters" (see http://www.unicode.org/charts/PDF/UFFF0.pdf, at the end). These are "intended for process internal uses, but are not permitted for interchange". 0xFFFF specifically is marked "<not a character> - the value FFFF if guaranteed not to be a Unicode character at all".
So yes, it's assigned - for exactly such a purpose as D is using it for :).

> 2) For char[] selection of 0xFF is wrong and even worse.
> For example character with code 0xFF in Latin-I encoding is
> "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point.
> For example in KOI-8 encoding 0xFF is officially assigned value.

First of all, non-Unicode encodings are irrelevant. 'char' is a UTF-8 codepoint (I think that's the correct term).
It's not a Unicode character (though some Unicode characters are encoded as a single UTF-8 codepoint, specifically anything up to 0x80 IIRC).
0xFF is indeed a valid Unicode character, but that doesn't mean that character is encoded as a byte with value 0xFF in UTF-8 (which char[]s represent). 0xFF is in fact one of the byte values that *cannot* occur in a valid UTF-8 text.
July 29, 2006
Andrew Fedoniouk wrote:
> Following assumption ( http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):
> 
> "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, it is guaranteed by the
> Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character.
> This codepoint will remain forever unassigned, precisely so that it may be used
> for purposes such as this."
> 
> is just wrong.
> 
> 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
> R-zone: {U+FFF0..U+FFFF} - region assigned already.

"the value FFFF is guaranteed not to be a Unicode character at all"
http://www.unicode.org/charts/PDF/UFFF0.pdf


> 2) For char[] selection of 0xFF is wrong and even worse.
> For example character with code 0xFF in Latin-I encoding is
> "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point.
> For example in KOI-8 encoding 0xFF is officially assigned value.

char[] is not Unicode, it is UTF-8. For UTF-8, 0xFF is not a valid value. The Unicode U00FF is not encoded into UTF-8 as FF.

"The octet values C0, C1, F5 to FF never appear." http://www.ietf.org/rfc/rfc3629.txt


> What is the point of current initializaton?

The point is to initialize it with an invalid value, in order to flush out uninitialized data errors.

> If you are doing intialization already
> and this intialization is a part of specification so why not to use
> official "Nul" values in this case?

Because 0 is a valid UTF-8 character.


> You are doing the same for floats - you are using NaNs there
>  (Null value for floats). Why not to use the same for chars?

The FF initialization does correspond (as close as we can get) with NaN for floats. 0 can masquerade as legitimate data, FF cannot.
July 29, 2006
"Carlos Santander" <csantander619@gmail.com> wrote in message news:eagiip$1lad$3@digitaldaemon.com...
> Andrew Fedoniouk escribió:
>> 2) For char[] selection of 0xFF is wrong and even worse.
>> For example character with code 0xFF in Latin-I encoding is
>> "y diaeresis". In many European languages and Far East encodings 0xFF is
>> a valid code point.
>> For example in KOI-8 encoding 0xFF is officially assigned value.
>>
>
> But D's chars are UTF-8, not Latin-1 nor any other, so I don't think this applies.
>

UTF-8 is a multibyte transport encoding of full 21-bit UNICODE codepoint. Strictly speaking single byte in UTF-8 sequence cannot be named as char[acter]

char as typename implies that value of its type contains some complete codepoint (assumed that information about codepage is stored somewhere or is known at the point of use)

I mean that "UTF-8 characrter" (if it makes any sense at all) as type is always char[] and not a single char.

0xFF as a char initialization value implies that D char is not supposed to handle single byte character encodings at all. Is this an original intention?

Andrew Fedoniouk.
http://terrainformatica.com











July 29, 2006
"Walter Bright" <newshound@digitalmars.com> wrote in message news:eagk1o$1mph$1@digitaldaemon.com...
> Andrew Fedoniouk wrote:
>> Following assumption ( http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):
>>
>> "codepoint U+FFFF is not a legitimate Unicode character, and,
>> furthermore, it is guaranteed by the
>> Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode
>> character.
>> This codepoint will remain forever unassigned, precisely so that it may
>> be used
>> for purposes such as this."
>>
>> is just wrong.
>>
>> 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from R-zone: {U+FFF0..U+FFFF} - region assigned already.
>
> "the value FFFF is guaranteed not to be a Unicode character at all" http://www.unicode.org/charts/PDF/UFFF0.pdf
>
>
>> 2) For char[] selection of 0xFF is wrong and even worse.
>> For example character with code 0xFF in Latin-I encoding is
>> "y diaeresis". In many European languages and Far East encodings 0xFF is
>> a valid code point.
>> For example in KOI-8 encoding 0xFF is officially assigned value.
>
> char[] is not Unicode, it is UTF-8. For UTF-8, 0xFF is not a valid value. The Unicode U00FF is not encoded into UTF-8 as FF.
>
> "The octet values C0, C1, F5 to FF never appear." http://www.ietf.org/rfc/rfc3629.txt
>
>
>> What is the point of current initializaton?
>
> The point is to initialize it with an invalid value, in order to flush out uninitialized data errors.
>
>> If you are doing intialization already
>> and this intialization is a part of specification so why not to use
>> official "Nul" values in this case?
>
> Because 0 is a valid UTF-8 character.

1) What "UTF-8 character" means exactly?
2) In ASCII char(0) is officially NUL. Why not to initialize strings
by null?

>
>
>> You are doing the same for floats - you are using NaNs there
>>  (Null value for floats). Why not to use the same for chars?
>
> The FF initialization does correspond (as close as we can get) with NaN for floats. 0 can masquerade as legitimate data, FF cannot.

I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?' Are you saying that I cannot use char[] to represen russian text in D?

Andrew Fedoniouk.
http://terrainformatica.com


July 29, 2006
Carlos Santander wrote:
> Hasan Aljudy escribió:
>>
>>
>> Still missing my point.
>> in C/C++ that's a problem because un-initialized variables carry garbage.
>> in D, it's not; if you init them to a reasonable valid default, this problem won't exist anymore.
>>
>> If un-initializing is bad just for its own sake .. then the compiler should detect it and issue an error/warning, otherwise it should default to a reasonable valid value; in this case, zero for chars and floats.
> 
> The issue here is, a "reasonable valid default" will change from one app to the other, one function to the next, one variable to another, so the intention here is force the developer to be explicit about his/her intentions.
> 
> Walter has said in the past that if there was a NAN for int/long/etc, he'd use that instead of 0.
> 

That's right. Also, given:

	int x;

	foo(x);

it is impossible for the maintenance programmer to distinguish between:

1) x is meant to be 0
2) the original programmer forgot to initialize x to 3, and there's a bug in the program

Ok, fine, so why doesn't the compiler just squawk about referencing uninitialized variables? Consider:

	int x;
	...
	if (...)
	{	x = 3;
		...
	}
	...
	if (...)
	{	...
		foo(x);
	}

There is no way for the compiler to determine that x in foo(x) is always initialized. So it must assume otherwise, and squawk about it. So how does our harried programmer fix it?

	int x = some-random-value;
	...
	if (...)
	{	x = 3;
		...
	}
	...
	if (...)
	{	...
		foo(x);
	}

The compiler is now happy, but pity the poor maintenance programmer. He notices the some-random-value, and wonders what that value means. He analyzes the code, and discovers that that value is never used. Was it intended to be used? Did some previous maintenance programmer break the code? What's going on here?

My take on programming languages is that the semantics should have the obvious meaning - i.e. if the programmer initializes a variable to a value, that value should have meaning. He should not have to initialize a variable because of some subtle *side effect* such initialization has.

Programmers should not be required to add dead assignments, unreachable code, etc., just to keep the compiler happy.
July 29, 2006
Andrew Fedoniouk wrote:
>>> What is the point of current initializaton?
>> The point is to initialize it with an invalid value, in order to flush out uninitialized data errors.
>>
>>> If you are doing intialization already
>>> and this intialization is a part of specification so why not to use
>>> official "Nul" values in this case?
>> Because 0 is a valid UTF-8 character.
> 
> 1) What "UTF-8 character" means exactly?

For an exact answer, the spec is: http://www.ietf.org/rfc/rfc3629.txt
There isn't much to it.

> 2) In ASCII char(0) is officially NUL. Why not to initialize strings
> by null?

Because 0 characters are valid UTF-8 values. By using an invalid UTF-8 value, we can flush out bugs from uninitialized data.

> I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?'
> Are you saying that I cannot use char[] to represen russian text in D?

char[] is for UTF-8 encoded text only. For other encoding systems, use ubyte[]. But rest assured that Russian (and every other language) has a defined encoding in UTF-8, which is why it was selected for D.
July 29, 2006
"Frits van Bommel" <fvbommel@REMwOVExCAPSs.nl> wrote in message news:eagjcd$1m1t$1@digitaldaemon.com...
> Andrew Fedoniouk wrote:
>> To Walter:
>>
>> Following assumption ( http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):
>>
>> "codepoint U+FFFF is not a legitimate Unicode character, and,
>> furthermore, it is guaranteed by the
>> Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode
>> character.
>> This codepoint will remain forever unassigned, precisely so that it may
>> be used
>> for purposes such as this."
>>
>> is just wrong.
>>
>> 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from R-zone: {U+FFF0..U+FFFF} - region assigned already.
>
> Yep, 0xFFFF is in the "Specials" range. In fact, together with 0xFFFE it
> forms the subrange of the "Noncharacters" (see
> http://www.unicode.org/charts/PDF/UFFF0.pdf, at the end). These are
> "intended for process internal uses, but are not permitted for
> interchange". 0xFFFF specifically is marked "<not a character> - the value
> FFFF if guaranteed not to be a Unicode character at all".
> So yes, it's assigned - for exactly such a purpose as D is using it for
> :).
>
>> 2) For char[] selection of 0xFF is wrong and even worse.
>> For example character with code 0xFF in Latin-I encoding is
>> "y diaeresis". In many European languages and Far East encodings 0xFF is
>> a valid code point.
>> For example in KOI-8 encoding 0xFF is officially assigned value.
>
> First of all, non-Unicode encodings are irrelevant. 'char' is a UTF-8 codepoint (I think that's the correct term).

Sorry but this is wrong. "UTF-8 codepoint" is a non-sense.

In common practice Code Point is a: (1) A numerical index (or position)
in an encoding table used for encoding characters.
(2) Synonym for Unicode scalar value.

As rule one code point represented by single glyph while represented to human.


> It's not a Unicode character (though some Unicode characters are encoded as a single UTF-8 codepoint, specifically anything up to 0x80 IIRC). 0xFF is indeed a valid Unicode character, but that doesn't mean that character is encoded as a byte with value 0xFF in UTF-8 (which char[]s represent). 0xFF is in fact one of the byte values that *cannot* occur in a valid UTF-8 text.

Sorry, but element of UTF-8 encoded sequence is a byte (octet) and
not a char. char as a type historically means type for storing
character code points. 0xFF is assigned and legal value in many encodings.

Either use different name for this "D char" - let's say utf8byte or use char in the meaning "code point value" - thus initialize it by NUL value common for all known encodings.

Andrew Fedoniouk.
http://terrainformatica.com


July 29, 2006
Andrew,

I think it will make a lot more sense if you keep these things in mind... (I'm sure you already know all of them, I'm just listing them out since they're crucial and must be thought of together):

1. char, wchar, and dchar are separate types.

2. char contains UTF-8 bytes.  It may not contain UTF-16, UCS-2, KOI-8R, or any other encoding.  It must contain UTF-8.

3. wchar contains UTF-16.  It is similar to char in every other way (may not contain any other encoding than UTF-16, not even UCS-2.)

4. dchar contains UTF-32 code points.  It may not contain any other sort of encoding, again.

5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use ubyte/byte or some other method.  It is not valid to use char.

6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 string.  Since char can only contain UTF-8 strings, it represents invalid data if it contains such an 8-bit octet.

7. Code points are the characters in Unicode; they are "compressed", so to speak, in encodings such as UTF-8 and UTF-16.  USC-2 and USC-4 (UTF-32) contain full code points.

8. If you were to examine the bytes in a wchar string, it may be possible that the 8-bit octet sequence "FF" might show up.  Nonetheless, since char cannot be used for UTF-16, this doesn't matter.

9. For the above reason, wchar (UTF-16) uses FFFF.  This character is similar to FF for UTF-8.

Given the above, I think I might answer your questions:

1. UTF-8 character here could mean an 8-bit octet of code point.  In this case, they are both the same and represent a perfectly valid character in a string.

2. ASCII does not matter; char is not ASCII.  It happens that ASCII bytes 0 to 127 correspond to the same code points in Unicode, and the same characters in UTF-8.

3. It does not matter; KOI-8R encoded strings should not be placed in char arrays.  You should use UTF-8 or another encoding for your Russian text.

4. If you wish to use KOI-8R (or any other encoding not based on Unicode) you should not be using char arrays, which are meant for Unicode-related encodings only.

Obviously this is by far different from C, but that's the good thing about D in many ways ;).

Thanks,
-[Unknown]



> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eagk1o$1mph$1@digitaldaemon.com...
>> Andrew Fedoniouk wrote:
>>> Following assumption ( http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):
>>>
>>> "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, it is guaranteed by the
>>> Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character.
>>> This codepoint will remain forever unassigned, precisely so that it may be used
>>> for purposes such as this."
>>>
>>> is just wrong.
>>>
>>> 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
>>> R-zone: {U+FFF0..U+FFFF} - region assigned already.
>> "the value FFFF is guaranteed not to be a Unicode character at all"
>> http://www.unicode.org/charts/PDF/UFFF0.pdf
>>
>>
>>> 2) For char[] selection of 0xFF is wrong and even worse.
>>> For example character with code 0xFF in Latin-I encoding is
>>> "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point.
>>> For example in KOI-8 encoding 0xFF is officially assigned value.
>> char[] is not Unicode, it is UTF-8. For UTF-8, 0xFF is not a valid value. The Unicode U00FF is not encoded into UTF-8 as FF.
>>
>> "The octet values C0, C1, F5 to FF never appear." http://www.ietf.org/rfc/rfc3629.txt
>>
>>
>>> What is the point of current initializaton?
>> The point is to initialize it with an invalid value, in order to flush out uninitialized data errors.
>>
>>> If you are doing intialization already
>>> and this intialization is a part of specification so why not to use
>>> official "Nul" values in this case?
>> Because 0 is a valid UTF-8 character.
> 
> 1) What "UTF-8 character" means exactly?
> 2) In ASCII char(0) is officially NUL. Why not to initialize strings
> by null?
> 
>>
>>> You are doing the same for floats - you are using NaNs there
>>>  (Null value for floats). Why not to use the same for chars?
>> The FF initialization does correspond (as close as we can get) with NaN for floats. 0 can masquerade as legitimate data, FF cannot.
> 
> I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?'
> Are you saying that I cannot use char[] to represen russian text in D?
> 
> Andrew Fedoniouk.
> http://terrainformatica.com
> 
> 
July 29, 2006
"Walter Bright" <newshound@digitalmars.com> wrote in message news:eagmrk$1pn9$1@digitaldaemon.com...
> Andrew Fedoniouk wrote:
>>>> What is the point of current initializaton?
>>> The point is to initialize it with an invalid value, in order to flush out uninitialized data errors.
>>>
>>>> If you are doing intialization already
>>>> and this intialization is a part of specification so why not to use
>>>> official "Nul" values in this case?
>>> Because 0 is a valid UTF-8 character.
>>
>> 1) What "UTF-8 character" means exactly?
>
> For an exact answer, the spec is: http://www.ietf.org/rfc/rfc3629.txt There isn't much to it.

Sorry but I understand what UCS character means
but what exactly is "UTF-8 character" you are using?

Is this 1) a single octet in UTF-8 sequence or
2) is a sequence of octets representing one unicode character (21 bit value)


>
>> 2) In ASCII char(0) is officially NUL. Why not to initialize strings
>> by null?
>
> Because 0 characters are valid UTF-8 values. By using an invalid UTF-8 value, we can flush out bugs from uninitialized data.

Oh....

0 as a value of UTF-8 octet can represent only single value character with codepoint 0x00000000.

In plain English: UTF-8 encoded strings cannot contain zeros in the middle.


>
>> I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?' Are you saying that I cannot use char[] to represen russian text in D?
>
> char[] is for UTF-8 encoded text only. For other encoding systems, use ubyte[]. But rest assured that Russian (and every other language) has a defined encoding in UTF-8, which is why it was selected for D.

Sorry but char[acter] in plain english means character - index of some
human readable glyph in some table like ASCII, KOI-8,
MAC-ASCII, whatever.

Element of UTF-8 sequence is an octet.  I think you should rename 'char' type to 'octet' if D/Phobos intended to support only UTF-8.

Andrew.
















July 29, 2006
"Unknown W. Brackets" <unknown@simplemachines.org> wrote in message news:eagn4d$1q1t$1@digitaldaemon.com...
> Andrew,
>
> I think it will make a lot more sense if you keep these things in mind... (I'm sure you already know all of them, I'm just listing them out since they're crucial and must be thought of together):
>
> 1. char, wchar, and dchar are separate types.

No objections with this.

>
> 2. char contains UTF-8 bytes.  It may not contain UTF-16, UCS-2, KOI-8R, or any other encoding.  It must contain UTF-8.

Sorry but plural form "char contains UTF-8 bytes" is wrong.

What you think char means:
1) char is an octet (byte) - member of utf-8 sequence -or-
2) char is code point of some character in some character table.

?

Probably I am treating English too literally but
char(acter) is not an UTF-8 byte.  And never was.

char is an index of some glyph in some encoding table.
This is common definition used everywhere.

>
> 3. wchar contains UTF-16.  It is similar to char in every other way (may not contain any other encoding than UTF-16, not even UCS-2.)

The same problem as in #2.

What is wchar (uint16) for you:
1) wchar as is an index of a Unicode scalar value in Basic Multilingual
Plane (BMP)
-or-
2) is a uint16 value - member of UTF-16 sequence.

?

>
> 4. dchar contains UTF-32 code points.  It may not contain any other sort of encoding, again.

Oh.....

UTF-32 (as any other utfs) is a transformation format - group name of two different encodings UTF-32BE and UTF-32LE.

UTF-32 code point is a non-sense.

UTF-32 defines of how to encode Unicode code point  in again sequence of four bytes - octets.

I would define this thing as

dchar ( better name is uchar ) is type for representing
full set of Unicode Code Points (21bit value).

Pleas note: "transformation format" (UTF) is not by
any means a "manipulation format".

Representation of text in memory suitable for
manipulation (e.g. text processing) is different as rule.

You cannot use utf-8 encoded russian text for
analysis. No way.

>
> 5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use ubyte/byte or some other method.  It is not valid to use char.

Vice versa. For utf-8 encoded strings you should use byte[]
and for strings using single byte encodings you should use char.

>
> 6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 string.  Since char can only contain UTF-8 strings, it represents invalid data if it contains such an 8-bit octet.

No objections with that, for UTF-8 octet sequences 0xFF is invalid
value of octet in the sequence. But please note: in the sequence of octets.

>
> 7. Code points are the characters in Unicode; they are "compressed", so to speak, in encodings such as UTF-8 and UTF-16.  USC-2 and USC-4 (UTF-32) contain full code points.

Sorry, but USC-4 *is not* UTF-32 http://www.unicode.org/reports/tr19/tr19-9.html

I will ask again:

What:
char c = 'a';
means for you?

And following in C/C++:

#pragma(encoding,"KOI-8R")

char c = '?';

?


>
> 8. If you were to examine the bytes in a wchar string, it may be possible that the 8-bit octet sequence "FF" might show up.  Nonetheless, since char cannot be used for UTF-16, this doesn't matter.

Not clear what you mean here. Could you clarify? Especially last statement.

>
> 9. For the above reason, wchar (UTF-16) uses FFFF.  This character is similar to FF for UTF-8.
>
> Given the above, I think I might answer your questions:
>
> 1. UTF-8 character here could mean an 8-bit octet of code point.  In this case, they are both the same and represent a perfectly valid character in a string.

Sorry I am not buying following:
"UTF-8 character" and "8-bit octet of code point"

>
> 2. ASCII does not matter; char is not ASCII.  It happens that ASCII bytes 0 to 127 correspond to the same code points in Unicode, and the same characters in UTF-8.

"ASCII does not matter"... for whom?

>
> 3. It does not matter; KOI-8R encoded strings should not be placed in char arrays.  You should use UTF-8 or another encoding for your Russian text.

"You should use UTF-8 or another encoding for your Russian text."

Thanks.

Advice from my side:
Let me know when you will visit Russia.
I will ask representatives of russian developer community and web authors
to meet you.

Advice per se: You should wear a helmet.

>
> 4. If you wish to use KOI-8R (or any other encoding not based on Unicode) you should not be using char arrays, which are meant for Unicode-related encodings only.

The same advice as above.

>
> Obviously this is by far different from C, but that's the good thing about D in many ways ;).

In Israel they have an old saying:
"Not a human for Saturday but Saturday for human".

I do have practical experience in writnig text processing software in encodings other than "US-ASCII" and have heard your advices about UTF-8 usage with interest.

Please don't take all of this personal - no intention to harm anybody. Honestly and with smile :)

Andrew.