July 30, 2006
Andrew Fedoniouk escribió:
> "Carlos Santander" <csantander619@gmail.com> wrote in message news:eagiip$1lad$3@digitaldaemon.com...
>> Andrew Fedoniouk escribió:
>>> 2) For char[] selection of 0xFF is wrong and even worse.
>>> For example character with code 0xFF in Latin-I encoding is
>>> "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point.
>>> For example in KOI-8 encoding 0xFF is officially assigned value.
>>>
>> But D's chars are UTF-8, not Latin-1 nor any other, so I don't think this applies.
>>
> 
> UTF-8 is a multibyte transport encoding of full 21-bit UNICODE codepoint.
> Strictly speaking single byte in UTF-8 sequence cannot be named as char[acter]
> 
> char as typename implies that value of its type contains some complete
> codepoint (assumed that information about codepage is stored somewhere
> or is known at the point of use)
> 
> I mean that "UTF-8 characrter" (if it makes any sense at all) as type
> is always char[] and not a single char.
> 
> 0xFF as a char initialization value implies that D char is not supposed
> to handle single byte character encodings at all. Is this an original intention?
> 
> Andrew Fedoniouk.
> http://terrainformatica.com
> 

My bad, then. I should've said char[] instead of char. Frits and Walter wrote better responses, anyway, so I'll leave this as is.

-- 
Carlos Santander Bernal
July 30, 2006
Andrew Fedoniouk wrote:
> Element of UTF-8 sequence is an octet.  I think you should rename
> 'char' type to 'octet' if D/Phobos intended to support only UTF-8.

This was all hashed out years ago. It's too late to start renaming basic types.
July 30, 2006
Andrew Fedoniouk wrote:
> I will ask again:
> 
> What:
> char c = 'a';
> means for you?
> And following in C/C++:
> 
> #pragma(encoding,"KOI-8R")
> 
> char c = '?';
> 
> ?

Pragmas are implementation defined behavior in C and C++, meaning they are unportable and rather useless. Not only that, char's themselves are implementation defined, and so it is very difficult to write portable code that deals with anything other than a-zA-Z0-9 and a few other characters.

In D, char[] is a UTF-8 sequence. It's well defined, and therefore portable. It supports every human language.
July 30, 2006
"Walter Bright" <newshound@digitalmars.com> wrote in message news:eagut9$2l96$1@digitaldaemon.com...
> Andrew Fedoniouk wrote:
>> I will ask again:
>>
>> What:
>> char c = 'a';
>> means for you?
>> And following in C/C++:
>>
>> #pragma(encoding,"KOI-8R")
>>
>> char c = '?';
>>
>> ?
>
> Pragmas are implementation defined behavior in C and C++, meaning they are unportable and rather useless. Not only that, char's themselves are implementation defined, and so it is very difficult to write portable code that deals with anything other than a-zA-Z0-9 and a few other characters.
>
> In D, char[] is a UTF-8 sequence. It's well defined, and therefore portable. It supports every human language.

What does it mean "UTF-8 ... supports ...every human language" ?

It allows to encode - yes.

But in runtime support means quite different thing
and I am pretty sure you know what I mean here.

In Java as we know UTF-8 is used for representing
string literals inside .class files but being loaded they
became vectors of Java chars - unicode BMP codepoints
(ushort). And this serves almost all character cases.
Exceptions like: it is not trivial to do effectively
processing of single byte encoded things there - you need
to rewrite the whole set of functions to handle this.

Please don't think that UTF-8 is a panacea.

For example in China they use GB2312 encoding
to represent almost 7000 Chinese characters in active use now.
This is strictly 2 bytes enconding and
don't even try to ask them to switch to UTF-8
(3 bytes as a rule). This will increase their internet
traffic by 1/3.

Same apply to Europe. E.g. in Russia
there are 32 characters in alphabet and it is
just enough to have one byte encoding for
English/Russian text. It makes no sense
to send over the wire two bytes (russian in utf-8)
instead of one for the sites like lib.ru.

Sorry but guys are paying there for each byte
downloaded from Internet. This apply
to almost all countries except of US and Canada.

Andrew Fedoniouk.
http://terrainformatica.com


July 30, 2006
"Walter Bright" <newshound@digitalmars.com> wrote in message news:eagufo$2knt$1@digitaldaemon.com...
> Andrew Fedoniouk wrote:
>> Element of UTF-8 sequence is an octet.  I think you should rename 'char' type to 'octet' if D/Phobos intended to support only UTF-8.
>
> This was all hashed out years ago. It's too late to start renaming basic types.

I am not asking to rename anything.

Could you please just remove this weird 0xFF initialization for char arrays? ( as it was prior to .162 buld )

This is the whole point. If you will do this
then current char type can be used for
representation of single byte encodings as it stands -
character.

Andrew Fedoniouk.
http://terrainformatica.com


July 30, 2006
2. Sorry, an array of char (a single char is one single 8 bit octet) contains UTF-8 bytes which are 8-bit octets.

A single character, in UTF-8 encoding, may be 1 byte, 2 bytes, etc. Thus, one char MAY NOT hold every single Unicode code point.  You may need an array of multiple chars (bytes) to hold a single code point.

This is not what it means to me; this is what it means.  A char is a single 8-bit octet in a UTF-8 sequence.  They ARE NOT by any means code points.

I'm sorry that I did not specify "array", but I fear you are being pedantic here; I'm sure you knew what I meant.

A char is a single byte in a UTF-8 sequence.  I'm afraid I think calling it an index to a glyph is dangerous, because it could be mistaken. Again, a single char CANNOT represent code points above and including 128 because it is only ONE byte.

A single char therefore may not represent a glyph all of the time, but rather will represent a byte in the sequence of UTF-8 which may be used to decode (along with other necessary bytes) the entirity of the code point.

I hope I'm not being overly pedantic here, but I think your definition is either lax or wrong.  But, that is only by its reading in English.

3. It is #2, as above.  wchars are not UCS-2.  They cannot always represent full code points alone.  Arrays of wchars must be used for some code points.  As I read your question, #1 is UCS-2 (fixed length 16-bit encoding) and #2 is UTF-16 (dynamic length, 16-bit baseline encoding.)

4. I was ignoring endianess issues for simplicity.  My point here is that a UTF-32 character directly represents  a code point.  Sorry again for the non-pedantic laxness in my wording.

5. Wrong.  There is no vice versa.  You may use byte or ubyte arrays for your UTF-8 encoded strings and so forth.

In case you didn't realize I was trying to say this:

*char is not for single byte encodings.  char is ONLY for UTF-8.  char may not be used for any other encoding unless you wish to have problems.  char is not the same as in other languages, e.g. C.*

If you wish for a 8-bit octet value (such as a character in any encoding; single byte or otherwise) you should not be using a char. That is not a correct usage for them, that is what byte and ubyte are for.

It is expected that chars in an array will follow a specific sequence; that is, that they will be encoded in UTF-8.  It is not possible to guarantee this if you use other encodings, which is why writefln() will fail in such cases.

6.  Correct.  And a single char (8-bit octet in a sequence of UTF-8 octets encoded such) may never be FF because no single 8-bit octet anywhere in a valid UTF-8 sequence may be FF.  Remember, char is not a code point.  It is a single 8-bit octet in a sequence.

7. My mistake.  I always consider them roughly the same (and for some reason I thought that they had been made the same; but I assume your link is current.)

Your first code sample defines a single UTF-8 character, 'a'.  It is lucky you did not try:

char c = '蝿';

(hopefully this character gets sent through to you properly; I will be sending this message UTF-8 if my client allows it.)

Because that would have failed.  A char cannot hold such a character, which has a code point outside the range 0 - 127.  You would either need to use an array of chars, or etc.

Your second example means nothing to me.  I don't really care for such pragmas or putting untranslated text directly in source code, and have never dealt with it.

8. You may not use a single char or an array of chars to represent UTF-16.  It may only represent UTF-8.  If you wish to use UTF-16, you must use wchars.

1 (the second #1): but for the code point 0, as encoded in UTF-8, they are the same - do you not agree?  A 0 is a zero is a zero.  It doesn't matter what he means.

2 (the second): rules about ASCII do not apply to char.  Just as rules in Portugal do not dissuade me here in Los Angeles.

3 (the second): I have lead the development of a multi-lingual software which was used by quite a large sum of people.  I also helped coordinate, and later interface with the assigned coordinator of translation.  This software was translated into Thai, Chinese (simple and traditional), Russian, Italian, Spanish, Japanese, Catalan, and several other languages.  More than twenty anyway.

At first I was suggesting that everyone use their own encoding and handling that (sometimes painfully) in the code.  I would sometimes get comments about using Unicode instead (from the translators who would have preferred this.)  This software now uses UTF-8 and remains translated in these languages.

So, while I have not been to Russia (although I have worked with numerous Russian developers, consumers, and translators) I would tend to disagree with your assertion.  Also I do not like helmets.

Obviously, I mean nothing to be taken personally as well; we are only talking about UTF-8, Unicode, its usage in D, and being pedantic ;). And helmets, we touched that subject too.  But not about each other, really.

Thanks,
-[Unknown]


> "Unknown W. Brackets" <unknown@simplemachines.org> wrote in message news:eagn4d$1q1t$1@digitaldaemon.com...
>> Andrew,
>>
>> I think it will make a lot more sense if you keep these things in mind... (I'm sure you already know all of them, I'm just listing them out since they're crucial and must be thought of together):
>>
>> 1. char, wchar, and dchar are separate types.
> 
> No objections with this.
> 
>> 2. char contains UTF-8 bytes.  It may not contain UTF-16, UCS-2, KOI-8R, or any other encoding.  It must contain UTF-8.
> 
> Sorry but plural form "char contains UTF-8 bytes" is wrong.
> 
> What you think char means:
> 1) char is an octet (byte) - member of utf-8 sequence -or-
> 2) char is code point of some character in some character table.
> 
> ?
> 
> Probably I am treating English too literally but
> char(acter) is not an UTF-8 byte.  And never was.
> 
> char is an index of some glyph in some encoding table.
> This is common definition used everywhere.
> 
>> 3. wchar contains UTF-16.  It is similar to char in every other way (may not contain any other encoding than UTF-16, not even UCS-2.)
> 
> The same problem as in #2.
> 
> What is wchar (uint16) for you:
> 1) wchar as is an index of a Unicode scalar value in Basic Multilingual Plane (BMP)
> -or-
> 2) is a uint16 value - member of UTF-16 sequence.
> 
> ?
> 
>> 4. dchar contains UTF-32 code points.  It may not contain any other sort of encoding, again.
> 
> Oh.....
> 
> UTF-32 (as any other utfs) is a transformation format -
> group name of two different encodings UTF-32BE and UTF-32LE.
> 
> UTF-32 code point is a non-sense.
> 
> UTF-32 defines of how to encode Unicode code point  in
> again sequence of four bytes - octets.
> 
> I would define this thing as
> 
> dchar ( better name is uchar ) is type for representing
> full set of Unicode Code Points (21bit value).
> 
> Pleas note: "transformation format" (UTF) is not by
> any means a "manipulation format".
> 
> Representation of text in memory suitable for
> manipulation (e.g. text processing) is different as rule.
> 
> You cannot use utf-8 encoded russian text for
> analysis. No way.
> 
>> 5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use ubyte/byte or some other method.  It is not valid to use char.
> 
> Vice versa. For utf-8 encoded strings you should use byte[]
> and for strings using single byte encodings you should use char.
> 
>> 6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 string.  Since char can only contain UTF-8 strings, it represents invalid data if it contains such an 8-bit octet.
> 
> No objections with that, for UTF-8 octet sequences 0xFF is invalid
> value of octet in the sequence. But please note: in the sequence of octets.
> 
>> 7. Code points are the characters in Unicode; they are "compressed", so to speak, in encodings such as UTF-8 and UTF-16.  USC-2 and USC-4 (UTF-32) contain full code points.
> 
> Sorry, but USC-4 *is not* UTF-32
> http://www.unicode.org/reports/tr19/tr19-9.html
> 
> I will ask again:
> 
> What:
> char c = 'a';
> means for you?
> 
> And following in C/C++:
> 
> #pragma(encoding,"KOI-8R")
> 
> char c = '?';
> 
> ?
> 
> 
>> 8. If you were to examine the bytes in a wchar string, it may be possible that the 8-bit octet sequence "FF" might show up.  Nonetheless, since char cannot be used for UTF-16, this doesn't matter.
> 
> Not clear what you mean here. Could you clarify? Especially last statement.
> 
>> 9. For the above reason, wchar (UTF-16) uses FFFF.  This character is similar to FF for UTF-8.
>>
>> Given the above, I think I might answer your questions:
>>
>> 1. UTF-8 character here could mean an 8-bit octet of code point.  In this case, they are both the same and represent a perfectly valid character in a string.
> 
> Sorry I am not buying following:
> "UTF-8 character" and "8-bit octet of code point"
> 
>> 2. ASCII does not matter; char is not ASCII.  It happens that ASCII bytes 0 to 127 correspond to the same code points in Unicode, and the same characters in UTF-8.
> 
> "ASCII does not matter"... for whom?
> 
>> 3. It does not matter; KOI-8R encoded strings should not be placed in char arrays.  You should use UTF-8 or another encoding for your Russian text.
> 
> "You should use UTF-8 or another encoding for your Russian text."
> 
> Thanks.
> 
> Advice from my side:
> Let me know when you will visit Russia.
> I will ask representatives of russian developer community and web authors
> to meet you.
> 
> Advice per se: You should wear a helmet.
> 
>> 4. If you wish to use KOI-8R (or any other encoding not based on Unicode) you should not be using char arrays, which are meant for Unicode-related encodings only.
> 
> The same advice as above.
> 
>> Obviously this is by far different from C, but that's the good thing about D in many ways ;).
> 
> In Israel they have an old saying:
> "Not a human for Saturday but Saturday for human".
> 
> I do have practical experience in writnig text processing software in
> encodings other than "US-ASCII" and have heard your advices about
> UTF-8 usage with interest.
> 
> Please don't take all of this personal - no intention to harm anybody.
> Honestly and with smile :)
> 
> Andrew.
> 
> 
July 30, 2006
Andrew Fedoniouk wrote:
>> In D, char[] is a UTF-8 sequence. It's well defined, and therefore portable. It supports every human language.
> 
> What does it mean "UTF-8 ... supports ...every human language" ?
> 
> It allows to encode - yes.

We both know what UTF-8 is and does.

> But in runtime support means quite different thing
> and I am pretty sure you know what I mean here.

I'm sure there are bugs in the library UTF-8 support. But they are bugs, are fixable, and not fundamental problems. As you find any, please post them to bugzilla.


> In Java as we know UTF-8 is used for representing
> string literals inside .class files but being loaded they
> became vectors of Java chars - unicode BMP codepoints
> (ushort). And this serves almost all character cases.
> Exceptions like: it is not trivial to do effectively
> processing of single byte encoded things there - you need
> to rewrite the whole set of functions to handle this.
> 
> Please don't think that UTF-8 is a panacea.

I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.


> For example in China they use GB2312 encoding
> to represent almost 7000 Chinese characters in active use now.
> This is strictly 2 bytes enconding and
> don't even try to ask them to switch to UTF-8
> (3 bytes as a rule). This will increase their internet
> traffic by 1/3.
> 
> Same apply to Europe. E.g. in Russia
> there are 32 characters in alphabet and it is
> just enough to have one byte encoding for
> English/Russian text. It makes no sense
> to send over the wire two bytes (russian in utf-8)
> instead of one for the sites like lib.ru.
> 
> Sorry but guys are paying there for each byte
> downloaded from Internet. This apply
> to almost all countries except of US and Canada.

If one needs to use a custom encoding, use ubyte[] or ushort[]. If one needs to be universal, use char[], wchar[], or dchar[]. And for what it's worth, D isn't a web transmission protocol. I don't see any problem with a D program converting its input from Format X to UTF for internal processing, and then converting its output back to X or Y or Z.
July 30, 2006
But even prior, this:

char c;
writefln(cast(size_t) c);

Would have given you 255, not 0.  This has been true for quite some time.  The fact that it did not happen for arrays in the same way was, as far as I know, a bug.  Actually, I didn't even realize that got fixed.

-[Unknown]


> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eagufo$2knt$1@digitaldaemon.com...
>> Andrew Fedoniouk wrote:
>>> Element of UTF-8 sequence is an octet.  I think you should rename
>>> 'char' type to 'octet' if D/Phobos intended to support only UTF-8.
>> This was all hashed out years ago. It's too late to start renaming basic types.
> 
> I am not asking to rename anything.
> 
> Could you please just remove this weird 0xFF initialization
> for char arrays? ( as it was prior to .162 buld )
> 
> This is the whole point. If you will do this
> then current char type can be used for
> representation of single byte encodings as it stands -
> character.
> 
> Andrew Fedoniouk.
> http://terrainformatica.com
> 
> 
July 30, 2006
Andrew Fedoniouk wrote:
> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eagufo$2knt$1@digitaldaemon.com...
>> Andrew Fedoniouk wrote:
>>> Element of UTF-8 sequence is an octet.  I think you should rename
>>> 'char' type to 'octet' if D/Phobos intended to support only UTF-8.
>> This was all hashed out years ago. It's too late to start renaming basic types.
> I am not asking to rename anything.

Ok, but you did say "I think you should rename..." <g>

> Could you please just remove this weird 0xFF initialization
> for char arrays? ( as it was prior to .162 buld )

char's have been initialized to 0xFF for years now, it was a bug that some array initializations didn't do it.

> This is the whole point. If you will do this
> then current char type can be used for
> representation of single byte encodings as it stands -
> character.

? I don't understand what's standing in the way of that now. And values from 0..7F are single byte UTF-8 encodings and can be stored in a char.

BTW, you can do this:

typedef char mychar = 0;

mychar[] a = new mychar[100];	// mychar[] will be initialized to 0
July 30, 2006
"Unknown W. Brackets" <unknown@simplemachines.org> wrote in message news:eah49h$2pi8$1@digitaldaemon.com...
> 2. Sorry, an array of char (a single char is one single 8 bit octet) contains UTF-8 bytes which are 8-bit octets.
>
> A single character, in UTF-8 encoding, may be 1 byte, 2 bytes, etc. Thus, one char MAY NOT hold every single Unicode code point.  You may need an array of multiple chars (bytes) to hold a single code point.
>
> This is not what it means to me; this is what it means.  A char is a single 8-bit octet in a UTF-8 sequence.  They ARE NOT by any means code points.
>
> I'm sorry that I did not specify "array", but I fear you are being pedantic here; I'm sure you knew what I meant.
>
> A char is a single byte in a UTF-8 sequence.  I'm afraid I think calling it an index to a glyph is dangerous, because it could be mistaken. Again, a single char CANNOT represent code points above and including 128 because it is only ONE byte.
>
> A single char therefore may not represent a glyph all of the time, but rather will represent a byte in the sequence of UTF-8 which may be used to decode (along with other necessary bytes) the entirity of the code point.
>
> I hope I'm not being overly pedantic here, but I think your definition is either lax or wrong.  But, that is only by its reading in English.

"your definition is either lax or wrong"

Which one?

>
> 3. It is #2, as above.  wchars are not UCS-2.  They cannot always represent full code points alone.  Arrays of wchars must be used for some code points.  As I read your question, #1 is UCS-2 (fixed length 16-bit encoding) and #2 is UTF-16 (dynamic length, 16-bit baseline encoding.)
>
> 4. I was ignoring endianess issues for simplicity.  My point here is that a UTF-32 character directly represents  a code point.  Sorry again for the non-pedantic laxness in my wording.

>
> 5. Wrong.  There is no vice versa.  You may use byte or ubyte arrays for your UTF-8 encoded strings and so forth.
>
> In case you didn't realize I was trying to say this:
>
> *char is not for single byte encodings.  char is ONLY for UTF-8.  char may not be used for any other encoding unless you wish to have problems. char is not the same as in other languages, e.g. C.*
>
> If you wish for a 8-bit octet value (such as a character in any encoding; single byte or otherwise) you should not be using a char. That is not a correct usage for them, that is what byte and ubyte are for.
>
> It is expected that chars in an array will follow a specific sequence; that is, that they will be encoded in UTF-8.  It is not possible to guarantee this if you use other encodings, which is why writefln() will fail in such cases.
>
> 6.  Correct.  And a single char (8-bit octet in a sequence of UTF-8 octets encoded such) may never be FF because no single 8-bit octet anywhere in a valid UTF-8 sequence may be FF.  Remember, char is not a code point.  It is a single 8-bit octet in a sequence.
>
> 7. My mistake.  I always consider them roughly the same (and for some reason I thought that they had been made the same; but I assume your link is current.)
>
> Your first code sample defines a single UTF-8 character, 'a'.  It is lucky you did not try:
>
> char c = '?';
>
> (hopefully this character gets sent through to you properly; I will be sending this message UTF-8 if my client allows it.)
>
> Because that would have failed.  A char cannot hold such a character, which has a code point outside the range 0 - 127.  You would either need to use an array of chars, or etc.
>
> Your second example means nothing to me.  I don't really care for such pragmas or putting untranslated text directly in source code, and have never dealt with it.
>
> 8. You may not use a single char or an array of chars to represent UTF-16. It may only represent UTF-8.  If you wish to use UTF-16, you must use wchars.
>
> 1 (the second #1): but for the code point 0, as encoded in UTF-8, they are the same - do you not agree?  A 0 is a zero is a zero.  It doesn't matter what he means.
>
> 2 (the second): rules about ASCII do not apply to char.  Just as rules in Portugal do not dissuade me here in Los Angeles.
>
> 3 (the second): I have lead the development of a multi-lingual software which was used by quite a large sum of people.  I also helped coordinate, and later interface with the assigned coordinator of translation.  This software was translated into Thai, Chinese (simple and traditional), Russian, Italian, Spanish, Japanese, Catalan, and several other languages. More than twenty anyway.
>
> At first I was suggesting that everyone use their own encoding and handling that (sometimes painfully) in the code.  I would sometimes get comments about using Unicode instead (from the translators who would have preferred this.)  This software now uses UTF-8 and remains translated in these languages.
>
> So, while I have not been to Russia (although I have worked with numerous Russian developers, consumers, and translators) I would tend to disagree with your assertion.  Also I do not like helmets.
>
> Obviously, I mean nothing to be taken personally as well; we are only talking about UTF-8, Unicode, its usage in D, and being pedantic ;). And helmets, we touched that subject too.  But not about each other, really.
>
> Thanks,
> -[Unknown]
>

Ok. Let's make second round

Some defintions:

Unicode Code Point is an integer value (21bit used) - index in
global Unicode table.
Such global encoding table maintained by international Unicode Consortium.
With some exceptions each code point there has correspondent
glyph in "global super font".

There are two types of encodings used for Unicode Code Points:
1) transport encodings - example UTF. Main purpose - transport/transfer.
2) manipulation encodings - mapping of ranges of  Unicode Code Points
to diapasons 0..0xFF, 0..0xFFFF and 0..0xFFFFFFFF.

Transport encodings are used for transfer and long term storage of character data - texts.

Manipulation encoding are used in programming for effective implementation
of text processing functions.
As a rule manipulation encoding maps some fragment (or two) of
Unicode Code Point set to the range 0..0xFF and 0..0xFFFF.
Main charcteristic of such mapping: each value of character vector (string)
there is in 1:1 relationship with the correspondent codepoint in
Unicode set.
Main idea of such encoding - character at some index in string (vector)
represents one code point in full.

I think that motivation of having manipulation encodings is simple
and everyone understands it.
Think about how you will implement caret positioning in editbox
for example.

So statement: "char[] in D supposed to hold only UTF-8 encoded text" immediately leads us to "D is not designed for effective text processing".

Is this logic clear?

Again - let char be a char in D as it is now. Just don't initialize it
by 0xFF please. And let us be a bit carefull with our utf-8 expectations -
yes, it is almost ideal transport encoding, but it is completely useless
for text manipulation purposes - too expensive.

(last message on the subject)

Andrew Fedoniouk.
http://terrainformatica.com