July 30, 2006
>> Please don't think that UTF-8 is a panacea.
>
> I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.

Sorry but this is a bit optimistic.

D/samples/wc.exe from the box will fail on russian texts.
It will fail on almost all Eastern texts. Even they
will be in UTF-8 encoding. Meaning of 'word'
is different there.

Having statement "string literals in D are only
UTF-8 encoded" is not conceptually better than
"string literals in C are encoded by using codepage defined
by pragma(codepage,...)".

Same by the way applied to most of Java compilers
they accepts texts in various singlebyte encodings.
(Why *I* am telling this to *you*? :-)

Andrew.







July 30, 2006
Andrew Fedoniouk wrote:
>>> Please don't think that UTF-8 is a panacea.
>> I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.
> 
> Sorry but this is a bit optimistic.
> 
> D/samples/wc.exe from the box will fail on russian texts.
> It will fail on almost all Eastern texts. Even they
> will be in UTF-8 encoding. Meaning of 'word'
> is different there.

No matter, it is far easier to write a UTF-8 isword function than one that will work on all possible character encoding methods.


> Having statement "string literals in D are only
> UTF-8 encoded" is not conceptually better than
> "string literals in C are encoded by using codepage defined
> by pragma(codepage,...)".

It is conceptually better because UTF-8 is completely defined and covers all human languages. Codepages are not completely defined, do not cover asian languages, rely on non-standard compiler extensions, and in fact you cannot even rely on *ASCII* being supported by any particular C or C++ compiler. (It could be EBCDIC or any encoding invented by the compiler vendor.)

Code pages have another disastrous problem - it's impossible to mix languages. I have an academic text in front of me written in a mixture of german, french, and latin. How's that going to work with code pages?

Code pages are obsolete yesterday's technology, and I'm not sorry to see them go.

> Same by the way applied to most of Java compilers
> they accepts texts in various singlebyte encodings.
> (Why *I* am telling this to *you*? :-)

The compiler may accept it as an extension, but the Java *language* is defined to work with UTF-16 source text only. (Java calls them 'char's, even though there may be multi-char encodings.)
July 30, 2006
It really sounds to me like you're looking for UCS-2, then (e.g. as used in JavaScript, etc.)  For that, length calculation (which is what I presume you mean) is inexpensive.

As to your below assertion, I disagree.  What I think you meant was:

"char[] is not designed for effective multi-byte text processing."

I will agree that wchar[] would be much better in that case, and even that limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would probably make things significantly easier to work with.

Nonetheless, I was only commenting on how D is currently designed and implemented.  Perhaps there was some misunderstanding here.

Even so, I don't see how initializing it to FF makes any problem.  I think everyone understands that char[] is meant to hold UTF-8, and if you don't like that or don't want to use it, there are other methods available to you (heh, you can even use UTF-32!)

I don't see that the initialization of these variables will cause anyone any problems.  The only time I want such a variable initialized to 0 is when I use a numeric type, not a character type (and then, I try to use = 0 anyway.)

It seems like what you may want to do is simply this:

typedef ushort ucs2_t = 0;

And use that type.  Mission accomplished.  Or, use various different encodings - in which case I humbly suggest:

typedef ubyte latin1_t = 0;
typedef ushort ucs2_t = 0;
typedef ubyte koi8r_t = 0;
typedef ubyte big5_t = 0;

And so on, so on, so on...

-[Unknown]


> So statement: "char[] in D supposed to hold only UTF-8 encoded text"
> immediately leads us to "D is not designed for effective text processing".
> 
> Is this logic clear?
July 30, 2006
"Walter Bright" <newshound@digitalmars.com> wrote in message news:eah9st$2v1o$1@digitaldaemon.com...
> Andrew Fedoniouk wrote:
>>>> Please don't think that UTF-8 is a panacea.
>>> I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.
>>
>> Sorry but this is a bit optimistic.
>>
>> D/samples/wc.exe from the box will fail on russian texts.
>> It will fail on almost all Eastern texts. Even they
>> will be in UTF-8 encoding. Meaning of 'word'
>> is different there.
>
> No matter, it is far easier to write a UTF-8 isword function than one that will work on all possible character encoding methods.
>

Sorry, did you try to write such a function (isword)?

(You need the whole set of character classification tables to accomplish this - utf-8 will not help you)

>
>> Having statement "string literals in D are only
>> UTF-8 encoded" is not conceptually better than
>> "string literals in C are encoded by using codepage defined
>> by pragma(codepage,...)".
>
> It is conceptually better because UTF-8 is completely defined and covers all human languages. Codepages are not completely defined, do not cover asian languages, rely on non-standard compiler extensions, and in fact you cannot even rely on *ASCII* being supported by any particular C or C++ compiler. (It could be EBCDIC or any encoding invented by the compiler vendor.)
>
> Code pages have another disastrous problem - it's impossible to mix languages. I have an academic text in front of me written in a mixture of german, french, and latin. How's that going to work with code pages?

I am not saying that you shall avoid use of UTF-8 encoding.
If you have mix of say english, russian and chinese on some page
the only way to deliver this to the user is to use some (universal)
unicode transport encoding.
But to render this thing on the screen is completely different
story.

Consider this: attribute names in html (sgml) represented by
ascii codes only - you don't need utf-8 processing to deal with them at all.
You also cannot use utf-8 for storing attribute values generally speaking.
Attribute values participate in CSS selector analysis and some selectors
require char by char (char as a code point and not a D char) access.

There are only few academic cases where you can use utf-8 literally
(as a sequence of utf-8 bytes) *in runtime*. D source code compilation
is one of such things - you can store content of string literals in utf-8
form -
you don't need to analyze their content.

>
> Code pages are obsolete yesterday's technology, and I'm not sorry to see them go.

Sorry but US is the first country which will ask "what a ...?" on demand to send always four bytes instead of one.

UTF-8 encoding is "traffic friendly" only for 1/10 of population
on the Earth (English speaking people).
Others just don't want to pay that price.

Sorry you or not sorry it is irrelevant for code pages existence. They will be forever untill all of us will not speak on Esperanto.

( Currently I am doing right-to-left support in the engine - Arabic and
Hebrew -
trust me - probably I have more things to say "sorry" about )

>
>> Same by the way applied to most of Java compilers
>> they accepts texts in various singlebyte encodings.
>> (Why *I* am telling this to *you*? :-)
>
> The compiler may accept it as an extension, but the Java *language* is defined to work with UTF-16 source text only. (Java calls them 'char's, even though there may be multi-char encodings.)

Walter, where did you get that magic UTF-16 ?

Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html
mentions that input of Java compiler is sequence of Unicode (Code Points).
And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not
matter at all and spec is silent about this - human is in its rights to
choose
encoding his/her terminal/keyboard supports.

Andrew Fedoniouk.
http://terrainformatica.com


July 30, 2006
Is there a doctor in the house?



Andrew Fedoniouk wrote:
> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eah9st$2v1o$1@digitaldaemon.com...
> 
>>Andrew Fedoniouk wrote:
>>
>>>>>Please don't think that UTF-8 is a panacea.
>>>>
>>>>I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.
>>>
>>>Sorry but this is a bit optimistic.
>>>
>>>D/samples/wc.exe from the box will fail on russian texts.
>>>It will fail on almost all Eastern texts. Even they
>>>will be in UTF-8 encoding. Meaning of 'word'
>>>is different there.
>>
>>No matter, it is far easier to write a UTF-8 isword function than one that will work on all possible character encoding methods.
>>
> 
> 
> Sorry, did you try to write such a function (isword)?
> 
> (You need the whole set of character classification tables
> to accomplish this - utf-8 will not help you)
> 
> 
>>>Having statement "string literals in D are only
>>>UTF-8 encoded" is not conceptually better than
>>>"string literals in C are encoded by using codepage defined
>>>by pragma(codepage,...)".
>>
>>It is conceptually better because UTF-8 is completely defined and covers all human languages. Codepages are not completely defined, do not cover asian languages, rely on non-standard compiler extensions, and in fact you cannot even rely on *ASCII* being supported by any particular C or C++ compiler. (It could be EBCDIC or any encoding invented by the compiler vendor.)
>>
>>Code pages have another disastrous problem - it's impossible to mix languages. I have an academic text in front of me written in a mixture of german, french, and latin. How's that going to work with code pages?
> 
> 
> I am not saying that you shall avoid use of UTF-8 encoding.
> If you have mix of say english, russian and chinese on some page
> the only way to deliver this to the user is to use some (universal)
> unicode transport encoding.
> But to render this thing on the screen is completely different
> story.
> 
> Consider this: attribute names in html (sgml) represented by
> ascii codes only - you don't need utf-8 processing to deal with them at all.
> You also cannot use utf-8 for storing attribute values generally speaking.
> Attribute values participate in CSS selector analysis and some selectors
> require char by char (char as a code point and not a D char) access.
> 
> There are only few academic cases where you can use utf-8 literally
> (as a sequence of utf-8 bytes) *in runtime*. D source code compilation
> is one of such things - you can store content of string literals in utf-8 form -
> you don't need to analyze their content.
> 
> 
>>Code pages are obsolete yesterday's technology, and I'm not sorry to see them go.
> 
> 
> Sorry but US is the first country which will ask "what a ...?" on demand
> to send always four bytes instead of one.
> 
> UTF-8 encoding is "traffic friendly" only for 1/10 of population
> on the Earth (English speaking people).
> Others just don't want to pay that price.
> 
> Sorry you or not sorry it is irrelevant for code pages existence.
> They will be forever untill all of us will not speak on Esperanto.
> 
> ( Currently I am doing right-to-left support in the engine - Arabic and Hebrew -
> trust me - probably I have more things to say "sorry" about )
> 
> 
>>>Same by the way applied to most of Java compilers
>>>they accepts texts in various singlebyte encodings.
>>>(Why *I* am telling this to *you*? :-)
>>
>>The compiler may accept it as an extension, but the Java *language* is defined to work with UTF-16 source text only. (Java calls them 'char's, even though there may be multi-char encodings.)
> 
> 
> Walter, where did you get that magic UTF-16 ?
> 
> Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html
> mentions that input of Java compiler is sequence of Unicode (Code Points).
> And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not
> matter at all and spec is silent about this - human is in its rights to choose
> encoding his/her terminal/keyboard supports.
> 
> Andrew Fedoniouk.
> http://terrainformatica.com
> 
> 
July 30, 2006
"Unknown W. Brackets" <unknown@simplemachines.org> wrote in message news:eahcqu$4d$1@digitaldaemon.com...
> It really sounds to me like you're looking for UCS-2, then (e.g. as used in JavaScript, etc.)  For that, length calculation (which is what I presume you mean) is inexpensive.
>

Well, lets speak in terms of javascript if it is easier:

String.substr(start, end)...

What these start, end means for you?
I don't think that you will be interested in indexes
of bytes in utf-8 sequence.

> As to your below assertion, I disagree.  What I think you meant was:
>
> "char[] is not designed for effective multi-byte text processing."

What is "multi-byte text processing"?
processing of text - sequence of codepoints of the alphabet?
What is 'multi-byte' there doing? Multi-byte I beleive you mean is
a method of encoding of codepoints for transmission. Is this correct?

You need real codepoints to do something meaningfull with them...
How these codepoints are stored in memory: as byte, word or dword
depends on your task, amount of memory you have and alphabet
you are using.
E.g. if you are counting frequency of russian words used in internet
you'd better do not do this in Java - twice as expensive as in C
without any need.

So phrase "multi-byte text processing" is fuzzy on this end.

(Seems like I am not clear enough with my subset of English.)

>
> I will agree that wchar[] would be much better in that case, and even that limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would probably make things significantly easier to work with.
>
> Nonetheless, I was only commenting on how D is currently designed and implemented.  Perhaps there was some misunderstanding here.
>
> Even so, I don't see how initializing it to FF makes any problem.  I think everyone understands that char[] is meant to hold UTF-8, and if you don't like that or don't want to use it, there are other methods available to you (heh, you can even use UTF-32!)
>
> I don't see that the initialization of these variables will cause anyone any problems.  The only time I want such a variable initialized to 0 is when I use a numeric type, not a character type (and then, I try to use = 0 anyway.)
>
> It seems like what you may want to do is simply this:
>
> typedef ushort ucs2_t = 0;
>
> And use that type.  Mission accomplished.  Or, use various different encodings - in which case I humbly suggest:
>
> typedef ubyte latin1_t = 0;
> typedef ushort ucs2_t = 0;
> typedef ubyte koi8r_t = 0;
> typedef ubyte big5_t = 0;
>
> And so on, so on, so on...
>
> -[Unknown]

I like the last statement "..., so on, so on..."
Sounds promissing enough.

Just for information:
strlen(const char* str)  works with *all*
single byte encodings in C.
For multi-bytes (e.g. utf-8 )  it returns
length of the sequence in octets.
But these are not chars in terms of C
strictly speaking but bytes -
unsigned chars.


>
>
>> So statement: "char[] in D supposed to hold only UTF-8 encoded text" immediately leads us to "D is not designed for effective text processing".
>>
>> Is this logic clear?


July 30, 2006

Andrew Fedoniouk wrote:
> ( Currently I am doing right-to-left support in the engine - Arabic and Hebrew -
> trust me - probably I have more things to say "sorry" about )
> 

That's great, I'd be glad to help with anything if you need help with regard to Arabic (I'm a native Arabic speaker).

> 
> Andrew Fedoniouk.
> http://terrainformatica.com
> 
> 
July 30, 2006
On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:


> ... but this is far from concept of null codepoint in character encodings.

Andrew and others,
I've read through these posts a few times now, trying to understand the
various points of view being presented. I keep getting the feeling that
some people are deliberately trying *not* to understand what other people
are saying. This is a sad situation.

Andrew seems to be stating ...
(a) char[] arrays should be allowed to hold encodings other than UTF-8, and
thus initializing them with hex-FF byte values is not useful.
(b) UTF-8 encoding is not an efficient encoding for text analysis.
(c) UTF encodings are not optimized for data transmission (they contain
redundant data in many contexts).
(d) The D type called 'char' may not have been the best name to use if it
is meant to be used to contain only UTF-8 octets.

I, and many others including Walter, would probably agree to (b), (c) and
(d). However, considering (b) and (c), UTF has benefits that outweigh these
issues and there are ways to compensate for these too. Point (d) is a
casualty of history and to change the language now to rename 'char' to
anything else would be counter productive now. But feel free to implement
your own flavour of D.<g>

Back to point (a)... The fact is, char[] is designed to hold UTF-8 encodings so don't try to force anything else into such arrays. If you wish to use some other encodings, then use a more appropriate data structure for it. For example, to hold 'KOI-8' encodings of Russian text, I would recommend using ubyte[] instead. To transform char[] to any other encoding you will have to provide the functions to do that, as I don't think it is Walter's or D's responsibilty to do it. The point of initializing UTF-8 strings with illegal values is to help detect coding or logical mistakes. And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode codepoint *is* illegal. If you must store an octet of hex-FF then use ubyte[] arrays to do it.

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"
July 30, 2006
Andrew Fedoniouk wrote:
> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eah9st$2v1o$1@digitaldaemon.com...
>> Andrew Fedoniouk wrote:
>>>>> Please don't think that UTF-8 is a panacea.
>>>> I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.
>>> Sorry but this is a bit optimistic.
>>>
>>> D/samples/wc.exe from the box will fail on russian texts.
>>> It will fail on almost all Eastern texts. Even they
>>> will be in UTF-8 encoding. Meaning of 'word'
>>> is different there.
>> No matter, it is far easier to write a UTF-8 isword function than one that will work on all possible character encoding methods.
> Sorry, did you try to write such a function (isword)?

I have written isUniAlpha, which is the same thing.

> (You need the whole set of character classification tables
> to accomplish this - utf-8 will not help you)

With code pages, it isn't so straightforward (especially if you've got things like shift-JIS too). With code pages, a program can't even accept a text file unless you tell it what page the text is in.

> I am not saying that you shall avoid use of UTF-8 encoding.
> If you have mix of say english, russian and chinese on some page
> the only way to deliver this to the user is to use some (universal)
> unicode transport encoding.
> But to render this thing on the screen is completely different
> story.

Fortunately, rendering is the job of the operating system - and I don't see how rendering with code pages would be any easier.

> Consider this: attribute names in html (sgml) represented by
> ascii codes only - you don't need utf-8 processing to deal with them at all.
> You also cannot use utf-8 for storing attribute values generally speaking.
> Attribute values participate in CSS selector analysis and some selectors
> require char by char (char as a code point and not a D char) access.

I'd be surprised at that, since UTF-8 is a documented, supported HTML page encoding method. But if UTF-8 doesn't work for you, you can use wchar (UTF-16) or dchar (UTF-32), or ubyte (for anything else).

> There are only few academic cases where you can use utf-8 literally
> (as a sequence of utf-8 bytes) *in runtime*. D source code compilation
> is one of such things - you can store content of string literals in utf-8 form -
> you don't need to analyze their content.

D identifiers can be unicode alphas, which means the UTF-8 must be decoded.

The DMC++ compiler supports various code page source file possibilities, including some of the asian language multibyte encodings. I find that UTF-8 is a lot easier to work with, as the UTF-8 designers learned from the mistakes of the earlier multibyte encodings.

>> Code pages are obsolete yesterday's technology, and I'm not sorry to see them go.
> Sorry but US is the first country which will ask "what a ...?" on demand
> to send always four bytes instead of one.
> UTF-8 encoding is "traffic friendly" only for 1/10 of population
> on the Earth (English speaking people).
> Others just don't want to pay that price.

I'll make a prediction that the huge benefits of UTF will outweigh the downside, and that code pages will increasingly fall into disuse. Note that javascript, java, C#, Ruby, etc., are all unicode languages (Ruby also supports EUC or SJIS, but not other code pages). Windows is (internally) completely unicode (the code page face it shows is done by a translation layer on I/O).

In an increasingly multicultural and global economy, applications that cannot simultaneously handle multiple languages are going to be at a severe disadvantage.

Another problem with code pages is when you're presented with a text file, what code page is it in? There's no way for a program to tell, unless there's some other transmission of associated metadata. With UTF, that's no problem.

> Sorry you or not sorry it is irrelevant for code pages existence.
> They will be forever untill all of us will not speak on Esperanto.
> 
> ( Currently I am doing right-to-left support in the engine - Arabic and Hebrew -
> trust me - probably I have more things to say "sorry" about )

No problem, I believe you <g>.

>>> Same by the way applied to most of Java compilers
>>> they accepts texts in various singlebyte encodings.
>>> (Why *I* am telling this to *you*? :-)
>> The compiler may accept it as an extension, but the Java *language* is defined to work with UTF-16 source text only. (Java calls them 'char's, even though there may be multi-char encodings.)
> 
> Walter, where did you get that magic UTF-16 ?
> 
> Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html
> mentions that input of Java compiler is sequence of Unicode (Code Points).
> And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not
> matter at all and spec is silent about this - human is in its rights to choose encoding his/her terminal/keyboard supports.

Java Language Specification Third Edition Chapter 3.2: "The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding."

It is, of course, entirely reasonable for a Java compiler to have extensions to recognize other encodings and automatically convert them internally to UTF-16 before lexical analysis.

"One Encoding to rule them all, One Encoding to replace them,
One Encoding to handle them all and in the darkness bind them"
-- UTF Tolkien
July 30, 2006
Derek wrote:
> Andrew seems to be stating ...
> (a) char[] arrays should be allowed to hold encodings other than UTF-8, and
> thus initializing them with hex-FF byte values is not useful.
> (b) UTF-8 encoding is not an efficient encoding for text analysis.
> (c) UTF encodings are not optimized for data transmission (they contain
> redundant data in many contexts).
> (d) The D type called 'char' may not have been the best name to use if it
> is meant to be used to contain only UTF-8 octets.
> 
> I, and many others including Walter, would probably agree to (b), (c) and
> (d). However, considering (b) and (c), UTF has benefits that outweigh these
> issues and there are ways to compensate for these too. Point (d) is a
> casualty of history and to change the language now to rename 'char' to
> anything else would be counter productive now. But feel free to implement
> your own flavour of D.<g>
> 
> Back to point (a)... The fact is, char[] is designed to hold UTF-8
> encodings so don't try to force anything else into such arrays. If you wish
> to use some other encodings, then use a more appropriate data structure for
> it. For example, to hold 'KOI-8' encodings of Russian text, I would
> recommend using ubyte[] instead. To transform char[] to any other encoding
> you will have to provide the functions to do that, as I don't think it is
> Walter's or D's responsibilty to do it. The point of initializing UTF-8
> strings with illegal values is to help detect coding or logical mistakes.
> And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
> codepoint *is* illegal. If you must store an octet of hex-FF then use
> ubyte[] arrays to do it.

Thank you for the insightful summary of the situation.

I suspect, though, that (c) might be moot since it is my understanding that most actual data transmission equipment automatically compresses the data stream, and so the redundancy of the UTF-8 is minimized. Text itself tends to be highly compressible on top of that.

Furthermore, because of the rate of expansion and declining costs of bandwidth, the cost of extra bytes is declining at the same time that the cost of the inflexibility of code pages is increasing.