July 30, 2006
LOL!!!

---
Paolo

Walter Bright wrote:

> "One Encoding to rule them all, One Encoding to replace them,
> One Encoding to handle them all and in the darkness bind them"
> -- UTF Tolkien
July 30, 2006
Derek wrote:
> On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:
> 
> 
>> ... but this is far from concept of null codepoint in character encodings.
> 
> Andrew and others,
> I've read through these posts a few times now, trying to understand the
> various points of view being presented. I keep getting the feeling that
> some people are deliberately trying *not* to understand what other people
> are saying. This is a sad situation.
> 
> Andrew seems to be stating ...
> (a) char[] arrays should be allowed to hold encodings other than UTF-8, and
> thus initializing them with hex-FF byte values is not useful.
> (b) UTF-8 encoding is not an efficient encoding for text analysis.
> (c) UTF encodings are not optimized for data transmission (they contain
> redundant data in many contexts).
> (d) The D type called 'char' may not have been the best name to use if it
> is meant to be used to contain only UTF-8 octets.
> 
> I, and many others including Walter, would probably agree to (b), (c) and
> (d). However, considering (b) and (c), UTF has benefits that outweigh these
> issues and there are ways to compensate for these too. Point (d) is a
> casualty of history and to change the language now to rename 'char' to
> anything else would be counter productive now. But feel free to implement
> your own flavour of D.<g>
> 
> Back to point (a)... The fact is, char[] is designed to hold UTF-8
> encodings so don't try to force anything else into such arrays. If you wish
> to use some other encodings, then use a more appropriate data structure for
> it. For example, to hold 'KOI-8' encodings of Russian text, I would
> recommend using ubyte[] instead. To transform char[] to any other encoding
> you will have to provide the functions to do that, as I don't think it is
> Walter's or D's responsibilty to do it. The point of initializing UTF-8
> strings with illegal values is to help detect coding or logical mistakes.
> And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
> codepoint *is* illegal. If you must store an octet of hex-FF then use
> ubyte[] arrays to do it.

Thank you for the clear summary.

Apart from the obvious (d), I think there are two reasons this char confusion comes up now and then.

1. The documentation may not be clear enough on the point that char is really only meant to represent an UTF-8 code unit (or ASCII character) and that char[] is an UTF-8 encoded string. It seems it needs to be more stressed. People coming from C will automatically assume the D char is a C char equivalent. It should be mentioned that dchar is the only type that can represent any Unicode character, while char is a character only in ASCII.

The C to D type conversion table doesn't help either:
http://www.digitalmars.com/d/ctod.html
It should say something like:
char => char (UTF-8 and ASCII strings) ubyte (other byte based encodings)

2. All string functions in Phobos work only on char[] (and in some cases wchar[] and dchar[]), making the tools for working with other string encodings extremely limited. This is easily remedied by a templated string library, such as what I have proposed earlier.

/Oskar
July 30, 2006
It's true that in HTML, attribute names were limited to a subset of characters available for use in the document.  Namely, as mentioned, alpha-type characters (/[A-Za-z][A-Za-z0-9\.\-]*/.)  You couldn't even use accented chars.

However (in the case of HTML), you were required to use specific (English) attribute names anyway for HTML to validate; it's really not a significant limitation.  Few people used SGML for anything else.

XML allows for Unicode attribute and element names... PIs, CDATA, PCDATA, etc.  And, of course, allows you to reference any Unicode code point (e.g. &#1234;.)

We could also talk about the limitations of horse driven carriages, and how they can only go a certain speed... nonetheless, we have cars now, so I'm not terribly worried about HTML's technical limitations anymore.

-[Unknown]


>> Consider this: attribute names in html (sgml) represented by
>> ascii codes only - you don't need utf-8 processing to deal with them at all.
>> You also cannot use utf-8 for storing attribute values generally speaking.
>> Attribute values participate in CSS selector analysis and some selectors
>> require char by char (char as a code point and not a D char) access.
> 
> I'd be surprised at that, since UTF-8 is a documented, supported HTML page encoding method. But if UTF-8 doesn't work for you, you can use wchar (UTF-16) or dchar (UTF-32), or ubyte (for anything else).
> 
July 30, 2006
Unknown W. Brackets wrote:
> 6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 string.  Since char can only contain UTF-8 strings, it represents invalid data if it contains such an 8-bit octet.
> 
You mentioned "8-bit octet" repeatedly in various posts. That's redundant: An "octet" is an 8-bit value. There are no "16-bit octets" and no "8-bit hextets" or stuff like that :P . I hope you knew that and were just distracted, but you kept saying that :) .

> 1. UTF-8 character here could mean an 8-bit octet of code point.  In this case, they are both the same and represent a perfectly valid character in a string.
> 

An "UTF-8 octet" is also called a UTF-8 'code unit'. Similarly a "UTF-16 hextet" is called a UTF-16 'code unit'. An UTF-8 code unit holds a Unicode code point if the code point is <128. Otherwise multiple UTF-8 code units are needed to encode that code point.

The confusion between 'code unit' and 'code point' is a long standing one. An "UTF-8 character" is a slighty ambiguous term. Does it a mean a UTF-8 code unit, or does it mean an Unicode character/codepoint encoded in a UTF-8 sequence?

-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
July 30, 2006
Yes, you're right, most of the time I wouldn't (although a significant portion of the time, I would.)  Even so, this is why I would use UCS-2, and not UTF-8.  Why are you held up on char[]?

My point is that char[] is only trouble when you're dealing with text that is not ISO-8859-1.  I'm a great fan of localization and internationalization, but in all honesty the largest part of my text processing/analysis is with such strings.

Generally, user input I don't analyze.  Caret placement I leave to be handled by the libraries I use.  That is, when I use char[].

So again, I will agree that, in D, char[] is not a good choice for strings you are expecting to contain possibly-internationalized data.

I'm perfectly aware of what strlen (and str.length in D) do... it's similar to what they do in practically all other languages (unless you know the encoding is UCS-2, etc.)  For example, I work with PHP a lot and it doesn't even have (with the versions I support) built-in support for Unicode.  This makes text processing fun!

-[Unknown]


> "Unknown W. Brackets" <unknown@simplemachines.org> wrote in message news:eahcqu$4d$1@digitaldaemon.com...
>> It really sounds to me like you're looking for UCS-2, then (e.g. as used in JavaScript, etc.)  For that, length calculation (which is what I presume you mean) is inexpensive.
>>
> 
> Well, lets speak in terms of javascript if it is easier:
> 
> String.substr(start, end)...
> 
> What these start, end means for you?
> I don't think that you will be interested in indexes
> of bytes in utf-8 sequence.
> 
>> As to your below assertion, I disagree.  What I think you meant was:
>>
>> "char[] is not designed for effective multi-byte text processing."
> 
> What is "multi-byte text processing"?
> processing of text - sequence of codepoints of the alphabet?
> What is 'multi-byte' there doing? Multi-byte I beleive you mean is
> a method of encoding of codepoints for transmission. Is this correct?
> 
> You need real codepoints to do something meaningfull with them...
> How these codepoints are stored in memory: as byte, word or dword
> depends on your task, amount of memory you have and alphabet
> you are using.
> E.g. if you are counting frequency of russian words used in internet
> you'd better do not do this in Java - twice as expensive as in C
> without any need.
> 
> So phrase "multi-byte text processing" is fuzzy on this end.
> 
> (Seems like I am not clear enough with my subset of English.)
> 
>> I will agree that wchar[] would be much better in that case, and even that limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would probably make things significantly easier to work with.
>>
>> Nonetheless, I was only commenting on how D is currently designed and implemented.  Perhaps there was some misunderstanding here.
>>
>> Even so, I don't see how initializing it to FF makes any problem.  I think everyone understands that char[] is meant to hold UTF-8, and if you don't like that or don't want to use it, there are other methods available to you (heh, you can even use UTF-32!)
>>
>> I don't see that the initialization of these variables will cause anyone any problems.  The only time I want such a variable initialized to 0 is when I use a numeric type, not a character type (and then, I try to use = 0 anyway.)
>>
>> It seems like what you may want to do is simply this:
>>
>> typedef ushort ucs2_t = 0;
>>
>> And use that type.  Mission accomplished.  Or, use various different encodings - in which case I humbly suggest:
>>
>> typedef ubyte latin1_t = 0;
>> typedef ushort ucs2_t = 0;
>> typedef ubyte koi8r_t = 0;
>> typedef ubyte big5_t = 0;
>>
>> And so on, so on, so on...
>>
>> -[Unknown]
> 
> I like the last statement "..., so on, so on..."
> Sounds promissing enough.
> 
> Just for information:
> strlen(const char* str)  works with *all*
> single byte encodings in C.
> For multi-bytes (e.g. utf-8 )  it returns
> length of the sequence in octets.
> But these are not chars in terms of C
> strictly speaking but bytes -
> unsigned chars.
> 
> 
>>
>>> So statement: "char[] in D supposed to hold only UTF-8 encoded text"
>>> immediately leads us to "D is not designed for effective text processing".
>>>
>>> Is this logic clear? 
> 
> 
July 30, 2006
Derek wrote:
> On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:
> 
> 
>> ... but this is far from concept of null codepoint in character encodings.
> 
> Andrew and others,
> I've read through these posts a few times now, trying to understand the
> various points of view being presented. I keep getting the feeling that
> some people are deliberately trying *not* to understand what other people
> are saying. This is a sad situation.
> 
> Andrew seems to be stating ...
> (a) char[] arrays should be allowed to hold encodings other than UTF-8, and
> thus initializing them with hex-FF byte values is not useful.
> (b) UTF-8 encoding is not an efficient encoding for text analysis.
> (c) UTF encodings are not optimized for data transmission (they contain
> redundant data in many contexts).
> (d) The D type called 'char' may not have been the best name to use if it
> is meant to be used to contain only UTF-8 octets.
> 
> I, and many others including Walter, would probably agree to (b), (c) and
> (d). However, considering (b) and (c), UTF has benefits that outweigh these
> issues and there are ways to compensate for these too. Point (d) is a
> casualty of history and to change the language now to rename 'char' to
> anything else would be counter productive now. But feel free to implement
> your own flavour of D.<g>
> 
> Back to point (a)... The fact is, char[] is designed to hold UTF-8
> encodings so don't try to force anything else into such arrays. If you wish
> to use some other encodings, then use a more appropriate data structure for
> it. For example, to hold 'KOI-8' encodings of Russian text, I would
> recommend using ubyte[] instead. To transform char[] to any other encoding
> you will have to provide the functions to do that, as I don't think it is
> Walter's or D's responsibilty to do it. The point of initializing UTF-8
> strings with illegal values is to help detect coding or logical mistakes.
> And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
> codepoint *is* illegal. If you must store an octet of hex-FF then use
> ubyte[] arrays to do it.
> 

Good summary. Additionally I'd like to say that, to hold 'KOI-8' encodings, you could create a typedef instead of just using a ubyte;

  typedef ubyte koi8char;

Thus you are able to express in the code, what the encoding of such ubyte is, as it is part of the type information. And then the program is able to work with it:

  koi8char toUpper(koi8char ch) { ...
  int wordCount(koi8char[] str) { ...
  dchar[] toUTF32(koi8char[] str) { ...


-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
July 30, 2006
Unknown W. Brackets wrote:
> 
> char c = '蝿';
> 
> 
> Because that would have failed.  A char cannot hold such a character, which has a code point outside the range 0 - 127.  You would either need to use an array of chars, or etc.

Which, speaking of which, shouldn't that be a compile time error? The
compiler allows all kinds of *char mingling:

  dchar dc = '蝿';
  char sc = dc;     // :-(


-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
July 30, 2006
Indeed; this is the same situation as with XML transmission over the web.  It contains a huge amount of redundancy, and compresses so well that I've seen it do better than binary-based formats.

Although, I'm afraid that most of the time this compression isn't necessarily automatic, and too often is not done.

-[Unknown]


> I suspect, though, that (c) might be moot since it is my understanding that most actual data transmission equipment automatically compresses the data stream, and so the redundancy of the UTF-8 is minimized. Text itself tends to be highly compressible on top of that.
> 
> Furthermore, because of the rate of expansion and declining costs of bandwidth, the cost of extra bytes is declining at the same time that the cost of the inflexibility of code pages is increasing.
July 30, 2006
Eek!  Yes, I would say (in my humble opinion) that this should be a compile-time error.

Obviously down-casting is more complicated.  I think the case of chars is much more obvious/clear than the case of ints, but then it's also a special-case.

-[Unknown]


> Unknown W. Brackets wrote:
>>
>> char c = '蝿';
>>
>>
>> Because that would have failed.  A char cannot hold such a character, which has a code point outside the range 0 - 127.  You would either need to use an array of chars, or etc.
> 
> Which, speaking of which, shouldn't that be a compile time error? The
> compiler allows all kinds of *char mingling:
> 
>   dchar dc = '蝿';
>   char sc = dc;     // :-(
> 
> 
July 30, 2006
I use that terminology because I've read too many RFCs (consider the FTP RFC) - they all say "8-bit octet".  Anyway, I'm trying to be completely clear.

Code unit.  Yeah, I knew it was code something but it slipped my mind. I was sure that he'd either correct me or 8-bit octet/etc. would remain clear.  I hate it when I forget such obvious terms.

Anyway, my point in what you're quoting is very context-dependent. Walter mentioned that "0 is a valid UTF-8 character."  Andrew asked what this meant, so I explained that in this case (as you also clarified) it doesn't make any difference.  Regardless, it's a valid [whatever it is] and that meaning is not unclear.

-[Unknown]


> Unknown W. Brackets wrote:
>> 6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 string.  Since char can only contain UTF-8 strings, it represents invalid data if it contains such an 8-bit octet.
>>
> You mentioned "8-bit octet" repeatedly in various posts. That's redundant: An "octet" is an 8-bit value. There are no "16-bit octets" and no "8-bit hextets" or stuff like that :P . I hope you knew that and were just distracted, but you kept saying that :) .
> 
>> 1. UTF-8 character here could mean an 8-bit octet of code point.  In this case, they are both the same and represent a perfectly valid character in a string.
>>
> 
> An "UTF-8 octet" is also called a UTF-8 'code unit'. Similarly a "UTF-16 hextet" is called a UTF-16 'code unit'. An UTF-8 code unit holds a Unicode code point if the code point is <128. Otherwise multiple UTF-8 code units are needed to encode that code point.
> 
> The confusion between 'code unit' and 'code point' is a long standing one. An "UTF-8 character" is a slighty ambiguous term. Does it a mean a UTF-8 code unit, or does it mean an Unicode character/codepoint encoded in a UTF-8 sequence?
>