Encoding and doFormat (page 2)

On Wed, 21 Jul 2004 07:24:42 +0000 (UTC), Arcane Jill wrote: > In article <cdkm15$25m7$1@digitaldaemon.com>, Sean Kelly says... >> >>Walter wrote: >>> >>> Input is chars, wchars, or dchars. >> >>Right, because all char types can be implicitly cast to dchar, correct? > > They can be implicitly cast, but they cannot be /correctly/ cast. I have mentioned this before (and suggested that it be considered a bug) but Walter was adamant that the runtime overhead involved in checking would be undesirable. > > The problem can be demonstrated by example. Suppose you cast a char containing UTF-8-fragment 0xC0 (which would ordinarily be the first byte of a two-byte UTF-8 sequence) into a dchar then it will be erroneously converted to U+00C0, instead of (as I would prefer) throwing a UTF conversion exception. > > In general, char values >0x7F should not be cast to wchars or dchars, because these values are /not characters/. > > Arcane Jill (Jill, I'm not critizing, disputing, argueing, being ornory, etc... I'm just trying to understand UTF better and I think you'd be one the best sources at the moment) If D char variables are supposed to be UTF-8 characters, then why does D allow a char to contain non-UTF-8 bit patterns (eg. a UTF-8-fragment)? I can see that a byte could, but a char? Or is D a bit simple here? -- Derek Melbourne, Australia 21/Jul/04 5:42:59 PM

July 21, 2004

Re: Encoding and doFormat

Posted by Arcane Jill
in reply to Derek Parnell

Permalink

Arcane Jill

Posted in reply to Derek Parnell

Permalink

In article <cdl74j$2ea7$1@digitaldaemon.com>, Derek Parnell says...

>If D char variables are supposed to be UTF-8 characters, then why does D allow a char to contain non-UTF-8 bit patterns (eg. a UTF-8-fragment)? I can see that a byte could, but a char? Or is D a bit simple here?

Oh that's easy to answer.

Okay, first off, there is no such thing as a "UTF-8 character". UTF-8 is an encoding of Unicode, so there is only a "Unicode character", which may be encoded in UTF-8 as a multi-byte sequence. So a char, in fact, can /only/ contain a UTF-8 fragment.

Fortunately, there are some UTF-8 multibyte sequences which are, in fact, exactly one byte long. The Unicode characters represented by such one-byte-sequences are the characters U+0000 to U+007F inclusive - in other words, ASCII. UTF-8 was designed that way on purpose, to maintain compatibility with ASCII.

Thus, if a char contains a value in the range 0x00 to 0x7F inclusive then it may be interpretted either as an ASCII character or a "one-byte UTF-8 fragment which happens to represent a complete Unicode character". Both interpretations are equally valid and interchangable.

On the other hand, if a char contains a value in the range 0x80 to 0xF8 then it can /only/ be a UTF-8 fragment, since these bytes form part of multibyte sequences which are /at least/ two bytes long, and so cannot be equated with a single character.

(Values in the range 0xF9 to 0xFF are completely meaningless. That's one reason why char.init is 0xFF).

To answer your question: "Why does D allow a char to contain non-UTF-8 bit patterns" - for the same reason that it allows a dchar to conatain non-UTF-32 bit patterns - it's simply a platform-native integer type. Constraining char to contain values in the range 0x00 to 0xF8 (or constraining dchar to values in the range 0x00000000 to 0x0010FFFF excluding 0x0000D800 to 0x0000DFFF) would add run-time overhead that is simply not necessary.

If I have misunderstood your question, and you were actually intending to ask "Why does D allow a char to contain UTF-8 fragments which cannot be interpretted in isolation" then the answer has to be that char exists so that char[] can exist. Only in a /string/ does UTF-8 make any real sense. A string needs an array, and an array has to be an array of /something/. char is that something.

Any help?

Arcane Jill

Arcane Jill wrote: > In article <cdkm15$25m7$1@digitaldaemon.com>, Sean Kelly says... > >>Walter wrote: >> >>>Input is chars, wchars, or dchars. >> >>Right, because all char types can be implicitly cast to dchar, correct? > > > They can be implicitly cast, but they cannot be /correctly/ cast. I have > mentioned this before (and suggested that it be considered a bug) but Walter was > adamant that the runtime overhead involved in checking would be undesirable. > > The problem can be demonstrated by example. Suppose you cast a char containing > UTF-8-fragment 0xC0 (which would ordinarily be the first byte of a two-byte > UTF-8 sequence) into a dchar then it will be erroneously converted to U+00C0, > instead of (as I would prefer) throwing a UTF conversion exception. > > In general, char values >0x7F should not be cast to wchars or dchars, because > these values are /not characters/. Oops, right. What unFormat does is read everything into dchars then convert to UTF-8 before writing to a char array or to UTF-16 before writing to a wchar array. The missing piece is converting from UTF-8 or UTF-16 when reading, which should be done in a day or two--I decided to rewrite the utf routines to allow a put/get delegate. Sean

Forums