Jump to page: 1 2
Thread overview
Unicode and UTF similarities/differences?
Oct 01, 2004
Kramer
Oct 01, 2004
Arcane Jill
Oct 04, 2004
Stewart Gordon
Oct 04, 2004
Arcane Jill
Oct 05, 2004
Arcane Jill
Oct 05, 2004
Walter
Oct 05, 2004
J C Calvarese
Oct 05, 2004
Stewart Gordon
Oct 06, 2004
Arcane Jill
Oct 06, 2004
Stewart Gordon
Oct 04, 2004
Walter
October 01, 2004
I know Unicode and UTF is talked about a lot here and with good reason, because it sounds like a feature that makes D attractive in the portability realm of I18N being able to write source code once and compile anywhere.  But as a newbie to D and to Unicode and UTF, could anyone please explain the differences (or similarities) on the two.

Thanks,
Kramer


October 01, 2004
In article <cjk0hf$oeu$1@digitaldaemon.com>, Kramer says...
>
>I know Unicode and UTF is talked about a lot here and with good reason, because it sounds like a feature that makes D attractive in the portability realm of I18N being able to write source code once and compile anywhere.  But as a newbie to D and to Unicode and UTF, could anyone please explain the differences (or similarities) on the two.

Well, they are different kinds of objects. Unicode is a character set; UTF-16 is an encoding. Bear with me - I'll try to make that clearer.

A character set is a set of characters in which each character has a /number/ associated with it, called its "codepoint". For example, in the ASCII character set, the character 'A' has a codepoint of 65 (more usually written in hex, as 0x41). In the Unicode character set, 'A' also has a codepoint of 65, and the character '€' (not present in ASCII) has a codepoint of 8,364 (more normally written in hex as 0x20AC).

Unicode characters are often written as U+ followed by their codepoint in hexadecimal. That is, U+20AC means the same thing as '€'.

Once upon a time, Unicode was going to be a sixteen-bit wide character set. That is, there were going to be (at most) 65,536 characters in it. Thus, every Unicode string would fit comfortably into an array of 16-bit-wide words.

Then things changed. Unicode grew too big. Suddenly, 65,536 characters wasn't going to be enough. But too many important real-life applications had come to rely on characters being 16-bits wide (for example: Java and Windows, to name a couple of biggies). Something had to be done. That something was UTF-16.

UTF-16 is a sneaky way of squeezing >65535 characters into an array originally designed for 16-bit words. Unicode characters with codepoints <0x10000 still occupy only one word; Unicode characters with codepoints >=0x10000 now occupy two words. (A special range of otherwise unused codepoints makes this possible).

In general, an "encoding" is a bidirectional mapping which maps each codepoint to an array of fixed-width objects called "code units". How wide is a code unit? Well, it depends on the encoding. UTF-8 code units are 8 bits wide; UTF-16 code units are 16 bits wide; and UTF-32 code units are 32 bits wide. So UTF-16 is a mapping from Unicode codepoints to arrays of 16-bit wide units. For example, the codepoint 0x10000 maps (in UTF-16) to the array [ 0xD800, 0xDC00 ].

You can learn all about this in much more detail here: http://www.unicode.org/faq/utf_bom.html

Hope that helps
Arcane Jill



October 04, 2004
Arcane Jill wrote:

<snip>
> In general, an "encoding" is a bidirectional mapping

Bidirectional?  Shouldn't the other way be a "decoding"?

> which maps each codepoint to an array of fixed-width objects called "code units".

What character set are ISO-8859-x et al encodings of, for that matter?

Or do certain web browsers/news clients say "encoding" when half the time they really mean "character set"?

(The two nomers could come together, if each is its own character set, encoded by the identity function.)

Stewart.
October 04, 2004
In article <cjrg2e$2eg5$1@digitaldaemon.com>, Stewart Gordon says...
>
>Bidirectional?  Shouldn't the other way be a "decoding"?

Hehe. Then maybe I should have said "reversible"? Or maybe I should have been defining "encoding scheme" rather than "encoding"? Anyway, I don't think it matters too much, it was just a rough explanation. Anyone who wants the details should head over to the glossary on the Unicode web site.


>What character set are ISO-8859-x et al encodings of, for that matter?

Themselves. All the encodings of 8-bit-wide character sets are encoded trivially.


>Or do certain web browsers/news clients say "encoding" when half the time they really mean "character set"?

Yeah, it's one of those historical accidents - like when you see in an HTTP header:

#   Content-type = text/html; charset=UTF-8

it should probably really say encoding, but that's just how it ended up.

>(The two nomers could come together, if each is its own character set, encoded by the identity function.)

Yeah. As you quite rightly point out, with 8-bit character sets, there's basically no difference.

Jill


October 04, 2004
I think you should add this to the D wiki!


October 05, 2004
In article <cjrg2e$2eg5$1@digitaldaemon.com>, Stewart Gordon says...

>What character set are ISO-8859-x et al encodings of, for that matter? Or do certain web browsers/news clients say "encoding" when half the time they really mean "character set"?

Another way of looking at it (which certainly works from the point of view of web browsers and web pages) is that the character set is Unicode, and that the 8-bit character sets may be viewed as encodings of Unicode, so that (for example), in WINDOWS-1252, the Unicode character U+20AC is encoded as 0x80.

Of course, this flies in the face of my earlier statement that encodings are reversible, but I think I was wrong on that point. (An encoding can only be reversible if it can losslessly encode the entirety of its character set. The UTFs are all reversible). Fortunately, HTML (and XML) gives us a workaround, because of course even in Latin-1, you can still use:

#    &#x0410;

to get Cyrillic uppercase A.

Jill

PS. This is probably all too vague to go in a WIKI. Maybe it would be better if the WIKI just directed people to the glossary on the Unicode web site. It's a bit wordier, but a lot more accurate.



October 05, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cjtfkv$2g3b$1@digitaldaemon.com...
> PS. This is probably all too vague to go in a WIKI. Maybe it would be
better if
> the WIKI just directed people to the glossary on the Unicode web site.
It's a
> bit wordier, but a lot more accurate.

I think it should go into the Wiki because:

1) It repeatedly comes up with regards to D because a language that supports UTF so thoroughly is new and people are not familiar with the UTF issues yet.

2) You write good explanations. They'd be easier to find in the Wiki than on the n.g.

3) I think people would be more comfortable to start with the nice overviews you write. The wordier, pedantic explanations can come later.

4) The Wiki isn't like a book where one has to be ruthless in maintaining tight focus. The D Wiki is free to explore related topics to whatever depth is interesting to D programmers.


October 05, 2004
Arcane Jill wrote:
<snip>
> (An encoding can only be reversible if it can losslessly encode the entirety of its character set. The
> UTFs are all reversible).

So it would be able to map an arbitrary Unicode string to a string in the destination code, as long as there are substitution rules for Unicode characters that become unavailable?

> Fortunately, HTML (and XML) gives us a workaround,
> because of course even in Latin-1, you can still use:
> 
> #    &#x0410;
> 
> to get Cyrillic uppercase A.
<snip>

AIUI, &# codes are supposed to be Unicode, regardless of the character set in which the HTML file is actually encoded.  Just like character escapes in D are independent of the source encoding.

Stewart.
October 05, 2004
In article <cjuilc$cju$1@digitaldaemon.com>, Walter says...
>
>
>"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cjtfkv$2g3b$1@digitaldaemon.com...
>> PS. This is probably all too vague to go in a WIKI. Maybe it would be
>better if
>> the WIKI just directed people to the glossary on the Unicode web site.
>It's a
>> bit wordier, but a lot more accurate.
>
>I think it should go into the Wiki because:
>
>1) It repeatedly comes up with regards to D because a language that supports UTF so thoroughly is new and people are not familiar with the UTF issues yet.
>
>2) You write good explanations. They'd be easier to find in the Wiki than on the n.g.
>
>3) I think people would be more comfortable to start with the nice overviews you write. The wordier, pedantic explanations can come later.
>
>4) The Wiki isn't like a book where one has to be ruthless in maintaining tight focus. The D Wiki is free to explore related topics to whatever depth is interesting to D programmers.

I agree. I added it to http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues (with a link back to AJ's post).

jcc7
October 06, 2004
In article <cjujiv$dmb$1@digitaldaemon.com>, Stewart Gordon says...
>
>Arcane Jill wrote:
><snip>
>> (An encoding can only be reversible if it can losslessly encode the entirety of its character set. The UTFs are all reversible).
>
>So it would be able to map an arbitrary Unicode string to a string in the destination code, as long as there are substitution rules for Unicode characters that become unavailable?

Well, usually there's some sort of default replacement character - like if you're converting from a Russian character set to a Latin one, you sometimes end up replacing some of the characters with '?'. That would be non-reversible.

But on the other hand, if you instead replaced missing characters with "&#x----;" (with Unicode codepoint inserted) then it suddenly becomes reversible. I'm not sure if that's cheating, but it works well enough for the internet.




>AIUI, &# codes are supposed to be Unicode, regardless of the character set in which the HTML file is actually encoded.

Yes, that's spot on. (Although try telling that to Microsoft, who insist on displaying &#128; as '€' in Internet Explorer! But yes - you're right; Microsoft is wrong).

Jill


« First   ‹ Prev
1 2