Unicode and UTF similarities/differences? (page 2)

October 06, 2004

Re: Unicode and UTF similarities/differences?

Posted by Stewart Gordon
in reply to Arcane Jill

Permalink

Stewart Gordon

Posted in reply to Arcane Jill

Permalink

Arcane Jill wrote:

> In article <cjujiv$dmb$1@digitaldaemon.com>, Stewart Gordon says...
> 
>> Arcane Jill wrote: <snip>
>> 
>>> (An encoding can only be reversible if it can losslessly encode the entirety of its character set. The UTFs are all reversible).
>> 
>> So it would be able to map an arbitrary Unicode string to a string in the destination code, as long as there are substitution rules for Unicode characters that become unavailable?
> 
> Well, usually there's some sort of default replacement character - like if you're converting from a Russian character set to a Latin one, you sometimes end up replacing some of the characters with '?'. That would be non-reversible.

Of course, from an implementation point of view, there's the option of throwing an exception for untranslatable characters....

> But on the other hand, if you instead replaced missing characters with "&#x----;" (with Unicode codepoint inserted) then it suddenly becomes reversible. I'm not sure if that's cheating, but it works well enough for the internet.

As long as at least one of the characters '&', '#', 'x', ';' is also encoded.

>> AIUI, &# codes are supposed to be Unicode, regardless of the character set in which the HTML file is actually encoded.
> 
> Yes, that's spot on. (Although try telling that to Microsoft, who insist on displaying &#128; as '€' in Internet Explorer! But yes - you're right; Microsoft is wrong).

Quite a few browsers copy M$ in this respect.  Even on this Mac, IE, Safari and Mozilla (which is meant to be standards compliant!) all do it.

Stewart.

Forums