UTF and U+FFFE and U+FFFF

July 12, 2004
Posted by Arcane Jill
Permalink
Arcane Jill
Permalink
Since there appears to be some confusion on the status of U+FFFE and U+FFFF, I thought I'd quote from the FAQ on the Unicode website itself, at URL: http://www.unicode.org/faq/utf_bom.html

"Q: What is a UTF?

"A: A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The ISO/IEC 10646 standard uses the term “UCS transformation format” for UTF; the two terms are merely synonyms for the same concept.

"Each UTF is reversible, thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping  must also map all code points that are not valid Unicode characters to unique byte sequences. These invalid code points are the 66 noncharacters (including FFFE and FFFF), as well as unpaired surrogates."


The phrase "every Unicode code point (except surrogate code points)" implies that surrogate codepoints (those in the range 0xD800 to 0xDFFF) need not be encodable (although, curiously, the second paragraph says "as well as unpaired surrogates" which seems to contradict this). It's a non-sequiter, however, since surrogate codepoints CANNOT be expressed in UTF-16. The phrase "including FFFE and FFFF" is quite unambiguous, however.

Jill
Forums