isValidDchar error

July 09, 2004
isValidDchar error
Posted by Arcane Jill
Permalink
Arcane Jill
Permalink
The source code for the function isValidDchar reads as follows:

#    bit isValidDchar(dchar c)
#    {
#        return c < 0xD800 ||
#         (c > 0xDFFF && c <= 0x10FFFF && c != 0xFFFE && c != 0xFFFF);
#    }

I believe that this implementation is incorrect. You see, the codepoints U+FFFE and U+FFFF are not, in fact, invalid. They are simply permanently unassigned. Their General Category is Cn. They are legal codepoints (though unassigned characters).

Now, MOST unassigned characters are only temporarily unassigned. There is always the possibility that what is unassigned today may be assigned tomorrow. However, there is a small list of codepoints for which this is not the case (see below). For this small list of codepoints, the Unicode Consortium guarantee that they will remain permanently unassigned. U+FFFE and U+FFFF are in this list. (In fact, this was the reason that U+FFFF was chosen as the value for wchar.init and dchar.init).

To quote from the definitive Unicode standard document UFFF0[1].pdf, available from the Unicode website:

Noncharacters:

/These codes are intended for process internal uses, but are not permitted for interchange/

FFFE - <not a character>
* the value FFFE is guaranteed not to be a Unicode character at all
* may be used to detect byte order by contrast with FEFF which is a character.

FFFF - <not a character>
* the value FFFF is guaranteed not to be a Unicode character at all


Now, the key sentence here is the one which reads: "These codes are intended for process internal uses, but are not permitted for interchange". In other words, they are PERMITTED for application internal use, (such as delimiting a piece of Unicode text, or as an appropriate value for dchar.init) but PROHIBITED for character representation.

In addition, the equally definitive PropList.txt file lists properties for these codepoints, as follows:

FDD0..FDEF    ; Noncharacter_Code_Point # Cn  [32] FFFE..FFFF    ; Noncharacter_Code_Point # Cn   [2] 1FFFE..1FFFF  ; Noncharacter_Code_Point # Cn   [2] 2FFFE..2FFFF  ; Noncharacter_Code_Point # Cn   [2] 3FFFE..3FFFF  ; Noncharacter_Code_Point # Cn   [2] 4FFFE..4FFFF  ; Noncharacter_Code_Point # Cn   [2] 5FFFE..5FFFF  ; Noncharacter_Code_Point # Cn   [2] 6FFFE..6FFFF  ; Noncharacter_Code_Point # Cn   [2] 7FFFE..7FFFF  ; Noncharacter_Code_Point # Cn   [2] 8FFFE..8FFFF  ; Noncharacter_Code_Point # Cn   [2] 9FFFE..9FFFF  ; Noncharacter_Code_Point # Cn   [2] AFFFE..AFFFF  ; Noncharacter_Code_Point # Cn   [2] BFFFE..BFFFF  ; Noncharacter_Code_Point # Cn   [2] CFFFE..CFFFF  ; Noncharacter_Code_Point # Cn   [2] DFFFE..DFFFF  ; Noncharacter_Code_Point # Cn   [2] EFFFE..EFFFF  ; Noncharacter_Code_Point # Cn   [2] FFFFE..FFFFF  ; Noncharacter_Code_Point # Cn   [2] 10FFFE..10FFFF; Noncharacter_Code_Point # Cn   [2]

The etc.unicode function isNoncharacterCodePoint() will return true for every character in this list, and for no others. FFFE and FFFF may be non-character codepoints, but so is FDD0, so is FDE0, and so on. That doesn't make them invalid dchars - it just makes them noncharacter codepoints.

(On the other hand, if you're going to treat FFFE and FFFF as invalid, you should treat all 66 codepoints listed above as equally invalid).

I suggest that you rewrite isValidDchar() as follows:

#    bit isValidDchar(dchar c)
#    {
#        return c < 0xD800 || (c > 0xDFFF && c <= 0x10FFFF);
#    }

Arcane Jill
Forums