February 01, 2008
On Feb 1, 2008 12:03 AM, Frits van Bommel <fvbommel@remwovexcapss.nl> wrote:
> Some code points
> expand to 2 or 3 codepoints when uppercased. One common case is U+00DF
> "ß", LATIN SMALL LETTER SHARP S, which expands to "SS" (two characters)
> when uppercased[1]. Another example from the Unicode standard, U+0390,
> GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS apparently expands to
> three codepoints.

I know. I would have mentioned that, but I didn't want to needlessly complicate the issue.

But Unicode makes a distinction between "simple casing" and "full casing". What you're talking about is full casing. In simple casing, one character maps to one character. So you could uppercase U+00DF (to itself) using simple-casing. When using full casing, as you quite rightly point out, one can only case-change strings, not characters.

February 01, 2008
On Feb 1, 2008 7:50 AM, Janice Caron <caron800@googlemail.com> wrote:
> So you could uppercase U+00DF (to
> itself) using simple-casing.

I'm obviously telling you stuff you already know -  I apologise. I would imagine that normalisation forms probably also complicate full casing.

One interesting thing is that simple casing also works just fine for wchar (that is, UTF-16). That's because every letter of every living language will be found in the Basic Multilingual Plane (the range U+0000 to U+FFFF). Codepoints outside this range are either symbols, or letters of dead languages. (Or combining characters, etc.), so it's probably safe to leave all non-BMP codepoints unchanged when case-changing. Codepoints in the BMP occupy a single UTF-16 code unit.

In many languages (e.g. Chinese), a UTF-16 string is likely to be shorter than the corresponding UTF-8 string. This makes me suspect that UTF-16 may well be the ideal choice for string representation in the real world. (It's what Java went with). Maybe UTF-16, not UTF-8, should be the default kind of string?
1 2 3
Next ›   Last »