February 01, 2008 Re: Why string alias is invariant ? | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Frits van Bommel | On Feb 1, 2008 12:03 AM, Frits van Bommel <fvbommel@remwovexcapss.nl> wrote:
> Some code points
> expand to 2 or 3 codepoints when uppercased. One common case is U+00DF
> "ß", LATIN SMALL LETTER SHARP S, which expands to "SS" (two characters)
> when uppercased[1]. Another example from the Unicode standard, U+0390,
> GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS apparently expands to
> three codepoints.
I know. I would have mentioned that, but I didn't want to needlessly complicate the issue.
But Unicode makes a distinction between "simple casing" and "full casing". What you're talking about is full casing. In simple casing, one character maps to one character. So you could uppercase U+00DF (to itself) using simple-casing. When using full casing, as you quite rightly point out, one can only case-change strings, not characters.
| |||
February 01, 2008 Re: Why string alias is invariant ? | ||||
|---|---|---|---|---|
| ||||
On Feb 1, 2008 7:50 AM, Janice Caron <caron800@googlemail.com> wrote:
> So you could uppercase U+00DF (to
> itself) using simple-casing.
I'm obviously telling you stuff you already know - I apologise. I would imagine that normalisation forms probably also complicate full casing.
One interesting thing is that simple casing also works just fine for wchar (that is, UTF-16). That's because every letter of every living language will be found in the Basic Multilingual Plane (the range U+0000 to U+FFFF). Codepoints outside this range are either symbols, or letters of dead languages. (Or combining characters, etc.), so it's probably safe to leave all non-BMP codepoints unchanged when case-changing. Codepoints in the BMP occupy a single UTF-16 code unit.
In many languages (e.g. Chinese), a UTF-16 string is likely to be shorter than the corresponding UTF-8 string. This makes me suspect that UTF-16 may well be the ideal choice for string representation in the real world. (It's what Java went with). Maybe UTF-16, not UTF-8, should be the default kind of string?
| ||||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply