September 29, 2004
In article <cje7o0$rj6$1@digitaldaemon.com>, Arcane Jill says...

>A legacy 16-bit-Unicode application looking at the same text file (assuming it to have been saved in UTF-16) will see two "characters": U+D874 followed by U+DD1E. (These are the UTF-16 fragments which together represent U+1D11E).

Erratum.

Whoops! UTF-16 for 1D11E is actually D834 followed by DD1E. (That'll teach me not to try UTF-16 transcoding by hand in future!)

The logic of the post still holds, however.

Jill


September 29, 2004
Arcane Jill wrote:
> In article <cjc38q$2jna$2@digitaldaemon.com>, Benjamin Herr says...
> 
>>Arcane Jill wrote:
>>
>>>*) If you just want basic Unicode support which works in all but exceptional
>>>circumstances, you can make do with UTF-16, and the pretence that characters are
>>>16-bits wide.
>>
>>I guess I really do not get it. I thought I was just told that codepoints might be only 16-bits wide but that I always have to account for multi-codepointy chars?
>>*more clueless*
> 
> 
> Head out to www.unicode.org and check out their various FAQs. They do a much
> better job at explaining things than I.
> 
> 
> For what it's worth, here's my potted summary:

Cool. I added this to a wiki page: http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues

> 
> "code unit" = the technical name for a single primitive fragment of either
> UTF-8, UTF-16 or UTF-32 (that is, the value held in a single char, wchar or
> dchar). I tend to use the phrases UTF-8 fragment, UTF-16 fragment and UTF-32
> fragment to express this concept.
> 
> "code point" = the technical name for the numerical value associated with a
> character. In Unicode, valid codepoints go from 0 to 0x10FFFF inclusive. In D, a
> codepoint can only be stored in a dchar.
> 
> "character" = officially, the smallest unit of textual information with semantic
> meaning. Practically speaking, this means either (a) a control code; (b)
> something printable; or (c) a combiner, such as an accent you can place over
> another character. Every character has a unique codepoint. Conversely, every
> codepoint in the range 0 to 0x10FFFF corresponds to a unique Unicode character.
> Unicode characters are often written in the form U+#### (for example, U+20AC,
> which is the character corresponding to codepoint 0x20AC).
> 
> As an observation, over 99% of all the characters you are likely to use, and
> which are involved in text processing, will occur in the range U+0000 to U+FFFF.
> Therefore an array of sixteen-bit values interpretted as characters will likely
> be sufficient for most purposes. (A UTF-16 string may be interpretted in this
> way). If you want that extra 1%, as some apps will, you'll need to go the whole
> hog and recognise characters all the way up to U+10FFFF.
> 
> "grapheme" = a printable base character which may have been modified by zero or
> more combining characters (for example 'a' followed by combining-acute-accent).
> 
> "glyph" = one or more graphemes glued together to form a single printable
> symbol. The Unicode character zero-width-joiner usually acts as the glue.
> 
> For more detailed information, as I suggested above, please feel free to go to
> the Unicode website, and get all the details from the people who organize the
> whole thing.
> 
> Arcane Jill
> 
> 


-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/
September 30, 2004
Arcane Jill: Thxs as always for the clear insight! I now have a better
understanding of how 16-bit characters (aka UTF-16 / wchar[]) and Unicode (v3.0
/ v4.0) match against one another. :))

I hope your ICU conversion work is coming along fine.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
1 2 3 4
Next ›   Last »