September 29, 2004 Re: UTF-16 wchar[] consistency | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | In article <cje7o0$rj6$1@digitaldaemon.com>, Arcane Jill says... >A legacy 16-bit-Unicode application looking at the same text file (assuming it to have been saved in UTF-16) will see two "characters": U+D874 followed by U+DD1E. (These are the UTF-16 fragments which together represent U+1D11E). Erratum. Whoops! UTF-16 for 1D11E is actually D834 followed by DD1E. (That'll teach me not to try UTF-16 transcoding by hand in future!) The logic of the post still holds, however. Jill |
September 29, 2004 Re: UTF-8 char[] consistency | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | Arcane Jill wrote: > In article <cjc38q$2jna$2@digitaldaemon.com>, Benjamin Herr says... > >>Arcane Jill wrote: >> >>>*) If you just want basic Unicode support which works in all but exceptional >>>circumstances, you can make do with UTF-16, and the pretence that characters are >>>16-bits wide. >> >>I guess I really do not get it. I thought I was just told that codepoints might be only 16-bits wide but that I always have to account for multi-codepointy chars? >>*more clueless* > > > Head out to www.unicode.org and check out their various FAQs. They do a much > better job at explaining things than I. > > > For what it's worth, here's my potted summary: Cool. I added this to a wiki page: http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues > > "code unit" = the technical name for a single primitive fragment of either > UTF-8, UTF-16 or UTF-32 (that is, the value held in a single char, wchar or > dchar). I tend to use the phrases UTF-8 fragment, UTF-16 fragment and UTF-32 > fragment to express this concept. > > "code point" = the technical name for the numerical value associated with a > character. In Unicode, valid codepoints go from 0 to 0x10FFFF inclusive. In D, a > codepoint can only be stored in a dchar. > > "character" = officially, the smallest unit of textual information with semantic > meaning. Practically speaking, this means either (a) a control code; (b) > something printable; or (c) a combiner, such as an accent you can place over > another character. Every character has a unique codepoint. Conversely, every > codepoint in the range 0 to 0x10FFFF corresponds to a unique Unicode character. > Unicode characters are often written in the form U+#### (for example, U+20AC, > which is the character corresponding to codepoint 0x20AC). > > As an observation, over 99% of all the characters you are likely to use, and > which are involved in text processing, will occur in the range U+0000 to U+FFFF. > Therefore an array of sixteen-bit values interpretted as characters will likely > be sufficient for most purposes. (A UTF-16 string may be interpretted in this > way). If you want that extra 1%, as some apps will, you'll need to go the whole > hog and recognise characters all the way up to U+10FFFF. > > "grapheme" = a printable base character which may have been modified by zero or > more combining characters (for example 'a' followed by combining-acute-accent). > > "glyph" = one or more graphemes glued together to form a single printable > symbol. The Unicode character zero-width-joiner usually acts as the glue. > > For more detailed information, as I suggested above, please feel free to go to > the Unicode website, and get all the details from the people who organize the > whole thing. > > Arcane Jill > > -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/ |
September 30, 2004 Re: UTF-16 wchar[] consistency | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | Arcane Jill: Thxs as always for the clear insight! I now have a better understanding of how 16-bit characters (aka UTF-16 / wchar[]) and Unicode (v3.0 / v4.0) match against one another. :)) I hope your ICU conversion work is coming along fine. ------------------------------------------------------------------- "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!" |
Copyright © 1999-2021 by the D Language Foundation