| |
 | Posted by Andrew Fedoniouk in reply to Anders F Björklund | Permalink Reply |
|
Andrew Fedoniouk 
Posted in reply to Anders F Björklund
| Hi, Anders,
I am looking on:
"All UTF-16 code units from 0xD800-0xDFFF are similarly just "surrogates" for a real code point, and *must be occur in pairs that can then be combined to form the real Unicode code unit*. The lower byte of the code units 0x0000-0x00FF are exactly the same as the ISO-8859-1 encoding, and 0x00-0x7F is the same as ASCII. They are also called "wide characters", by some operating systems."
Stuff in *...* (my mark) technically speaking is not the case.
UTF-16 corresponds to UCS-2 (Basic Multilanguage Plane - BMP) and does not
need to "must occur in pairs".
It depends on use case: Programm A supports only UCS-2 and programm B
supports UCS-4.
(BMP) The first plane defined in Unicode/ISO 10646, designed to include all scripts in active modern use. The BMP currently includes the Latin, Greek, Cyrillic, Devangari, hiragana, katakana, and Cherokee scripts, among others, and a large body of mathematical, APL-related, and other miscellaneous characters. Most of the Han ideographs in current use are present in the BMP, but due to the large number of ideographs, many were placed in the Supplementary Ideographic Plane.
Windows natively uses UCS-2 (win32::wchar_t, "widechar") and only couple of functions there (AFAIK) can treat widechars as sequences of UTF-16 codes.
All modern browsers has UCS-2 as their internal representation. JavaScript, Java are also UCS-2 only (by their specs) languages.
--------------------------------------------------------------------
I've found D way treating strings as char[], dchar[] and qchar[] pretty
reasonable as this allows to work with text in most optimal way. The only
thing I am not sure yet - string as an entity has its own methods. It is
pretty traditional these days to use them as objects : s.substr(1,4). But in
fact strings are atomic types so they should be handled as any other native
types e.g. int.
Personally I think that substr(s,1.4) is more "honest" than s.substr(1,4).
Some aesthetical concerns though. But as deeper I am looking in the "string
problem" as I more I am thinking that strings are not the objects in Java/C#
sense. E.g. inability to work with strings as sequences(arrays) of
characters is a source of many bottlenecks in these languages.
Andrew Fedoniouk.
http://terrainformatica.com
"Anders F Björklund" <afb@algonet.se> wrote in message news:cuifu3$23kd$1@digitaldaemon.com...
> Here's another long documentation essay,
> on the other "missing" D type: strings...
>
> http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs
>
>
> I'll add some D sample code on how to convert
> to and from legacy encodings (manually) later.
> http://www.algonet.se/~afb/d/mapping.zip
> (using ftp://ftp.unicode.org/Public/MAPPINGS/)
>
> And some character tables for US-ASCII and Latin-1,
> http://www.algonet.se/~afb/d/latin1/iso-8859-1.html
> Also needed is how to talk to the Windows console,
> http://www.digitalmars.com/techtips/windows_utf.html
>
>
> But that can wait until after I get back from vacation :-) Any comments can only make it better, here or on Wiki4D...
>
> Share and Enjoy,
> --anders
|