Thread overview
Chars and Strs
Feb 11, 2005
Andrew Fedoniouk
Feb 12, 2005
Roald Ribe
February 11, 2005
Here's another long documentation essay,
on the other "missing" D type: strings...

http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs


I'll add some D sample code on how to convert
to and from legacy encodings (manually) later.
http://www.algonet.se/~afb/d/mapping.zip
(using ftp://ftp.unicode.org/Public/MAPPINGS/)

And some character tables for US-ASCII and Latin-1,
http://www.algonet.se/~afb/d/latin1/iso-8859-1.html
Also needed is how to talk to the Windows console,
http://www.digitalmars.com/techtips/windows_utf.html


But that can wait until after I get back from vacation :-)
Any comments can only make it better, here or on Wiki4D...

Share and Enjoy,
--anders
February 11, 2005
Hi, Anders,

I am looking on:

"All UTF-16 code units from 0xD800-0xDFFF are similarly just "surrogates" for a real code point, and *must be occur in pairs that can then be combined to form the real Unicode code unit*. The lower byte of the code units 0x0000-0x00FF are exactly the same as the ISO-8859-1 encoding, and 0x00-0x7F is the same as ASCII. They are also called "wide characters", by some operating systems."

Stuff in *...* (my mark) technically speaking is not the case.

UTF-16 corresponds to UCS-2 (Basic Multilanguage Plane - BMP) and does not
need to "must occur in pairs".
It depends on use case: Programm A supports only UCS-2 and programm B
supports UCS-4.

(BMP) The first plane defined in Unicode/ISO 10646, designed to include all scripts in active modern use. The BMP currently includes the Latin, Greek, Cyrillic, Devangari, hiragana, katakana, and Cherokee scripts, among others, and a large body of mathematical, APL-related, and other miscellaneous characters. Most of the Han ideographs in current use are present in the BMP, but due to the large number of ideographs, many were placed in the Supplementary Ideographic Plane.

Windows natively uses UCS-2 (win32::wchar_t, "widechar") and only couple of functions there (AFAIK) can treat widechars as sequences of UTF-16 codes.

All modern browsers has UCS-2 as their internal representation. JavaScript, Java are also UCS-2 only (by their specs)  languages.

--------------------------------------------------------------------
I've found D way treating strings as char[], dchar[] and qchar[] pretty
reasonable as this allows to work with  text in most optimal way. The only
thing I am not sure yet - string as an entity has its own methods. It is
pretty traditional these days to use them as objects : s.substr(1,4). But in
fact strings are atomic types so they should be handled as any other native
types e.g. int.
Personally I think that substr(s,1.4) is more "honest" than s.substr(1,4).
Some aesthetical concerns though. But as deeper I am looking in the "string
problem" as I more I am thinking that strings are not the objects in Java/C#
sense. E.g. inability to work with strings as sequences(arrays) of
characters is a source of many bottlenecks in these languages.

Andrew Fedoniouk.
http://terrainformatica.com





"Anders F Björklund" <afb@algonet.se> wrote in message news:cuifu3$23kd$1@digitaldaemon.com...
> Here's another long documentation essay,
> on the other "missing" D type: strings...
>
> http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs
>
>
> I'll add some D sample code on how to convert
> to and from legacy encodings (manually) later.
> http://www.algonet.se/~afb/d/mapping.zip
> (using ftp://ftp.unicode.org/Public/MAPPINGS/)
>
> And some character tables for US-ASCII and Latin-1,
> http://www.algonet.se/~afb/d/latin1/iso-8859-1.html
> Also needed is how to talk to the Windows console,
> http://www.digitalmars.com/techtips/windows_utf.html
>
>
> But that can wait until after I get back from vacation :-) Any comments can only make it better, here or on Wiki4D...
>
> Share and Enjoy,
> --anders


February 11, 2005
Andrew Fedoniouk wrote:

> "All UTF-16 code units from 0xD800-0xDFFF are similarly just "surrogates" for a real code point, and *must be occur in pairs that can then be combined to form the real Unicode code unit*. The lower byte of the code units 0x0000-0x00FF are exactly the same as the ISO-8859-1 encoding, and 0x00-0x7F is the same as ASCII. They are also called "wide characters", by some operating systems."
> 
> Stuff in *...* (my mark) technically speaking is not the case.

Hmm, doesn't even seem to be a real sentence :-) "must be occur"

> UTF-16 corresponds to UCS-2 (Basic Multilanguage Plane - BMP) and does not need to "must occur in pairs".
> It depends on use case: Programm A supports only UCS-2 and programm B supports UCS-4.

What I meant to say was that *surrogates* need to be in pairs...
(0xD800-0xDFFF) Not all the other individual UTF-16 code units.

Got it from http://www.unicode.org/faq/utf_bom.html#UTF16

> Windows natively uses UCS-2 (win32::wchar_t, "widechar") and only couple of functions there (AFAIK) can treat widechars as sequences of UTF-16 codes.
> 
> All modern browsers has UCS-2 as their internal representation. JavaScript, Java are also UCS-2 only (by their specs)  languages.

Right, the "wide characters" should be mentioned down by the Z stuff...

--anders
February 12, 2005
Andrew Fedoniouk wrote:

> Windows natively uses UCS-2 (win32::wchar_t, "widechar") and only couple of functions there (AFAIK) can treat widechars as sequences of UTF-16 codes.

Most of the WIN32 API has two entries for each function. The 8 bit character
API functions has A appended to their names, and the 16 bit character funcs
has W appended. This how each application can choose which API to use.

* On NT based kernels the 16 bit char is what is used natively, and the A
API's just convert from currently selected codepage into unicode before
calling the W API.
* On 9x/Me kernels, the W API's comes as redistributable DLL's (apps can
include them in their installer). In these systems the W API just converts
the unicode strings/chars into current codepage (where possible) and then
calls the native A API's.

So to conclude: Most (all?) of the WIN32 API is available in both 8 and
16 bits versions. The only exception may be WIN32s, but I do not think
anyone uses that for new software releases (if they ever did).

Roald