Thread overview
get the facts: string
Nov 26, 2005
Thomas Kuehne
Nov 26, 2005
John Reimer
Nov 26, 2005
Derek Parnell
November 26, 2005
This post tries to sum up some of the facts about unicode, encodings and strings.

0) The concept of a "character" is language dependent.
(-> glyph, glyph cluster, lignature ...)

1) Every unicode code point can be encoded in each of the 5 common UTFs.
(-> UTF-8, UTF-16-BE, UTF-16-LE, UTF-32-BE, UTF-32-LE)

2) Not every characters can be represented in the different ANSI and OEM character sets.

3) The character set of a terminal/shell can be changed on the fly.

4) A code point isn't allways a complete character.
(Yes, a UTF-32 fragment isn't allways a character.)

5) Some characters can be represented by different code point sequences and thus different sequences of code point fragments.

6) Slicing gets difficult if strings are NULL terminated like in C.

7) Slicing gets difficult if strings begin with a BOM.

8) Java's String concept hides a few transcodings and requires either a VM or opAssign.

9) Not every system uses the same fragment size let alone encoding.

10) Data has to be exchanged between different systems.

11) String processing is usually not a performance problem unless
the application is dedicated to text processing or a lot of transcodings
occure.

further reading: http://www.unicode.org

Have a look at ICU to see some unicode string processing <g>

Thomas


November 26, 2005
Thomas Kuehne wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> This post tries to sum up some of the facts about unicode, encodings
> and strings.
> 
> 0) The concept of a "character" is language dependent.
> (-> glyph, glyph cluster, lignature ...)
> 
> 1) Every unicode code point can be encoded in each of the 5 common UTFs.
> (-> UTF-8, UTF-16-BE, UTF-16-LE, UTF-32-BE, UTF-32-LE)
> 
> 2) Not every characters can be represented in the different ANSI and OEM
> character sets.
> 
> 3) The character set of a terminal/shell can be changed on the fly.
> 
> 4) A code point isn't allways a complete character.
> (Yes, a UTF-32 fragment isn't allways a character.)
> 
> 5) Some characters can be represented by different code point sequences
> and thus different sequences of code point fragments.
> 
> 6) Slicing gets difficult if strings are NULL terminated like in C.
> 
> 7) Slicing gets difficult if strings begin with a BOM.
> 
> 8) Java's String concept hides a few transcodings and requires either
> a VM or opAssign.
> 
> 9) Not every system uses the same fragment size let alone encoding.
> 
> 10) Data has to be exchanged between different systems.
> 
> 11) String processing is usually not a performance problem unless
> the application is dedicated to text processing or a lot of transcodings
> occure.
> 
> further reading: http://www.unicode.org
> 
> Have a look at ICU to see some unicode string processing <g>
> 
> Thomas
> 
> 
> -----BEGIN PGP SIGNATURE-----
> 
> iD8DBQFDiZeR3w+/yD4P9tIRAsdDAJ47LKfhl9DKM/yZtlf/V/sEYJplBQCgwu1+
> 3J8o9MivNXDROOkracEmE7Y=
> =FbqB
> -----END PGP SIGNATURE-----

Thanks, Thomas.  Nice summary. I think I may actually get to understand some of this finally.  :)
November 26, 2005
On Sat, 26 Nov 2005 21:39:03 +0000 (UTC), Thomas Kuehne wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> This post tries to sum up some of the facts about unicode, encodings and strings.

Thanks Thomas, this is very neat.

Now what is Walter going to do about it with respect to D. I suspect nothing. It is up to each coder to decide how to handle Unicode when using D, so there will be a myriad of solutions to the issues, and some will be better than others. The C/C++ world prevails. Such a pity.

-- 
Derek Parnell
Melbourne, Australia
27/11/2005 9:26:14 AM