get the facts: string

Nov 26, 2005

This post tries to sum up some of the facts about unicode, encodings and strings. 0) The concept of a "character" is language dependent. (-> glyph, glyph cluster, lignature ...) 1) Every unicode code point can be encoded in each of the 5 common UTFs. (-> UTF-8, UTF-16-BE, UTF-16-LE, UTF-32-BE, UTF-32-LE) 2) Not every characters can be represented in the different ANSI and OEM character sets. 3) The character set of a terminal/shell can be changed on the fly. 4) A code point isn't allways a complete character. (Yes, a UTF-32 fragment isn't allways a character.) 5) Some characters can be represented by different code point sequences and thus different sequences of code point fragments. 6) Slicing gets difficult if strings are NULL terminated like in C. 7) Slicing gets difficult if strings begin with a BOM. 8) Java's String concept hides a few transcodings and requires either a VM or opAssign. 9) Not every system uses the same fragment size let alone encoding. 10) Data has to be exchanged between different systems. 11) String processing is usually not a performance problem unless the application is dedicated to text processing or a lot of transcodings occure. further reading: http://www.unicode.org Have a look at ICU to see some unicode string processing <g> Thomas

November 26, 2005

Re: get the facts: string

Posted by John Reimer
in reply to Thomas Kuehne

Permalink

John Reimer

Posted in reply to Thomas Kuehne

Permalink

Thomas Kuehne wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> This post tries to sum up some of the facts about unicode, encodings
> and strings.
> 
> 0) The concept of a "character" is language dependent.
> (-> glyph, glyph cluster, lignature ...)
> 
> 1) Every unicode code point can be encoded in each of the 5 common UTFs.
> (-> UTF-8, UTF-16-BE, UTF-16-LE, UTF-32-BE, UTF-32-LE)
> 
> 2) Not every characters can be represented in the different ANSI and OEM
> character sets.
> 
> 3) The character set of a terminal/shell can be changed on the fly.
> 
> 4) A code point isn't allways a complete character.
> (Yes, a UTF-32 fragment isn't allways a character.)
> 
> 5) Some characters can be represented by different code point sequences
> and thus different sequences of code point fragments.
> 
> 6) Slicing gets difficult if strings are NULL terminated like in C.
> 
> 7) Slicing gets difficult if strings begin with a BOM.
> 
> 8) Java's String concept hides a few transcodings and requires either
> a VM or opAssign.
> 
> 9) Not every system uses the same fragment size let alone encoding.
> 
> 10) Data has to be exchanged between different systems.
> 
> 11) String processing is usually not a performance problem unless
> the application is dedicated to text processing or a lot of transcodings
> occure.
> 
> further reading: http://www.unicode.org
> 
> Have a look at ICU to see some unicode string processing <g>
> 
> Thomas
> 
> 
> -----BEGIN PGP SIGNATURE-----
> 
> iD8DBQFDiZeR3w+/yD4P9tIRAsdDAJ47LKfhl9DKM/yZtlf/V/sEYJplBQCgwu1+
> 3J8o9MivNXDROOkracEmE7Y=
> =FbqB
> -----END PGP SIGNATURE-----

Thanks, Thomas.  Nice summary. I think I may actually get to understand some of this finally.  :)

On Sat, 26 Nov 2005 21:39:03 +0000 (UTC), Thomas Kuehne wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This post tries to sum up some of the facts about unicode, encodings and strings. Thanks Thomas, this is very neat. Now what is Walter going to do about it with respect to D. I suspect nothing. It is up to each coder to decide how to handle Unicode when using D, so there will be a myriad of solutions to the issues, and some will be better than others. The C/C++ world prevails. Such a pity. -- Derek Parnell Melbourne, Australia 27/11/2005 9:26:14 AM

Forums