"There is no character" (str type)

October 11, 2004
Posted by Anders F Björklund
Permalink
Anders F Björklund
Permalink
Like most other newcomers to Unicode,
I had some trouble with the UTF types...

Then I found this passage on the ICU home page:
> Often, a user thinks of a "character" as a complete unit in a
> language, like an 'Ä', while it may be represented with multiple
> Unicode code points including a base character and combining marks.
> (See the Unicode standard for details.) This often requires users to
> index and pass strings (UnicodeString or UChar *) with multiple code
> units or code points. It cannot be done with single-integer character
> types. Indexing of such "characters" is done with the BreakIterator
> class (in C: ubrk_ functions). [note: they talk about the ICU types]

Which explained to me that sometimes a single
"character" is not enough *anyway*, and that I
should be thinking in strings and code units...

And suddenly, me and all D's new char types are friends again!
It makes perfect sense to have UTF-8/UTF-16/UTF-32 types in D.
I just have to get out of the "uniform code unit size"-think.


I guess that of my own Latin-1 text, about 99% is ASCII*...
Which sounds like a good reason to have UTF-8 the default ?
(I also read that of Unicode text, 99% is U+0000 to U+FFFF)

Another major advantage of UTF-8 (besides half the size)
over UTF-16 is that it is endian-agnostic. No more BE/LE!
(and none of that pesky ASCII-breaking "BOM" crap either)

So if my text is mostly ascii / iso-latin-1, I just use char[].
If most of my text is unicode, I use wchar[]. And should I ever
need to access a single Unicode code point, then I have dchar.


Now all I want is a string type ALIAS, and all things are spiffy.
Can we have a either "str" or "string" alias added, for char[] ?
Pretty-please ? (Hey, it worked for the "bool" alias for bit...)

"void main(str[] args)"
--anders


PS.  Now I just have to remember to dimension my D strings
     as char[max * 2] (for Latin-1) or even char[max * 4]...
     And how to loop over an array of potential "surrogates".

PPS. * 10% of my own Swedish characters are non-ASCII. (ÅÄÖ)
     But as a programmer, I usually write things in English.
     Except for my last name, which accounts for the 1%. :-)
Forums