Unicode (page 2)

On Thu, 15 Apr 2004 19:26:47 +1000, "Scott Egan" <scotte@tpg.com.aux> wrote: >Given the intent of D to maintain some of the low level 'system' capability >I'd rather just use UTF-32 if it came down to it. >The fixed size representation it offers has got to sure make dealing with >strings more efficient and faster (and eaiser to much around with). Efficiency would depend on the application - one that copies strings alot would slow down significantly assuming most strings would fit "nicely" in UTF-16 or UTF-8. Walter's experience has been that more programs copy strings than index as characters.

I've done some more homework and have a few other points: Walter's experiance may be that programmers copy strings, but have you looked at the library lately? It's full of index work. BTW none of the string library is Unicode compatible; it just treats the char[] as arrays of single bytes (as is my 'split' offering does ;). If char is supposed to be UTF-8 then the system needs to be aware of supplemental chars etc (doesn't it???) for correct word boundary and capitalisation efforts. I would also expect that it would be very easy to produce invalid Unicode streams with some of the functions. And... Why not use just the Basic Multilingual Plane (Plane 0) and code it as UCS-2 (fixed 2 byte representation) like C#? Or given that the unicode standard is 21 bits, just use the fixed with UTF-32? Now I will shut up! "Ben Hinkle" <bhinkle4@juno.com> wrote in message news:2tus70dgshsjb5seh2hcrfvl3raj2mui20@4ax.com... > On Thu, 15 Apr 2004 19:26:47 +1000, "Scott Egan" <scotte@tpg.com.aux> wrote: > > >Given the intent of D to maintain some of the low level 'system' capability > >I'd rather just use UTF-32 if it came down to it. > >The fixed size representation it offers has got to sure make dealing with > >strings more efficient and faster (and eaiser to much around with). > > Efficiency would depend on the application - one that copies strings alot would slow down significantly assuming most strings would fit "nicely" in UTF-16 or UTF-8. Walter's experience has been that more programs copy strings than index as characters. > >

April 18, 2004

Re: Unicode

Posted by Sigbjørn Lund Olsen
in reply to Scott Egan

Permalink

Sigbjørn Lund Olsen

Posted in reply to Scott Egan

Permalink

Scott Egan wrote:

> I've done some more homework and have a few other points:
> 
> Walter's experiance may be that programmers copy strings, but have you
> looked at the library lately?
> 
> It's full of index work.
> 
> BTW none of the string library is Unicode compatible; it just treats the
> char[] as arrays of single bytes (as is my 'split' offering does ;).
> If char is supposed to be UTF-8 then the system needs to be aware of
> supplemental chars etc (doesn't it???) for correct word boundary and
> capitalisation efforts.  I would also expect that it would be very easy to
> produce invalid Unicode streams with some of the functions.

No, as far as I know UTF-8/UTF-16, it expects the "storage class" to be containers of a certain bit width. That is, it does not expect 'char' to represent a character - it would be just the part of a character. A more semantically correct name for 'char' would be 'utf8byte' but some would think that too wordy. Personally it's one of the first things I alias.

> And...
> 
> Why not use just the Basic Multilingual Plane (Plane 0) and code it as UCS-2
> (fixed 2 byte representation) like C#?
> 
> Or given that the unicode standard is 21 bits, just use the fixed with
> UTF-32?
> 
> Now I will shut up!

Sometimes space is a consideration. If I had a database of English text, lets say a couple of billion characters, well, I *know* I pretty much only need ASCII codes except in rare cases, and since I want to have as much of the database cached in memory at any given time to serve said English text faster, I would rather have UTF-8 encoded the text than UTF-32.

In many cases you'll find that a particular encoding may be more appropriate than another, even if several encodings are appealing in their design. D gives you choice, and that's good. I like to think that the programmer knows better than a language designer what she wishes to do.

Cheers,
Sigbjørn Lund Olsen

Forums