Thread overview
Unicode problems?
Feb 16, 2009
Trass3r
Feb 16, 2009
Daniel Keep
Feb 16, 2009
Lutger
February 16, 2009
Wikipedia states that D still has some Unicode problems:
"Operations on Unicode strings are unintuitive (compiler accepts Unicode source code, standard library and foreach constructs operate on UTF-8, but string slicing and length property operate on bytes rather than characters)."

Is this information correct?
February 16, 2009

Trass3r wrote:
> Wikipedia states that D still has some Unicode problems:
> "Operations on Unicode strings are unintuitive (compiler accepts Unicode
> source code, standard library and foreach constructs operate on UTF-8,
> but string slicing and length property operate on bytes rather than
> characters)."
> 
> Is this information correct?

They're not bugs, if that's what you mean.  It's just a side-effect of how Unicode works.

http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

Long story short: they operate on bytes because operating on actual code points can't be done efficiently [1].

  -- Daniel

[1] Given that strings are implemented as arrays with a given, non-changing width and that you're not using UTF-32 which no one does because it's too big and that we don't add some fancy caching stuff to char[] arrays specifically, blah blah blah.
February 16, 2009
Trass3r wrote:

> Wikipedia states that D still has some Unicode problems:
> "Operations on Unicode strings are unintuitive (compiler accepts Unicode
> source code, standard library and foreach constructs operate on UTF-8, but
> string slicing and length property operate on bytes rather than
> characters)."
> 
> Is this information correct?

I think it's a point of view thing to call that unintuitive, but otherwise I can't find anything incorrect in it. Except maybe that "..operate on bytes" should be "..operate on code units" ? It doesn't mean that D has unicode problems though.





February 16, 2009
Daniel Keep wrote:
> 
> Trass3r wrote:
>> Wikipedia states that D still has some Unicode problems:
>> "Operations on Unicode strings are unintuitive (compiler accepts Unicode
>> source code, standard library and foreach constructs operate on UTF-8,
>> but string slicing and length property operate on bytes rather than
>> characters)."
>>
>> Is this information correct?
> 
> They're not bugs, if that's what you mean.  It's just a side-effect of
> how Unicode works.
> 
> http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
> 
> Long story short: they operate on bytes because operating on actual code
> points can't be done efficiently [1].
> 
>   -- Daniel
> 
> [1] Given that strings are implemented as arrays with a given,
> non-changing width and that you're not using UTF-32 which no one does
> because it's too big and that we don't add some fancy caching stuff to
> char[] arrays specifically, blah blah blah.

I use UTF-32, at least occasionally.  In cases where I specifically expect/encourage multilingual support/use, it can simplify matters greatly, where those otherwise inefficient operations become common.

-- Chris Nicholson-Sauls