Thread overview | ||||||
---|---|---|---|---|---|---|
|
February 16, 2009 Unicode problems? | ||||
---|---|---|---|---|
| ||||
Wikipedia states that D still has some Unicode problems: "Operations on Unicode strings are unintuitive (compiler accepts Unicode source code, standard library and foreach constructs operate on UTF-8, but string slicing and length property operate on bytes rather than characters)." Is this information correct? |
February 16, 2009 Re: Unicode problems? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Trass3r | Trass3r wrote: > Wikipedia states that D still has some Unicode problems: > "Operations on Unicode strings are unintuitive (compiler accepts Unicode > source code, standard library and foreach constructs operate on UTF-8, > but string slicing and length property operate on bytes rather than > characters)." > > Is this information correct? They're not bugs, if that's what you mean. It's just a side-effect of how Unicode works. http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD Long story short: they operate on bytes because operating on actual code points can't be done efficiently [1]. -- Daniel [1] Given that strings are implemented as arrays with a given, non-changing width and that you're not using UTF-32 which no one does because it's too big and that we don't add some fancy caching stuff to char[] arrays specifically, blah blah blah. |
February 16, 2009 Re: Unicode problems? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Trass3r | Trass3r wrote:
> Wikipedia states that D still has some Unicode problems:
> "Operations on Unicode strings are unintuitive (compiler accepts Unicode
> source code, standard library and foreach constructs operate on UTF-8, but
> string slicing and length property operate on bytes rather than
> characters)."
>
> Is this information correct?
I think it's a point of view thing to call that unintuitive, but otherwise I can't find anything incorrect in it. Except maybe that "..operate on bytes" should be "..operate on code units" ? It doesn't mean that D has unicode problems though.
|
February 16, 2009 Re: Unicode problems? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Daniel Keep | Daniel Keep wrote:
>
> Trass3r wrote:
>> Wikipedia states that D still has some Unicode problems:
>> "Operations on Unicode strings are unintuitive (compiler accepts Unicode
>> source code, standard library and foreach constructs operate on UTF-8,
>> but string slicing and length property operate on bytes rather than
>> characters)."
>>
>> Is this information correct?
>
> They're not bugs, if that's what you mean. It's just a side-effect of
> how Unicode works.
>
> http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
>
> Long story short: they operate on bytes because operating on actual code
> points can't be done efficiently [1].
>
> -- Daniel
>
> [1] Given that strings are implemented as arrays with a given,
> non-changing width and that you're not using UTF-32 which no one does
> because it's too big and that we don't add some fancy caching stuff to
> char[] arrays specifically, blah blah blah.
I use UTF-32, at least occasionally. In cases where I specifically expect/encourage multilingual support/use, it can simplify matters greatly, where those otherwise inefficient operations become common.
-- Chris Nicholson-Sauls
|
Copyright © 1999-2021 by the D Language Foundation