Unicode

Apr 14, 2004

Scott Egan

Apr 14, 2004

Apr 14, 2004

Apr 14, 2004

Apr 14, 2004

Apr 14, 2004

Apr 14, 2004

Apr 15, 2004

Apr 15, 2004

Apr 15, 2004

Apr 17, 2004

Apr 18, 2004

Apr 14, 2004

Apr 14, 2004

Scott Egan schrieb: > Would it have been better just to stick to Unicode internally, and left any > conversion to the IO classes? By Walter's convention, all char[] are UTF-8, and where the standard library doesn't obey to it and interpret it as ANSI/ASCII/whatever is to be considered as a bug. He has stated it at least for 10 times already. And char[] is to be the standardway of excanging unicode strings within D programmes. There are also dchar and wchar for other Unicode encodings. -eye

Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with UCS-2 (UTF-16?), ie straight 16bit chars? "Ilya Minkov" <minkov@cs.tum.edu> wrote in message news:c5jfuf$1ibt$1@digitaldaemon.com... > Scott Egan schrieb: > > Would it have been better just to stick to Unicode internally, and left any > > conversion to the IO classes? > > By Walter's convention, all char[] are UTF-8, and where the standard library doesn't obey to it and interpret it as ANSI/ASCII/whatever is to be considered as a bug. He has stated it at least for 10 times already. And char[] is to be the standardway of excanging unicode strings within D programmes. There are also dchar and wchar for other Unicode encodings. > > -eye

Scott Egan wrote: > Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with > UCS-2 (UTF-16?), ie straight 16bit chars? char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32. Of the three Unicode transition formats, personally, I find UTF-8 and UTF-32 to be the most appealing, due to UTF-8's backwards compatability with the widespread (if lacking) ASCII format, and UTF-32 because it's by far the most straightforward of the formats. UTF-16 is *not* straight 16bit chars, there are 1 million (roughly) defined Unicode character points, and you often need two utf16 shorts to map to a character. Only UTF-32 can offer a straightforward 'same sized block' character encoding within the Unicode standard, and personally, I can't see what UTF-16 has to offer compared to the two alternatives in the Unicode standard, aside from size savings in some languages. But, of course, UTF-8 (and probably UTF-32) 'suck' out of the box, don't they? At any rate, the three formats seem to all be in use in different spheres of computing, and thus all have their proper place in a generic programming language. Furthermore, the standard library has (afaik - I haven't used them) functions for converting between the three formats. Cheers, Sigbjørn Lund Olsen

"Scott Egan" <scotte@tpg.com.aux> wrote in message news:c5jhsa$1l7v$1@digitaldaemon.com... > Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with UCS-2 (UTF-16?), ie straight 16bit chars? UTF-8 is a compromise between Unicode support and C's character model. Unicode hasn't flared up on the newsgroup in a while so you might have to look back a while to find Walter's arguments for and against the various ideas.

Sigbjørn Lund Olsen wrote: >> Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with >> UCS-2 (UTF-16?), ie straight 16bit chars? > > > char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32. This is not correct. dchar is UTF-32 and wchar is UTF-16. Hauke

> > char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32. > > This is not correct. dchar is UTF-32 and wchar is UTF-16. heh. I can never remember which one is which either. How about changing dchar to wwchar for "weally wide char", which is scalable to any number of bytes - weally weally wide char, etc ;-)

Ben Hinkle wrote: > "Scott Egan" <scotte@tpg.com.aux> wrote in message > news:c5jhsa$1l7v$1@digitaldaemon.com... > >>Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with >>UCS-2 (UTF-16?), ie straight 16bit chars? > > > UTF-8 is a compromise between Unicode support and C's character model. > Unicode hasn't flared up on the newsgroup in a while so you might have to > look back a while to find Walter's arguments for and against the various > ideas. Since it has come up before, I've made a list of some of these threads: http://www.wikiservice.at/d/wiki.cgi?UnicodeIssues -- Justin http://jcc_7.tripod.com/d/

April 15, 2004

Re: Unicode

Posted by Scott Egan
in reply to Sigbjørn Lund Olsen

Permalink

Scott Egan

Posted in reply to Sigbjørn Lund Olsen

Permalink

Given the intent of D to maintain some of the low level 'system' capability
I'd rather just use UTF-32 if it came down to it.
The fixed size representation it offers has got to sure make dealing with
strings more efficient and faster (and eaiser to much around with).

The various stream libraies could be left to take care of any necessary conversions.

However, that said, I'll drop it.


"Sigbjørn Lund Olsen" <sigbjorn@lundolsen.net> wrote in message news:c5jlp2$1r51$1@digitaldaemon.com...
> Scott Egan wrote:
>
> > Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick
with
> > UCS-2 (UTF-16?), ie straight 16bit chars?
>
> char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.
>
> Of the three Unicode transition formats, personally, I find UTF-8 and UTF-32 to be the most appealing, due to UTF-8's backwards compatability with the widespread (if lacking) ASCII format, and UTF-32 because it's by far the most straightforward of the formats. UTF-16 is *not* straight 16bit chars, there are 1 million (roughly) defined Unicode character points, and you often need two utf16 shorts to map to a character. Only UTF-32 can offer a straightforward 'same sized block' character encoding within the Unicode standard, and personally, I can't see what UTF-16 has to offer compared to the two alternatives in the Unicode standard, aside from size savings in some languages. But, of course, UTF-8 (and probably UTF-32) 'suck' out of the box, don't they?
>
> At any rate, the three formats seem to all be in use in different spheres of computing, and thus all have their proper place in a generic programming language.
>
> Furthermore, the standard library has (afaik - I haven't used them) functions for converting between the three formats.
>
> Cheers,
> Sigbjørn Lund Olsen

Forums