Jump to page: 1 2
Thread overview
Unicode
Apr 14, 2004
Scott Egan
Apr 14, 2004
Ilya Minkov
Apr 14, 2004
Scott Egan
Apr 14, 2004
Hauke Duden
Apr 14, 2004
Ben Hinkle
Apr 14, 2004
Walter
Apr 15, 2004
Scott Egan
Apr 15, 2004
Ben Hinkle
Apr 15, 2004
Scott Egan
Apr 17, 2004
Serge K
Apr 14, 2004
Ben Hinkle
Apr 14, 2004
J C Calvarese
April 14, 2004
Would it have been better just to stick to Unicode internally, and left any conversion to the IO classes?



April 14, 2004
Scott Egan schrieb:
> Would it have been better just to stick to Unicode internally, and left any
> conversion to the IO classes?

By Walter's convention, all char[] are UTF-8, and where the standard library doesn't obey to it and interpret it as ANSI/ASCII/whatever is to be considered as a bug. He has stated it at least for 10 times already. And char[] is to be the standardway of excanging unicode strings within D programmes. There are also dchar and wchar for other Unicode encodings.

-eye
April 14, 2004
Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with UCS-2 (UTF-16?), ie straight 16bit chars?

"Ilya Minkov" <minkov@cs.tum.edu> wrote in message news:c5jfuf$1ibt$1@digitaldaemon.com...
> Scott Egan schrieb:
> > Would it have been better just to stick to Unicode internally, and left
any
> > conversion to the IO classes?
>
> By Walter's convention, all char[] are UTF-8, and where the standard library doesn't obey to it and interpret it as ANSI/ASCII/whatever is to be considered as a bug. He has stated it at least for 10 times already. And char[] is to be the standardway of excanging unicode strings within D programmes. There are also dchar and wchar for other Unicode encodings.
>
> -eye


April 14, 2004
Scott Egan wrote:

> Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with
> UCS-2 (UTF-16?), ie straight 16bit chars?

char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.

Of the three Unicode transition formats, personally, I find UTF-8 and UTF-32 to be the most appealing, due to UTF-8's backwards compatability with the widespread (if lacking) ASCII format, and UTF-32 because it's by far the most straightforward of the formats. UTF-16 is *not* straight 16bit chars, there are 1 million (roughly) defined Unicode character points, and you often need two utf16 shorts to map to a character. Only UTF-32 can offer a straightforward 'same sized block' character encoding within the Unicode standard, and personally, I can't see what UTF-16 has to offer compared to the two alternatives in the Unicode standard, aside from size savings in some languages. But, of course, UTF-8 (and probably UTF-32) 'suck' out of the box, don't they?

At any rate, the three formats seem to all be in use in different spheres of computing, and thus all have their proper place in a generic programming language.

Furthermore, the standard library has (afaik - I haven't used them) functions for converting between the three formats.

Cheers,
Sigbjørn Lund Olsen
April 14, 2004
"Scott Egan" <scotte@tpg.com.aux> wrote in message news:c5jhsa$1l7v$1@digitaldaemon.com...
> Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with UCS-2 (UTF-16?), ie straight 16bit chars?

UTF-8 is a compromise between Unicode support and C's character model. Unicode hasn't flared up on the newsgroup in a while so you might have to look back a while to find Walter's arguments for and against the various ideas.


April 14, 2004
Sigbjørn Lund Olsen wrote:
>> Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with
>> UCS-2 (UTF-16?), ie straight 16bit chars?
> 
> 
> char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.

This is not correct. dchar is UTF-32 and wchar is UTF-16.

Hauke
April 14, 2004

> > char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.
>
> This is not correct. dchar is UTF-32 and wchar is UTF-16.

heh. I can never remember which one is which either.
How about changing dchar to wwchar for "weally wide char", which
is scalable to any number of bytes - weally weally wide char, etc ;-)


April 14, 2004
Ben Hinkle wrote:
> "Scott Egan" <scotte@tpg.com.aux> wrote in message
> news:c5jhsa$1l7v$1@digitaldaemon.com...
> 
>>Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with
>>UCS-2 (UTF-16?), ie straight 16bit chars?
> 
> 
> UTF-8 is a compromise between Unicode support and C's character model.
> Unicode hasn't flared up on the newsgroup in a while so you might have to
> look back a while to find Walter's arguments for and against the various
> ideas.

Since it has come up before, I've made a list of some of these threads:
http://www.wikiservice.at/d/wiki.cgi?UnicodeIssues

-- 
Justin
http://jcc_7.tripod.com/d/
April 14, 2004
"Ben Hinkle" <bhinkle4@juno.com> wrote in message news:c5jun3$28gi$1@digitaldaemon.com...
> How about changing dchar to wwchar for "weally wide char",

LOL! Wish I'd thought of that!


April 15, 2004
Given the intent of D to maintain some of the low level 'system' capability
I'd rather just use UTF-32 if it came down to it.
The fixed size representation it offers has got to sure make dealing with
strings more efficient and faster (and eaiser to much around with).

The various stream libraies could be left to take care of any necessary conversions.

However, that said, I'll drop it.


"Sigbjørn Lund Olsen" <sigbjorn@lundolsen.net> wrote in message news:c5jlp2$1r51$1@digitaldaemon.com...
> Scott Egan wrote:
>
> > Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick
with
> > UCS-2 (UTF-16?), ie straight 16bit chars?
>
> char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.
>
> Of the three Unicode transition formats, personally, I find UTF-8 and UTF-32 to be the most appealing, due to UTF-8's backwards compatability with the widespread (if lacking) ASCII format, and UTF-32 because it's by far the most straightforward of the formats. UTF-16 is *not* straight 16bit chars, there are 1 million (roughly) defined Unicode character points, and you often need two utf16 shorts to map to a character. Only UTF-32 can offer a straightforward 'same sized block' character encoding within the Unicode standard, and personally, I can't see what UTF-16 has to offer compared to the two alternatives in the Unicode standard, aside from size savings in some languages. But, of course, UTF-8 (and probably UTF-32) 'suck' out of the box, don't they?
>
> At any rate, the three formats seem to all be in use in different spheres of computing, and thus all have their proper place in a generic programming language.
>
> Furthermore, the standard library has (afaik - I haven't used them) functions for converting between the three formats.
>
> Cheers,
> Sigbjørn Lund Olsen


« First   ‹ Prev
1 2