View mode: basic / threaded / horizontal-split · Log in · Help
March 09, 2005
To wchar or not to wchar?
Which would be the best string type to use - char[], wchar[] or dchar[]? I 
want to choose one of them and stick with it throughout my code for the sake 
of consistency. My preference would be for wchar[] but using it is not as 
smooth as I'd hoped. For example, Object.toString() returns char[], Phobos 
seems not to have wchar versions for integer-to-string conversions, and 
concatenating sometimes requires casts. It's not too bad, I suppose: I can 
use free functions to encode/decode strings and write my own integer 
conversion routines. But I am puzzled as to why I need to cast when 
concatenating, e.g:

   wchar[] text = cast(wchar[])"The quick brown " ~ quickAnimalStr ~ 
cast(wchar[])" jumped over the lazy " ~ lazyAnimalStr;

Anyway, I'm doing a lot of text processing on Windows XP, which uses UTF16 
natively, so it seemed sane to choose the equivalent string type in D. Plus 
I read here http://www.digitalmars.com/techtips/windows_utf.html that char[] 
is not directly compatible with the ANSI versions of the Windows API (again, 
I'm using this a lot).

Given the above considerations, which do you advise I go with?

Cheers,
John.

P.S. Here's an idea: perhaps Walter could add UTF8, UTF16 and UTF32 version 
identifiers which we could use on the command line to tell the compiler to 
expect that string type as the default (e.g., "version=UTF16"). It would 
then mean that char[] becomes an alias for the specified type. When the type 
is not specified, char[] goes back to being UTF8.
March 10, 2005
Re: To wchar or not to wchar?
> Which would be the best string type to use - char[], wchar[] or dchar[]? I 
> want to choose one of them and stick with it throughout my code for the 
> sake of consistency. My preference would be for wchar[] but using it is 
> not as smooth as I'd hoped. For example, Object.toString() returns char[], 
> Phobos seems not to have wchar versions for integer-to-string conversions, 
> and concatenating sometimes requires casts. It's not too bad, I suppose: I 
> can use free functions to encode/decode strings and write my own integer 
> conversion routines. But I am puzzled as to why I need to cast when 
> concatenating, e.g:
>
>    wchar[] text = cast(wchar[])"The quick brown " ~ quickAnimalStr ~ 
> cast(wchar[])" jumped over the lazy " ~ lazyAnimalStr;
>
> Anyway, I'm doing a lot of text processing on Windows XP, which uses UTF16 
> natively, so it seemed sane to choose the equivalent string type in D. 
> Plus I read here http://www.digitalmars.com/techtips/windows_utf.html that 
> char[] is not directly compatible with the ANSI versions of the Windows 
> API (again, I'm using this a lot).

You've pretty much summed up all the pros and cons.  XP uses wchars 
natively, but Phobos is not too kind to them.

I just use char[] as I'm not planning on translating my programs into 
languages which use non-roman alphabets any time soon ;)

> P.S. Here's an idea: perhaps Walter could add UTF8, UTF16 and UTF32 
> version identifiers which we could use on the command line to tell the 
> compiler to expect that string type as the default (e.g., 
> "version=UTF16"). It would then mean that char[] becomes an alias for the 
> specified type. When the type is not specified, char[] goes back to being 
> UTF8.

Don't know about it being a language feature, but perhaps something that 
could be added to the runtime.  Something like a conditional alias that 
would define a type like "nchar" to mean "native char".
March 10, 2005
Re: To wchar or not to wchar?
> I just use char[] as I'm not planning on translating my programs into 
> languages which use non-roman alphabets any time soon ;)

You'll be suprised but even Latin-1 set does not fit into
the char.

http://www.bbsinc.com/symbol.html

For example if you will not use wchar you will not be able to see e.g.
Euro sign as one char - two UTF-8 bytes.

:) "Anders F Bjorklund" name will also be represented with one byte more.

,etc.


"Jarrett Billingsley" <kb3ctd2@yahoo.com> wrote in message 
news:d0odnf$svr$1@digitaldaemon.com...
>> Which would be the best string type to use - char[], wchar[] or dchar[]? 
>> I want to choose one of them and stick with it throughout my code for the 
>> sake of consistency. My preference would be for wchar[] but using it is 
>> not as smooth as I'd hoped. For example, Object.toString() returns 
>> char[], Phobos seems not to have wchar versions for integer-to-string 
>> conversions, and concatenating sometimes requires casts. It's not too 
>> bad, I suppose: I can use free functions to encode/decode strings and 
>> write my own integer conversion routines. But I am puzzled as to why I 
>> need to cast when concatenating, e.g:
>>
>>    wchar[] text = cast(wchar[])"The quick brown " ~ quickAnimalStr ~ 
>> cast(wchar[])" jumped over the lazy " ~ lazyAnimalStr;
>>
>> Anyway, I'm doing a lot of text processing on Windows XP, which uses 
>> UTF16 natively, so it seemed sane to choose the equivalent string type in 
>> D. Plus I read here http://www.digitalmars.com/techtips/windows_utf.html 
>> that char[] is not directly compatible with the ANSI versions of the 
>> Windows API (again, I'm using this a lot).
>
> You've pretty much summed up all the pros and cons.  XP uses wchars 
> natively, but Phobos is not too kind to them.
>
> I just use char[] as I'm not planning on translating my programs into 
> languages which use non-roman alphabets any time soon ;)
>
>> P.S. Here's an idea: perhaps Walter could add UTF8, UTF16 and UTF32 
>> version identifiers which we could use on the command line to tell the 
>> compiler to expect that string type as the default (e.g., 
>> "version=UTF16"). It would then mean that char[] becomes an alias for the 
>> specified type. When the type is not specified, char[] goes back to being 
>> UTF8.
>
> Don't know about it being a language feature, but perhaps something that 
> could be added to the runtime.  Something like a conditional alias that 
> would define a type like "nchar" to mean "native char".
>
March 10, 2005
Re: To wchar or not to wchar?
Andrew Fedoniouk wrote:

>>I just use char[] as I'm not planning on translating my programs into 
>>languages which use non-roman alphabets any time soon ;)
> 
> You'll be suprised but even Latin-1 set does not fit into
> the char.

He probably meant "non-US" ? (lone chars holds US-ASCII characters)

> For example if you will not use wchar you will not be able to see e.g.
> Euro sign as one char - two UTF-8 bytes.

Three, actually:
char[1] euro = "\u20AC";
> cannot implicitly convert expression "\u20ac" of type char[3] to char[1]

http://www.fileformat.info/info/unicode/char/20ac/index.htm
Some characters are even 4.

> :) "Anders F Bjorklund" name will also be represented with one byte more.

It actually messed up GDC, my name was added in Latin-1 in a comment...
(when I added the patch to DMD that actually made it check comments too)

It's even more fun when using .length, as it returns bytes (code units)
I use char[] and dchar, myself. (and not wchar[] and wchar[], like Java)

--anders
March 10, 2005
Re: To wchar or not to wchar?
> He probably meant "non-US" ? (lone chars holds US-ASCII characters)

That's it.  It seems weird though that such a (relatively) common letter as 
umlaut-o would be represented as 2 bytes in UTF8.  Maybe I'm thinking of 
ASCII (and not the old kind, where chars 128-255 are lines and stuff).
Top | Discussion index | About this forum | D home