To wchar or not to wchar?

Mar 09, 2005

John C

Mar 10, 2005

Jarrett Billingsley

Mar 10, 2005

Andrew Fedoniouk

Mar 10, 2005

Anders F Björklund

Mar 10, 2005

Jarrett Billingsley

Which would be the best string type to use - char[], wchar[] or dchar[]? I want to choose one of them and stick with it throughout my code for the sake of consistency. My preference would be for wchar[] but using it is not as smooth as I'd hoped. For example, Object.toString() returns char[], Phobos seems not to have wchar versions for integer-to-string conversions, and concatenating sometimes requires casts. It's not too bad, I suppose: I can use free functions to encode/decode strings and write my own integer conversion routines. But I am puzzled as to why I need to cast when concatenating, e.g: wchar[] text = cast(wchar[])"The quick brown " ~ quickAnimalStr ~ cast(wchar[])" jumped over the lazy " ~ lazyAnimalStr; Anyway, I'm doing a lot of text processing on Windows XP, which uses UTF16 natively, so it seemed sane to choose the equivalent string type in D. Plus I read here http://www.digitalmars.com/techtips/windows_utf.html that char[] is not directly compatible with the ANSI versions of the Windows API (again, I'm using this a lot). Given the above considerations, which do you advise I go with? Cheers, John. P.S. Here's an idea: perhaps Walter could add UTF8, UTF16 and UTF32 version identifiers which we could use on the command line to tell the compiler to expect that string type as the default (e.g., "version=UTF16"). It would then mean that char[] becomes an alias for the specified type. When the type is not specified, char[] goes back to being UTF8.

> Which would be the best string type to use - char[], wchar[] or dchar[]? I want to choose one of them and stick with it throughout my code for the sake of consistency. My preference would be for wchar[] but using it is not as smooth as I'd hoped. For example, Object.toString() returns char[], Phobos seems not to have wchar versions for integer-to-string conversions, and concatenating sometimes requires casts. It's not too bad, I suppose: I can use free functions to encode/decode strings and write my own integer conversion routines. But I am puzzled as to why I need to cast when concatenating, e.g: > > wchar[] text = cast(wchar[])"The quick brown " ~ quickAnimalStr ~ > cast(wchar[])" jumped over the lazy " ~ lazyAnimalStr; > > Anyway, I'm doing a lot of text processing on Windows XP, which uses UTF16 natively, so it seemed sane to choose the equivalent string type in D. Plus I read here http://www.digitalmars.com/techtips/windows_utf.html that char[] is not directly compatible with the ANSI versions of the Windows API (again, I'm using this a lot). You've pretty much summed up all the pros and cons. XP uses wchars natively, but Phobos is not too kind to them. I just use char[] as I'm not planning on translating my programs into languages which use non-roman alphabets any time soon ;) > P.S. Here's an idea: perhaps Walter could add UTF8, UTF16 and UTF32 version identifiers which we could use on the command line to tell the compiler to expect that string type as the default (e.g., "version=UTF16"). It would then mean that char[] becomes an alias for the specified type. When the type is not specified, char[] goes back to being UTF8. Don't know about it being a language feature, but perhaps something that could be added to the runtime. Something like a conditional alias that would define a type like "nchar" to mean "native char".

March 10, 2005

Re: To wchar or not to wchar?

Posted by Andrew Fedoniouk
in reply to Jarrett Billingsley

Permalink

Andrew Fedoniouk

Posted in reply to Jarrett Billingsley

Permalink

> I just use char[] as I'm not planning on translating my programs into languages which use non-roman alphabets any time soon ;)

You'll be suprised but even Latin-1 set does not fit into the char.

http://www.bbsinc.com/symbol.html

For example if you will not use wchar you will not be able to see e.g. Euro sign as one char - two UTF-8 bytes.

:) "Anders F Bjorklund" name will also be represented with one byte more.

,etc.


"Jarrett Billingsley" <kb3ctd2@yahoo.com> wrote in message news:d0odnf$svr$1@digitaldaemon.com...
>> Which would be the best string type to use - char[], wchar[] or dchar[]? I want to choose one of them and stick with it throughout my code for the sake of consistency. My preference would be for wchar[] but using it is not as smooth as I'd hoped. For example, Object.toString() returns char[], Phobos seems not to have wchar versions for integer-to-string conversions, and concatenating sometimes requires casts. It's not too bad, I suppose: I can use free functions to encode/decode strings and write my own integer conversion routines. But I am puzzled as to why I need to cast when concatenating, e.g:
>>
>>    wchar[] text = cast(wchar[])"The quick brown " ~ quickAnimalStr ~
>> cast(wchar[])" jumped over the lazy " ~ lazyAnimalStr;
>>
>> Anyway, I'm doing a lot of text processing on Windows XP, which uses UTF16 natively, so it seemed sane to choose the equivalent string type in D. Plus I read here http://www.digitalmars.com/techtips/windows_utf.html that char[] is not directly compatible with the ANSI versions of the Windows API (again, I'm using this a lot).
>
> You've pretty much summed up all the pros and cons.  XP uses wchars natively, but Phobos is not too kind to them.
>
> I just use char[] as I'm not planning on translating my programs into languages which use non-roman alphabets any time soon ;)
>
>> P.S. Here's an idea: perhaps Walter could add UTF8, UTF16 and UTF32 version identifiers which we could use on the command line to tell the compiler to expect that string type as the default (e.g., "version=UTF16"). It would then mean that char[] becomes an alias for the specified type. When the type is not specified, char[] goes back to being UTF8.
>
> Don't know about it being a language feature, but perhaps something that could be added to the runtime.  Something like a conditional alias that would define a type like "nchar" to mean "native char".
>

Andrew Fedoniouk wrote: >>I just use char[] as I'm not planning on translating my programs into languages which use non-roman alphabets any time soon ;) > > You'll be suprised but even Latin-1 set does not fit into > the char. He probably meant "non-US" ? (lone chars holds US-ASCII characters) > For example if you will not use wchar you will not be able to see e.g. > Euro sign as one char - two UTF-8 bytes. Three, actually: char[1] euro = "\u20AC"; > cannot implicitly convert expression "\u20ac" of type char[3] to char[1] http://www.fileformat.info/info/unicode/char/20ac/index.htm Some characters are even 4. > :) "Anders F Bjorklund" name will also be represented with one byte more. It actually messed up GDC, my name was added in Latin-1 in a comment... (when I added the patch to DMD that actually made it check comments too) It's even more fun when using .length, as it returns bytes (code units) I use char[] and dchar, myself. (and not wchar[] and wchar[], like Java) --anders

> He probably meant "non-US" ? (lone chars holds US-ASCII characters) That's it. It seems weird though that such a (relatively) common letter as umlaut-o would be represented as 2 bytes in UTF8. Maybe I'm thinking of ASCII (and not the old kind, where chars 128-255 are lines and stuff).

Forums