Thread overview
[D-runtime] Wide characters in D
Jun 07, 2010
Ruslan Nikolaev
Jun 09, 2010
Sean Kelly
Jun 09, 2010
Ruslan Nikolaev
June 06, 2010
Hi. I am new to D. It looks like D supports 3 types of characters: char, wchar, dchar. This is cool, however, I have some questions about it:

1. When we have 2 methods (one with wchar[] and another with char[]), how D will determine which one to use if I pass a string "hello world"?
2. Many libraries (e.g. tango or phobos) don't provide functions/methods (or have incomplete support) for wchar/dchar
e.g. writefln probably assumes char[] for strings like "Number %d..."
3. Even if they do support, it is kind of annoying to provide methods for all 3 types of chars. Especially, if we want to use native mode (e.g. for Windows wchar is better, for Linux char is better). E.g. Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc, wchar_t[] argv) and so on, and they should be native (in a sense that no conversion is necessary when we do, for instance, _wopen). Linux doesn't have them as UTF-8 is used widely there.

Since D language is targeted on system programming, why not to try to use whatever works better on a particular system (e.g. char will be 2 bytes on Windows and 1 byte on Linux; it can be a compiler switch, and all libraries can be compiled properly on a particular system). It's still necessary to have all 3 types of char for cooperation with C. But in those cases byte, short and int will do their work. For this kind of situation, it would be nice to have some built-in functions for transparent conversion from char to byte/short/int and vice versa (especially, if conversion only happens if needed on a particular platform).

In my opinion, to separate notion of character from byte would be nice, and it makes sense as a particular platform uses either UTF-8 or UTF-16 natively. Programmers may write universal code (like TCHAR on Windows). Unfortunately, C uses 'char' and 'byte' interchangeably but why D has to make this mistake again?

Sorry if my suggestion sounds odd. Anyway, it would be great to hear something from D gurus :-)

Ruslan.



June 09, 2010
On Jun 6, 2010, at 5:00 PM, Ruslan Nikolaev wrote:

> Hi. I am new to D. It looks like D supports 3 types of characters: char, wchar, dchar. This is cool, however, I have some questions about it:
> 
> 1. When we have 2 methods (one with wchar[] and another with char[]), how D will determine which one to use if I pass a string "hello world"?

You'll get an overload error because an unqualified string literal converts to both string and wstring.  You'd have to either cast or use "hello world"c or "hello world"w to call the desired routine.

> 2. Many libraries (e.g. tango or phobos) don't provide functions/methods (or have incomplete support) for wchar/dchar
> e.g. writefln probably assumes char[] for strings like "Number %d..."

I think writefln will actually accept any kind of string.  That's how the code looks anyway, though I've never tried anything but utf-8.

> 3. Even if they do support, it is kind of annoying to provide methods for all 3 types of chars. Especially, if we want to use native mode (e.g. for Windows wchar is better, for Linux char is better). E.g. Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc, wchar_t[] argv) and so on, and they should be native (in a sense that no conversion is necessary when we do, for instance, _wopen). Linux doesn't have them as UTF-8 is used widely there.

Templates should largely take care of this for library functions.  It's rare that an algorithm has to know it's working with a string of characters.

> Since D language is targeted on system programming, why not to try to use whatever works better on a particular system (e.g. char will be 2 bytes on Windows and 1 byte on Linux; it can be a compiler switch, and all libraries can be compiled properly on a particular system). It's still necessary to have all 3 types of char for cooperation with C. But in those cases byte, short and int will do their work. For this kind of situation, it would be nice to have some built-in functions for transparent conversion from char to byte/short/int and vice versa (especially, if conversion only happens if needed on a particular platform).

Casting?  Or do you mean codepage conversions?  Personally, I'd rather use a specific encoding internally and if necessary convert during IO.  If you want your app to be portable you won't be able to use a single encoding throughout anyway--you'll need utf-8 for IO on Posix, utf-16 for IO on Windows, etc.

> In my opinion, to separate notion of character from byte would be nice, and it makes sense as a particular platform uses either UTF-8 or UTF-16 natively. Programmers may write universal code (like TCHAR on Windows). Unfortunately, C uses 'char' and 'byte' interchangeably but why D has to make this mistake again?

Working with multibyte characters is computationally expensive and often unnecessary.  Plus, it makes things a bit weird in a systems language.  If I have a char*, seems like dereferencing the pointer would give me a value of 4 bytes back, the last 0-3 being zero?  I really don't know how this would work.

For codepage conversions and the like, I've had tremendous success with libicu.  I don't know that a binding for it is appropriate for Phobos, but I'd love to see a well-maintained project for this on dsource.
June 09, 2010
We transferred our discussion to general D-language mailing list. The rationale of having tchar and some problems with templates were discussed there.

Thank you!
Ruslan.

--- On Wed, 6/9/10, Sean Kelly <sean at invisibleduck.org> wrote:

> From: Sean Kelly <sean at invisibleduck.org>
> Subject: Re: [D-runtime] Wide characters in D
> To: "D's runtime library developers list" <d-runtime at puremagic.com>
> Date: Wednesday, June 9, 2010, 8:55 PM
> On Jun 6, 2010, at 5:00 PM, Ruslan
> Nikolaev wrote:
> 
> > Hi. I am new to D. It looks like D supports 3 types of
> characters: char, wchar, dchar. This is cool, however, I have some questions about it:
> > 
> > 1. When we have 2 methods (one with wchar[] and
> another with char[]), how D will determine which one to use if I pass a string "hello world"?
> 
> You'll get an overload error because an unqualified string
> literal converts to both string and wstring.? You'd
> have to either cast or use "hello world"c or "hello world"w
> to call the desired routine.
> 
> > 2. Many libraries (e.g. tango or phobos) don't provide
> functions/methods (or have incomplete support) for
> wchar/dchar
> > e.g. writefln probably assumes char[] for strings like
> "Number %d..."
> 
> I think writefln will actually accept any kind of
> string.? That's how the code looks anyway, though I've
> never tried anything but utf-8.
> 
> > 3. Even if they do support, it is kind of annoying to
> provide methods for all 3 types of chars. Especially, if we want to use native mode (e.g. for Windows wchar is better, for Linux char is better). E.g. Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc, wchar_t[] argv) and so on, and they should be native (in a sense that no conversion is necessary when we do, for instance, _wopen). Linux doesn't have them as UTF-8 is used widely there.
> 
> Templates should largely take care of this for library
> functions.? It's rare that an algorithm has to know
> it's working with a string of characters.
> 
> > Since D language is targeted on system programming,
> why not to try to use whatever works better on a particular system (e.g. char will be 2 bytes on Windows and 1 byte on Linux; it can be a compiler switch, and all libraries can be compiled properly on a particular system). It's still necessary to have all 3 types of char for cooperation with C. But in those cases byte, short and int will do their work. For this kind of situation, it would be nice to have some built-in functions for transparent conversion from char to byte/short/int and vice versa (especially, if conversion only happens if needed on a particular platform).
> 
> Casting?? Or do you mean codepage conversions??
> Personally, I'd rather use a specific encoding internally
> and if necessary convert during IO.? If you want your
> app to be portable you won't be able to use a single
> encoding throughout anyway--you'll need utf-8 for IO on
> Posix, utf-16 for IO on Windows, etc.
> 
> > In my opinion, to separate notion of character from
> byte would be nice, and it makes sense as a particular platform uses either UTF-8 or UTF-16 natively. Programmers may write universal code (like TCHAR on Windows). Unfortunately, C uses 'char' and 'byte' interchangeably but why D has to make this mistake again?
> 
> Working with multibyte characters is computationally
> expensive and often unnecessary.? Plus, it makes things
> a bit weird in a systems language.? If I have a char*,
> seems like dereferencing the pointer would give me a value
> of 4 bytes back, the last 0-3 being zero?? I really
> don't know how this would work.
> 
> For codepage conversions and the like, I've had tremendous
> success with libicu.? I don't know that a binding for
> it is appropriate for Phobos, but I'd love to see a
> well-maintained project for this on dsource.
> _______________________________________________
> D-runtime mailing list
> D-runtime at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/d-runtime
>