Thread overview
Re: Wide characters support in D
Jun 07, 2010
Ruslan Nikolaev
Jun 08, 2010
Walter Bright
Jun 08, 2010
Ruslan Nikolaev
June 07, 2010
Just one more addition: it is possible to have built-in function that converts multibyte (or multiword) char sequence (even though in my proposal it can be of different size) to dchar (UTF-32) character. Again, my only point is that it would be nice to have something similar to TCHAR so that all libraries can use it if they choose not to provide functions for all 3 types.

2Walter:
Yes, programmers do often ignore surrogate pairs in case of UTF-16. But in case of undetermined char size (1 or 2 bytes) they will have to use special builtin conversion functions to dchar unless they want their code to be completely broken.

Thanks,
Ruslan.

--- On Tue, 6/8/10, Ruslan Nikolaev <nruslan_devel@yahoo.com> wrote:

> From: Ruslan Nikolaev <nruslan_devel@yahoo.com>
> Subject: Re: Wide characters support in D
> To: "digitalmars.D" <digitalmars-d@puremagic.com>
> Date: Tuesday, June 8, 2010, 3:16 AM
> Ok, ok... that was just a
> suggestion... Thanks, for reply about "Hello world"
> representation. Was postfix "w" and "d" added initially or
> just recently? I did not know about it. I thought D does
> automatic conversion for string literals.
> 
> Yes, templates may help. However, that unnecessary make code bigger (since we have to compile it for every char type). The other problem is that it allows programmer to choose which one to use. He or she may just prefer char[] as UTF-8 (or wchar[] as UTF-16). That will be fine on platform that supports this encoding natively (e.g. for file system operations, screen output, etc.), whereas it will cause conversion overhead on the other. Not to say that it's a big overhead, but unnecessary one. Having said this, I do agree that there must be some flexibility (e.g. in Java char[] is always 2 bytes), however, I don't believe that this flexibility should be available for application programmer.
> 
> I don't think there is any problem with having different size of char. In fact, that would make programs better (since application programmers will have to think in terms of characters as opposed to bytes). System programmers (i.e. OS programmers) may choose to think as they expect it to be (since char width option can be added to compiler). TCHAR in Windows is a good example of it. Whenever you need to determine size of element (e.g. for allocation), you can use 'sizeof'. Again, it does not mean that you're deprived of char/wchar/dchar capability. It still can be supported (e.g. via ubyte/ushort/uint) for the sake of interoperability or some special cases. Special string constants (e.g. ""b, ""w, ""d) can be supported, too. My only point is that it would be good to have universal char type that depends on platform. That, in turns, allows to have unified char for all libraries on this platform.
> 
> In addition, commonly used constants '\n', '\r', '\t' will be the same regardless of char width.
> 
> Anyway, that was just a suggestion. You may disagree with this if you wish.
> 
> Ruslan.
> 
> 
> 
> 



June 08, 2010
Ruslan Nikolaev wrote:
> Just one more addition: it is possible to have built-in function that
> converts multibyte (or multiword) char sequence (even though in my proposal
> it can be of different size) to dchar (UTF-32) character. Again, my only
> point is that it would be nice to have something similar to TCHAR so that all
> libraries can use it if they choose not to provide functions for all 3 types.
> 
> 
> 2Walter: Yes, programmers do often ignore surrogate pairs in case of UTF-16.
> But in case of undetermined char size (1 or 2 bytes) they will have to use
> special builtin conversion functions to dchar unless they want their code to
> be completely broken.

The nice thing about char[] is that you'll find out real fast if your multibyte code is broken. With surrogate pairs in wchar[], the bug may lurk undetected for a decade.
June 08, 2010
Yes, to clarify what I suggest, I can put it as follows (2 possibilities):

1. Have a special standardized type "tchar" and "tstring". Then, system libraries as well as users can use this type unless they want to do something special. There can be a compiler switch to change tchar width (essentially, to assign tchar to char, wchar or dchar), so that for each platform it can be used accordingly. In addition, tmain(tstring[] args) can be used as entry point; _topen, _treaddir, _tfopen, etc. can be added to binding.
Adv: doesn't break existent code.
Disadv: tchar and tstring may look weird for users.

2. Rename current char to bchar or schar, or something similar. Then 'char' can be used as type described above.
Adv: users are likely to use this type
Disadv: may break existent code; especially in part of bindings

I think to have something (at least (1)) would be nice feature and addition to D. Although, I do admit that there can be different opinions about it. However, TCHAR in Windows DOES work fine. In the case described above it's even better since we always work with Unicode (UTF8/16/32) unlike Windows (which use ANSI for 1 byte char), thus everything should be more or less transparent. It would be cool to hear something from D, phobos and tango developers.

P.S. For commonly used characters (e.g. '\n') the size of char will never make any difference. The problems should not occur in good code, or should occur really rare (which can be adjusted by programmer).

Thanks,
Ruslan Nikolaev