[Issue 5016] New: to!() can not convert from wide characters to char

Oct 08, 2010

Marcin Kuszczak

Jan 09, 2011

Andrei Alexandrescu

Jan 09, 2011

Jan 09, 2011

Jan 09, 2011

Jan 22, 2011

http://d.puremagic.com/issues/show_bug.cgi?id=5016 Summary: to!() can not convert from wide characters to char Product: D Version: D2 Platform: Other OS/Version: All Status: NEW Severity: major Priority: P2 Component: Phobos AssignedTo: nobody@puremagic.com ReportedBy: aarti@interia.pl --- Comment #0 from Marcin Kuszczak <aarti@interia.pl> 2010-10-08 01:03:34 PDT --- Test case: void main() { //Instantiation error dchar from0 = 'A'; char to0 = to!(char)(from0); //Instantiation error wchar from1 = 'A'; char to1 = to!(char)(from1); //Ok char from2 = 'A'; char to2 = to!(char)(from2); //Ok char from3 = 'A'; wchar to3 = to!(wchar)(from3); //Ok char from4 = 'A'; dchar to4 = to!(dchar)(from4); } It's interesting case as failing conversions should not always succeed (e.g. when wchar/dchar can not be coded in one byte), while in many cases they are perfectly valid. I am starting thinking that assuming that strings/chars are just arrays is quite a big mistake in D design: it introduces a lot of corner cases. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------

http://d.puremagic.com/issues/show_bug.cgi?id=5016 --- Comment #1 from Marcin Kuszczak <aarti@interia.pl> 2011-01-09 13:18:44 PST --- After rethinking problem it seems that real problem is that char and wchar are not "real" characters. These two types are just artificial things which cause more troubles than necessary. The only "true" character is dchar and all other character types should be depreciated. In such a case: string <=> ubyte[] => dchar[] wstring <=> ushort[] => dchar[] ... and maybe also: dstring <=> uint[] <=> dchar[] where "=>" means "can be viewed as" It would solve cleanly and properly problems with strange and unnecessary conversions like "dchar -> char" -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------

January 09, 2011

[Issue 5016] to!() can not convert from wide characters to char

Posted by Jonathan M Davis
in reply to Marcin Kuszczak

Permalink

Jonathan M Davis

Posted in reply to Marcin Kuszczak

Permalink

http://d.puremagic.com/issues/show_bug.cgi?id=5016


Jonathan M Davis <jmdavisProg@gmx.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jmdavisProg@gmx.com


--- Comment #2 from Jonathan M Davis <jmdavisProg@gmx.com> 2011-01-09 15:16:32 PST ---
char is explictly defined to be a UTF-8 code unit. wchar is explicitly defined to be a UTF-16 code unit. dchar is explicitly defined to be a UTF-32 code unit. In UTF-8 and UTF-16, it can take multiple code units to make up a code point, whereas it always takes one code one UTF-32 code unit to make a code point. A code point is what you would normally think of as a character. This is all standard unicode stuff and getting rid of it would be foolish. It's used all over the place in computing, not just in D.

Part of the trick to dealing with char and wchar correctly is that if you wish to deal with code points / characters (_not_ code units), then _never_ deal with char and wchar individually. That's why most of std.string deals with entire strings at time. If you want to deal with an individual character, you either use a dchar or one of the string types - e.g. 'a' as a dchar or "a" as a string type. You shouldn't be converting from dchar to char and vice versa (or between either of those and wchar). It really doesn't make sense. What makes sense is converting between string types.

On the whole, what D does works fantastically, but you need to understand the basics of unicode. The best place to look would probably be The D Programming Language by Andrei Alexandrescu, since it applies directly to D, but there are plenty of places online to find info on unicode, and you can look at the online docs on arrays for more info about them: http://is.gd/krYRH .

What it comes down to really is that you use whatever string type you need based on size - string, wstring, or dstring - or the need to be able to treat an individual array index as a character. If you need to be able to use random access on a string (including using them in algorithms in std.algorithm which require random access ranges), or if you need to be able to alter individual characters in place, then use dstring or dchar[]. Otherwise, save space and use either string or wstring (string would generally be better unless you're using primarily asian characters, since they tend to take 3 bytes in UTF-8 and 2 in UTF-16).

There are functions which specifically take a dchar, so you can give them a character then, but most deal entirely in strings, even if what you really care about is an individual character. So, generally just treat individual characters as strings with one character.

Take a look at the functions in std.utf: http://is.gd/krZLW . e.g. std.utf.count() can be used to tell you how many code points / characters there are in a string, and std.utf.stride() will tell you how many code units a particular character is so that you can index into a string or wstring if you have to.

When using foreach, make sure that you give the type as dchar. e.g.

string str = "hello world";

foreach(dchar c; str)
    writeln(c);

will print out each character individually, whereas as using char (which is the default if you don't give a type) or wchar would print out the individual code units (which isn't generally very useful). foreach is smart enough to convert the string to the appropriate type on the fly while iterating over it, so if you give it dchar, it'll take each code point at a time instead of each code unit.

I'm sure that there are other things that would be useful to point out, but that's all that comes to mind at the moment. On the whole, the way D handles strings is fantastic. You just have to realize that you're dealing with UTF-8, UTF-16, and UTF-32 code units instead of code points when you have a char, wchar, or dchar respectively. dchar/UTF-32 is the only type where code units and code points are the same size.

There has been some talk of various improvements to how all of this works (like possibly making dchar the default type for foreach with string types), so some incremental improvements may be made to iron out some of the wrinkles, but strings in D are designed the way that they are on purpose, and it's not likely to be drastically changed. For the most part, the problem is not the design but rather understanding what the design is so that you can use it properly.

If you want to avoid the whole issue, then you can just use dstring everywhere, but that _will_ result in using about 4 times the amount of memory as you would need with string if you're dealing primarily with ASCII characters.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

http://d.puremagic.com/issues/show_bug.cgi?id=5016 --- Comment #3 from Jonathan M Davis <jmdavisProg@gmx.com> 2011-01-09 15:20:07 PST --- std.conv.to!() does need to be fixed to better handle the situation though. It should probably either outright refuse to convert between each of the character types on the theory that there's pretty much no way that that's a good idea and that the programmer can just use cast if they really, actually need to do such a conversion. Or it should throw when the character can't fit in a single code unit of the target type, though that's going to result in code that is rather hit or miss as to whether it's going to succeed or not and wouldn't likely be a good idea to use in code generally. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------

http://d.puremagic.com/issues/show_bug.cgi?id=5016 Andrei Alexandrescu <andrei@metalanguage.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED --- Comment #4 from Andrei Alexandrescu <andrei@metalanguage.com> 2011-01-22 15:11:51 PST --- std.conv.to for narrowing conversions acts as a checked cast. This bug was fixed in http://www.dsource.org/projects/phobos/changeset/2359 and http://www.dsource.org/projects/phobos/changeset/2363 -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------

Forums