| |
 | Posted by Jonathan M Davis in reply to Marcin Kuszczak | Permalink Reply |
|
Jonathan M Davis 
Posted in reply to Marcin Kuszczak
| http://d.puremagic.com/issues/show_bug.cgi?id=5016
Jonathan M Davis <jmdavisProg@gmx.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jmdavisProg@gmx.com
--- Comment #2 from Jonathan M Davis <jmdavisProg@gmx.com> 2011-01-09 15:16:32 PST ---
char is explictly defined to be a UTF-8 code unit. wchar is explicitly defined to be a UTF-16 code unit. dchar is explicitly defined to be a UTF-32 code unit. In UTF-8 and UTF-16, it can take multiple code units to make up a code point, whereas it always takes one code one UTF-32 code unit to make a code point. A code point is what you would normally think of as a character. This is all standard unicode stuff and getting rid of it would be foolish. It's used all over the place in computing, not just in D.
Part of the trick to dealing with char and wchar correctly is that if you wish to deal with code points / characters (_not_ code units), then _never_ deal with char and wchar individually. That's why most of std.string deals with entire strings at time. If you want to deal with an individual character, you either use a dchar or one of the string types - e.g. 'a' as a dchar or "a" as a string type. You shouldn't be converting from dchar to char and vice versa (or between either of those and wchar). It really doesn't make sense. What makes sense is converting between string types.
On the whole, what D does works fantastically, but you need to understand the basics of unicode. The best place to look would probably be The D Programming Language by Andrei Alexandrescu, since it applies directly to D, but there are plenty of places online to find info on unicode, and you can look at the online docs on arrays for more info about them: http://is.gd/krYRH .
What it comes down to really is that you use whatever string type you need based on size - string, wstring, or dstring - or the need to be able to treat an individual array index as a character. If you need to be able to use random access on a string (including using them in algorithms in std.algorithm which require random access ranges), or if you need to be able to alter individual characters in place, then use dstring or dchar[]. Otherwise, save space and use either string or wstring (string would generally be better unless you're using primarily asian characters, since they tend to take 3 bytes in UTF-8 and 2 in UTF-16).
There are functions which specifically take a dchar, so you can give them a character then, but most deal entirely in strings, even if what you really care about is an individual character. So, generally just treat individual characters as strings with one character.
Take a look at the functions in std.utf: http://is.gd/krZLW . e.g. std.utf.count() can be used to tell you how many code points / characters there are in a string, and std.utf.stride() will tell you how many code units a particular character is so that you can index into a string or wstring if you have to.
When using foreach, make sure that you give the type as dchar. e.g.
string str = "hello world";
foreach(dchar c; str)
writeln(c);
will print out each character individually, whereas as using char (which is the default if you don't give a type) or wchar would print out the individual code units (which isn't generally very useful). foreach is smart enough to convert the string to the appropriate type on the fly while iterating over it, so if you give it dchar, it'll take each code point at a time instead of each code unit.
I'm sure that there are other things that would be useful to point out, but that's all that comes to mind at the moment. On the whole, the way D handles strings is fantastic. You just have to realize that you're dealing with UTF-8, UTF-16, and UTF-32 code units instead of code points when you have a char, wchar, or dchar respectively. dchar/UTF-32 is the only type where code units and code points are the same size.
There has been some talk of various improvements to how all of this works (like possibly making dchar the default type for foreach with string types), so some incremental improvements may be made to iron out some of the wrinkles, but strings in D are designed the way that they are on purpose, and it's not likely to be drastically changed. For the most part, the problem is not the design but rather understanding what the design is so that you can use it properly.
If you want to avoid the whole issue, then you can just use dstring everywhere, but that _will_ result in using about 4 times the amount of memory as you would need with string if you're dealing primarily with ASCII characters.
--
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
|