April 08, 2012
For most of the string processing I do, I read/write text in UTF-8 and convert it to UTF-32 for processing (with std.utf), so I don't have to worry about encoding. Is this a good or bad paradigm? Is there a better way to do this? What method do all of you use?

Just curious, NMS
April 08, 2012
On Sunday, April 08, 2012 23:36:23 Nathan M. Swan wrote:
> For most of the string processing I do, I read/write text in UTF-8 and convert it to UTF-32 for processing (with std.utf), so I don't have to worry about encoding. Is this a good or bad paradigm? Is there a better way to do this? What method do all of you use?
> 
> Just curious, NMS

It depends on what you're doing. Depending on the functions that you use and your memory requirements, UTF-8 may be faster or UTF-32 may be faster. UTF-32 has the advantage of being a random-access range, which will make it work with a number of functions that UTF-8 won't work with. But UTF-32 also takes considerably more memory (especially if most of your characters are ASCII characters), which can be a problem.

I think that the most common thing is to just operate on UTF-8 unless another encoding is needed (e.g. UTF-32 is required because random-access is needed), and in plenty of cases, you end up operating on generic ranges anyway if you use range-based functions on strings and don't use std.array.array on them.

You're going to have to profile your code to see whether using UTF-8 or UTF-32 primarily in your string-processing is more efficient.

- Jonathan M Davis