January 29, 2012
> char is UTF-8 by definition, and D code is free to assume
> that that's the case.
> A lot of the string processing code in Phobos will throw if
> you give it ill-
> formed unicode.
> 
> Now, you can put whatever you want in a char, but don't
> expect other D code to
> handle it correctly.
> 
> The only support in Phobos for dealing with alternate
> encodings is
> std.encoding. It currently supports "UTF-8, UTF-16, UTF-32,
> ASCII, ISO-8859-1
> (also known as LATIN-1), and WINDOWS-1252." So, if you can
> get that to do the
> conversions that you want, then there you go, but otherwise
> you're on your
> own.
> 
> Regardless, you need to convert your chars to proper UTF-8
> if you want other D
> code (and especially Phobos) to handle them correctly.

 Yeah, and while I'm finding more often then not what is breaking the Unicode are likely duplicates and errors in the source file (at least 10 years old too).

 Based on the sparseness and rarity of the formatting getting in the way I've tried making a custom compare function that uses the phobos code, but catches the exception when the UTF is badly formatted, which then converts it and tries the compare again. The source format doesn't have everything as texts marked, rather it has to be taken in context when it is needed, so needlessly converting to proper unicode on everything will be a waste 75%-95% of the time.