January 29, 2012
On Saturday, January 28, 2012 20:54:30 Era Scarecrow wrote:
>  It there any support for the extended ascii characters? (128-255). I
> understand unicode is important, however working with some data and
> programs that don't support those, I am getting a problem that the program
> causes an exception because it isn't valid utf-8. Do I have to handle it
> all as bytes/ubytes? If I do then I lose out on many char specific
> functions. Alternatively I can rely on the C functions, but I want to avoid
> using them if I can.
> 
> Example: note the raw data below, being 39 vs -110
> 
> this._ID = "SPEL_wulfharth's cups"
> rhs._ID  = "SPEL_wulfharthâ–’s cups"
> 
> this._ID = [83, 80, 69, 76, 95, 119, 117, 108, 102, 104, 97, 114, 116, 104, 39, 115, 32, 99, 117, 112, 115, 0] rhs._ID  = [83, 80, 69, 76, 95, 119, 117, 108, 102, 104, 97, 114, 116, 104, -110, 115, 32, 99, 117, 112, 115, 0]
> 
> 
> I have compiled and made a table for the appropriate conversions to proper unicode, which you can then use in reverse to get it back to it's previous state. However I'm not sure.
> 
> //referenced from http://ascii-table.com/ascii-extended-pc-list.php
> wchar[128] convertAsciiExtended = [
> 	0x00C7, 0x00FC, 0x00E9, 0x00E2, 0x00E4, 0x00E0, 0x00E5, 0x00E7,
> 	0x00EA, 0x00EB, 0x00E8, 0x00EF, 0x00EE, 0x00EC, 0x00C4, 0x00C5,
> 	0x00C9, 0x00E6, 0x00C6, 0x00F4, 0x00F6, 0x00F2, 0x00FB, 0x00F9,
> 	0x00FF, 0x00D6, 0x00DC, 0x00A2, 0x00A3, 0x00A5, 0x20A7, 0x0192,
> 	0x00E1, 0x00ED, 0x00F3, 0x00FA, 0x00F1, 0x00D1, 0x00AA, 0x00BA,
> 	0x00BF, 0x2310, 0x00AC, 0x00BD, 0x00BC, 0x00A1, 0x00AB, 0x00BB,
> 	0x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x2561, 0x2562, 0x2556,
> 	0x2555, 0x2563, 0x2551, 0x2557, 0x255D, 0x255C, 0x255B, 0x2510,
> 	0x2514, 0x2534, 0x252C, 0x251C, 0x2500, 0x253C, 0x255E, 0x255F,
> 	0x255A, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 0x256C, 0x2567,
> 	0x2568, 0x2564, 0x2565, 0x2559, 0x2558, 0x2552, 0x2553, 0x256B,
> 	0x256A, 0x2518, 0x250C, 0x2588, 0x2584, 0x258C, 0x2590, 0x2580,
> 	0x03B1, 0x00DF, 0x0393, 0x03C0, 0x03A3, 0x03C3, 0x00B5, 0x03C4,
> 	0x03A6, 0x0398, 0x03A9, 0x03B4, 0x221E, 0x03C6, 0x03B5, 0x2229,
> 	0x2261, 0x00B1, 0x2265, 0x2264, 0x2320, 0x2321, 0x00F7, 0x2248,
> 	0x00B0, 0x2219, 0x00B7, 0x221A, 0x207F, 0x00B2, 0x25A0, 0x00A0];

char is UTF-8 by definition, and D code is free to assume that that's the case. A lot of the string processing code in Phobos will throw if you give it ill- formed unicode.

Now, you can put whatever you want in a char, but don't expect other D code to handle it correctly.

The only support in Phobos for dealing with alternate encodings is std.encoding. It currently supports "UTF-8, UTF-16, UTF-32, ASCII, ISO-8859-1 (also known as LATIN-1), and WINDOWS-1252." So, if you can get that to do the conversions that you want, then there you go, but otherwise you're on your own.

Regardless, you need to convert your chars to proper UTF-8 if you want other D code (and especially Phobos) to handle them correctly.

- Jonathan M Davis