Thread overview | |||||||
---|---|---|---|---|---|---|---|
|
November 29, 2005 ASCII to UTF conversion? | ||||
---|---|---|---|---|
| ||||
Maybe I missed something in the D Docs, but is there a way to convert from ASCII to UTF? Sometimes problems arise when dealing with non-UTF-aware functions (like those in some libraries), when they return ASCII strings that have characters above 0x7F. All it ends me up with is heartache and "4Invalid UTF-8 Sequence" exceptions. So is there a standard function for doing this, or would I just be better off looping through the string and replacing any above-0x7F characters with underscores or something? |
November 29, 2005 Re: ASCII to UTF conversion? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jarrett Billingsley | Jarrett Billingsley wrote: > Maybe I missed something in the D Docs, but is there a way to convert from ASCII to UTF? Sometimes problems arise when dealing with non-UTF-aware functions (like those in some libraries), when they return ASCII strings that have characters above 0x7F. All it ends me up with is heartache and "4Invalid UTF-8 Sequence" exceptions. You need to find out which encoding that your non-UTF functions return. Hint: it's not ASCII, as that is a 7-bit encoding compatible with UTF-8 > So is there a standard function for doing this, or would I just be better off looping through the string and replacing any above-0x7F characters with underscores or something? There are no functions in Phobos (as far as I know), but libiconv works. See: http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs ("8 bit enc.") --anders |
November 29, 2005 Re: ASCII to UTF conversion? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jarrett Billingsley | Jarrett Billingsley wrote:
> Maybe I missed something in the D Docs, but is there a way to convert from ASCII to UTF? Sometimes problems arise when dealing with non-UTF-aware functions (like those in some libraries), when they return ASCII strings that have characters above 0x7F. All it ends me up with is heartache and "4Invalid UTF-8 Sequence" exceptions.
>
> So is there a standard function for doing this, or would I just be better off looping through the string and replacing any above-0x7F characters with underscores or something?
ASCII to UTF-8 is simple:
# char[] ascii2utf(ubyte[] ascii) { return cast(char[]) ascii; }
But by mentioning characters above 0x7F, I assume you mean something else than ASCII...
Here is a simple Latin-1 to UTF-16 converter:
# wchar[] latin12utf16(ubyte[] latin1) {
# wchar[] ret;
# ret.length = latin1.length;
# foreach(int i, ubyte b; latin1)
# ret[i] = cast(wchar) b;
# return ret;
# }
(Disclaimer: no code is tested.)
For 8-bit character sets other than Latin-1 (ISO 8859-1) you will need a library to supply the mapping. (Unicode's lower 256 code points map 1:1 to Latin-1)
/Oskar
|
November 29, 2005 Re: ASCII to UTF conversion? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jarrett Billingsley | "Jarrett Billingsley" <kb3ctd2@yahoo.com> wrote in message news:dmgmc4$hed$1@digitaldaemon.com... > Maybe I missed something in the D Docs, but is there a way to convert from ASCII to UTF? Sometimes problems arise when dealing with non-UTF-aware functions (like those in some libraries), when they return ASCII strings that have characters above 0x7F. All it ends me up with is heartache and "4Invalid UTF-8 Sequence" exceptions. > > So is there a standard function for doing this, or would I just be better off looping through the string and replacing any above-0x7F characters with > underscores or something? You can try the functions in std.charset. |
November 29, 2005 Re: ASCII to UTF conversion? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jarrett Billingsley | "Jarrett Billingsley" <kb3ctd2@yahoo.com> wrote in message news:dmgmc4$hed$1@digitaldaemon.com... Thanks for the replies! Walter's suggestion is what I was looking for - totally missed those functions. And yes, I suppose I meant "Latin 1." I didn't realize that the formal definition of ASCII was still so strict as to mean just the characters between 0x0 and 0x7F; for me, characters between 0x0 and 0xFF have always been "ASCII." I guess that's what happens when you only have five years of programming experience. |
Copyright © 1999-2021 by the D Language Foundation