Thread overview
ASCII to UTF conversion?
Nov 29, 2005
Oskar Linde
Nov 29, 2005
Walter Bright
November 29, 2005
Maybe I missed something in the D Docs, but is there a way to convert from ASCII to UTF?  Sometimes problems arise when dealing with non-UTF-aware functions (like those in some libraries), when they return ASCII strings that have characters above 0x7F.  All it ends me up with is heartache and "4Invalid UTF-8 Sequence" exceptions.

So is there a standard function for doing this, or would I just be better off looping through the string and replacing any above-0x7F characters with underscores or something?


November 29, 2005
Jarrett Billingsley wrote:

> Maybe I missed something in the D Docs, but is there a way to convert from ASCII to UTF?  Sometimes problems arise when dealing with non-UTF-aware functions (like those in some libraries), when they return ASCII strings that have characters above 0x7F.  All it ends me up with is heartache and "4Invalid UTF-8 Sequence" exceptions.

You need to find out which encoding that your non-UTF functions return.
Hint: it's not ASCII, as that is a 7-bit encoding compatible with UTF-8

> So is there a standard function for doing this, or would I just be better off looping through the string and replacing any above-0x7F characters with underscores or something? 

There are no functions in Phobos (as far as I know), but libiconv works.
See: http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs ("8 bit enc.")

--anders
November 29, 2005
Jarrett Billingsley wrote:
> Maybe I missed something in the D Docs, but is there a way to convert from ASCII to UTF?  Sometimes problems arise when dealing with non-UTF-aware functions (like those in some libraries), when they return ASCII strings that have characters above 0x7F.  All it ends me up with is heartache and "4Invalid UTF-8 Sequence" exceptions.
> 
> So is there a standard function for doing this, or would I just be better off looping through the string and replacing any above-0x7F characters with underscores or something? 

ASCII to UTF-8 is simple:

# char[] ascii2utf(ubyte[] ascii) { return cast(char[]) ascii; }

But by mentioning characters above 0x7F, I assume you mean something else than ASCII...

Here is a simple Latin-1 to UTF-16 converter:

# wchar[] latin12utf16(ubyte[] latin1) {
# 	wchar[] ret;
# 	ret.length = latin1.length;
# 	foreach(int i, ubyte b; latin1)
# 		ret[i] = cast(wchar) b;
# 	return ret;
# }

(Disclaimer: no code is tested.)

For 8-bit character sets other than Latin-1 (ISO 8859-1) you will need a library to supply the mapping. (Unicode's lower 256 code points map 1:1 to Latin-1)

/Oskar
November 29, 2005
"Jarrett Billingsley" <kb3ctd2@yahoo.com> wrote in message news:dmgmc4$hed$1@digitaldaemon.com...
> Maybe I missed something in the D Docs, but is there a way to convert from ASCII to UTF?  Sometimes problems arise when dealing with non-UTF-aware functions (like those in some libraries), when they return ASCII strings that have characters above 0x7F.  All it ends me up with is heartache and "4Invalid UTF-8 Sequence" exceptions.
>
> So is there a standard function for doing this, or would I just be better off looping through the string and replacing any above-0x7F characters
with
> underscores or something?

You can try the functions in std.charset.


November 29, 2005
"Jarrett Billingsley" <kb3ctd2@yahoo.com> wrote in message news:dmgmc4$hed$1@digitaldaemon.com...

Thanks for the replies!  Walter's suggestion is what I was looking for - totally missed those functions.

And yes, I suppose I meant "Latin 1."  I didn't realize that the formal definition of ASCII was still so strict as to mean just the characters between 0x0 and 0x7F; for me, characters between 0x0 and 0xFF have always been "ASCII."  I guess that's what happens when you only have five years of programming experience.