[Fix] std.utf bad conversions from UTF-16

July 12, 2004
Posted by Stewart Gordon
Permalink
Stewart Gordon
Permalink
Using DMD 0.95, Windows 98SE.

I've just been experimenting with std.utf.  Two separate bugs cropped up, both in ununittested functions:

1. toUTF32(wchar[]) runs into an infinite loop when it encounters a non-ASCII single-word character.  The problem is in decode - a missing else block means that the counter doesn't get incremented.

2. toUTF8(wchar[]) also tends to fail.  The problem is that each wchar is cast to a dchar, one by one, instead of decoding the UTF-16 string.


The fixed functions are below.

Stewart.

----------
dchar decode(wchar[] s, inout size_t idx)
in
{
	assert(idx >= 0 && idx < s.length);
}
out (result)
{
	assert(isValidDchar(result));
}
body
{
	char[] msg;
	dchar V;
	size_t i = idx;
	uint u = s[i];

	if (u >= 0xD800 && u <= 0xDBFF)
	{
		uint u2;

		if (i + 1 == s.length)
		{
			msg = "surrogate UTF-16 high value past end of string";
			goto Lerr;
		}
		u2 = s[i + 1];
		if (u2 < 0xDC00 || u2 > 0xDFFF)
		{
			msg = "surrogate UTF-16 low value out of range";
			goto Lerr;
		}
		u = ((u - 0xD7C0) << 10) + (u2 - 0xDC00);
		i += 2;
	}
	else if (u >= 0xDC00 && u <= 0xDFFF)
	{
		msg = "unpaired surrogate UTF-16 value";
		goto Lerr;
	}
	else if (u == 0xFFFE || u == 0xFFFF)
	{
		msg = "illegal UTF-16 value";
		goto Lerr;
	}
	//	default: single-word charcter (0x0000 to 0xD7FF, 0xE000 to 0xFFFD)
	//	SG fixed bug - previous if (u <= 0x7F) becomes redundant
	else
	{
		i++;
	}

	idx = i;
	return cast(dchar)u;

  Lerr:
	throw new UtfError(msg, i);
}


char[] toUTF8(wchar[] s)
{
	char[] r;

	for (size_t i = 0; i < s.length; )
	{
		encode(r, decode(s, i));
	}
	return r;
}

-- 
My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment.  Please keep replies on the 'group where everyone may benefit.
Forums