July 12, 2004 [Fix] std.utf bad conversions from UTF-16 | ||||
---|---|---|---|---|
| ||||
Using DMD 0.95, Windows 98SE.
I've just been experimenting with std.utf. Two separate bugs cropped up, both in ununittested functions:
1. toUTF32(wchar[]) runs into an infinite loop when it encounters a non-ASCII single-word character. The problem is in decode - a missing else block means that the counter doesn't get incremented.
2. toUTF8(wchar[]) also tends to fail. The problem is that each wchar is cast to a dchar, one by one, instead of decoding the UTF-16 string.
The fixed functions are below.
Stewart.
----------
dchar decode(wchar[] s, inout size_t idx)
in
{
assert(idx >= 0 && idx < s.length);
}
out (result)
{
assert(isValidDchar(result));
}
body
{
char[] msg;
dchar V;
size_t i = idx;
uint u = s[i];
if (u >= 0xD800 && u <= 0xDBFF)
{
uint u2;
if (i + 1 == s.length)
{
msg = "surrogate UTF-16 high value past end of string";
goto Lerr;
}
u2 = s[i + 1];
if (u2 < 0xDC00 || u2 > 0xDFFF)
{
msg = "surrogate UTF-16 low value out of range";
goto Lerr;
}
u = ((u - 0xD7C0) << 10) + (u2 - 0xDC00);
i += 2;
}
else if (u >= 0xDC00 && u <= 0xDFFF)
{
msg = "unpaired surrogate UTF-16 value";
goto Lerr;
}
else if (u == 0xFFFE || u == 0xFFFF)
{
msg = "illegal UTF-16 value";
goto Lerr;
}
// default: single-word charcter (0x0000 to 0xD7FF, 0xE000 to 0xFFFD)
// SG fixed bug - previous if (u <= 0x7F) becomes redundant
else
{
i++;
}
idx = i;
return cast(dchar)u;
Lerr:
throw new UtfError(msg, i);
}
char[] toUTF8(wchar[] s)
{
char[] r;
for (size_t i = 0; i < s.length; )
{
encode(r, decode(s, i));
}
return r;
}
--
My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
|
July 12, 2004 Re: [Fix] std.utf bad conversions from UTF-16 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stewart Gordon | In article <ccto20$18bq$1@digitaldaemon.com>, Stewart Gordon says... Cool. Excellent. Like it. Apart from these lines... > else if (u == 0xFFFE || u == 0xFFFF) > { > msg = "illegal UTF-16 value"; > goto Lerr; > } The values U+FFFE and U+FFFF are not illegal either in UTF-16 or UTF-32. They are permanently unassigned characters, that's all. There are in total 64 Unicode characters which have this property, of which U+FFFE and U+FFFF are but two examples. Walter has now changed isValidDchar() to return true for U+FFFE and U+FFFF. Arcane Jill |
Copyright © 1999-2021 by the D Language Foundation