May 25, 2005
When an UCS index of zero is supplied to the 'toUTFindex' function, and the supplied string does not have a valid UTF-8 sequence at offset zero, the function fails to throw an exception. Instead it returns zero, implying that the supplied string is valid up to that point.

This bug may exist in other similar functions too.

The following code illustrates the issue.

<code>
import std.utf;
import std.stdio;

void main()
{
   char[] B;

   B = "\xFF\xFF\xFF"; // Not a valid UTF-8 string

   writefln("Index 0=%d", std.utf.toUTFindex(B, 0)); // should fail
   writefln("Index 1=%d", std.utf.toUTFindex(B, 1)); // does fail
}
</code>

Suggested fix :
<code>
size_t toUTFindex(char[] s, size_t n)
{
    size_t i;
    size_t r;

    do
    {
        if (i >= s.length)
    	    throw new UtfError("3invalid UTC index", i);
    	size_t j = std.utf.UTF8stride[s[i]];
    	if (j == 0xFF)
    	    throw new UtfError("3invalid UTF-8 sequence", i);
    	r = i;
    	i += j;
    } while(n--);

    return r;
}
</code>


Also, I note that the UTF8stride table has entries for 5 and 6 byte sequences. I was under the impression that these are no longer valid UTF-8 sequences.

-- 
Derek
Melbourne, Australia
25/05/2005 11:57:02 AM
May 25, 2005
> Also, I note that the UTF8stride table has entries for 5 and 6 byte
> sequences. I was under the impression that these are no longer valid UTF-8 sequences.

I have already changed that in my changed std.utf module (posted some days ago). The toUtfX() functions were also changed to reject any invalid encodings. Regrettably, i have not heard anything about it. I don't know if Walter includes the changed code into Phobos (i don't think so...).

As i said in that posting, i would also rework the other functions in std.utf. But i am not sure what to do about toUCSindex/toUTFindex() ─ they are very inefficient if used the wrong way...

Ciao
uwe