UTF-8 to dchar conversion (page 2)

July 29, 2004

Re: UTF-8 to dchar conversion

Posted by Arcane Jill
in reply to Walter

Permalink

Arcane Jill

Posted in reply to Walter

Permalink

In article <cebj7l$1mro$1@digitaldaemon.com>, Walter says...
>
>Does your version also reject UTF-8 sequences that produce the correct value, but are not the shortest possible sequence?

Theoretically, yes. Two-byte sequences starting with 0xC0 and 0xD0 are caught by the relevant zero entries in the LENGTH table (at offsets 0x40 and 0x41); Overlong three and four byte sequences are ruled out by the test:

#                if (firstChar != 0xE0 || (s[1] & 0xE0) != 0x80) &&
#                   (firstChar != 0xF0 || (s[1] & 0xF0) != 0x80))

and overlong five or more byte sequences (indeed, /all/ five or more byte
sequences) are ruled out, again, by zeroes in the LENGTH table (at offset 0x78
to 0x7F).

I have to confess, though, I have not tested this. I wrote it and posted it without testing it, which is bad form, I know, but it's the first D I've written since the funeral and I'm just getting back into practice. I figured you wouldn't want to use it as-is anyway, because you'll want all that delegate stuff with get() and put() instead of just assuming everyone wants a string. That said, I can't /see/ any bugs in it, and it's quite short so there are not many places for them to hide. (So, if you use this, or a variant of it, keep the unit tests in).

If you want UTF conversion to /really/ zip along, you could consider dropping to assembler. Just a thought.

Jill

"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cebljb$1nu9$1@digitaldaemon.com... > I have to confess, though, I have not tested this. It would be nice to have a comprehensive set of test data for these things. Are there any on the UTF sites you look at?

Forums