July 29, 2004 Re: UTF-8 to dchar conversion | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | In article <cebj7l$1mro$1@digitaldaemon.com>, Walter says... > >Does your version also reject UTF-8 sequences that produce the correct value, but are not the shortest possible sequence? Theoretically, yes. Two-byte sequences starting with 0xC0 and 0xD0 are caught by the relevant zero entries in the LENGTH table (at offsets 0x40 and 0x41); Overlong three and four byte sequences are ruled out by the test: # if (firstChar != 0xE0 || (s[1] & 0xE0) != 0x80) && # (firstChar != 0xF0 || (s[1] & 0xF0) != 0x80)) and overlong five or more byte sequences (indeed, /all/ five or more byte sequences) are ruled out, again, by zeroes in the LENGTH table (at offset 0x78 to 0x7F). I have to confess, though, I have not tested this. I wrote it and posted it without testing it, which is bad form, I know, but it's the first D I've written since the funeral and I'm just getting back into practice. I figured you wouldn't want to use it as-is anyway, because you'll want all that delegate stuff with get() and put() instead of just assuming everyone wants a string. That said, I can't /see/ any bugs in it, and it's quite short so there are not many places for them to hide. (So, if you use this, or a variant of it, keep the unit tests in). If you want UTF conversion to /really/ zip along, you could consider dropping to assembler. Just a thought. Jill |
July 29, 2004 Re: UTF-8 to dchar conversion | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | In article <cebljb$1nu9$1@digitaldaemon.com>, Arcane Jill says... Textual typo correction: >(at offsets 0x40 and 0x41); should read >(at offsets 0x40 and 0x50); |
July 29, 2004 Re: UTF-8 to dchar conversion | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | "Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cebljb$1nu9$1@digitaldaemon.com... > I have to confess, though, I have not tested this. It would be nice to have a comprehensive set of test data for these things. Are there any on the UTF sites you look at? |
Copyright © 1999-2021 by the D Language Foundation