UTF conversion API draft [was: Internationalization library - advice/help wanted] (page 2)

> Normal compilation: > * safe function: 0.100 ms > * unsafe function: 0.088 ms (12% faster) > > Compilation -release -O: > * safe function: 0.050 ms > * unsafe function: 0.046 ms (8 % faster) Maybe i should add that if you convert a text which contains a lot of UTF8 2/3-byte-encodings (asian languages), the unsafe function saves more: about 20% in comparison to the safe function. uwe

"Uwe Salomon" <post@uwesalomon.de> wrote in message news:op.sqy2lzec6yjbe6@sandmann.maerchenwald.net... >> 1) fast_toUtf8 should be something like toUtf8Unsafe or toUtf8Unchecked > > Yes, one of them sounds much better. I did not think long about fast_xxx()... Perhaps also toUtf8Unverified(), regrettably that is very long. > >> I'm not actually sure how often it >> would be ok to call such a function anyway so maybe it isn't even needed. >> Getting the wrong answer quickly is not a good trade-off. > > You are right, that is an important fact, especially for a standard library. Easy test: i converted a german email (mostly ASCII, some special characters) with 5000 characters from UTF8 to UTF16. I provided the buffer, because both functions are equally well at allocating memory. > > Normal compilation: > * safe function: 0.100 ms > * unsafe function: 0.088 ms (12% faster) > > Compilation -release -O: > * safe function: 0.050 ms > * unsafe function: 0.046 ms (8 % faster) > > I am not sure how all this could benefit from an assembler implementation. Anyways, the speed gain is minimal (actually, i thought it would be a lot more!). Well, no need to search for a good "unsafe" name then. ;) I could see using the unsafe versions when you check the input once and then convert many slices that one then knows to be safe. So it isn't unreasonable to have it in there. I don't know the use cases well enough to offer up an opinion. >> 2) it looks like you reallocate the output buffer inside the loop - can it be moved to outside? > > Why? To shorten the loop? I thought the buffer should only be reallocated if the conversion itself shows it is too short. Do you want to move it before (so that a reallocation *cannot* occure inside the loop), or just outside (with a goto SomeWhereOutsideTheLoop and after the reallocation goto BackIntoTheLoop)? How about if it needs to grow the buffer it does so with a large chunk instead of many small chunks. That is, the buffer doesn't have to fit exactly. Basically I have in mind that you estimate the maximum buffer size based on the number of input characters left and allocate that. >> 3) the formatting of the source code is somewhat unusual. I missed the loop at first. > > Changed. > > Thanks for the reply, > uwe

> I could see using the unsafe versions when you check the input once and then > convert many slices that one then knows to be safe. So it isn't unreasonable > to have it in there. I don't know the use cases well enough to offer up an opinion. Imagine a program that reads a lot of files from disk, does some fuzzy work on them, and writes some others back, for example a doc tool. It reads the source files in UTF8 format and converts them to the internally used UTF16 (using the safe functions). It then processes here and there, extracts the comments and formats them round. After that it puts out HTML files in UTF8. The comments need to be converted back to UTF8, and that's where the program could use the unsafe functions. At least that were my thoughts. But if the speed gain is under 30%, i think the fast versions are unnecessary. Imagine the doc tool needs a minute for output. With the current functions this would drop to 50 seconds at most, providing that the output only consists of UTF conversion (which is very unlikely). >>> 2) it looks like you reallocate the output buffer inside the loop - can >>> it be moved to outside? >> >> Why? To shorten the loop? I thought the buffer should only be reallocated >> if the conversion itself shows it is too short. Do you want to move it >> before (so that a reallocation *cannot* occure inside the loop), or just >> outside (with a goto SomeWhereOutsideTheLoop and after the reallocation >> goto BackIntoTheLoop)? > > How about if it needs to grow the buffer it does so with a large chunk > instead of many small chunks. That is, the buffer doesn't have to fit > exactly. Basically I have in mind that you estimate the maximum buffer size based on the number of input characters left and allocate that. Hmm, the current source is: if (pOut >= endOut) { // ... buffer.length = buffer.length + (endIn - pIn) + 2; // Will be enough. // ... } This will grow the buffer only once? (endIn - pIn) is the number of UTF8 characters to be processed, and they cannot expand to more than the same amount of UTF16 characters (1-byte encoded UTF8 becomes 1-word encoded UTF16, 4-byte encoded UTF8 becomes 2-word encoded UTF16). The same goes for toUtf8(). But you are right, this could still be moved before the loop, especially this one in toUtf16(). That's because (endIn - pIn) is a very accurate guess for languages with a lot of ASCII in them. Ciao uwe

May 18, 2005

Re: UTF conversion API draft [was: Internationalization library - advice/help wanted]

Posted by Ben Hinkle
in reply to Uwe Salomon

Permalink

Ben Hinkle

Posted in reply to Uwe Salomon

Permalink

"Uwe Salomon" <post@uwesalomon.de> wrote in message news:op.sqy4o0ud6yjbe6@sandmann.maerchenwald.net...
>> I could see using the unsafe versions when you check the input once and
>> then
>> convert many slices that one then knows to be safe. So it isn't
>> unreasonable
>> to have it in there. I don't know the use cases well enough to offer up
>> an opinion.
>
> Imagine a program that reads a lot of files from disk, does some fuzzy work on them, and writes some others back, for example a doc tool. It reads the source files in UTF8 format and converts them to the internally used UTF16 (using the safe functions). It then processes here and there, extracts the comments and formats them round. After that it puts out HTML files in UTF8. The comments need to be converted back to UTF8, and that's where the program could use the unsafe functions.
>
> At least that were my thoughts. But if the speed gain is under 30%, i think the fast versions are unnecessary. Imagine the doc tool needs a minute for output. With the current functions this would drop to 50 seconds at most, providing that the output only consists of UTF conversion (which is very unlikely).

sounds reasonable

>>>> 2) it looks like you reallocate the output buffer inside the loop - can it be moved to outside?
>>>
>>> Why? To shorten the loop? I thought the buffer should only be
>>> reallocated
>>> if the conversion itself shows it is too short. Do you want to move it
>>> before (so that a reallocation *cannot* occure inside the loop), or just
>>> outside (with a goto SomeWhereOutsideTheLoop and after the reallocation
>>> goto BackIntoTheLoop)?
>>
>> How about if it needs to grow the buffer it does so with a large chunk instead of many small chunks. That is, the buffer doesn't have to fit exactly. Basically I have in mind that you estimate the maximum buffer size based on the number of input characters left and allocate that.
>
> Hmm, the current source is:
>
> if (pOut >= endOut)
> {
>   // ...
>   buffer.length = buffer.length + (endIn - pIn) + 2;      // Will be
> enough.
>   // ...
> }
>
> This will grow the buffer only once? (endIn - pIn) is the number of UTF8 characters to be processed, and they cannot expand to more than the same amount of UTF16 characters (1-byte encoded UTF8 becomes 1-word encoded UTF16, 4-byte encoded UTF8 becomes 2-word encoded UTF16). The same goes for toUtf8().

ok - I didn't look at the details. I just saw the resizing happening in the loop and guessed it was resizing a little bit each time. What you have seems reasonable.

> But you are right, this could still be moved before the loop, especially this one in toUtf16(). That's because (endIn - pIn) is a very accurate guess for languages with a lot of ASCII in them.
>
> Ciao
> uwe

>> This will grow the buffer only once? (endIn - pIn) is the number of UTF8 >> characters to be processed, and they cannot expand to more than the same >> amount of UTF16 characters (1-byte encoded UTF8 becomes 1-word encoded >> UTF16, 4-byte encoded UTF8 becomes 2-word encoded UTF16). The same goes >> for toUtf8(). > > ok - I didn't look at the details. I just saw the resizing happening in the > loop and guessed it was resizing a little bit each time. What you have seems > reasonable. Still you are right. I moved it out of the loop in toUtf16(). I will think about it in the other functions, not sure what is best in each case (well, it always depends on the characters in the string). I am now writing the other 4 functions (that is much easier now, as the two were the most complex). After finishing and testing them, i'll beep again. :) By the way... how are the Phobos docs generated? Hand-crafted? I will also update the corresponding sections if you let me... Ciao uwe

I have now moved the UTF conversion code into the std.utf module. I have made the following changes: * The tabs are now spaces. Sorry... :) * Slight change in UTF8stride array. Unicode 4.01 declares some encodings illegal, including 5- and 6-byte encodings and some at the beginning of the 2-bytes. * Slight change in stride(wchar) and toUTFindex(wchar) and toUCSindex(wchar). I just changed the detection of UTF16 surrogate values to a faster variant that does not need a local variable as well. * Replacement of all toUTF() functions, except the ones that only validate because the return type has the same encoding as the parameter. toUTF16z() is still there as well, but changed to use my own toUTF16 (it zero-terminates the strings anyways). I have not changed the encode/decode functions, even if they really needed some change (especially the UTF8 decode() function). I will happily do that, but i want to know first if my previous work is ok. Ciao uwe

Forums