September 21, 2011
On 09/21/2011 12:37 PM, Dmitry Olshansky wrote:
> On 21.09.2011 4:04, Timon Gehr wrote:
>> On 09/21/2011 01:57 AM, Christophe wrote:
>>> "Jonathan M Davis" , dans le message (digitalmars.D.learn:29637), a
>>> écrit :
>>>> On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:
>>>>> On 9/20/11, Jonathan M Davis<jmdavisProg@gmx.com> wrote:
>>>>>> Or std.range.walkLength. I don't know why we really have
>>>>>> std.utf.count. I
>>>>>> just
>>>>>> calls walkLength anyway. I suspect that it's a function that predates
>>>>>> walkLength and was made to use walkLength after walkLength was
>>>>>> introduced. But
>>>>>> it's kind of pointless now.
>>>>>>
>>>>>> - Jonathan M Davis
>>>>>
>>>>> I don't think having better-named aliases is a bad thing. Although now
>>>>> I'm seeing it's not just an alias but a function.
>>>>
>>>
>>> std.utf.count has on advantage: someone looking for the function will
>>> find it. The programmer might not look in std.range to find a function
>>> about UFT strings, and even if he did, it is not indicated in walkLength
>>> that it works with (narrow) strings the way it does. To know you can use
>>> walklength, you must know that:
>>> -popFront works differently in string.
>>> -hasLength is not true for strings.
>>> -what is walkLength.
>>>
>>> So yes, you experienced programmer don't need std.utf.count, but newbies
>>> do.
>>>
>>> Last point: WalkLength is not optimized for strings.
>>> std.utf.count should be.
>>>
>>> This short implementation of count was 3 to 8 times faster than
>>> walkLength is a simple benchmark:
>>>
>>> size_t myCount(string text)
>>> {
>>> size_t n = text.length;
>>> for (uint i=0; i<text.length; ++i)
>>> {
>>> auto s = text[i]>>6;
>>> n -= (s>>1) - ((s+1)>>2);
>>> }
>>> return n;
>>> }
>>>
>>> (compiled with gdc on 64 bits, the sample text was the introduction of
>>> french wikipedia UTF-8 article down to the sommaire -
>>> http://fr.wikipedia.org/wiki/UTF-8 ).
>>>
>>> The reason is that the loop can be unrolled by the compiler.
>>
>> Very good point, you might want to file an enhancement request. It would
>> make the functionality different enough to prevent count from being
>> removed: walkLength throws on an invalid UTF sequence.
>
> Actually, I don't buy it. I guess the reason it's faster is that it
> doesn't check if the codepoint is valid. In fact you can easily get
> ridiculous overflowed "negative" lengths.

Most of these could be caught by a final check. I think having the option of a version that is so much faster would be nice. Chances are pretty high that code actually manipulating the string will throw eventually if it is invalid.

> Maybe we can put it here as
> unsafe and fast version though.
> Also check std.utf.stride to see if you can get it better, it's the
> beast behind narrow string popFront.
>

September 21, 2011
> Actually, I don't buy it. I guess the reason it's faster is that it doesn't check if the codepoint is valid.

Why should it ? The documentation of std.utf.count says the string must
be validly encoded, not that it will enforce that it is.
Checking a string is valid everytime you use it would be very expensive.

Actually, std.range.walkLength does not check the sequence is valid. See this test:

void main()
{
  string text = "aléluyah";
  char[] text2 = text.dup;
  text2[3] = 'a';
  writeln(walkLength(text2)); // outputs: 8
  writeln(text2);             // outputs: al\303aluyah
}

There is probably a way to check an utf sequence is valid with an unrollable loop.

> In fact you can easily get ridiculous overflowed "negative" lengths. Maybe we can put it here as unsafe and fast version though.

Unless I am mistaken, the minimum length myCount can return is 0 even if the string is invalid.

> Also check std.utf.stride to see if you can get it better, it's the beast behind narrow string popFront.

stride does not make much checking. It can even return 5 or 6, which is not possible for a valid utf-8 string !

The equivalent of myCount to stride would be:

size_t myStride(char c)
{
    // optional:
    // if ( (((c>>7)+1)>>1) - (((c>>6)+1)>>2) + (((c>>3)+1)>>5))
    //     throw new UtfException("Not the start of the UTF-8 sequence");
    return 1 + (((c>>6)+1)>>2) + (((c>>5)+1)>>3) + (((c>>4)+1)>>4);
}

That I compared to:

size_t utfLikeStride(char c)
{
  // optional:
  // immutable result = UTF8stride[c];
  // if (result == 0xFF)
  // throw new UtfException("Not the start of the UTF-8 sequence");
  // return result;
  return UTF8stride[c];
}

One table lookup is replaced by byte some arythmetic in myStride.

I also took only one char as input, since stride only looked at the i-th character. Actually, if stride signature is kept to uint "stride(char[] s, int i)", I did not find any change with -O3.

Average times for "a lot" of calls:
(compiled with gcc, tested with -O3 and a homogenous distribution of
"valid" characters from '\x00'..'\x7F' and '\xC2'..'\xF4')

myStride no throws:      1112ms.
utfLikeStride no throws: 1433ms.
utfLikeStride throws:    1868ms. (the current implementation).
myStride throws:         8269ms.

Removing throws from utfLikeStride makes it about 25% faster. Removing throws from myStride makes it about 7 times faster.

With -O0, myStride gets less 10% slower than utfLikeStride (no throws).

In conclusion, the fastest implementation is myStride without throws, and it beats the current implementation by about 40%. Changing std.utf.stride may be desirable. As I said earlier, the throws do not enforce the validity of the string. Really checking the validity of the string would cost much more, which may not be desirable, so why bother checking at all? A more serious benchmark could justify to change std.utf.stride. The improvement could be even better in real situation, because the lookup table of utfLikeStride may not be always at hand - this actually really depends on what the compiler does.

In any case, this may not improve walkLength by more than a few percents.

-- 
Christophe

now I'll go back to my real work...
September 21, 2011
On 21.09.2011 01:57, Christophe wrote:
>
> size_t myCount(string text)
> {
>    size_t n = text.length;
>    for (uint i=0; i<text.length; ++i)
>      {
>        auto s = text[i]>>6;
>        n -= (s>>1) - ((s+1)>>2);
>      }
>    return n;
> }
>

Here is a more readable and a bit faster version on dmd windows:

size_t utfCount(string text)
{
    size_t n = 0;
    for (uint i=0; i<text.length; ++i)
         n += ((text[i]>>6)^0b10)? 1: 0;
    return n;
}
September 21, 2011
On 21.09.2011 18:47, Christophe wrote:
>> Actually, I don't buy it. I guess the reason it's faster is that it
>> doesn't check if the codepoint is valid.
>
> Why should it ? The documentation of std.utf.count says the string must
> be validly encoded, not that it will enforce that it is.
> Checking a string is valid everytime you use it would be very expensive.
>
> Actually, std.range.walkLength does not check the sequence is valid. See
> this test:
>
> void main()
> {
>    string text = "aléluyah";
>    char[] text2 = text.dup;
>    text2[3] = 'a';
>    writeln(walkLength(text2)); // outputs: 8
>    writeln(text2);             // outputs: al\303aluyah
> }

Ouch, the checking is apparently very loosy.

>
> There is probably a way to check an utf sequence is valid with an
> unrollable loop.
>
>> In fact you can easily get ridiculous overflowed "negative" lengths.
>> Maybe we can put it here as unsafe and fast version though.
>
> Unless I am mistaken, the minimum length myCount can return is 0 even
> if the string is invalid.

Yeah, a brain malfunction on my part.

>
>> Also check std.utf.stride to see if you can get it better, it's the
>> beast behind narrow string popFront.
>
> stride does not make much checking. It can even return 5 or 6, which is
> not possible for a valid utf-8 string !
>
> The equivalent of myCount to stride would be:
>
> size_t myStride(char c)
> {
>      // optional:
>      // if ( (((c>>7)+1)>>1) - (((c>>6)+1)>>2) + (((c>>3)+1)>>5))
>      //     throw new UtfException("Not the start of the UTF-8 sequence");
>      return 1 + (((c>>6)+1)>>2) + (((c>>5)+1)>>3) + (((c>>4)+1)>>4);
> }
>
> That I compared to:
>
> size_t utfLikeStride(char c)
> {
>    // optional:
>    // immutable result = UTF8stride[c];
>    // if (result == 0xFF)
>    // throw new UtfException("Not the start of the UTF-8 sequence");
>    // return result;
>    return UTF8stride[c];
> }
>
> One table lookup is replaced by byte some arythmetic in myStride.
>
> I also took only one char as input, since stride only looked at the i-th
> character. Actually, if stride signature is kept to uint "stride(char[]
> s, int i)", I did not find any change with -O3.
>
> Average times for "a lot" of calls:
> (compiled with gcc, tested with -O3 and a homogenous distribution of
> "valid" characters from '\x00'..'\x7F' and '\xC2'..'\xF4')
>
> myStride no throws:      1112ms.
> utfLikeStride no throws: 1433ms.
> utfLikeStride throws:    1868ms. (the current implementation).
> myStride throws:         8269ms.
>
I wonder what impact may have if any changing 0xff to 0x00 in implementation of utfLikeStride. It should amount to cmp vs test, not sure if it matters much.

> Removing throws from utfLikeStride makes it about 25% faster.
> Removing throws from myStride makes it about 7 times faster.
>
> With -O0, myStride gets less 10% slower than utfLikeStride (no throws).
>
> In conclusion, the fastest implementation is myStride without throws,
> and it beats the current implementation by about 40%. Changing
> std.utf.stride may be desirable. As I said earlier, the throws do
> not enforce the validity of the string. Really checking the validity of
> the string would cost much more, which may not be desirable, so why
> bother checking at all?

The truth is I'd checked this in the past (though I used some bsr black magic) and if I kept check in place the end result was always slower then current. But since the check is not very accurate anyway, maybe it can be replaced. It's problematic if some code happen to depend on it. (given the doc it should not)

> A more serious benchmark could justify to change
> std.utf.stride. The improvement could be even better in real situation,
> because the lookup table of utfLikeStride may not be always at hand -
> this actually really depends on what the compiler does.
>
Yes and no, I think it would be hard to find app that bottlenecks at traversing UTF, on decoding - maybe. Generally if you do a lot calls to stride it's in cache, if not it doesn't matter much(?). Though I'd prefer non-tabulated version

> In any case, this may not improve walkLength by more than a few
> percents.
>

Then specializing walkLength to do your unrollable version seems like good idea.

-- 
Dmitry Olshansky
September 21, 2011
> Here is a more readable and a bit faster version on dmd windows:
> 
> size_t utfCount(string text)
> {
>      size_t n = 0;
>      for (uint i=0; i<text.length; ++i)
>           n += ((text[i]>>6)^0b10)? 1: 0;
>      return n;
> }

Nice. It is better with gdc linux 64bits too. I wanted to avoid conditional expressions like ?: but it's actually slightly faster that way.

And now people can't tell it is dangerous because it could return a fuzzy number.

Even faster, through less readable:

size_t utfLength(string text)
{
  size_t n=0;
  for (size_t i=0; i<text.length; ++i)
    n += (((text[i]>>6)^0b10) != 0);
  return n;
}

Let's see how we can boost std.utf.stride that way...

-- 
Christophe
September 21, 2011
On 21.09.2011 19:12, Christophe Travert wrote:
> Nice. It is better with gdc linux 64bits too. I wanted to avoid
> conditional expressions like ?: but it's actually slightly faster that
> way.

It is not compiled in as conditional jump.
1 2 3
Next ›   Last »