October 12, 2016
On Wednesday, 12 October 2016 at 09:23:53 UTC, Stefan Koch wrote:
> On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup wrote:
>> [...]
>
> All three are slower than baseline, for my test-case.
> What did you test it against.

The blns.txt file mentioned upthread.
October 12, 2016
On Wednesday, 12 October 2016 at 10:15:17 UTC, Matthias Bentrup wrote:
> On Wednesday, 12 October 2016 at 09:23:53 UTC, Stefan Koch wrote:
>> On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup wrote:
>>> [...]
>>
>> All three are slower than baseline, for my test-case.
>> What did you test it against.
>
> The blns.txt file mentioned upthread.

And what were your timings ?

BTW. the code you posted would not be a proper replacement for utf8 popFront since our UTF8-popFront does handle depreacted sequences correctly.

making it equivalent to this code :
void pfnew(ref char[] s) @trusted pure nothrow
{
    immutable c = s[0];
    uint char_length = 1;
    if (c < 192)
    {
Lend :
        s = s.ptr[char_length .. s.length];
    } else {
        if (c < 192+32)
        {
            char_length = 2;
        }
        else if (c < 192+32+16)
        {
            char_length = 3;
        }
        else if (c < 192+32+16+8)
        {
            char_length = 4;
        }
        else if (c < 192+32+16+8+4)
        {
            char_length = 5;
        }
        else if (c < 192+32+16+8+4+2)
        {
            char_length = 6;
        }

        char_length = char_length > s.length ? s.length : char_length;
        goto Lend;
    }
}
October 12, 2016
I just confirmed that branching version is faster then table-lookup.

please test it our for yourself
http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj

The table-lookup does produce the smallest code though.

October 12, 2016
On 10/12/2016 04:56 AM, Matthias Bentrup wrote:
>
> void popFront1b(ref char[] s) @trusted pure nothrow {
>   immutable c = cast(byte)s[0];
>   if (c >= -8) {
>     s = s[1 .. $];
>   } else {
>     uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31);
>     import std.algorithm;
>     s = s[min(i, $) .. $];
>   }
> }

This is really small although it does two jumps on the frequent path. Perhaps an improvement over the current implementation. Nice work! -- Andrei
October 12, 2016
On 10/12/2016 05:23 AM, Stefan Koch wrote:
> All three are slower than baseline, for my test-case.
> What did you test it against.

I'd say: (a) test for speed of ASCII-only text; (b) make it small. That's all we need. Nobody worries about 10-20% in multibyte-heavy text. -- Andrei
October 12, 2016
On 10/12/2016 06:56 AM, Stefan Koch wrote:
> I just confirmed that branching version is faster then table-lookup.
>
> please test it our for yourself
> http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj
>
> The table-lookup does produce the smallest code though.

Nice. I like that the table is NOT looked up on the ASCII path, so it stays cold in ASCII text. However, there's a problem with this pattern:

    size_t char_length = 1;
    immutable c = s[0];
    if (c < 192)
    {
        Lend :
        s = s.ptr[char_length .. s.length];
        return ;
    }

as opposed to:

    immutable c = s[0];
    if (c < 192)
    {
        Lend :
        s = s.ptr[1 .. s.length];
        return ;
    }

In the second case, the compiler generates an inc for bumping the pointer and a dec for decreasing the length (small instructions). If the variable char_length is used, add/sub must be used (larger). -- Andrei
October 12, 2016
On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei Alexandrescu wrote:
> On 10/12/2016 06:56 AM, Stefan Koch wrote:
>> I just confirmed that branching version is faster then table-lookup.
>>
>> please test it our for yourself
>> http://paste.ofcode.org/3CpieAhkrTYEcSncbPKbrj
>>
>> The table-lookup does produce the smallest code though.
>
> Nice. I like that the table is NOT looked up on the ASCII path, so it stays cold in ASCII text. However, there's a problem with this pattern:
>
>     size_t char_length = 1;
>     immutable c = s[0];
>     if (c < 192)
>     {
>         Lend :
>         s = s.ptr[char_length .. s.length];
>         return ;
>     }
>
> as opposed to:
>
>     immutable c = s[0];
>     if (c < 192)
>     {
>         Lend :
>         s = s.ptr[1 .. s.length];
>         return ;
>     }
>
> In the second case, the compiler generates an inc for bumping the pointer and a dec for decreasing the length (small instructions). If the variable char_length is used, add/sub must be used (larger). -- Andrei

btw.
We could get rid of of a sub if we changed slices from holding a pointer and a length to holding two pointers.

October 12, 2016
On Wednesday, 12 October 2016 at 13:32:45 UTC, Stefan Koch wrote:
> On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei Alexandrescu wrote:
>>
>> In the second case, the compiler generates an inc for bumping the pointer and a dec for decreasing the length (small instructions). If the variable char_length is used, add/sub must be used (larger). -- Andrei

When I write code so i can keep the
str = str[1 .. str.length]
It leads to lager code overall and decreases performance.
both in table lookup and in the multi-branch code.

update:
for ldc the table-lookup code is faster 27% then the current popFront.
on dmd the table-lookup is 2% faster then the current popFront.

for ldc the branching code is 23% faster then the current popFront
for dmd the branching code is 20% faster then the current popFront
October 12, 2016
On 10/12/2016 09:39 AM, Stefan Koch wrote:
> On Wednesday, 12 October 2016 at 13:32:45 UTC, Stefan Koch wrote:
>> On Wednesday, 12 October 2016 at 12:46:50 UTC, Andrei Alexandrescu wrote:
>>>
>>> In the second case, the compiler generates an inc for bumping the
>>> pointer and a dec for decreasing the length (small instructions). If
>>> the variable char_length is used, add/sub must be used (larger). --
>>> Andrei
>
> When I write code so i can keep the
> str = str[1 .. str.length]
> It leads to lager code overall and decreases performance.
> both in table lookup and in the multi-branch code.
>
> update:
> for ldc the table-lookup code is faster 27% then the current popFront.
> on dmd the table-lookup is 2% faster then the current popFront.
>
> for ldc the branching code is 23% faster then the current popFront
> for dmd the branching code is 20% faster then the current popFront

Thanks! I'd say make sure there is exactly 0% loss on performance compared to the popFront in the ASCII case, and if so make a PR with the table version. -- Andrei
October 12, 2016
On Wednesday, 12 October 2016 at 14:12:30 UTC, Andrei Alexandrescu wrote:
> On 10/12/2016 09:39 AM, Stefan Koch wrote:

>
> Thanks! I'd say make sure there is exactly 0% loss on performance compared to the popFront in the ASCII case, and if so make a PR with the table version. -- Andrei

I measured again.
The table version has a DECREASES the performance for dmd by 1%.
I think we should keep performance for dmd in mind.
I could add the table version in a version (LDC) block.

I cannot currently measure on gdc.
I'd appreciate more perf data.