October 11, 2016
On 10/11/2016 05:45 AM, Temtaime wrote:
> void popFront7(ref char[] s) @trusted pure nothrow
> {
>   import core.bitop;
>   auto v = 7 - bsr(~s[0] | 1);
>   s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$];
> }
>
> Please check this.

Thanks. This does a lot of work on the frequent path c < 0x80:

pure nothrow @trusted void example.popFront7(ref char[]):
        movq    8(%rdi), %rax
        movzbl  (%rax), %ecx
        xorq    $254, %rcx
        orq     $1, %rcx
        bsrq    %rcx, %rcx
        notl    %ecx
        addl    $8, %ecx
        cmpl    $6, %ecx
        jg      .LBB0_2
        testl   %ecx, %ecx
        je      .LBB0_2
        movslq  %ecx, %rdx
        movq    (%rdi), %rcx
        cmpq    %rcx, %rdx
        cmovaq  %rcx, %rdx
        jmp     .LBB0_3
.LBB0_2:
        movq    (%rdi), %rcx
        movl    $1, %edx
.LBB0_3:
        addq    %rdx, %rax
        subq    %rdx, %rcx
        movq    %rcx, (%rdi)
        movq    %rax, 8(%rdi)
        retq

So I changed it to:

void popFront7(ref char[] s) @trusted pure nothrow
{
  immutable c = s[0];
  if (c < 0x80) {
    s = s.ptr[1 .. s.length];
  } else {
    import core.bitop;
    uint v = 7 - bsr(~c | (c > 0xfd) << 6u);
    s = s.ptr[v > s.length ? s.length : v .. s.length];
  }
}

That's about as large as the baseline.


Andrei

October 11, 2016
On Tuesday, 11 October 2016 at 14:49:28 UTC, Matthias Bentrup wrote:
>
> This is the result I'd like to get, but I can't find a way to write it without inline assembly :(
>
> void popFrontAsmIntel(ref char[] s) @trusted pure nothrow {
>   immutable c = s[0];
>   if (c < 0x80) {
>     s = s[1 .. $];
>   } else {
>     uint l = void;
>     asm pure nothrow @nogc {
>       mov EAX, 1;
>       mov BL, 0xf8-1;
>       sub BL, c;
>       cmp BL, 0xf8-0xc0;
>       adc EAX, 0;
>       cmp BL, 0xf8-0xe0;
>       adc EAX, 0;
>       cmp BL, 0xf8-0xf0;
>       adc EAX, 0;
>       mov l, EAX;
>     }
>     s = s[l <= $ ? l : $ .. $];
>   }
> }

This takes 180us.
Baseline takes 124us.



October 11, 2016
On 10/11/2016 10:49 AM, Matthias Bentrup wrote:
>
> void popFrontAsmIntel(ref char[] s) @trusted pure nothrow {
>   immutable c = s[0];
>   if (c < 0x80) {
>     s = s[1 .. $];
>   } else {
>     uint l = void;
>     asm pure nothrow @nogc {
>       mov EAX, 1;
>       mov BL, 0xf8-1;
>       sub BL, c;
>       cmp BL, 0xf8-0xc0;
>       adc EAX, 0;
>       cmp BL, 0xf8-0xe0;
>       adc EAX, 0;
>       cmp BL, 0xf8-0xf0;
>       adc EAX, 0;
>       mov l, EAX;
>     }
>     s = s[l <= $ ? l : $ .. $];
>   }
> }

Did you take a look at the codegen on http://ldc.acomirei.ru? It's huge. -- Andrei

October 11, 2016
On 10/10/2016 11:00 PM, Stefan Koch wrote:
>
> void popFront3(ref char[] s) @trusted pure nothrow {
>    immutable c = s[0];
>    uint char_length = 1;
>    if (c < 127)
>    {
>    Lend :
>      s = s.ptr[char_length .. s.length];
>    } else {
>      if ((c & b01100_0000) == 0b1000_0000)
>      {
>        //just skip one in case this is not the beginning of a code-point
> char
>        goto Lend;
>      }
>      if (c < 192)
>      {
>        char_length = 2;
>        goto Lend;
>      }
>      if (c < 240)
>      {
>        char_length = 3;
>        goto Lend;
>      }
>      if (c < 248)
>      {
>        char_length = 4;
>        goto Lend;
>      }
>    }
>  }

Looked at this, still seems to generate a jump forward with ldc. Also, why do you leave a fallthrough path? Progress needs to be made on all paths, otherwise we have infinite loops.


Andrei

October 11, 2016
On Tuesday, 11 October 2016 at 15:08:34 UTC, Andrei Alexandrescu wrote:
> Looked at this, still seems to generate a jump forward with ldc.

ldc.intrinsics.llvm_expect might help to influence basic block layout.

 — David
October 11, 2016
On Tuesday, 11 October 2016 at 15:08:34 UTC, Andrei Alexandrescu wrote:
> On 10/10/2016 11:00 PM, Stefan Koch wrote:
>>
>> [...]
>
> Looked at this, still seems to generate a jump forward with ldc. Also, why do you leave a fallthrough path? Progress needs to be made on all paths, otherwise we have infinite loops.

I forgot that the fall-trough did no longer end in Lend;
That forward jump to Lend is a very common and therefore predicted branch.


I will now run this problem through STOKE.
Let's see what it comes up with :)
October 11, 2016
On 10/11/16 11:15 AM, Stefan Koch wrote:
> I will now run this problem through STOKE.
> Let's see what it comes up with :)

http://stoke.stanford.edu you mean? That would be cool. Keep us posted! -- Andrei
October 11, 2016
On Tuesday, 11 October 2016 at 16:13:45 UTC, Andrei Alexandrescu wrote:
> On 10/11/16 11:15 AM, Stefan Koch wrote:
>> I will now run this problem through STOKE.
>> Let's see what it comes up with :)
>
> http://stoke.stanford.edu you mean? That would be cool. Keep us posted! -- Andrei

Yep I mean that one.
It will take a while to work out the right cost-functions.
I'll do a PR as soon as this bears fruit.
October 12, 2016
On Tuesday, 11 October 2016 at 15:01:47 UTC, Andrei Alexandrescu wrote:
> On 10/11/2016 10:49 AM, Matthias Bentrup wrote:
>>
>> void popFrontAsmIntel(ref char[] s) @trusted pure nothrow {
>>   immutable c = s[0];
>>   if (c < 0x80) {
>>     s = s[1 .. $];
>>   } else {
>>     uint l = void;
>>     asm pure nothrow @nogc {
>>       mov EAX, 1;
>>       mov BL, 0xf8-1;
>>       sub BL, c;
>>       cmp BL, 0xf8-0xc0;
>>       adc EAX, 0;
>>       cmp BL, 0xf8-0xe0;
>>       adc EAX, 0;
>>       cmp BL, 0xf8-0xf0;
>>       adc EAX, 0;
>>       mov l, EAX;
>>     }
>>     s = s[l <= $ ? l : $ .. $];
>>   }
>> }
>
> Did you take a look at the codegen on http://ldc.acomirei.ru? It's huge. -- Andrei

Here are three branch-less variants that use the sign instead of the carry bit.

The last one is the fastest on my machine, although it mixes the rare error case and the common 1-byte case into one branch.

void popFront1(ref char[] s) @trusted pure nothrow {
  immutable c = cast(byte)s[0];
  if (c >= 0) {
    s = s[1 .. $];
  } else if (c < -8) {
    uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31);
    import std.algorithm;
    s = s[min(i, $) .. $];
  } else {
    s = s[1 .. $];
  }
}

void popFront1a(ref char[] s) @trusted pure nothrow {
  immutable c = cast(byte)s[0];
  if (c >= 0) {Three
    s = s[1 .. $];
  } else {
    uint i = 1 + ((3 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31)) & (c + 8 >> 31));
    import std.algorithm;
    s = s[min(i, $) .. $];
  }
}

void popFront1b(ref char[] s) @trusted pure nothrow {
  immutable c = cast(byte)s[0];
  if (c >= -8) {
    s = s[1 .. $];
  } else {
    uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31);
    import std.algorithm;
    s = s[min(i, $) .. $];
  }
}

October 12, 2016
On Wednesday, 12 October 2016 at 08:56:59 UTC, Matthias Bentrup wrote:
>
> Here are three branch-less variants that use the sign instead of the carry bit.
>
> The last one is the fastest on my machine, although it mixes the rare error case and the common 1-byte case into one branch.
>
> void popFront1(ref char[] s) @trusted pure nothrow {
>   immutable c = cast(byte)s[0];
>   if (c >= 0) {
>     s = s[1 .. $];
>   } else if (c < -8) {
>     uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31);
>     import std.algorithm;
>     s = s[min(i, $) .. $];
>   } else {
>     s = s[1 .. $];
>   }
> }
>
> void popFront1a(ref char[] s) @trusted pure nothrow {
>   immutable c = cast(byte)s[0];
>   if (c >= 0) {Three
>     s = s[1 .. $];
>   } else {
>     uint i = 1 + ((3 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31)) & (c + 8 >> 31));
>     import std.algorithm;
>     s = s[min(i, $) .. $];
>   }
> }
>
> void popFront1b(ref char[] s) @trusted pure nothrow {
>   immutable c = cast(byte)s[0];
>   if (c >= -8) {
>     s = s[1 .. $];
>   } else {
>     uint i = 4 + (c + 64 >> 31) + (c + 32 >> 31) + (c + 16 >> 31);
>     import std.algorithm;
>     s = s[min(i, $) .. $];
>   }
> }

All three are slower than baseline, for my test-case.
What did you test it against.