October 11, 2016
On Tuesday, 11 October 2016 at 08:44:04 UTC, Temtaime wrote:
>
> void popFront1(ref char[] s) @trusted pure nothrow
> {
>   import core.bitop, std.algorithm;
>   auto v = bsr(~s[0] | 1);
>   s = s[clamp(v, 1, v > 6 ? 1 : $)..$];
> }
>
> Seems to be less if i'm not wrong.


Yours runs with 790 us best time.
bsr is a real timetaker :)

October 11, 2016
On Tuesday, 11 October 2016 at 08:57:46 UTC, Stefan Koch wrote:
> On Tuesday, 11 October 2016 at 08:44:04 UTC, Temtaime wrote:
>>
>> void popFront1(ref char[] s) @trusted pure nothrow
>> {
>>   import core.bitop, std.algorithm;
>>   auto v = bsr(~s[0] | 1);
>>   s = s[clamp(v, 1, v > 6 ? 1 : $)..$];
>> }
>>
>> Seems to be less if i'm not wrong.
>
>
> Yours runs with 790 us best time.
> bsr is a real timetaker :)

CORRECTION this is not bsr's fault.
It's most likely clamp.
I am compiling with dmd and dmd is not as good in optimizing when templates are in the mix.

October 11, 2016
On Tuesday, 11 October 2016 at 09:13:10 UTC, Stefan Koch wrote:
> On Tuesday, 11 October 2016 at 08:57:46 UTC, Stefan Koch wrote:
>> On Tuesday, 11 October 2016 at 08:44:04 UTC, Temtaime wrote:
>>>
>>> void popFront1(ref char[] s) @trusted pure nothrow
>>> {
>>>   import core.bitop, std.algorithm;
>>>   auto v = bsr(~s[0] | 1);
>>>   s = s[clamp(v, 1, v > 6 ? 1 : $)..$];
>>> }
>>>
>>> Seems to be less if i'm not wrong.
>>
>>
>> Yours runs with 790 us best time.
>> bsr is a real timetaker :)
>
> CORRECTION this is not bsr's fault.
> It's most likely clamp.
> I am compiling with dmd and dmd is not as good in optimizing when templates are in the mix.

Sorry this was also a type in the code.

void popFront7(ref char[] s) @trusted pure nothrow
{
  import core.bitop;
  auto v = 7 - bsr(~s[0] | 1);
  s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$];
}

Please check this.
October 11, 2016
On Tuesday, 11 October 2016 at 09:45:11 UTC, Temtaime wrote:
>
> Sorry this was also a type in the code.
>
> void popFront7(ref char[] s) @trusted pure nothrow
> {
>   import core.bitop;
>   auto v = 7 - bsr(~s[0] | 1);
>   s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$];
> }
>
> Please check this.

162 us
October 11, 2016
On Tuesday, 11 October 2016 at 10:01:41 UTC, Stefan Koch wrote:
> On Tuesday, 11 October 2016 at 09:45:11 UTC, Temtaime wrote:
>>
>> Sorry this was also a type in the code.
>>
>> void popFront7(ref char[] s) @trusted pure nothrow
>> {
>>   import core.bitop;
>>   auto v = 7 - bsr(~s[0] | 1);
>>   s = s[v > 6 ? 1 : (v ? (v > s.length ? s.length : v) : 1)..$];
>> }
>>
>> Please check this.
>
> 162 us

The branching, it hurts my eyes!

Something like the following should give correct (assuming I haven't written bad logic) branchless results with architecture-optimised max calls. Note that the minus/plus 1 operation on the third line will ensure with the sign multiplication that values of 7 will map to 1, whereas for all other values it's an extra operation. But the advantage is that you're not sticking three branches in close proximity to each other, so you will never get a branch predictor fail. (Of note, any performance test for these functions should test with data designed to fail the branching code I quoted, keeping in mind that desktop Intel processors have a four-state branch predictor. I've not performance tested it myself, but this will certainly run faster on the AMD Jaguar processors than a version with branching checks.)

int v = 7 - bsr( ~s[0] | 1 );
int sign = ( (v - 7) >> 31 );
v = ( v - 1 ) * sign + 1;
str = str[ min( v, s.length ) .. $ ];
October 11, 2016
On 10/11/2016 04:57 AM, Stefan Koch wrote:
> Yours runs with 790 us best time.
> bsr is a real timetaker :)

What inputs did you test it on?

Here's what I think would be a good set of requirements:

* The ASCII case should be short and fast: a comparison and a branch, followed by return. This would improve a very common case and address the main issue with autodecoding.

* For the multibyte case, the main requirement is the code must be small. This is because it gets inlined all over the place and seldom used.

* For the multibyte case, the fewer bytes in the encoding the less work. This is because more frequent multi-byte characters have generally lower codes.

Currently front() - the other time spender in autodecoding - issues a function call on the multibyte case. That makes the code of front() itself small, at the cost of more expensive multibyte handling.


Andrei

October 11, 2016
On Tuesday, 11 October 2016 at 14:16:54 UTC, Andrei Alexandrescu wrote:
> On 10/11/2016 04:57 AM, Stefan Koch wrote:
>> Yours runs with 790 us best time.
>> bsr is a real timetaker :)
>
> What inputs did you test it on?

https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txt

> Here's what I think would be a good set of requirements:
>
> * The ASCII case should be short and fast: a comparison and a branch, followed by return. This would improve a very common case and address the main issue with autodecoding.
Already done
> * For the multibyte case, the main requirement is the code must be small. This is because it gets inlined all over the place and seldom used.
>
> * For the multibyte case, the fewer bytes in the encoding the less work. This is because more frequent multi-byte characters have generally lower codes.
That is why I had the branches, generally only the first one is taken
> Currently front() - the other time spender in autodecoding - issues a function call on the multibyte case. That makes the code of front() itself small, at the cost of more expensive multibyte handling.
I think at some point we have to cache the length of the last decoded char,
Otherwise we are throwing work away.

However that will only work within a RangeWrapper-Struct

October 11, 2016
On 10/11/2016 03:30 AM, Matthias Bentrup wrote:
> A branch-free version:
>
> void popFront4(ref char[] s) @trusted pure nothrow {
>   immutable c = s[0];
>   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
>   s = s.ptr[char_length .. s.length];
> }
>
> Theoretically the char_length could be computed with three sub and addc
> instructions, but no compiler is smart enough to detect that.

Interesting. 0x80 should be special-cased and you forgot to check the bounds. So:

void popFront4(ref char[] s) @trusted pure nothrow {
  immutable c = s[0];
  if (c < 0x80) {
    s = s.ptr[1 .. s.length];
  } else {
    uint l = 1 + (c >= 192) + (c >= 240) + (c >= 248);
    s = s.ptr[l <= s.length ? l : s.length .. s.length];
  }
}

This generated 27 instructions, i.e. same as the baseline. See:

http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3

But it doesn't seem to check for all errors.


Andrei
October 11, 2016
On Tuesday, 11 October 2016 at 14:24:56 UTC, Andrei Alexandrescu wrote:
> On 10/11/2016 03:30 AM, Matthias Bentrup wrote:
>> A branch-free version:
>>
>> void popFront4(ref char[] s) @trusted pure nothrow {
>>   immutable c = s[0];
>>   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
>>   s = s.ptr[char_length .. s.length];
>> }
>>
>
> void popFront4(ref char[] s) @trusted pure nothrow {
>   immutable c = s[0];
>   if (c < 0x80) {
>     s = s.ptr[1 .. s.length];
>   } else {
>     uint l = 1 + (c >= 192) + (c >= 240) + (c >= 248);
>     s = s.ptr[l <= s.length ? l : s.length .. s.length];
>   }
> }
>
> This generated 27 instructions, i.e. same as the baseline. See:
>
> http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3
>
> But it doesn't seem to check for all errors.
>
>
> Andrei

It's much slower.
Because of the flag-storing instructions.

October 11, 2016
On Tuesday, 11 October 2016 at 14:24:56 UTC, Andrei Alexandrescu wrote:
> On 10/11/2016 03:30 AM, Matthias Bentrup wrote:
>> A branch-free version:
>>
>> void popFront4(ref char[] s) @trusted pure nothrow {
>>   immutable c = s[0];
>>   uint char_length = 1 + (c >= 192) + (c >= 240) + (c >= 248);
>>   s = s.ptr[char_length .. s.length];
>> }
>>
>> Theoretically the char_length could be computed with three sub and addc
>> instructions, but no compiler is smart enough to detect that.
>
> Interesting. 0x80 should be special-cased and you forgot to check the bounds. So:
>
> void popFront4(ref char[] s) @trusted pure nothrow {
>   immutable c = s[0];
>   if (c < 0x80) {
>     s = s.ptr[1 .. s.length];
>   } else {
>     uint l = 1 + (c >= 192) + (c >= 240) + (c >= 248);
>     s = s.ptr[l <= s.length ? l : s.length .. s.length];
>   }
> }
>
> This generated 27 instructions, i.e. same as the baseline. See:
>
> http://ldc.acomirei.ru/#compilers:!((compiler:ldc,options:'-release+-O3+-boundscheck%3Doff',sourcez:G4ewlgJgBADiMDEBOIB2AXAjACiQUwDMoBjACwEMkBtAXSgGcBKKAAXSQFd709oYP8UVCHSkUAdygBvAFBQoYALaKO6cgCMANnhJQAvAyoAGGgG45CotmJQAPFCMAPABxHms%2BfPr6GAOhjsVJhQvr5%2B2qgA5qJmFgC%2BUHia9DoenkpwSOgkIPi%2B6mDo8OaeUBxgGAo%2BAOwcUAC0UOr0SNgAfjYAPlCYHIwl6YqZ2dwQvuSakbmFpIoD8mBWYFAAfFAAbH1VBpjzDD70/oGKFdhgADTheFGizKFXN6Sx8nEyrzKgkLDwyGjoAEy4QgkCjUOjcJDMNicbi8WACHTCUQSaQWJQqNRaHQ2AwQ4zPSxQax2BwuNwWNLyAD0VICSAU3i4cKKUHIn2gHFQXLwxDw9HolAAnk0QJyIN4yDyANYVSK%2BCxedgHdhHajBe4Q3wRaJPAaveRJFKE4l6AxOAgERgUhVQGlNFplVAQQgVOEEXIOG0Q5VIVVBEJhTXamJ6iyGvDW0oZXLZYi5PD5QrwKAALntSD25FUICginozRqDXT7WI/RtiyJ2DzBfs/2Y3Sr%2Be8a3WjCtpUpnhpAElUMAJl8AKoAFQQ9WcNvk1e8Oz2%2BsGwwY6DGEymSBmcy9StxKrpVBOqEbzUuQeuOrugZVwd18TeMiAA)),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3
>
> But it doesn't seem to check for all errors.
>
>
> Andrei

This is the result I'd like to get, but I can't find a way to write it without inline assembly :(

void popFrontAsmIntel(ref char[] s) @trusted pure nothrow {
  immutable c = s[0];
  if (c < 0x80) {
    s = s[1 .. $];
  } else {
    uint l = void;
    asm pure nothrow @nogc {
      mov EAX, 1;
      mov BL, 0xf8-1;
      sub BL, c;
      cmp BL, 0xf8-0xc0;
      adc EAX, 0;
      cmp BL, 0xf8-0xe0;
      adc EAX, 0;
      cmp BL, 0xf8-0xf0;
      adc EAX, 0;
      mov l, EAX;
    }
    s = s[l <= $ ? l : $ .. $];
  }
}