March 24, 2014
On Monday, 24 March 2014 at 08:06:53 UTC, dnspies wrote:
> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>
>> Andrei
>
> Here's mine (the loop gets unrolled):
>
> dchar front(const char[] s) {
>   assert(s.length > 0);
>   byte c = s[0];
>   dchar res = cast(ubyte)c;
>   if(c >= 0) {
>     return res;
>   }
>   c <<= 1;
>   assert(c < 0);
>   for(int i = 1; i < 4; i++) {
>     assert(i < s.length);
>     ubyte d = s[i];
>     assert((d >> 6) == 0b10);
>     res = (res << 8) | d;
>     c <<= 1;
>     if(c >= 0)
>       return res;
>   }
>   assert(false);
> }

Sorry, I misunderstood.  We only want the x's in the output.  Here's my revised solution

http://goo.gl/PL729J

dchar front(const char[] s) {
  assert(s.length > 0);
  byte c = s[0];
  dchar res = cast(ubyte)c;
  if(c >= 0)
    return res;
  dchar cover = 0b0100_0000;
  c <<= 1;
  assert(c < 0);
  for(int i = 1; i < 4; i++) {
    assert(i < s.length);
    ubyte d = s[i];
    assert((d >> 6) == 0b10);
    cover <<= 5;
    res = ((res << 6) & (cover - 1)) | (d & 0b0011_1111);
    c <<= 1;
    if(c >= 0)
      return res;
  }
  assert(false);
}
March 24, 2014
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>
> Andrei

Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences?

Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable.
March 24, 2014
On Monday, 24 March 2014 at 07:53:11 UTC, Chris Williams wrote:
> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>
>> Andrei
>
> http://goo.gl/TaZTNB

>  for (int i = 1; i < len; i++) {

size_t, ++i! :>

(optimized away but still, nitnit pickpick)
March 24, 2014
Am 24.03.2014 10:02, schrieb monarch_dodra:
> Having a fast "front" is fine and all, but if it means your
> program asserting in release (or worst, silently corrupting
> memory) just because the client was trying to read a bad text
> file, I'm unsure this is acceptable.

it would be great to habe a "basic" form of this range that error behavior could be extended policy based / templates - the phobos version could extend the basic range with "prefered" error behavior - but the basic range is still able to read without checking - for example if i know that my input is 100% valid and i need the speed etc.


March 24, 2014
On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>
>> Andrei
>
> Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences?
>
> Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable.

As you'll note from the assembler, asserts are compiled out of release builds. Though if the goal of this little project is to create useful code for inclusion in a library, you would want to switch many of the asserts in the examples into "if (x) return 0xfffd;". (Mine needs an assert/if in the loop to check for the proper bit markers).
March 24, 2014
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>
> Andrei

Ok, I managed to make it smaller.  I think this is the smallest so far with only 23 instructions (no loop unrolling in this one):

http://goo.gl/RKF5Vm
March 24, 2014
24-Mar-2014 04:44, Simen Kjærås пишет:
> On 2014-03-24 00:32, Mike wrote:
>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>
>>> Andrei
>>
>> This example only considers encodings of up to 4 bytes, but UTF-8 can
>> encode code points in as many as 6 bytes.  Is that not a concern?
>>
>> Mike
>
> RFC 3629 (http://tools.ietf.org/html/rfc3629) restricted UTF-8 to
> conform to constraints in UTF-16, removing all 5- and 6-byte sequences.

More importantly Unicode standard explicitly fixed the range of code points to that of representable in UTF-16. Starting with the 5th version of the standard if memory serves me right.


-- 
Dmitry Olshansky
March 24, 2014
On Monday, 24 March 2014 at 11:48:00 UTC, Dmitry Olshansky wrote:
>> RFC 3629 (http://tools.ietf.org/html/rfc3629) restricted UTF-8 to
>> conform to constraints in UTF-16, removing all 5- and 6-byte sequences.
>
> More importantly Unicode standard explicitly fixed the range of code points to that of representable in UTF-16. Starting with the 5th version of the standard if memory serves me right.

I did some hacks using C at work with _pext_u32, it's an absolutely wonderful instruction(pext) with the corresponding pdep.
http://software.intel.com/sites/landingpage/IntrinsicsGuide/

And ridiculously fast according to Agner(Latency 3, Throughput 1):
http://www.agner.org/optimize/instruction_tables.pdf

I think we should add this as an intrinsic to D as well(if it isn't already, but I couldn't find it)... it could do wonders for utf decoding.

I'm currently too busy to submit a complete solution, but please feel free to use my idea if you think it sounds promising.
March 24, 2014
On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>
>> Andrei
>
> Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences?
>
> Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable.

I would strongly advise to at least offer an option, possibly via a template parameter, for turning error handling on or off, similar to how Python handles decoding. Examples below in Python 3.

b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError
b"\255".decode("utf-8", errors="replace") # replacement character used
b"\255".decode("utf-8", errors="ignore") # Empty string, invalid sequence removed.

All three strategies are useful from time to time. I mainly reach for option three when I'm trying to get some text data out of some old broken databases or similar.

We may consider leaving the error checking on in -release for the 'strict' decoding, but throwing an Error instead of an exception so the function can be nothrow. This would prevent memory corruption in release code. assert vs throw Error is up for debate.
March 24, 2014
Am 24.03.2014 13:51, schrieb w0rp:
> On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu
>> wrote:
>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>
>>> Andrei
>>
>> Before we roll this out, could we discuss a strategy/guideline
>> in regards to detecting and handling invalid UTF sequences?
>>
>> Having a fast "front" is fine and all, but if it means your
>> program asserting in release (or worst, silently corrupting
>> memory) just because the client was trying to read a bad text
>> file, I'm unsure this is acceptable.
>
> I would strongly advise to at least offer an option, possibly via
> a template parameter, for turning error handling on or off,
> similar to how Python handles decoding. Examples below in Python
> 3.
>
> b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError
> b"\255".decode("utf-8", errors="replace") # replacement character
> used
> b"\255".decode("utf-8", errors="ignore") # Empty string, invalid
> sequence removed.
>
> All three strategies are useful from time to time. I mainly reach
> for option three when I'm trying to get some text data out of
> some old broken databases or similar.
>
> We may consider leaving the error checking on in -release for the
> 'strict' decoding, but throwing an Error instead of an exception
> so the function can be nothrow. This would prevent memory
> corruption in release code. assert vs throw Error is up for
> debate.
>

+1