Challenge: write a really really small front() for UTF8 (page 4)

On Monday, 24 March 2014 at 08:06:53 UTC, dnspies wrote: > On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: >> Here's a baseline: http://goo.gl/91vIGc. Destroy! >> >> Andrei > > Here's mine (the loop gets unrolled): > > dchar front(const char[] s) { > assert(s.length > 0); > byte c = s[0]; > dchar res = cast(ubyte)c; > if(c >= 0) { > return res; > } > c <<= 1; > assert(c < 0); > for(int i = 1; i < 4; i++) { > assert(i < s.length); > ubyte d = s[i]; > assert((d >> 6) == 0b10); > res = (res << 8) | d; > c <<= 1; > if(c >= 0) > return res; > } > assert(false); > } Sorry, I misunderstood. We only want the x's in the output. Here's my revised solution http://goo.gl/PL729J dchar front(const char[] s) { assert(s.length > 0); byte c = s[0]; dchar res = cast(ubyte)c; if(c >= 0) return res; dchar cover = 0b0100_0000; c <<= 1; assert(c < 0); for(int i = 1; i < 4; i++) { assert(i < s.length); ubyte d = s[i]; assert((d >> 6) == 0b10); cover <<= 5; res = ((res << 6) & (cover - 1)) | (d & 0b0011_1111); c <<= 1; if(c >= 0) return res; } assert(false); }

On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: > Here's a baseline: http://goo.gl/91vIGc. Destroy! > > Andrei Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences? Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable.

On Monday, 24 March 2014 at 07:53:11 UTC, Chris Williams wrote: > On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: >> Here's a baseline: http://goo.gl/91vIGc. Destroy! >> >> Andrei > > http://goo.gl/TaZTNB > for (int i = 1; i < len; i++) { size_t, ++i! :> (optimized away but still, nitnit pickpick)

Am 24.03.2014 10:02, schrieb monarch_dodra: > Having a fast "front" is fine and all, but if it means your > program asserting in release (or worst, silently corrupting > memory) just because the client was trying to read a bad text > file, I'm unsure this is acceptable. it would be great to habe a "basic" form of this range that error behavior could be extended policy based / templates - the phobos version could extend the basic range with "prefered" error behavior - but the basic range is still able to read without checking - for example if i know that my input is 100% valid and i need the speed etc.

On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: > Here's a baseline: http://goo.gl/91vIGc. Destroy! > > Andrei Ok, I managed to make it smaller. I think this is the smallest so far with only 23 instructions (no loop unrolling in this one): http://goo.gl/RKF5Vm

24-Mar-2014 04:44, Simen Kjærås пишет: > On 2014-03-24 00:32, Mike wrote: >> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: >>> Here's a baseline: http://goo.gl/91vIGc. Destroy! >>> >>> Andrei >> >> This example only considers encodings of up to 4 bytes, but UTF-8 can >> encode code points in as many as 6 bytes. Is that not a concern? >> >> Mike > > RFC 3629 (http://tools.ietf.org/html/rfc3629) restricted UTF-8 to > conform to constraints in UTF-16, removing all 5- and 6-byte sequences. More importantly Unicode standard explicitly fixed the range of code points to that of representable in UTF-16. Starting with the 5th version of the standard if memory serves me right. -- Dmitry Olshansky

On Monday, 24 March 2014 at 11:48:00 UTC, Dmitry Olshansky wrote: >> RFC 3629 (http://tools.ietf.org/html/rfc3629) restricted UTF-8 to >> conform to constraints in UTF-16, removing all 5- and 6-byte sequences. > > More importantly Unicode standard explicitly fixed the range of code points to that of representable in UTF-16. Starting with the 5th version of the standard if memory serves me right. I did some hacks using C at work with _pext_u32, it's an absolutely wonderful instruction(pext) with the corresponding pdep. http://software.intel.com/sites/landingpage/IntrinsicsGuide/ And ridiculously fast according to Agner(Latency 3, Throughput 1): http://www.agner.org/optimize/instruction_tables.pdf I think we should add this as an intrinsic to D as well(if it isn't already, but I couldn't find it)... it could do wonders for utf decoding. I'm currently too busy to submit a complete solution, but please feel free to use my idea if you think it sounds promising.

On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote: > On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: >> Here's a baseline: http://goo.gl/91vIGc. Destroy! >> >> Andrei > > Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences? > > Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable. I would strongly advise to at least offer an option, possibly via a template parameter, for turning error handling on or off, similar to how Python handles decoding. Examples below in Python 3. b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError b"\255".decode("utf-8", errors="replace") # replacement character used b"\255".decode("utf-8", errors="ignore") # Empty string, invalid sequence removed. All three strategies are useful from time to time. I mainly reach for option three when I'm trying to get some text data out of some old broken databases or similar. We may consider leaving the error checking on in -release for the 'strict' decoding, but throwing an Error instead of an exception so the function can be nothrow. This would prevent memory corruption in release code. assert vs throw Error is up for debate.

March 24, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by dennis luehring
in reply to w0rp

Permalink

dennis luehring

Posted in reply to w0rp

Permalink

Am 24.03.2014 13:51, schrieb w0rp:
> On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu
>> wrote:
>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>
>>> Andrei
>>
>> Before we roll this out, could we discuss a strategy/guideline
>> in regards to detecting and handling invalid UTF sequences?
>>
>> Having a fast "front" is fine and all, but if it means your
>> program asserting in release (or worst, silently corrupting
>> memory) just because the client was trying to read a bad text
>> file, I'm unsure this is acceptable.
>
> I would strongly advise to at least offer an option, possibly via
> a template parameter, for turning error handling on or off,
> similar to how Python handles decoding. Examples below in Python
> 3.
>
> b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError
> b"\255".decode("utf-8", errors="replace") # replacement character
> used
> b"\255".decode("utf-8", errors="ignore") # Empty string, invalid
> sequence removed.
>
> All three strategies are useful from time to time. I mainly reach
> for option three when I'm trying to get some text data out of
> some old broken databases or similar.
>
> We may consider leaving the error checking on in -release for the
> 'strict' decoding, but throwing an Error instead of an exception
> so the function can be nothrow. This would prevent memory
> corruption in release code. assert vs throw Error is up for
> debate.
>

+1

Forums