Challenge: write a really really small front() for UTF8 (page 7)

March 25, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by dennis luehring
in reply to Andrei Alexandrescu

Permalink

dennis luehring

Posted in reply to Andrei Alexandrescu

Permalink

Am 24.03.2014 17:44, schrieb Andrei Alexandrescu:
> On 3/24/14, 5:51 AM, w0rp wrote:
>> On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
>>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>>
>>>> Andrei
>>>
>>> Before we roll this out, could we discuss a strategy/guideline in
>>> regards to detecting and handling invalid UTF sequences?
>>>
>>> Having a fast "front" is fine and all, but if it means your program
>>> asserting in release (or worst, silently corrupting memory) just
>>> because the client was trying to read a bad text file, I'm unsure this
>>> is acceptable.
>>
>> I would strongly advise to at least offer an option
>
> Options are fine for functions etc. But front would need to find an
> all-around good compromise between speed and correctness.
>
> Andrei
>

b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError
b"\255".decode("utf-8", errors="replace") # replacement character used
b"\255".decode("utf-8", errors="ignore") # Empty string, invalid
sequence removed.

i think there should be a base range for UTF8 iteration - with policy based error extension (like in python) and some variants that defer this base UTF8 range with different error behavior - and one of these become the phobos standard = default parameter so its still switchable

On 25 March 2014 00:04, Daniel N <ufo@orbiting.us> wrote: > On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote: >> >> I'm currently too busy to submit a complete solution, but please feel free to use my idea if you think it sounds promising. > > > I now managed to dig up my old C source... but I'm still blocked by dmd not accepting the 'pext' instruction... > > 1) I know my solution is not directly comparable to the rest in this > thread(for many reasons). > 2) It's of course trivial to add a fast path for ascii... if desired. > 3) It throws safety and standards out the window. > 4) It's tied to one piece of hardware. No Thankee.

On 3/25/2014 4:00 AM, Iain Buclaw wrote: > On 25 March 2014 00:04, Daniel N <ufo@orbiting.us> wrote: >> On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote: >>> >>> I'm currently too busy to submit a complete solution, but please feel free >>> to use my idea if you think it sounds promising. >> >> >> I now managed to dig up my old C source... but I'm still blocked by dmd not >> accepting the 'pext' instruction... >> >> 1) I know my solution is not directly comparable to the rest in this >> thread(for many reasons). >> 2) It's of course trivial to add a fast path for ascii... if desired. >> 3) It throws safety and standards out the window. >> > > > 4) It's tied to one piece of hardware. > > No Thankee. > bool supportCpuFeatureX; void main() { supportCpuFeatureX = detectCpuFeatureX(); doStuff(); } void doStuff() { if(supportCpuFeatureX) doStuff_FeatureX(); else doStuff_Fallback(); } > dmd -inline blah.d

Am 25.03.2014 11:38, schrieb Nick Sabalausky: > On 3/25/2014 4:00 AM, Iain Buclaw wrote: >> On 25 March 2014 00:04, Daniel N <ufo@orbiting.us> wrote: >>> On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote: >>>> >>>> I'm currently too busy to submit a complete solution, but please feel free >>>> to use my idea if you think it sounds promising. >>> >>> >>> I now managed to dig up my old C source... but I'm still blocked by dmd not >>> accepting the 'pext' instruction... >>> >>> 1) I know my solution is not directly comparable to the rest in this >>> thread(for many reasons). >>> 2) It's of course trivial to add a fast path for ascii... if desired. >>> 3) It throws safety and standards out the window. >>> >> >> >> 4) It's tied to one piece of hardware. >> >> No Thankee. > void doStuff() { > if(supportCpuFeatureX) > doStuff_FeatureX(); > else > doStuff_Fallback(); > } > > > dmd -inline blah.d the extra branch could kill the performance benefit if doStuff is too small

On Tuesday, 25 March 2014 at 10:42:59 UTC, dennis luehring wrote: >> void doStuff() { >> if(supportCpuFeatureX) >> doStuff_FeatureX(); >> else >> doStuff_Fallback(); >> } >> >> > dmd -inline blah.d > > the extra branch could kill the performance benefit if doStuff is too small you'd simply have to hoist the condition outside the inner loop. Furthermore the branch prediction would never fail, only unpredictable branches are terrible.

dchar front(char[] s) { dchar c = s[0]; if (!(c & 0x80)) return c; byte b = (c >> 4) & 3; b += !b; c &= 63 >> b; char *p = s.ptr; do { p++; c = c << 6 | *p & 63; } while(--b); return c; }

24-Mar-2014 23:53, Dmitry Olshansky пишет: > 24-Mar-2014 01:22, Andrei Alexandrescu пишет: >> Here's a baseline: http://goo.gl/91vIGc. Destroy! >> >> Andrei > > I had to join the party at some point. > This seems like 25 instructions: > http://goo.gl/N7sHtK > Interestingly gdc-4.8 produces better results. http://goo.gl/1R7GMs -- Dmitry Olshansky

W dniu 2014-03-25 11:42, dennis luehring pisze: > Am 25.03.2014 11:38, schrieb Nick Sabalausky: >> On 3/25/2014 4:00 AM, Iain Buclaw wrote: >>> On 25 March 2014 00:04, Daniel N <ufo@orbiting.us> wrote: >>>> On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote: >>>>> >>>>> I'm currently too busy to submit a complete solution, but please >>>>> feel free >>>>> to use my idea if you think it sounds promising. >>>> >>>> >>>> I now managed to dig up my old C source... but I'm still blocked by >>>> dmd not >>>> accepting the 'pext' instruction... >>>> >>>> 1) I know my solution is not directly comparable to the rest in this >>>> thread(for many reasons). >>>> 2) It's of course trivial to add a fast path for ascii... if desired. >>>> 3) It throws safety and standards out the window. >>>> >>> >>> >>> 4) It's tied to one piece of hardware. >>> >>> No Thankee. >> void doStuff() { >> if(supportCpuFeatureX) >> doStuff_FeatureX(); >> else >> doStuff_Fallback(); >> } >> >> > dmd -inline blah.d > > the extra branch could kill the performance benefit if doStuff is too small void function() doStuff; void main() { auto supportCpuFeatureX = detectCpuFeatureX(); if (supportCpuFeatureX) doStuff = &doStuff_FeatureX; else doStuff = &doStuff_Fallback; }

Forums