Challenge: write a really really small front() for UTF8

24-Mar-2014 01:22, Andrei Alexandrescu пишет: > Here's a baseline: http://goo.gl/91vIGc. Destroy! > Assertions to check encoding?! I thought it would detect broken encoding and do a substitution at least. > Andrei -- Dmitry Olshansky

On 3/23/14, 2:29 PM, Dmitry Olshansky wrote: > 24-Mar-2014 01:22, Andrei Alexandrescu пишет: >> Here's a baseline: http://goo.gl/91vIGc. Destroy! >> > Assertions to check encoding?! > I thought it would detect broken encoding and do a substitution at least. That implementation does zero effort to optimize checks themselves, and indeed puts them in asserts. I think there's value in having such a primitive down below. Andrei

24-Mar-2014 01:34, Andrei Alexandrescu пишет: > On 3/23/14, 2:29 PM, Dmitry Olshansky wrote: >> 24-Mar-2014 01:22, Andrei Alexandrescu пишет: >>> Here's a baseline: http://goo.gl/91vIGc. Destroy! >>> >> Assertions to check encoding?! >> I thought it would detect broken encoding and do a substitution at least. > > That implementation does zero effort to optimize checks themselves, and > indeed puts them in asserts. I think there's value in having such a > primitive down below. Just how much you are willing to assert? You don't even check length. In short - what are the specs of this primitive and where you see it being used. > > Andrei > > -- Dmitry Olshansky

March 23, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by Andrei Alexandrescu
in reply to Dmitry Olshansky

Permalink

Andrei Alexandrescu

Posted in reply to Dmitry Olshansky

Permalink

On 3/23/14, 3:10 PM, Dmitry Olshansky wrote:
> 24-Mar-2014 01:34, Andrei Alexandrescu пишет:
>> On 3/23/14, 2:29 PM, Dmitry Olshansky wrote:
>>> 24-Mar-2014 01:22, Andrei Alexandrescu пишет:
>>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>>
>>> Assertions to check encoding?!
>>> I thought it would detect broken encoding and do a substitution at
>>> least.
>>
>> That implementation does zero effort to optimize checks themselves, and
>> indeed puts them in asserts. I think there's value in having such a
>> primitive down below.
>
> Just how much you are willing to assert? You don't even check length.

Array bounds checking takes care of that.

> In short - what are the specs of this primitive and where you see it
> being used.

A replacement for front() in arrays of char and wchar.


Andrei

dchar front(char[] s) { uint c = s[0]; ubyte p = ~s[0]; if (p>>7) return c; c = c<<8 | s[1]; if (p>>5) return c; c = c<<8 | s[2]; if (p>>4) return c; return c<<8 | s[3]; }

On 3/23/14, 4:28 PM, Anonymous wrote: > dchar front(char[] s) { > uint c = s[0]; > ubyte p = ~s[0]; > if (p>>7) > return c; > c = c<<8 | s[1]; > if (p>>5) > return c; > c = c<<8 | s[2]; > if (p>>4) > return c; > return c<<8 | s[3]; > } That's smaller but doesn't seem to do the same! Andrei

On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: > Here's a baseline: http://goo.gl/91vIGc. Destroy! > > Andrei This example only considers encodings of up to 4 bytes, but UTF-8 can encode code points in as many as 6 bytes. Is that not a concern? Mike

On 3/23/2014 5:32 PM, Mike wrote: > This example only considers encodings of up to 4 bytes, but UTF-8 can encode > code points in as many as 6 bytes. Is that not a concern? It's not anymore. The 5 and 6 byte encodings are now illegal.

On 2014-03-24 00:32, Mike wrote: > On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: >> Here's a baseline: http://goo.gl/91vIGc. Destroy! >> >> Andrei > > This example only considers encodings of up to 4 bytes, but UTF-8 can > encode code points in as many as 6 bytes. Is that not a concern? > > Mike RFC 3629 (http://tools.ietf.org/html/rfc3629) restricted UTF-8 to conform to constraints in UTF-16, removing all 5- and 6-byte sequences. -- Simen

Forums