March 23, 2014 Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei |
March 23, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | 24-Mar-2014 01:22, Andrei Alexandrescu пишет: > Here's a baseline: http://goo.gl/91vIGc. Destroy! > Assertions to check encoding?! I thought it would detect broken encoding and do a substitution at least. > Andrei -- Dmitry Olshansky |
March 23, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On 3/23/14, 2:29 PM, Dmitry Olshansky wrote:
> 24-Mar-2014 01:22, Andrei Alexandrescu пишет:
>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>
> Assertions to check encoding?!
> I thought it would detect broken encoding and do a substitution at least.
That implementation does zero effort to optimize checks themselves, and indeed puts them in asserts. I think there's value in having such a primitive down below.
Andrei
|
March 23, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | 24-Mar-2014 01:34, Andrei Alexandrescu пишет: > On 3/23/14, 2:29 PM, Dmitry Olshansky wrote: >> 24-Mar-2014 01:22, Andrei Alexandrescu пишет: >>> Here's a baseline: http://goo.gl/91vIGc. Destroy! >>> >> Assertions to check encoding?! >> I thought it would detect broken encoding and do a substitution at least. > > That implementation does zero effort to optimize checks themselves, and > indeed puts them in asserts. I think there's value in having such a > primitive down below. Just how much you are willing to assert? You don't even check length. In short - what are the specs of this primitive and where you see it being used. > > Andrei > > -- Dmitry Olshansky |
March 23, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On 3/23/14, 3:10 PM, Dmitry Olshansky wrote: > 24-Mar-2014 01:34, Andrei Alexandrescu пишет: >> On 3/23/14, 2:29 PM, Dmitry Olshansky wrote: >>> 24-Mar-2014 01:22, Andrei Alexandrescu пишет: >>>> Here's a baseline: http://goo.gl/91vIGc. Destroy! >>>> >>> Assertions to check encoding?! >>> I thought it would detect broken encoding and do a substitution at >>> least. >> >> That implementation does zero effort to optimize checks themselves, and >> indeed puts them in asserts. I think there's value in having such a >> primitive down below. > > Just how much you are willing to assert? You don't even check length. Array bounds checking takes care of that. > In short - what are the specs of this primitive and where you see it > being used. A replacement for front() in arrays of char and wchar. Andrei |
March 23, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | dchar front(char[] s) { uint c = s[0]; ubyte p = ~s[0]; if (p>>7) return c; c = c<<8 | s[1]; if (p>>5) return c; c = c<<8 | s[2]; if (p>>4) return c; return c<<8 | s[3]; } |
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anonymous | On 3/23/14, 4:28 PM, Anonymous wrote:
> dchar front(char[] s) {
> uint c = s[0];
> ubyte p = ~s[0];
> if (p>>7)
> return c;
> c = c<<8 | s[1];
> if (p>>5)
> return c;
> c = c<<8 | s[2];
> if (p>>4)
> return c;
> return c<<8 | s[3];
> }
That's smaller but doesn't seem to do the same!
Andrei
|
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>
> Andrei
This example only considers encodings of up to 4 bytes, but UTF-8 can encode code points in as many as 6 bytes. Is that not a concern?
Mike
|
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mike | On 3/23/2014 5:32 PM, Mike wrote:
> This example only considers encodings of up to 4 bytes, but UTF-8 can encode
> code points in as many as 6 bytes. Is that not a concern?
It's not anymore. The 5 and 6 byte encodings are now illegal.
|
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mike | On 2014-03-24 00:32, Mike wrote: > On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: >> Here's a baseline: http://goo.gl/91vIGc. Destroy! >> >> Andrei > > This example only considers encodings of up to 4 bytes, but UTF-8 can > encode code points in as many as 6 bytes. Is that not a concern? > > Mike RFC 3629 (http://tools.ietf.org/html/rfc3629) restricted UTF-8 to conform to constraints in UTF-16, removing all 5- and 6-byte sequences. -- Simen |
Copyright © 1999-2021 by the D Language Foundation