Challenge: write a really really small front() for UTF8 (page 3)

Andrei Alexandrescu: > -w > > Probably time to move that from a warning to an error. In the short time we've used D at Facebook (forgot to add "-w" to the implicit flags) we've already have not one, but two bugs related to switch's implicit fallthrough. Lot of people in D.learn don't use warnings. What I'd like is to have (informational) warning active on default, and to disable them on request with a compiler switch. Bye, bearophile

On Monday, 24 March 2014 at 04:37:23 UTC, Michel Fortin wrote: > dchar front(char[] s) > { > if (s[0] < 0b1000000) > return s[0]; // ASCII > auto indicator = (s[0] >> 5) & 0b11; > auto tailLength = indicator ? indicator : 1; > > dchar result = s[0] & (0b00111111 >> tailLength); > foreach (i; 0..tailLength) > result = (result << 6) | (s[1+i] & 0b00111111); > return result; > } > > (Disclaimer: not tested, but I did check that all the expected code paths are present in the assembly this time.) 0b1000000 is missing a zero: 0b1000_0000 Fixing that, I still get a range violation from "s[1+i]". ----- Test program ----- void main() { foreach (ubyte b0; 0..0x80) { char[] s = [b0]; assert(front(s)==front2(s)); } writeln("Single byte done"); foreach (ubyte b0; 0..0x40) foreach (ubyte b1; 0..0x20) { char[] s = [0xC0|b1, 0x80|b0]; assert(front(s)==front2(s)); } writeln("Double byte done"); foreach (ubyte b0; 0..0x40) foreach (ubyte b1; 0..0x40) foreach (ubyte b2; 0..0x10) { char[] s = [0xE0|b2, 0x80|b1, 0x80|b0]; assert(front(s)==front2(s)); } writeln("Triple byte done"); foreach (ubyte b0; 0..0x40) foreach (ubyte b1; 0..0x40) foreach (ubyte b2; 0..0x40) foreach (ubyte b3; 0..0x08) { char[] s = [0xF0|b3, 0x80|b2, 0x80|b1, 0x80|b0]; assert(front(s)==front2(s)); } }

On 03/23/2014 09:44 PM, Michel Fortin wrote: > On 2014-03-24 04:39:08 +0000, "bearophile" <bearophileHUGS@lycos.com> said: > >> Michel Fortin: >> >>> I really wish D had no implicit fallthrough. >> >> Isn't your wish already granted? > > Maybe. I don't have a D compiler installed at the moment to check. I'm > just playing with d.godbolt.org and it accepts implicit fallthrough. > That compiler is very old. :) import std.compiler; pragma(msg, version_major); pragma(msg, version_minor); The output: 2u 55u Ali

On 3/23/14, 9:49 PM, Michel Fortin wrote: > > dchar front(char[] s) > { > if (s[0] < 0b1000000) > return s[0]; // ASCII > > // pattern indicator tailLength > // 0b1100xxxx 0b00 (0) 1 > // 0b1101xxxx 0b01 (1) 1 == indicator > // 0b1110xxxx 0b10 (2) 2 == indicator > // 0b1111xxxx 0b11 (3) 3 == indicator > // note: undefined result for illegal 0b11111xxx case > > auto indicator = (s[0] >> 4) & 0b11; > auto tailLength = indicator ? indicator : 1; > > dchar result = s[0] & (0b00111111 >> tailLength); > foreach (i; 0..tailLength) > result = (result << 6) | (s[1+i] & 0b00111111); > return result; > } http://goo.gl/y9EVFr Andrei

On Sunday, 23 March 2014 at 22:58:53 UTC, Andrei Alexandrescu wrote: > On 3/23/14, 3:10 PM, Dmitry Olshansky wrote: >> 24-Mar-2014 01:34, Andrei Alexandrescu пишет: >>> On 3/23/14, 2:29 PM, Dmitry Olshansky wrote: >>>> 24-Mar-2014 01:22, Andrei Alexandrescu пишет: >>>>> Here's a baseline: http://goo.gl/91vIGc. Destroy! >>>>> >>>> Assertions to check encoding?! >>>> I thought it would detect broken encoding and do a substitution at >>>> least. >>> >>> That implementation does zero effort to optimize checks themselves, and >>> indeed puts them in asserts. I think there's value in having such a >>> primitive down below. >> >> Just how much you are willing to assert? You don't even check length. > > Array bounds checking takes care of that. Since strings are often user input, if there is chance that bad input can cause undefined behavior, then the checks must be done unconditionally.

March 24, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by Michel Fortin
in reply to Andrei Alexandrescu

Permalink

Michel Fortin

Posted in reply to Andrei Alexandrescu

Permalink

On 2014-03-24 05:09:15 +0000, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> On 3/23/14, 9:49 PM, Michel Fortin wrote:
>> 
> http://goo.gl/y9EVFr
> 
> Andrei

As I said earlier, that front2 version is broken. It's missing a break. Adding the break makes it non-interesting from an instruction count point of view.

Also, there's still an issue in my code above (found by safetyOff): the first "if" for ASCII should use 0b1000_0000, not 0b1000000 (missing one zero). Here's the corrected (again) version of front1:

dchar front1(char[] s)
{
 if (s[0] < 0b1000_0000)
   return s[0]; // ASCII

 // pattern     indicator  tailLength
 // 0b1100xxxx  0b00 (0)   1
 // 0b1101xxxx  0b01 (1)   1 == indicator
 // 0b1110xxxx  0b10 (2)   2 == indicator
 // 0b1111xxxx  0b11 (3)   3 == indicator
 // note: undefined result for illegal 0b11111xxx case

 auto indicator = (s[0] >> 4) & 0b11;
 auto tailLength = indicator ? indicator : 1;

 dchar result = s[0] & (0b0011_1111 >> tailLength);
 foreach (i; 0..tailLength)
     result = (result << 6) | (s[1+i] & 0b0011_1111);
 return result;
}

And I'm going to suggest a variant of the above with one less branch (but three more instructions): look at how tailLenght is computed by or'ing with the negative of bit 2 of indicator. I suspect it'll be faster with non-ASCII input, unless it gets inlined less.

dchar front2(char[] s)
{
 if (s[0] < 0b1000_0000)
   return s[0]; // ASCII

 // pattern     indicator  tailLength
 // 0b1100xxxx  0b00 (0)   1
 // 0b1101xxxx  0b01 (1)   1
 // 0b1110xxxx  0b10 (2)   2
 // 0b1111xxxx  0b11 (3)   3
 // note: undefined result for illegal 0b11111xxx case

 auto indicator = ((s[0] >> 4) & 0b11);
 auto tailLength = indicator | ((~indicator >> 1) & 0b1);

 dchar result = s[0] & (0b0011_1111 >> tailLength);
 foreach (i; 0..tailLength)
     result = (result << 6) | (s[1+i] & 0b0011_1111);
 return result;
}

-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca

On 2014-03-24 04:58:08 +0000, "safety0ff" <safety0ff.dev@gmail.com> said: > 0b1000000 is missing a zero: 0b1000_0000 Indeed, thanks. > Fixing that, I still get a range violation from "s[1+i]". That "auto indicator = (s[0] >> 5) & 0b11;" line is wrong too. s[0] needs a shift by 4, not by 5. No doubt it crashes your test program. -- Michel Fortin michel.fortin@michelf.ca http://michelf.ca

On Monday, 24 March 2014 at 05:27:43 UTC, Michel Fortin wrote: > Here's the corrected (again) version of front1: > [Snip] > > And I'm going to suggest a variant of the above with one less branch (but three more instructions): > [Snip] Everything seems to work in your corrected versions: http://dpaste.dzfl.pl/3b32535559ba Andrei's version doesn't compile on recent compiler versions due to goto skipping initialization.

On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: > Here's a baseline: http://goo.gl/91vIGc. Destroy! > > Andrei Here's mine (the loop gets unrolled): dchar front(const char[] s) { assert(s.length > 0); byte c = s[0]; dchar res = cast(ubyte)c; if(c >= 0) { return res; } c <<= 1; assert(c < 0); for(int i = 1; i < 4; i++) { assert(i < s.length); ubyte d = s[i]; assert((d >> 6) == 0b10); res = (res << 8) | d; c <<= 1; if(c >= 0) return res; } assert(false); }

Forums