March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | Andrei Alexandrescu:
> -w
>
> Probably time to move that from a warning to an error. In the short time we've used D at Facebook (forgot to add "-w" to the implicit flags) we've already have not one, but two bugs related to switch's implicit fallthrough.
Lot of people in D.learn don't use warnings.
What I'd like is to have (informational) warning active on default, and to disable them on request with a compiler switch.
Bye,
bearophile
|
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Michel Fortin | On Monday, 24 March 2014 at 04:37:23 UTC, Michel Fortin wrote:
> dchar front(char[] s)
> {
> if (s[0] < 0b1000000)
> return s[0]; // ASCII
> auto indicator = (s[0] >> 5) & 0b11;
> auto tailLength = indicator ? indicator : 1;
>
> dchar result = s[0] & (0b00111111 >> tailLength);
> foreach (i; 0..tailLength)
> result = (result << 6) | (s[1+i] & 0b00111111);
> return result;
> }
>
> (Disclaimer: not tested, but I did check that all the expected code paths are present in the assembly this time.)
0b1000000 is missing a zero: 0b1000_0000
Fixing that, I still get a range violation from "s[1+i]".
----- Test program -----
void main()
{
foreach (ubyte b0; 0..0x80)
{
char[] s = [b0];
assert(front(s)==front2(s));
} writeln("Single byte done");
foreach (ubyte b0; 0..0x40)
foreach (ubyte b1; 0..0x20)
{
char[] s = [0xC0|b1, 0x80|b0];
assert(front(s)==front2(s));
} writeln("Double byte done");
foreach (ubyte b0; 0..0x40)
foreach (ubyte b1; 0..0x40)
foreach (ubyte b2; 0..0x10)
{
char[] s = [0xE0|b2, 0x80|b1, 0x80|b0];
assert(front(s)==front2(s));
} writeln("Triple byte done");
foreach (ubyte b0; 0..0x40)
foreach (ubyte b1; 0..0x40)
foreach (ubyte b2; 0..0x40)
foreach (ubyte b3; 0..0x08)
{
char[] s = [0xF0|b3, 0x80|b2, 0x80|b1, 0x80|b0];
assert(front(s)==front2(s));
}
}
|
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Michel Fortin | On 03/23/2014 09:44 PM, Michel Fortin wrote:
> On 2014-03-24 04:39:08 +0000, "bearophile" <bearophileHUGS@lycos.com> said:
>
>> Michel Fortin:
>>
>>> I really wish D had no implicit fallthrough.
>>
>> Isn't your wish already granted?
>
> Maybe. I don't have a D compiler installed at the moment to check. I'm
> just playing with d.godbolt.org and it accepts implicit fallthrough.
>
That compiler is very old. :)
import std.compiler;
pragma(msg, version_major);
pragma(msg, version_minor);
The output:
2u
55u
Ali
|
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Michel Fortin | On 3/23/14, 9:49 PM, Michel Fortin wrote: > > dchar front(char[] s) > { > if (s[0] < 0b1000000) > return s[0]; // ASCII > > // pattern indicator tailLength > // 0b1100xxxx 0b00 (0) 1 > // 0b1101xxxx 0b01 (1) 1 == indicator > // 0b1110xxxx 0b10 (2) 2 == indicator > // 0b1111xxxx 0b11 (3) 3 == indicator > // note: undefined result for illegal 0b11111xxx case > > auto indicator = (s[0] >> 4) & 0b11; > auto tailLength = indicator ? indicator : 1; > > dchar result = s[0] & (0b00111111 >> tailLength); > foreach (i; 0..tailLength) > result = (result << 6) | (s[1+i] & 0b00111111); > return result; > } http://goo.gl/y9EVFr Andrei |
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Sunday, 23 March 2014 at 22:58:53 UTC, Andrei Alexandrescu wrote:
> On 3/23/14, 3:10 PM, Dmitry Olshansky wrote:
>> 24-Mar-2014 01:34, Andrei Alexandrescu пишет:
>>> On 3/23/14, 2:29 PM, Dmitry Olshansky wrote:
>>>> 24-Mar-2014 01:22, Andrei Alexandrescu пишет:
>>>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>>>
>>>> Assertions to check encoding?!
>>>> I thought it would detect broken encoding and do a substitution at
>>>> least.
>>>
>>> That implementation does zero effort to optimize checks themselves, and
>>> indeed puts them in asserts. I think there's value in having such a
>>> primitive down below.
>>
>> Just how much you are willing to assert? You don't even check length.
>
> Array bounds checking takes care of that.
Since strings are often user input, if there is chance that bad input can cause undefined behavior, then the checks must be done unconditionally.
|
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On 2014-03-24 05:09:15 +0000, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said: > On 3/23/14, 9:49 PM, Michel Fortin wrote: >> > http://goo.gl/y9EVFr > > Andrei As I said earlier, that front2 version is broken. It's missing a break. Adding the break makes it non-interesting from an instruction count point of view. Also, there's still an issue in my code above (found by safetyOff): the first "if" for ASCII should use 0b1000_0000, not 0b1000000 (missing one zero). Here's the corrected (again) version of front1: dchar front1(char[] s) { if (s[0] < 0b1000_0000) return s[0]; // ASCII // pattern indicator tailLength // 0b1100xxxx 0b00 (0) 1 // 0b1101xxxx 0b01 (1) 1 == indicator // 0b1110xxxx 0b10 (2) 2 == indicator // 0b1111xxxx 0b11 (3) 3 == indicator // note: undefined result for illegal 0b11111xxx case auto indicator = (s[0] >> 4) & 0b11; auto tailLength = indicator ? indicator : 1; dchar result = s[0] & (0b0011_1111 >> tailLength); foreach (i; 0..tailLength) result = (result << 6) | (s[1+i] & 0b0011_1111); return result; } And I'm going to suggest a variant of the above with one less branch (but three more instructions): look at how tailLenght is computed by or'ing with the negative of bit 2 of indicator. I suspect it'll be faster with non-ASCII input, unless it gets inlined less. dchar front2(char[] s) { if (s[0] < 0b1000_0000) return s[0]; // ASCII // pattern indicator tailLength // 0b1100xxxx 0b00 (0) 1 // 0b1101xxxx 0b01 (1) 1 // 0b1110xxxx 0b10 (2) 2 // 0b1111xxxx 0b11 (3) 3 // note: undefined result for illegal 0b11111xxx case auto indicator = ((s[0] >> 4) & 0b11); auto tailLength = indicator | ((~indicator >> 1) & 0b1); dchar result = s[0] & (0b0011_1111 >> tailLength); foreach (i; 0..tailLength) result = (result << 6) | (s[1+i] & 0b0011_1111); return result; } -- Michel Fortin michel.fortin@michelf.ca http://michelf.ca |
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to safety0ff | On 2014-03-24 04:58:08 +0000, "safety0ff" <safety0ff.dev@gmail.com> said: > 0b1000000 is missing a zero: 0b1000_0000 Indeed, thanks. > Fixing that, I still get a range violation from "s[1+i]". That "auto indicator = (s[0] >> 5) & 0b11;" line is wrong too. s[0] needs a shift by 4, not by 5. No doubt it crashes your test program. -- Michel Fortin michel.fortin@michelf.ca http://michelf.ca |
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Michel Fortin | On Monday, 24 March 2014 at 05:27:43 UTC, Michel Fortin wrote: > Here's the corrected (again) version of front1: > [Snip] > > And I'm going to suggest a variant of the above with one less branch (but three more instructions): > [Snip] Everything seems to work in your corrected versions: http://dpaste.dzfl.pl/3b32535559ba Andrei's version doesn't compile on recent compiler versions due to goto skipping initialization. |
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: > Here's a baseline: http://goo.gl/91vIGc. Destroy! > > Andrei http://goo.gl/TaZTNB |
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>
> Andrei
Here's mine (the loop gets unrolled):
dchar front(const char[] s) {
assert(s.length > 0);
byte c = s[0];
dchar res = cast(ubyte)c;
if(c >= 0) {
return res;
}
c <<= 1;
assert(c < 0);
for(int i = 1; i < 4; i++) {
assert(i < s.length);
ubyte d = s[i];
assert((d >> 6) == 0b10);
res = (res << 8) | d;
c <<= 1;
if(c >= 0)
return res;
}
assert(false);
}
|
Copyright © 1999-2021 by the D Language Foundation