March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Monday, 24 March 2014 at 16:31:42 UTC, Andrei Alexandrescu wrote:
> On 3/24/14, 2:02 AM, monarch_dodra wrote:
>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>
>>> Andrei
>>
>> Before we roll this out, could we discuss a strategy/guideline in
>> regards to detecting and handling invalid UTF sequences?
>
> I think std.array.front should return the invalid dchar on error, and popFront should attempt to resync on error.
Ignoring UTF errors is a lossy operation and has the potential problem of irreversible data loss. For example, consider a program which needs to process Windows-1251 files: it would need to read the file from disk, convert to UTF-8, process it, convert back to Windows-1251, and save it back to disk. If a bug causes the UTF-8 conversion step to be accidentally skipped, then all Unicode data in that file is lost.
I think UTF-8 decoding operations should, by default, throw on UTF-8 errors. Ignoring UTF-8 errors should only be done explicitly, with the programmer's consent.
|
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: > Here's a baseline: http://goo.gl/91vIGc. Destroy! > > Andrei On a bigendian machine with loose alignment requirements (1 byte), you can do this, which is down to 13 instructions on x86 (which is of course meaningless, what with it being the wrong endianess): uint front(char[] s) { if(!(s[0] & 0b1000_0000)) return s[0]; //handle ASCII assert(s[0] & 0b0100_0000); if(s[0] & 0b0010_0000) { if(s[0] & 0b0001_0000) { assert(s.length >=4 && !(s[0] & 0b1000) && s[1] <= 0b1011_1111 && s[2] <= 0b1011_1111 && s[3] <= 0b1011_1111); return *(cast(dchar*)(s.ptr)); } assert(s.length >= 3 && s[1] <= 0b1011_1111 && s[2] <= 0b1011_1111); return *(cast(dchar*)(s.ptr)) >> 8; } assert(s.length >= 2 && s[1] <= 0b1011_1111); return *(cast(wchar*)(s.ptr)); } http://goo.gl/Kf6RZJ There may be architectures that can benefit from this. |
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to w0rp | On 3/24/14, 5:51 AM, w0rp wrote:
> On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>
>>> Andrei
>>
>> Before we roll this out, could we discuss a strategy/guideline in
>> regards to detecting and handling invalid UTF sequences?
>>
>> Having a fast "front" is fine and all, but if it means your program
>> asserting in release (or worst, silently corrupting memory) just
>> because the client was trying to read a bad text file, I'm unsure this
>> is acceptable.
>
> I would strongly advise to at least offer an option
Options are fine for functions etc. But front would need to find an all-around good compromise between speed and correctness.
Andrei
|
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to John Colvin | On Monday, 24 March 2014 at 16:41:02 UTC, John Colvin wrote: > On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: >> Here's a baseline: http://goo.gl/91vIGc. Destroy! >> >> Andrei > > On a bigendian machine with loose alignment requirements (1 byte), you can do this Same again, but for little-endian, 18 instructions: http://goo.gl/jlrweQ uint front(char[] s) { if(!(s[0] & 0b1000_0000)) return s[0]; //handle ASCII assert(s[0] & 0b0100_0000); if(s[0] & 0b0010_0000) { if(s[0] & 0b0001_0000) { assert(s.length >=4 && !(s[0] & 0b1000) && s[1] <= 0b1011_1111 && s[2] <= 0b1011_1111 && s[3] <= 0b1011_1111); return swapEndian(*(cast(dchar*)(s.ptr))); } assert(s.length >= 3 && s[1] <= 0b1011_1111 && s[2] <= 0b1011_1111); return swapEndian(*(cast(dchar*)(s.ptr))) >> 8; } assert(s.length >= 2 && s[1] <= 0b1011_1111); return swapEndian(*(cast(wchar*)(s.ptr))); } |
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Michel Fortin | It seems like mine wasn't being inlined because I had carelessly replaced char[] with const(char)[] in the function signature (I don't know why that should make a difference, but it does) Going back to my original version (with Andre's trick to avoid letting the loop unroll), I get dnspies_orig 159 - 16 labels = 143 instructions However, MFortin1 can be made to do better by taking successive slices instead of indexing (and commenting out the check, since it no longer helps optimize). MFortin1_slice 137 - 13 labels = 124 instructions http://goo.gl/JAgVK1 On Monday, 24 March 2014 at 13:39:45 UTC, Michel Fortin wrote: > On 2014-03-24 06:08:30 +0000, "safety0ff" <safety0ff.dev@gmail.com> said: > >> Everything seems to work in your corrected versions: >> http://dpaste.dzfl.pl/3b32535559ba >> >> Andrei's version doesn't compile on recent compiler versions due to goto skipping initialization. > > Ok, so let's check what happens in actual usage. In other words, what happens when front is inlined and used in a loop. I'm counting the number of assembly lines of this main function for looping through each possible input using the various implementations: > > http://goo.gl/QjnA2k > > implementation line count of main in assembly > andralex 285 - 9 labels = 276 instructions > MFortin1 135 - 13 labels = 122 instructions > MFortin2 150 - 14 labels = 136 instructions > safety0ff -- not inlined (too big?) -- > dnspies 161 - 15 labels = 146 instructions > CWilliams 233 - 25 labels = 208 instructions > > For compactness, my first implementation seems to be the winner, both with and without inlining. > > That said, I think front should only assume a non-empty char[] and thus should check the length before accessing anything after the first byte to protect against incomplete multi-byte sequences such as [0x1000_0000]. So I added this line to my two implementations: > > if (1+tailLength >= s.length) return dchar.init; > > Now lets see what happens with those changes: > > http://goo.gl/XPCGYE > > implementation line count of main in assembly > MFortin1Check 103 - 11 labels = 92 instructions > MFortin2Check 135 - 13 labels = 122 instructions > > Now I'm baffled. Adding a test makes the code shorter? It actually make the standalone functions longer by 8 instructions (as expected), but the inlined version is shorter. My guess is that the optimizer creates something entirely different and it turns out that this different version optimises better after inlining. > > That said, my test "main" function isn't very representative of the general case because the length is statically known by the optimizer. Let's see what it does when the length is not statically known: > > http://goo.gl/E2Q0Yu > > implementation line count of main in assembly > andralex 384 - 9 labels = 375 instructions > MFortin1 173 - 19 labels = 154 instructions > MFortin2 182 - 18 labels = 164 instructions > safety0ff -- not inlined -- > dnspies -- not inlined -- > CWilliams 229 - 23 labels = 206 instructions > MFortin1Check 211 - 24 labels = 187 instructions > MFortin2Check 218 - 21 labels = 197 instructions > > So again MFortin1 is the winner for compactness. Still, I maintain that we ought to at least check the length of the array for multi-byte sequences to protect against incomplete sequences that could lay at the end of the string, so I'd favor MFortin1Check. |
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to dnspies | Whoops, wrong "original" version. That was the one before I understood the game. Here mine is fixed (but up to 180 lines - 16 labels = 164 instructions): http://goo.gl/wjIAVm On Monday, 24 March 2014 at 17:14:11 UTC, dnspies wrote: > It seems like mine wasn't being inlined because I had carelessly replaced char[] with const(char)[] in the function signature (I don't know why that should make a difference, but it does) > > Going back to my original version (with Andre's trick to avoid letting the loop unroll), I get > dnspies_orig 159 - 16 labels = 143 instructions > > However, MFortin1 can be made to do better by taking successive slices instead of indexing (and commenting out the check, since it no longer helps optimize). > MFortin1_slice 137 - 13 labels = 124 instructions > > http://goo.gl/JAgVK1 > > On Monday, 24 March 2014 at 13:39:45 UTC, Michel Fortin wrote: >> On 2014-03-24 06:08:30 +0000, "safety0ff" <safety0ff.dev@gmail.com> said: >> >>> Everything seems to work in your corrected versions: >>> http://dpaste.dzfl.pl/3b32535559ba >>> >>> Andrei's version doesn't compile on recent compiler versions due to goto skipping initialization. >> >> Ok, so let's check what happens in actual usage. In other words, what happens when front is inlined and used in a loop. I'm counting the number of assembly lines of this main function for looping through each possible input using the various implementations: >> >> http://goo.gl/QjnA2k >> >> implementation line count of main in assembly >> andralex 285 - 9 labels = 276 instructions >> MFortin1 135 - 13 labels = 122 instructions >> MFortin2 150 - 14 labels = 136 instructions >> safety0ff -- not inlined (too big?) -- >> dnspies 161 - 15 labels = 146 instructions >> CWilliams 233 - 25 labels = 208 instructions >> >> For compactness, my first implementation seems to be the winner, both with and without inlining. >> >> That said, I think front should only assume a non-empty char[] and thus should check the length before accessing anything after the first byte to protect against incomplete multi-byte sequences such as [0x1000_0000]. So I added this line to my two implementations: >> >> if (1+tailLength >= s.length) return dchar.init; >> >> Now lets see what happens with those changes: >> >> http://goo.gl/XPCGYE >> >> implementation line count of main in assembly >> MFortin1Check 103 - 11 labels = 92 instructions >> MFortin2Check 135 - 13 labels = 122 instructions >> >> Now I'm baffled. Adding a test makes the code shorter? It actually make the standalone functions longer by 8 instructions (as expected), but the inlined version is shorter. My guess is that the optimizer creates something entirely different and it turns out that this different version optimises better after inlining. >> >> That said, my test "main" function isn't very representative of the general case because the length is statically known by the optimizer. Let's see what it does when the length is not statically known: >> >> http://goo.gl/E2Q0Yu >> >> implementation line count of main in assembly >> andralex 384 - 9 labels = 375 instructions >> MFortin1 173 - 19 labels = 154 instructions >> MFortin2 182 - 18 labels = 164 instructions >> safety0ff -- not inlined -- >> dnspies -- not inlined -- >> CWilliams 229 - 23 labels = 206 instructions >> MFortin1Check 211 - 24 labels = 187 instructions >> MFortin2Check 218 - 21 labels = 197 instructions >> >> So again MFortin1 is the winner for compactness. Still, I maintain that we ought to at least check the length of the array for multi-byte sequences to protect against incomplete sequences that could lay at the end of the string, so I'd favor MFortin1Check. |
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Monday, 24 March 2014 at 16:17:52 UTC, Andrei Alexandrescu wrote:
> On 3/24/14, 12:53 AM, Chris Williams wrote:
>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>
>>> Andrei
>>
>> http://goo.gl/TaZTNB
>
> Nice! Why assert(ret != 0)?
>
> Andrei
It should be -1. I wanted to make sure that bsr() wasn't called on zero, but of course that's not the correct way to test it when I'm about to ~ it.
|
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | 24-Mar-2014 01:22, Andrei Alexandrescu пишет: > Here's a baseline: http://goo.gl/91vIGc. Destroy! > > Andrei I had to join the party at some point. This seems like 25 instructions: http://goo.gl/N7sHtK -- Dmitry Olshansky |
March 24, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Monday, 24 March 2014 at 16:31:42 UTC, Andrei Alexandrescu wrote:
> On 3/24/14, 2:02 AM, monarch_dodra wrote:
>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>
>>> Andrei
>>
>> Before we roll this out, could we discuss a strategy/guideline in
>> regards to detecting and handling invalid UTF sequences?
>
> I think std.array.front should return the invalid dchar on error, and popFront should attempt to resync on error.
>
> Andrei
That sounds good to me (I think). Better than a straight up assert anyways...
...which is what the code you submitted does.
Also, I think relying on bounds checking is wrong: Disabling bounds checking mean you are OK for memory corruption for *wrong code*. I don't think invalid string is wrong code. 'front' should do an actual "if", and "assert(0)" when needed. This should have no impact on performance BTW, if done with pointer extraction ("slice.ptr[1]"), and in a trusted context.
Of course, this is all assuming we are asserting at all.
|
March 25, 2014 Re: Challenge: write a really really small front() for UTF8 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Daniel N | On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote: > I'm currently too busy to submit a complete solution, but please feel free to use my idea if you think it sounds promising. I now managed to dig up my old C source... but I'm still blocked by dmd not accepting the 'pext' instruction... 1) I know my solution is not directly comparable to the rest in this thread(for many reasons). 2) It's of course trivial to add a fast path for ascii... if desired. 3) It throws safety and standards out the window. #include <x86intrin.h> char32_t front(const char* inp) { static const uint32_t mask[32] = { 0b01111111'00000000'00000000'00000000, 0b00000000'00000000'00000000'00000000, 0b00011111'00111111'00000000'00000000, 0b00001111'00111111'00111111'00000000, 0b00000111'00111111'00111111'00111111, }; uint32_t rev = __builtin_bswap32(reinterpret_cast<const uint32_t*>(inp)[0]); uint32_t len = __lzcnt32(~rev); return _pext_u32(rev, mask[len]); } This is what clang 3.4 generated: ## BB#0: pushq %rbp Ltmp2: .cfi_def_cfa_offset 16 Ltmp3: .cfi_offset %rbp, -16 movq %rsp, %rbp Ltmp4: .cfi_def_cfa_register %rbp movl (%rdi), %eax bswapl %eax movl %eax, %ecx notl %ecx lzcntl %ecx, %ecx leaq __ZZ5frontPKcE4mask(%rip), %rdx pextl (%rdx,%rcx,4), %eax, %eax popq %rbp ret |
Copyright © 1999-2021 by the D Language Foundation