Challenge: write a really really small front() for UTF8 (page 6) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Challenge: write a really really small front() for UTF8 (page 6)

March 24, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by Vladimir Panteleev
in reply to Andrei Alexandrescu

Vladimir Panteleev

Posted in reply to Andrei Alexandrescu

On Monday, 24 March 2014 at 16:31:42 UTC, Andrei Alexandrescu wrote:
> On 3/24/14, 2:02 AM, monarch_dodra wrote:
>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>
>>> Andrei
>>
>> Before we roll this out, could we discuss a strategy/guideline in
>> regards to detecting and handling invalid UTF sequences?
>
> I think std.array.front should return the invalid dchar on error, and popFront should attempt to resync on error.

Ignoring UTF errors is a lossy operation and has the potential problem of irreversible data loss. For example, consider a program which needs to process Windows-1251 files: it would need to read the file from disk, convert to UTF-8, process it, convert back to Windows-1251, and save it back to disk. If a bug causes the UTF-8 conversion step to be accidentally skipped, then all Unicode data in that file is lost.

I think UTF-8 decoding operations should, by default, throw on UTF-8 errors. Ignoring UTF-8 errors should only be done explicitly, with the programmer's consent.

March 24, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by John Colvin
in reply to Andrei Alexandrescu

John Colvin

Posted in reply to Andrei Alexandrescu

On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>
> Andrei

On a bigendian machine with loose alignment requirements (1 byte), you can do this, which is down to 13 instructions on x86 (which is of course meaningless, what with it being the wrong endianess):

uint front(char[] s) {
  if(!(s[0] & 0b1000_0000)) return s[0]; //handle ASCII
  assert(s[0] & 0b0100_0000);
	
  if(s[0] & 0b0010_0000)
  {
    if(s[0] & 0b0001_0000)
    {
      assert(s.length >=4 && !(s[0] & 0b1000)
             && s[1] <= 0b1011_1111
             && s[2] <= 0b1011_1111
             && s[3] <= 0b1011_1111);
      return *(cast(dchar*)(s.ptr));
    }
    assert(s.length >= 3 && s[1] <= 0b1011_1111 && s[2] <= 0b1011_1111);
    return *(cast(dchar*)(s.ptr)) >> 8;
  }
	
  assert(s.length >= 2 && s[1] <= 0b1011_1111);
  return *(cast(wchar*)(s.ptr));
}

http://goo.gl/Kf6RZJ


There may be architectures that can benefit from this.

March 24, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by Andrei Alexandrescu
in reply to w0rp

Andrei Alexandrescu

Posted in reply to w0rp

On 3/24/14, 5:51 AM, w0rp wrote:
> On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>
>>> Andrei
>>
>> Before we roll this out, could we discuss a strategy/guideline in
>> regards to detecting and handling invalid UTF sequences?
>>
>> Having a fast "front" is fine and all, but if it means your program
>> asserting in release (or worst, silently corrupting memory) just
>> because the client was trying to read a bad text file, I'm unsure this
>> is acceptable.
>
> I would strongly advise to at least offer an option

Options are fine for functions etc. But front would need to find an all-around good compromise between speed and correctness.

Andrei

March 24, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by John Colvin
in reply to John Colvin

John Colvin

Posted in reply to John Colvin

On Monday, 24 March 2014 at 16:41:02 UTC, John Colvin wrote:
> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>
>> Andrei
>
> On a bigendian machine with loose alignment requirements (1 byte), you can do this

Same again, but for little-endian, 18 instructions:

http://goo.gl/jlrweQ

uint front(char[] s)
{
  if(!(s[0] & 0b1000_0000)) return s[0]; //handle ASCII
  assert(s[0] & 0b0100_0000);
	
  if(s[0] & 0b0010_0000)
  {
    if(s[0] & 0b0001_0000)
    {
      assert(s.length >=4 && !(s[0] & 0b1000)
             && s[1] <= 0b1011_1111
             && s[2] <= 0b1011_1111
             && s[3] <= 0b1011_1111);
      return swapEndian(*(cast(dchar*)(s.ptr)));
    }
    assert(s.length >= 3 && s[1] <= 0b1011_1111 && s[2] <= 0b1011_1111);
    return swapEndian(*(cast(dchar*)(s.ptr))) >> 8;
  }
	
  assert(s.length >= 2 && s[1] <= 0b1011_1111);
  return swapEndian(*(cast(wchar*)(s.ptr)));
}

March 24, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by dnspies
in reply to Michel Fortin

dnspies

Posted in reply to Michel Fortin

It seems like mine wasn't being inlined because I had carelessly replaced char[] with const(char)[] in the function signature (I don't know why that should make a difference, but it does)

Going back to my original version (with Andre's trick to avoid letting the loop unroll), I get
dnspies_orig   159 - 16 labels = 143 instructions

However, MFortin1 can be made to do better by taking successive slices instead of indexing (and commenting out the check, since it no longer helps optimize).
MFortin1_slice 137 - 13 labels = 124 instructions

http://goo.gl/JAgVK1

On Monday, 24 March 2014 at 13:39:45 UTC, Michel Fortin wrote:
> On 2014-03-24 06:08:30 +0000, "safety0ff" <safety0ff.dev@gmail.com> said:
>
>> Everything seems to work in your corrected versions:
>> http://dpaste.dzfl.pl/3b32535559ba
>> 
>> Andrei's version doesn't compile on recent compiler versions due to goto skipping initialization.
>
> Ok, so let's check what happens in actual usage. In other words, what happens when front is inlined and used in a loop. I'm counting the number of assembly lines of this main function for looping through each possible input using the various implementations:
>
> http://goo.gl/QjnA2k
>
> implementation   line count of main in assembly
> andralex         285 -  9 labels = 276 instructions
> MFortin1         135 - 13 labels = 122 instructions
> MFortin2         150 - 14 labels = 136 instructions
> safety0ff         -- not inlined (too big?) --
> dnspies          161 - 15 labels = 146 instructions
> CWilliams        233 - 25 labels = 208 instructions
>
> For compactness, my first implementation seems to be the winner, both with and without inlining.
>
> That said, I think front should only assume a non-empty char[] and thus should check the length before accessing anything after the first byte to protect against incomplete multi-byte sequences such as [0x1000_0000]. So I added this line to my two implementations:
>
>    if (1+tailLength >= s.length) return dchar.init;
>
> Now lets see what happens with those changes:
>
> http://goo.gl/XPCGYE
>
> implementation   line count of main in assembly
> MFortin1Check    103 - 11 labels =  92 instructions
> MFortin2Check    135 - 13 labels = 122 instructions
>
> Now I'm baffled. Adding a test makes the code shorter? It actually make the standalone functions longer by 8 instructions (as expected), but the inlined version is shorter. My guess is that the optimizer creates something entirely different and it turns out that this different version optimises better after inlining.
>
> That said, my test "main" function isn't very representative of the general case because the length is statically known by the optimizer. Let's see what it does when the length is not statically known:
>
> http://goo.gl/E2Q0Yu
>
> implementation   line count of main in assembly
> andralex         384 -  9 labels = 375 instructions
> MFortin1         173 - 19 labels = 154 instructions
> MFortin2         182 - 18 labels = 164 instructions
> safety0ff         -- not inlined --
> dnspies           -- not inlined --
> CWilliams        229 - 23 labels = 206 instructions
> MFortin1Check    211 - 24 labels = 187 instructions
> MFortin2Check    218 - 21 labels = 197 instructions
>
> So again MFortin1 is the winner for compactness. Still, I maintain that we ought to at least check the length of the array for multi-byte sequences to protect against incomplete sequences that could lay at the end of the string, so I'd favor MFortin1Check.

March 24, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by dnspies
in reply to dnspies

dnspies

Posted in reply to dnspies

Whoops, wrong "original" version.  That was the one before I understood the game. Here mine is fixed (but up to 180 lines - 16 labels = 164 instructions):

http://goo.gl/wjIAVm

On Monday, 24 March 2014 at 17:14:11 UTC, dnspies wrote:
> It seems like mine wasn't being inlined because I had carelessly replaced char[] with const(char)[] in the function signature (I don't know why that should make a difference, but it does)
>
> Going back to my original version (with Andre's trick to avoid letting the loop unroll), I get
> dnspies_orig   159 - 16 labels = 143 instructions
>
> However, MFortin1 can be made to do better by taking successive slices instead of indexing (and commenting out the check, since it no longer helps optimize).
> MFortin1_slice 137 - 13 labels = 124 instructions
>
> http://goo.gl/JAgVK1
>
> On Monday, 24 March 2014 at 13:39:45 UTC, Michel Fortin wrote:
>> On 2014-03-24 06:08:30 +0000, "safety0ff" <safety0ff.dev@gmail.com> said:
>>
>>> Everything seems to work in your corrected versions:
>>> http://dpaste.dzfl.pl/3b32535559ba
>>> 
>>> Andrei's version doesn't compile on recent compiler versions due to goto skipping initialization.
>>
>> Ok, so let's check what happens in actual usage. In other words, what happens when front is inlined and used in a loop. I'm counting the number of assembly lines of this main function for looping through each possible input using the various implementations:
>>
>> http://goo.gl/QjnA2k
>>
>> implementation   line count of main in assembly
>> andralex         285 -  9 labels = 276 instructions
>> MFortin1         135 - 13 labels = 122 instructions
>> MFortin2         150 - 14 labels = 136 instructions
>> safety0ff         -- not inlined (too big?) --
>> dnspies          161 - 15 labels = 146 instructions
>> CWilliams        233 - 25 labels = 208 instructions
>>
>> For compactness, my first implementation seems to be the winner, both with and without inlining.
>>
>> That said, I think front should only assume a non-empty char[] and thus should check the length before accessing anything after the first byte to protect against incomplete multi-byte sequences such as [0x1000_0000]. So I added this line to my two implementations:
>>
>>   if (1+tailLength >= s.length) return dchar.init;
>>
>> Now lets see what happens with those changes:
>>
>> http://goo.gl/XPCGYE
>>
>> implementation   line count of main in assembly
>> MFortin1Check    103 - 11 labels =  92 instructions
>> MFortin2Check    135 - 13 labels = 122 instructions
>>
>> Now I'm baffled. Adding a test makes the code shorter? It actually make the standalone functions longer by 8 instructions (as expected), but the inlined version is shorter. My guess is that the optimizer creates something entirely different and it turns out that this different version optimises better after inlining.
>>
>> That said, my test "main" function isn't very representative of the general case because the length is statically known by the optimizer. Let's see what it does when the length is not statically known:
>>
>> http://goo.gl/E2Q0Yu
>>
>> implementation   line count of main in assembly
>> andralex         384 -  9 labels = 375 instructions
>> MFortin1         173 - 19 labels = 154 instructions
>> MFortin2         182 - 18 labels = 164 instructions
>> safety0ff         -- not inlined --
>> dnspies           -- not inlined --
>> CWilliams        229 - 23 labels = 206 instructions
>> MFortin1Check    211 - 24 labels = 187 instructions
>> MFortin2Check    218 - 21 labels = 197 instructions
>>
>> So again MFortin1 is the winner for compactness. Still, I maintain that we ought to at least check the length of the array for multi-byte sequences to protect against incomplete sequences that could lay at the end of the string, so I'd favor MFortin1Check.

March 24, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by Chris Williams
in reply to Andrei Alexandrescu

Chris Williams

Posted in reply to Andrei Alexandrescu

On Monday, 24 March 2014 at 16:17:52 UTC, Andrei Alexandrescu wrote:
> On 3/24/14, 12:53 AM, Chris Williams wrote:
>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>
>>> Andrei
>>
>> http://goo.gl/TaZTNB
>
> Nice! Why assert(ret != 0)?
>
> Andrei

It should be -1. I wanted to make sure that bsr() wasn't called on zero, but of course that's not the correct way to test it when I'm about to ~ it.

March 24, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by Dmitry Olshansky
in reply to Andrei Alexandrescu

Dmitry Olshansky

Posted in reply to Andrei Alexandrescu

24-Mar-2014 01:22, Andrei Alexandrescu пишет:
> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>
> Andrei

I had to join the party at some point.
This seems like 25 instructions:
http://goo.gl/N7sHtK

-- 
Dmitry Olshansky

March 24, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by monarch_dodra
in reply to Andrei Alexandrescu

monarch_dodra

Posted in reply to Andrei Alexandrescu

On Monday, 24 March 2014 at 16:31:42 UTC, Andrei Alexandrescu wrote:
> On 3/24/14, 2:02 AM, monarch_dodra wrote:
>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>
>>> Andrei
>>
>> Before we roll this out, could we discuss a strategy/guideline in
>> regards to detecting and handling invalid UTF sequences?
>
> I think std.array.front should return the invalid dchar on error, and popFront should attempt to resync on error.
>
> Andrei

That sounds good to me (I think). Better than a straight up assert anyways...

...which is what the code you submitted does.

Also, I think relying on bounds checking is wrong: Disabling bounds checking mean you are OK for memory corruption for *wrong code*. I don't think invalid string is wrong code. 'front' should do an actual "if", and "assert(0)" when needed. This should have no impact on performance BTW, if done with pointer extraction ("slice.ptr[1]"), and in a trusted context.

Of course, this is all assuming we are asserting at all.

March 25, 2014

Re: Challenge: write a really really small front() for UTF8

Posted by Daniel N
in reply to Daniel N

Daniel N

Posted in reply to Daniel N

On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote:
> I'm currently too busy to submit a complete solution, but please feel free to use my idea if you think it sounds promising.

I now managed to dig up my old C source... but I'm still blocked by dmd not accepting the 'pext' instruction...

1) I know my solution is not directly comparable to the rest in this thread(for many reasons).
2) It's of course trivial to add a fast path for ascii... if desired.
3) It throws safety and standards out the window.

#include <x86intrin.h>

char32_t front(const char* inp)
{
  static const uint32_t mask[32] =
  {
      0b01111111'00000000'00000000'00000000,
      0b00000000'00000000'00000000'00000000,
      0b00011111'00111111'00000000'00000000,
      0b00001111'00111111'00111111'00000000,
      0b00000111'00111111'00111111'00111111,
  };
  uint32_t rev = __builtin_bswap32(reinterpret_cast<const uint32_t*>(inp)[0]);
  uint32_t len = __lzcnt32(~rev);

  return _pext_u32(rev, mask[len]);
}

This is what clang 3.4 generated:
## BB#0:
        pushq   %rbp
Ltmp2:
        .cfi_def_cfa_offset 16
Ltmp3:
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
Ltmp4:
        .cfi_def_cfa_register %rbp
        movl    (%rdi), %eax
        bswapl  %eax
        movl    %eax, %ecx
        notl    %ecx
        lzcntl  %ecx, %ecx
        leaq    __ZZ5frontPKcE4mask(%rip), %rdx
        pextl   (%rdx,%rcx,4), %eax, %eax
        popq    %rbp
        ret

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation