Reducing the cost of autodecoding (page 5) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Reducing the cost of autodecoding (page 5)

October 16, 2016

Re: Reducing the cost of autodecoding

Posted by NoBrainer
in reply to Uplink_Coder

NoBrainer

Posted in reply to Uplink_Coder

On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote:
> This looks quite slow.
> We already have a correct version in utf.decodeImpl.
> The goal here was to find a small and fast alternative.

A: What's 2 + 2?
B: 3
A: That's wrong.
B: But incredibly fast.

October 16, 2016

Re: Reducing the cost of autodecoding

Posted by Patrick Schluter
in reply to Uplink_Coder

Patrick Schluter

Posted in reply to Uplink_Coder

On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote:
> On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter wrote:

> This looks quite slow.
> We already have a correct version in utf.decodeImpl.
> The goal here was to find a small and fast alternative.

I know but it has to be correct before being fast.
The code is simple and the checks can easily be removed. Here the version without overlong, invalid sequence and codepoint check.

 dchar myFront3(ref char[] str)
 {
   dchar c0 = str.ptr[0];
   if(c0 < 0x80) {
     return c0;
   }
   else if(str.length > 1) {
     dchar c1 = str.ptr[1];
     if(c0 < 0xE0) {
       return ((c0 & 0x1F) << 6)|(c1 & 0x3F);
     }
     else if(str.length > 2) {
       dchar c2 = str.ptr[2];
       if(c0 < 0xF0) {
          return ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 &  0x3F);
       }
       else if(str.length > 3) {
         dchar c3 = str.ptr[3];
         if(c0 < 0xF5) {
          return((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 &  0x3F) << 6)|(c3 & 0x3F);
         }
       }
     }
   }
   Linvalid:
      throw new Exception("yadayada");
 }

Next step will be to loop for length 2,3,4, with or without your table.

October 16, 2016

Re: Reducing the cost of autodecoding

Posted by Patrick Schluter
in reply to Patrick Schluter

Patrick Schluter

Posted in reply to Patrick Schluter

On Sunday, 16 October 2016 at 10:05:37 UTC, Patrick Schluter wrote:
> On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote:
>> On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter wrote:
>
>> This looks quite slow.
>> We already have a correct version in utf.decodeImpl.
>> The goal here was to find a small and fast alternative.
>
> I know but it has to be correct before being fast.
> The code is simple and the checks can easily be removed. Here the version without overlong, invalid sequence and codepoint check.
>
>  dchar myFront3(ref char[] str)
>  {
>    dchar c0 = str.ptr[0];
>    if(c0 < 0x80) {
>      return c0;
>    }
>    else if(str.length > 1) {
>      dchar c1 = str.ptr[1];
>      if(c0 < 0xE0) {
>        return ((c0 & 0x1F) << 6)|(c1 & 0x3F);
>      }
>      else if(str.length > 2) {
>        dchar c2 = str.ptr[2];
>        if(c0 < 0xF0) {
>           return ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 &  0x3F);
>        }
>        else if(str.length > 3) {
>          dchar c3 = str.ptr[3];
>          if(c0 < 0xF5) {
>           return((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 &  0x3F) << 6)|(c3 & 0x3F);

Of course, this line is wrong, should shift by 18 not 16 :
           return((c0 & 0x07) << 18)|((c1 & 0x3F) << 12)|((c2 &
 0x3F) << 6)|(c3 & 0x3F);

>          }
>        }
>      }
>    }
>    Linvalid:
>       throw new Exception("yadayada");
>  }
>
> Next step will be to loop for length 2,3,4, with or without your table.

October 16, 2016

Re: Reducing the cost of autodecoding

Posted by Patrick Schluter
in reply to Patrick Schluter

Patrick Schluter

Posted in reply to Patrick Schluter

On Sunday, 16 October 2016 at 10:05:37 UTC, Patrick Schluter wrote:
>
> Next step will be to loop for length 2,3,4, with or without your table.

Ok now the looped version which doesn't need the lookup table. This one assembles in 72 lines of assembly (25 lines only for the exception code).

dchar myFront5(ref char[] str)
 {
   byte c0 = str[0];
   if(c0 >= 0) {
     return c0;
   }
   else {
     if(((c0!=-64) & (c0!=-63)) && c0 <= -12 ) {
	 auto l = str.length;

     int idx = 1;
     dchar mask0 = 0;
     dchar c1=c0&0x3f;
     int lim = -64;
     while(l--) {
         if(c0<lim) {
           if(c1 >0x10FFFF) break;
           return c1;
         }
         lim >>= 1 ;
         c1 = ((c1<<6) | (str.ptr[idx++]&0x3F)) & ~mask0;
         mask0 = 1 << (idx*6 + 7-idx);
       }
     }
   }
   Linvalid:
      throw new Exception("yadayada");
 }

Explanations of the tricks used:
1. the first character is read as signed byte with sign extension, this allows to compare it to the lim variable which is used to do mainly what the lookup table was doing.
2. there 2 loop escapes, if l, the variable holding the length of the string, is 0 which means that the string is too short or when the lim is bigger than the 1st character (see 1.)
3. Using the same 0x3f mask to extract the data bits from all bytes in the loop has the drawback of adding spurious bits coming from the 1st byte for 3 and 4 long sequences. The mask0 variable is used to "shoot" that bit in the next loop.
4. 2 simple tests allow to eliminate a lot of invalid cases (overlongs on 3 and 4 sequences are not tested yet though but I think there's a simple way of doing it but I'm too tired now to exlore it).

October 17, 2016

Re: Reducing the cost of autodecoding

Posted by ketmar
in reply to Uplink_Coder

ketmar

Posted in reply to Uplink_Coder

On Saturday, 15 October 2016 at 18:40:11 UTC, Uplink_Coder wrote:
have you seen this[1] link? it is almost what you're doing, but with some more nice properties.


[1] http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

October 25, 2016

Re: Reducing the cost of autodecoding

Posted by Dmitry Olshansky
in reply to Andrei Alexandrescu

Dmitry Olshansky

Posted in reply to Andrei Alexandrescu

On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:
> So we've had a good run with making popFront smaller. In ASCII
> microbenchmarks with ldc, the speed is indistinguishable from s = s[1 ..
> $]. Smaller functions make sure that the impact on instruction cache in
> larger applications is not high.
[snip]
> Your mission, should you choose to accept it, is to define a combination
> front/popFront that reduces the gap.
>

I'm sooo late to the party. Still a neat trick with UTF lookup table is to cram it into a single word with bit packing. Since we only ever consider top 5 bits of leading byte of which the top one can be safely discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry table, fits nicely in a 64-bit word with 4bits per entry.

// construct in-word lookup table
ulong bitTable()
{
        import std.utf;
	ulong table;
	for(size_t i = 128; i<256; i+= 8)
	{
		char[1] c = [ cast(char)i ];
		try{
			ulong s = std.utf.stride(c[0..1], 0);
			table |= s<<(4*((i-128)>>3));
		}
		catch(Exception){
			// keep zeros
		}
	}
	return table;
}

uint myStride()(char c)
{
	enum table = bitTable;
	return (table >> ((c & 0x7f) >> 3)) & 0xF;
}

void myPopFront()(ref char[] s)
{
	char c = s[0];
	if(c < 0x80)
		s = s[1..$];
	else {
		uint step = myStride(c);
		if(!step) throw new Exception("blah");
		s = s[step..$];
	}
}

Accordingly myFront could be modified to use the same technique.
Can't check perf with LDC at the moment sadly.

>
> Andrei

October 25, 2016

Re: Reducing the cost of autodecoding

Posted by Steven Schveighoffer
in reply to Dmitry Olshansky

Steven Schveighoffer

Posted in reply to Dmitry Olshansky

On 10/25/16 12:57 PM, Dmitry Olshansky wrote:
> On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:
>> So we've had a good run with making popFront smaller. In ASCII
>> microbenchmarks with ldc, the speed is indistinguishable from s = s[1 ..
>> $]. Smaller functions make sure that the impact on instruction cache in
>> larger applications is not high.
> [snip]
>> Your mission, should you choose to accept it, is to define a combination
>> front/popFront that reduces the gap.
>>
>
> I'm sooo late to the party. Still a neat trick with UTF lookup table is
> to cram it into a single word with bit packing. Since we only ever
> consider top 5 bits of leading byte of which the top one can be safely
> discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry
> table, fits nicely in a 64-bit word with 4bits per entry.

Half of the entries in your table are 0 (every char that starts with 10). In addition, you only need 2 bits to store the length (max byte sequence is 4, subtract 1 from sequence to get 2 bits).

So I think you can fit this into a 16-bit word (8 entries of 2 bits each). You just need to pre-check for the invalid sequences (you no longer have to check for 0 result):

if(c < 0x80) s = s[1 .. $];
else if(c < 0xc0 || c > 0xf7) throw new Exception("blah");
else s = s[1 + myStride(c) .. $];

Should behave better on 32-bit system.

You could also store 3-bits to avoid extra addition.

-Steve

October 25, 2016

Re: Reducing the cost of autodecoding

Posted by Stefam Koch
in reply to Steven Schveighoffer

Stefam Koch

Posted in reply to Steven Schveighoffer

On Tuesday, 25 October 2016 at 19:16:07 UTC, Steven Schveighoffer wrote:
> On 10/25/16 12:57 PM, Dmitry Olshansky wrote:
>> On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:
>>> So we've had a good run with making popFront smaller. In ASCII
>>> microbenchmarks with ldc, the speed is indistinguishable from s = s[1 ..
>>> $]. Smaller functions make sure that the impact on instruction cache in
>>> larger applications is not high.
>> [snip]
>>> Your mission, should you choose to accept it, is to define a combination
>>> front/popFront that reduces the gap.
>>>
>>
>> I'm sooo late to the party. Still a neat trick with UTF lookup table is
>> to cram it into a single word with bit packing. Since we only ever
>> consider top 5 bits of leading byte of which the top one can be safely
>> discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry
>> table, fits nicely in a 64-bit word with 4bits per entry.
>
> Half of the entries in your table are 0 (every char that starts with 10). In addition, you only need 2 bits to store the length (max byte sequence is 4, subtract 1 from sequence to get 2 bits).
>
> So I think you can fit this into a 16-bit word (8 entries of 2 bits each). You just need to pre-check for the invalid sequences (you no longer have to check for 0 result):
>
> if(c < 0x80) s = s[1 .. $];
> else if(c < 0xc0 || c > 0xf7) throw new Exception("blah");
> else s = s[1 + myStride(c) .. $];
>
> Should behave better on 32-bit system.
>
> You could also store 3-bits to avoid extra addition.
>
> -Steve

The whole point of a LUT to begin with is to reduce instructions.

October 25, 2016

Re: Reducing the cost of autodecoding

Posted by Steven Schveighoffer
in reply to Stefam Koch

Steven Schveighoffer

Posted in reply to Stefam Koch

On 10/25/16 3:32 PM, Stefam Koch wrote:
> On Tuesday, 25 October 2016 at 19:16:07 UTC, Steven Schveighoffer wrote:

>> Should behave better on 32-bit system.
>>
>> You could also store 3-bits to avoid extra addition.
> The whole point of a LUT to begin with is to reduce instructions.
>

Yes, I know. But as I understand it, using 64-bit math on 32-bit systems is expensive also. Maybe not in this case...

-Steve

October 25, 2016

Re: Reducing the cost of autodecoding

Posted by safety0ff
in reply to Andrei Alexandrescu

safety0ff

Posted in reply to Andrei Alexandrescu

On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote:
>
> Now it's time to look at the end-to-end cost of autodecoding.

Some food for thought:

- front necessarily needs to compute the number of bytes to advance.
- We can't change the API to share data between front and popFront, however we can create a situation where a pure function gets duplicate calls removed by the compiler.

Since we require that the ascii test gets inlined into the caller of front/popFront to improve ascii performance, we have a situation similar to this:

alias Result = Tuple!("codepoint",dchar,"advance",int);
auto decode(const char[] str) pure
{ pragma(inline,true);
  if (str[0] < 0x80) return Result(str[0],1);
  else return decodeNonAscii(str);
}

dchar front(const char[] str) pure
{ pragma(inline,true);
  return str.decode.codepoint;
}

void popFront(ref const(char)[] str)
{ pragma(inline,true);
  return str[str.decode.advance..$];
}

When used in front/popFront pairs, the duplicated decode calls get merged and we don't do any duplicate work (unlike the current situation.)

Unfortunately, it's not possible to achieve the best code generation due to missed optimizations by the compilers (I haven't tried GDC.)
I've reported a highly reduce case to LDC github issue #1851 / LDC subforum.

Once we have this is possible only the decodeNonAscii, perhaps using the DFA method linked by ketmar.

P.S. I am aware that this pessimises popFront for code which only counts codepoints without inspecting them.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation