October 16, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Uplink_Coder | On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote:
> This looks quite slow.
> We already have a correct version in utf.decodeImpl.
> The goal here was to find a small and fast alternative.
A: What's 2 + 2?
B: 3
A: That's wrong.
B: But incredibly fast.
|
October 16, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Uplink_Coder | On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote: > On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter wrote: > This looks quite slow. > We already have a correct version in utf.decodeImpl. > The goal here was to find a small and fast alternative. I know but it has to be correct before being fast. The code is simple and the checks can easily be removed. Here the version without overlong, invalid sequence and codepoint check. dchar myFront3(ref char[] str) { dchar c0 = str.ptr[0]; if(c0 < 0x80) { return c0; } else if(str.length > 1) { dchar c1 = str.ptr[1]; if(c0 < 0xE0) { return ((c0 & 0x1F) << 6)|(c1 & 0x3F); } else if(str.length > 2) { dchar c2 = str.ptr[2]; if(c0 < 0xF0) { return ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 & 0x3F); } else if(str.length > 3) { dchar c3 = str.ptr[3]; if(c0 < 0xF5) { return((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 & 0x3F) << 6)|(c3 & 0x3F); } } } } Linvalid: throw new Exception("yadayada"); } Next step will be to loop for length 2,3,4, with or without your table. |
October 16, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Patrick Schluter | On Sunday, 16 October 2016 at 10:05:37 UTC, Patrick Schluter wrote: > On Sunday, 16 October 2016 at 08:43:23 UTC, Uplink_Coder wrote: >> On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter wrote: > >> This looks quite slow. >> We already have a correct version in utf.decodeImpl. >> The goal here was to find a small and fast alternative. > > I know but it has to be correct before being fast. > The code is simple and the checks can easily be removed. Here the version without overlong, invalid sequence and codepoint check. > > dchar myFront3(ref char[] str) > { > dchar c0 = str.ptr[0]; > if(c0 < 0x80) { > return c0; > } > else if(str.length > 1) { > dchar c1 = str.ptr[1]; > if(c0 < 0xE0) { > return ((c0 & 0x1F) << 6)|(c1 & 0x3F); > } > else if(str.length > 2) { > dchar c2 = str.ptr[2]; > if(c0 < 0xF0) { > return ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 & 0x3F); > } > else if(str.length > 3) { > dchar c3 = str.ptr[3]; > if(c0 < 0xF5) { > return((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 & 0x3F) << 6)|(c3 & 0x3F); Of course, this line is wrong, should shift by 18 not 16 : return((c0 & 0x07) << 18)|((c1 & 0x3F) << 12)|((c2 & 0x3F) << 6)|(c3 & 0x3F); > } > } > } > } > Linvalid: > throw new Exception("yadayada"); > } > > Next step will be to loop for length 2,3,4, with or without your table. |
October 16, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Patrick Schluter | On Sunday, 16 October 2016 at 10:05:37 UTC, Patrick Schluter wrote:
>
> Next step will be to loop for length 2,3,4, with or without your table.
Ok now the looped version which doesn't need the lookup table. This one assembles in 72 lines of assembly (25 lines only for the exception code).
dchar myFront5(ref char[] str)
{
byte c0 = str[0];
if(c0 >= 0) {
return c0;
}
else {
if(((c0!=-64) & (c0!=-63)) && c0 <= -12 ) {
auto l = str.length;
int idx = 1;
dchar mask0 = 0;
dchar c1=c0&0x3f;
int lim = -64;
while(l--) {
if(c0<lim) {
if(c1 >0x10FFFF) break;
return c1;
}
lim >>= 1 ;
c1 = ((c1<<6) | (str.ptr[idx++]&0x3F)) & ~mask0;
mask0 = 1 << (idx*6 + 7-idx);
}
}
}
Linvalid:
throw new Exception("yadayada");
}
Explanations of the tricks used:
1. the first character is read as signed byte with sign extension, this allows to compare it to the lim variable which is used to do mainly what the lookup table was doing.
2. there 2 loop escapes, if l, the variable holding the length of the string, is 0 which means that the string is too short or when the lim is bigger than the 1st character (see 1.)
3. Using the same 0x3f mask to extract the data bits from all bytes in the loop has the drawback of adding spurious bits coming from the 1st byte for 3 and 4 long sequences. The mask0 variable is used to "shoot" that bit in the next loop.
4. 2 simple tests allow to eliminate a lot of invalid cases (overlongs on 3 and 4 sequences are not tested yet though but I think there's a simple way of doing it but I'm too tired now to exlore it).
|
October 17, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Uplink_Coder | On Saturday, 15 October 2016 at 18:40:11 UTC, Uplink_Coder wrote: have you seen this[1] link? it is almost what you're doing, but with some more nice properties. [1] http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ |
October 25, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On 10/12/16 3:53 PM, Andrei Alexandrescu wrote: > So we've had a good run with making popFront smaller. In ASCII > microbenchmarks with ldc, the speed is indistinguishable from s = s[1 .. > $]. Smaller functions make sure that the impact on instruction cache in > larger applications is not high. [snip] > Your mission, should you choose to accept it, is to define a combination > front/popFront that reduces the gap. > I'm sooo late to the party. Still a neat trick with UTF lookup table is to cram it into a single word with bit packing. Since we only ever consider top 5 bits of leading byte of which the top one can be safely discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry table, fits nicely in a 64-bit word with 4bits per entry. // construct in-word lookup table ulong bitTable() { import std.utf; ulong table; for(size_t i = 128; i<256; i+= 8) { char[1] c = [ cast(char)i ]; try{ ulong s = std.utf.stride(c[0..1], 0); table |= s<<(4*((i-128)>>3)); } catch(Exception){ // keep zeros } } return table; } uint myStride()(char c) { enum table = bitTable; return (table >> ((c & 0x7f) >> 3)) & 0xF; } void myPopFront()(ref char[] s) { char c = s[0]; if(c < 0x80) s = s[1..$]; else { uint step = myStride(c); if(!step) throw new Exception("blah"); s = s[step..$]; } } Accordingly myFront could be modified to use the same technique. Can't check perf with LDC at the moment sadly. > > Andrei |
October 25, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On 10/25/16 12:57 PM, Dmitry Olshansky wrote:
> On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:
>> So we've had a good run with making popFront smaller. In ASCII
>> microbenchmarks with ldc, the speed is indistinguishable from s = s[1 ..
>> $]. Smaller functions make sure that the impact on instruction cache in
>> larger applications is not high.
> [snip]
>> Your mission, should you choose to accept it, is to define a combination
>> front/popFront that reduces the gap.
>>
>
> I'm sooo late to the party. Still a neat trick with UTF lookup table is
> to cram it into a single word with bit packing. Since we only ever
> consider top 5 bits of leading byte of which the top one can be safely
> discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry
> table, fits nicely in a 64-bit word with 4bits per entry.
Half of the entries in your table are 0 (every char that starts with 10). In addition, you only need 2 bits to store the length (max byte sequence is 4, subtract 1 from sequence to get 2 bits).
So I think you can fit this into a 16-bit word (8 entries of 2 bits each). You just need to pre-check for the invalid sequences (you no longer have to check for 0 result):
if(c < 0x80) s = s[1 .. $];
else if(c < 0xc0 || c > 0xf7) throw new Exception("blah");
else s = s[1 + myStride(c) .. $];
Should behave better on 32-bit system.
You could also store 3-bits to avoid extra addition.
-Steve
|
October 25, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Tuesday, 25 October 2016 at 19:16:07 UTC, Steven Schveighoffer wrote:
> On 10/25/16 12:57 PM, Dmitry Olshansky wrote:
>> On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:
>>> So we've had a good run with making popFront smaller. In ASCII
>>> microbenchmarks with ldc, the speed is indistinguishable from s = s[1 ..
>>> $]. Smaller functions make sure that the impact on instruction cache in
>>> larger applications is not high.
>> [snip]
>>> Your mission, should you choose to accept it, is to define a combination
>>> front/popFront that reduces the gap.
>>>
>>
>> I'm sooo late to the party. Still a neat trick with UTF lookup table is
>> to cram it into a single word with bit packing. Since we only ever
>> consider top 5 bits of leading byte of which the top one can be safely
>> discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry
>> table, fits nicely in a 64-bit word with 4bits per entry.
>
> Half of the entries in your table are 0 (every char that starts with 10). In addition, you only need 2 bits to store the length (max byte sequence is 4, subtract 1 from sequence to get 2 bits).
>
> So I think you can fit this into a 16-bit word (8 entries of 2 bits each). You just need to pre-check for the invalid sequences (you no longer have to check for 0 result):
>
> if(c < 0x80) s = s[1 .. $];
> else if(c < 0xc0 || c > 0xf7) throw new Exception("blah");
> else s = s[1 + myStride(c) .. $];
>
> Should behave better on 32-bit system.
>
> You could also store 3-bits to avoid extra addition.
>
> -Steve
The whole point of a LUT to begin with is to reduce instructions.
|
October 25, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefam Koch | On 10/25/16 3:32 PM, Stefam Koch wrote: > On Tuesday, 25 October 2016 at 19:16:07 UTC, Steven Schveighoffer wrote: >> Should behave better on 32-bit system. >> >> You could also store 3-bits to avoid extra addition. > The whole point of a LUT to begin with is to reduce instructions. > Yes, I know. But as I understand it, using 64-bit math on 32-bit systems is expensive also. Maybe not in this case... -Steve |
October 25, 2016 Re: Reducing the cost of autodecoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Wednesday, 12 October 2016 at 13:53:03 UTC, Andrei Alexandrescu wrote: > > Now it's time to look at the end-to-end cost of autodecoding. Some food for thought: - front necessarily needs to compute the number of bytes to advance. - We can't change the API to share data between front and popFront, however we can create a situation where a pure function gets duplicate calls removed by the compiler. Since we require that the ascii test gets inlined into the caller of front/popFront to improve ascii performance, we have a situation similar to this: alias Result = Tuple!("codepoint",dchar,"advance",int); auto decode(const char[] str) pure { pragma(inline,true); if (str[0] < 0x80) return Result(str[0],1); else return decodeNonAscii(str); } dchar front(const char[] str) pure { pragma(inline,true); return str.decode.codepoint; } void popFront(ref const(char)[] str) { pragma(inline,true); return str[str.decode.advance..$]; } When used in front/popFront pairs, the duplicated decode calls get merged and we don't do any duplicate work (unlike the current situation.) Unfortunately, it's not possible to achieve the best code generation due to missed optimizations by the compilers (I haven't tried GDC.) I've reported a highly reduce case to LDC github issue #1851 / LDC subforum. Once we have this is possible only the decodeNonAscii, perhaps using the DFA method linked by ketmar. P.S. I am aware that this pessimises popFront for code which only counts codepoints without inspecting them. |
Copyright © 1999-2021 by the D Language Foundation