Table lookups - this is pretty definitive (page 4) - D Programming Language Discussion Forum

Dmitry Olshansky: > _Explicit_ tail call (aka switch-call) would have been cleaner and more modular. It covers all relevant cases of computed-goto such as threaded-code interpreters and also recursion-heavy functional code. It looks like an additive change. Can you show a possible syntax and semantics for D? If it's so good it could also become a DIP later. Bye, bearophile

On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote: > Try this benchmark comparing various classification schemes: > > bool isIdentifierChar1(ubyte c) > { > return ((c >= '0' || c == '$') && > (c <= '9' || c >= 'A') && > (c <= 'Z' || c >= 'a' || c == '_') && > (c <= 'z')); > } I'd like to point out this is quite a complicated function to begin with, so it doesn't generalize to all isXXX is ascii, for which the tests would be fairly simpler. In any case, (on my win32 machine) I can go from 810msecs to 500msecs using this function instead: bool isIdentifierChar1(ubyte c) { return c <= 'z' && ( 'a' <= c || ('0' <= c && (c <= '9' || c == '_' || ('A' <= c && c <= 'Z'))) || c == '$'); } That said, I'm abusing the fact that 50% of your bench is for chars over 0x80. If I loop only on actual ASCII you can find in text, (0x20 - 0X80), then those numbers "only" go from "320" => "300". Only slightly better, but still a win. *BUT*, if your functions were to accept any arbitrary codepoint, it would absolutely murder.

> On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote: >> Try this benchmark comparing various classification schemes: >> >> bool isIdentifierChar1(ubyte c) >> { >> return ((c >= '0' || c == '$') && >> (c <= '9' || c >= 'A') && >> (c <= 'Z' || c >= 'a' || c == '_') && >> (c <= 'z')); >> } Considering SSE4.2 was released already back in 2008 and was designed to accelerate this very use-case... Maybe it's more beneficial to add a few new code-paths... SSE4.2, NEON, etc.

02-Apr-2014 15:26, bearophile пишет: > Dmitry Olshansky: > >> _Explicit_ tail call (aka switch-call) would have been cleaner and >> more modular. It covers all relevant cases of computed-goto such as >> threaded-code interpreters and also recursion-heavy functional code. > > It looks like an additive change. Can you show a possible syntax and > semantics for D? If it's so good it could also become a DIP later. > Easily. In grammar, a new kind of statement: SwitchCallStatment: goto CallExpression; Semantics: Same as that of a function call expression except: a) Can Switch-Call only to a function with the same return type. b) Anything below this statement is unreachable, i.e. treat it as return. c) Stack frame of the current function is overwritten with that of switched-to function. In other words after a switch-call stack looks as if the switched-to function was called instead in the first place. The last point is what modern tail-call optimizations do when one has 2 functions that tail-recurse to each other. They overwrite stackframes on such tail calls. However this only happens with certain optimization switch which IMO makes the thing totally unreliable. In fact I even observed that I could implement threaded-code interpreter by relying on tail-call optimization with LDC, but only when compiling with all optimizations enabled. Unoptimized builds simply blowup stack. This "semantics are tied to optimization" is a situation that I hate about C/C++ and would hope to address in D. -- Dmitry Olshansky

02-Apr-2014 20:52, Daniel N пишет: >> On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote: >>> Try this benchmark comparing various classification schemes: >>> >>> bool isIdentifierChar1(ubyte c) >>> { >>> return ((c >= '0' || c == '$') && >>> (c <= '9' || c >= 'A') && >>> (c <= 'Z' || c >= 'a' || c == '_') && >>> (c <= 'z')); >>> } > > Considering SSE4.2 was released already back in 2008 and was designed to > accelerate this very use-case... > > Maybe it's more beneficial to add a few new code-paths... SSE4.2, NEON, > etc. Ideally we'd need: a) A collection of meaningful (high-level compared to std.simd) primitives that can be accelerated by SIMD, in Phobos. Such as searching, bit-counting etc. b) A way to automatically use whatever is best and supported by the hardware. GCC got a zero-overhead stuff for that since around 4.6 (?). -- Dmitry Olshansky

On Tuesday, 1 April 2014 at 22:32:28 UTC, bearophile wrote: > Eventually this should work, and avoid the nasty cast: > > immutable bool[256] tab3 = > ubyte > .max > .iota!"[]" I can't decide if this part is brilliant or terrible. Clever, either way.

On 4/2/2014 12:03 PM, Dmitry Olshansky wrote: > SwitchCallStatment: > goto CallExpression; > > Semantics: > Same as that of a function call expression except: > a) Can Switch-Call only to a function with the same return type. > b) Anything below this statement is unreachable, i.e. treat it as return. > c) Stack frame of the current function is overwritten with that of switched-to > function. In other words after a switch-call stack looks as if the switched-to > function was called instead in the first place. > > The last point is what modern tail-call optimizations do when one has 2 > functions that tail-recurse to each other. They overwrite stackframes on such > tail calls. How is this different from simply recognizing "return foo(arg);" as being a goto? > However this only happens with certain optimization switch which IMO makes the > thing totally unreliable. In fact I even observed that I could implement > threaded-code interpreter by relying on tail-call optimization with LDC, but > only when compiling with all optimizations enabled. Unoptimized builds simply > blowup stack. This "semantics are tied to optimization" is a situation that I > hate about C/C++ and would hope to address in D. Doesn't seem terribly different than the "force inline" feature discussion.

On 4/2/2014 4:49 AM, monarch_dodra wrote: > That said, I'm abusing the fact that 50% of your bench is for chars over 0x80. > If I loop only on actual ASCII you can find in text, (0x20 - 0X80), then those > numbers "only" go from "320" => "300". Only slightly better, but still a win. Surely a better approach would be to do a statistical analysis of character frequency in a representative corpus, and tune the compares that way. But the table lookup offered such a dramatic improvement, I didn't bother.

On 4/2/2014 12:38 PM, Dmitry Olshansky wrote: > Ideally we'd need: > a) A collection of meaningful (high-level compared to std.simd) primitives that > can be accelerated by SIMD, in Phobos. Such as searching, bit-counting etc. memchr is already optimized by SIMD, and std.algorithm.find uses memchr. In general, enormous effort has been poured into optimizing C standard library functions. If an algorithm can be implemented using such functions, we ought to.

April 02, 2014

Re: Table lookups - this is pretty definitive

Posted by Dmitry Olshansky
in reply to Walter Bright

Permalink

Dmitry Olshansky

Posted in reply to Walter Bright

Permalink

03-Apr-2014 00:39, Walter Bright пишет:
> On 4/2/2014 12:03 PM, Dmitry Olshansky wrote:
>> SwitchCallStatment:
>>      goto CallExpression;
>>
>> Semantics:
>> Same as that of a function call expression except:
>> a) Can Switch-Call only to a function with the same return type.
>> b) Anything below this statement is unreachable, i.e. treat it as return.
>> c) Stack frame of the current function is overwritten with that of
>> switched-to
>> function. In other words after a switch-call stack looks as if the
>> switched-to
>> function was called instead in the first place.
>>
>> The last point is what modern tail-call optimizations do when one has 2
>> functions that tail-recurse to each other. They overwrite stackframes
>> on such
>> tail calls.
>
> How is this different from simply recognizing "return foo(arg);" as
> being a goto?

If we can alter semantics of
return foo(arg);
to _always_ be a goto, guarantee a jump and reuse of the stack,  then I'm all for it.

>> However this only happens with certain optimization switch which IMO
>> makes the
>> thing totally unreliable. In fact I even observed that I could implement
>> threaded-code interpreter by relying on tail-call optimization with
>> LDC, but
>> only when compiling with all optimizations enabled. Unoptimized builds
>> simply
>> blowup stack. This "semantics are tied to optimization" is a situation
>> that I
>> hate about C/C++ and would hope to address in D.
>
> Doesn't seem terribly different than the "force inline" feature discussion.

Yes, it shows the same problem: semantics and optimization are often mixed up. In this case however a program would segfault if the relevant optimization was not performed, it's not just be many times slower.

-- 
Dmitry Olshansky

Forums