April 02, 2014
Dmitry Olshansky:

> _Explicit_ tail call (aka switch-call) would have been cleaner and more modular. It covers all relevant cases of computed-goto such as threaded-code interpreters and also recursion-heavy functional code.

It looks like an additive change. Can you show a possible syntax and semantics for D? If it's so good it could also become a DIP later.

Bye,
bearophile
April 02, 2014
On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
> Try this benchmark comparing various classification schemes:
>
> bool isIdentifierChar1(ubyte c)
> {
>     return ((c >= '0' || c == '$') &&
>             (c <= '9' || c >= 'A')  &&
>             (c <= 'Z' || c >= 'a' || c == '_') &&
>             (c <= 'z'));
> }

I'd like to point out this is quite a complicated function to begin with, so it doesn't generalize to all isXXX is ascii, for which the tests would be fairly simpler.

In any case, (on my win32 machine) I can go from 810msecs to 500msecs using this function instead:

bool isIdentifierChar1(ubyte c)
{
    return c <= 'z' && (
            'a' <= c ||
            ('0' <= c && (c <= '9' || c == '_' || ('A' <= c && c <= 'Z'))) ||
            c == '$');
}

That said, I'm abusing the fact that 50% of your bench is for chars over 0x80. If I loop only on actual ASCII you can find in text, (0x20 - 0X80), then those numbers "only" go from "320" => "300". Only slightly better, but still a win.

*BUT*, if your functions were to accept any arbitrary codepoint, it would absolutely murder.
April 02, 2014
> On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
>> Try this benchmark comparing various classification schemes:
>>
>> bool isIdentifierChar1(ubyte c)
>> {
>>    return ((c >= '0' || c == '$') &&
>>            (c <= '9' || c >= 'A')  &&
>>            (c <= 'Z' || c >= 'a' || c == '_') &&
>>            (c <= 'z'));
>> }

Considering SSE4.2 was released already back in 2008 and was designed to accelerate this very use-case...

Maybe it's more beneficial to add a few new code-paths... SSE4.2, NEON, etc.
April 02, 2014
02-Apr-2014 15:26, bearophile пишет:
> Dmitry Olshansky:
>
>> _Explicit_ tail call (aka switch-call) would have been cleaner and
>> more modular. It covers all relevant cases of computed-goto such as
>> threaded-code interpreters and also recursion-heavy functional code.
>
> It looks like an additive change. Can you show a possible syntax and
> semantics for D? If it's so good it could also become a DIP later.
>

Easily. In grammar, a new kind of statement:

SwitchCallStatment:
	goto CallExpression;

Semantics:
Same as that of a function call expression except:
a) Can Switch-Call only to a function with the same return type.
b) Anything below this statement is unreachable, i.e. treat it as return.
c) Stack frame of the current function is overwritten with that of switched-to function. In other words after a switch-call stack looks as if the switched-to function was called instead in the first place.

The last point is what modern tail-call optimizations do when one has 2 functions that tail-recurse to each other. They overwrite stackframes on such tail calls.

However this only happens with certain optimization switch which IMO makes the thing totally unreliable. In fact I even observed that I could implement threaded-code interpreter by relying on tail-call optimization with LDC, but only when compiling with all optimizations enabled. Unoptimized builds simply blowup stack. This "semantics are tied to optimization" is a situation that I hate about C/C++ and would hope to address in D.


-- 
Dmitry Olshansky
April 02, 2014
02-Apr-2014 20:52, Daniel N пишет:
>> On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
>>> Try this benchmark comparing various classification schemes:
>>>
>>> bool isIdentifierChar1(ubyte c)
>>> {
>>>    return ((c >= '0' || c == '$') &&
>>>            (c <= '9' || c >= 'A')  &&
>>>            (c <= 'Z' || c >= 'a' || c == '_') &&
>>>            (c <= 'z'));
>>> }
>
> Considering SSE4.2 was released already back in 2008 and was designed to
> accelerate this very use-case...
>
> Maybe it's more beneficial to add a few new code-paths... SSE4.2, NEON,
> etc.

Ideally we'd need:
a) A collection of meaningful (high-level compared to std.simd) primitives that can be accelerated by SIMD, in Phobos. Such as searching, bit-counting etc.
b) A way to automatically use whatever is best and supported by the hardware. GCC got a zero-overhead stuff for that since around 4.6 (?).

-- 
Dmitry Olshansky
April 02, 2014
On Tuesday, 1 April 2014 at 22:32:28 UTC, bearophile wrote:
> Eventually this should work, and avoid the nasty cast:
>
> immutable bool[256] tab3 =
>     ubyte
>     .max
>     .iota!"[]"

I can't decide if this part is brilliant or terrible. Clever, either way.
April 02, 2014
On 4/2/2014 12:03 PM, Dmitry Olshansky wrote:
> SwitchCallStatment:
>      goto CallExpression;
>
> Semantics:
> Same as that of a function call expression except:
> a) Can Switch-Call only to a function with the same return type.
> b) Anything below this statement is unreachable, i.e. treat it as return.
> c) Stack frame of the current function is overwritten with that of switched-to
> function. In other words after a switch-call stack looks as if the switched-to
> function was called instead in the first place.
>
> The last point is what modern tail-call optimizations do when one has 2
> functions that tail-recurse to each other. They overwrite stackframes on such
> tail calls.

How is this different from simply recognizing "return foo(arg);" as being a goto?


> However this only happens with certain optimization switch which IMO makes the
> thing totally unreliable. In fact I even observed that I could implement
> threaded-code interpreter by relying on tail-call optimization with LDC, but
> only when compiling with all optimizations enabled. Unoptimized builds simply
> blowup stack. This "semantics are tied to optimization" is a situation that I
> hate about C/C++ and would hope to address in D.

Doesn't seem terribly different than the "force inline" feature discussion.
April 02, 2014
On 4/2/2014 4:49 AM, monarch_dodra wrote:
> That said, I'm abusing the fact that 50% of your bench is for chars over 0x80.
> If I loop only on actual ASCII you can find in text, (0x20 - 0X80), then those
> numbers "only" go from "320" => "300". Only slightly better, but still a win.

Surely a better approach would be to do a statistical analysis of character frequency in a representative corpus, and tune the compares that way. But the table lookup offered such a dramatic improvement, I didn't bother.

April 02, 2014
On 4/2/2014 12:38 PM, Dmitry Olshansky wrote:
> Ideally we'd need:
> a) A collection of meaningful (high-level compared to std.simd) primitives that
> can be accelerated by SIMD, in Phobos. Such as searching, bit-counting etc.

memchr is already optimized by SIMD, and std.algorithm.find uses memchr. In general, enormous effort has been poured into optimizing C standard library functions. If an algorithm can be implemented using such functions, we ought to.

April 02, 2014
03-Apr-2014 00:39, Walter Bright пишет:
> On 4/2/2014 12:03 PM, Dmitry Olshansky wrote:
>> SwitchCallStatment:
>>      goto CallExpression;
>>
>> Semantics:
>> Same as that of a function call expression except:
>> a) Can Switch-Call only to a function with the same return type.
>> b) Anything below this statement is unreachable, i.e. treat it as return.
>> c) Stack frame of the current function is overwritten with that of
>> switched-to
>> function. In other words after a switch-call stack looks as if the
>> switched-to
>> function was called instead in the first place.
>>
>> The last point is what modern tail-call optimizations do when one has 2
>> functions that tail-recurse to each other. They overwrite stackframes
>> on such
>> tail calls.
>
> How is this different from simply recognizing "return foo(arg);" as
> being a goto?

If we can alter semantics of
return foo(arg);
to _always_ be a goto, guarantee a jump and reuse of the stack,  then I'm all for it.

>> However this only happens with certain optimization switch which IMO
>> makes the
>> thing totally unreliable. In fact I even observed that I could implement
>> threaded-code interpreter by relying on tail-call optimization with
>> LDC, but
>> only when compiling with all optimizations enabled. Unoptimized builds
>> simply
>> blowup stack. This "semantics are tied to optimization" is a situation
>> that I
>> hate about C/C++ and would hope to address in D.
>
> Doesn't seem terribly different than the "force inline" feature discussion.

Yes, it shows the same problem: semantics and optimization are often mixed up. In this case however a program would segfault if the relevant optimization was not performed, it's not just be many times slower.

-- 
Dmitry Olshansky