April 07, 2016
Am Thu, 07 Apr 2016 10:52:42 +0000
schrieb Kai Nacke <kai@redstar.de>:

> On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:
> > Then,
> >
> >     void app(int simd)() { ... my fabulous app ... }
> >
> >     int main() {
> >       auto fpu = core.cpuid.getfpu();
> >       switch (fpu) {
> >         case SIMD: app!(SIMD)(); break;
> >         case SIMD4: app!(SIMD4)(); break;
> >         default: error("unsupported FPU"); exit(1);
> >       }
> >     }
> > 
> 
> glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries
> 
> Would be awesome to have something similar in druntime/Phobos.
> 
> Regards,
> Kai

Available in GCC as the 'ifunc' attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes

What do you mean by 'something similar in druntime/phobos'? A platform independent (slightly slower) variant?:

http://dpaste.dzfl.pl/0aa81325a26a
April 07, 2016
On Thursday, 7 April 2016 at 10:03:50 UTC, 9il wrote:
>
> This is not true for BLAS based on D.

Perhaps if you provide him a simplified example he might see what you're talking about?
April 07, 2016
On Thursday, 7 April 2016 at 12:35:51 UTC, jmh530 wrote:
> On Thursday, 7 April 2016 at 10:03:50 UTC, 9il wrote:
>>
>> This is not true for BLAS based on D.
>
> Perhaps if you provide him a simplified example he might see what you're talking about?

He know what I am talking about. This is about architecture/style/concepts. If Walter disagree with this then nobody can change it.
April 07, 2016
On Thursday, 7 April 2016 at 11:25:47 UTC, Johannes Pfau wrote:
> Am Thu, 07 Apr 2016 10:52:42 +0000
> schrieb Kai Nacke <kai@redstar.de>:
>
>> glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries
>> 
>> Would be awesome to have something similar in druntime/Phobos.
>> 
>> Regards,
>> Kai
>
> Available in GCC as the 'ifunc' attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes
>
> What do you mean by 'something similar in druntime/phobos'? A platform independent (slightly slower) variant?:
>
> http://dpaste.dzfl.pl/0aa81325a26a

I thought that the ifunc mechanism means an indirect call (i.e. a function ptr is set at the start of the program) ? That would be the same as what you are doing without performance difference.

https://gcc.gnu.org/wiki/FunctionMultiVersioning
"To keep the cost of dispatching low, the IFUNC mechanism is used for dispatching. This makes the call to the dispatcher a one-time thing during startup and a call to a function version is a single jump ** indirect ** instruction." (emphasis mine)
I looked into this some time ago and did not see a reason to use the ifunc mechanism (which would not be available on Windows). I thought it should be implementable in a library, exactly as you did in your dpaste! :-)  (does `&foo` return `impl`?)


April 07, 2016
Am Thu, 07 Apr 2016 13:27:05 +0000
schrieb Johan Engelen <j@j.nl>:

> On Thursday, 7 April 2016 at 11:25:47 UTC, Johannes Pfau wrote:
> > Am Thu, 07 Apr 2016 10:52:42 +0000
> > schrieb Kai Nacke <kai@redstar.de>:
> > 
> >> glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries
> >> 
> >> Would be awesome to have something similar in druntime/Phobos.
> >> 
> >> Regards,
> >> Kai
> >
> > Available in GCC as the 'ifunc' attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes
> >
> > What do you mean by 'something similar in druntime/phobos'? A platform independent (slightly slower) variant?:
> >
> > http://dpaste.dzfl.pl/0aa81325a26a
> 
> I thought that the ifunc mechanism means an indirect call (i.e. a function ptr is set at the start of the program) ? That would be the same as what you are doing without performance difference.
> 
> https://gcc.gnu.org/wiki/FunctionMultiVersioning
> "To keep the cost of dispatching low, the IFUNC mechanism is used
> for dispatching. This makes the call to the dispatcher a one-time
> thing during startup and a call to a function version is a single
> jump ** indirect ** instruction." (emphasis mine)

The simple variant I've posted needs an additional branch on every function call. If we instead initialize the function pointer in a shared static ctor there's indeed no performance difference. The main problem here is because of cyclic constructor detection it will be more difficult to implement a generic template solution.

http://www.airs.com/blog/archives/403
"An alternative to all this linker stuff would be a variable holding a
function pointer. The function could then be written in assembler to do
the indirect jump. The variable would be initialized at program startup
time. The efficiency would be the same. The address of the function
would be the address of the indirect jump, so function pointers would
compare consistently."

> I looked into this some time ago and did not see a reason to use the ifunc mechanism (which would not be available on Windows). I thought it should be implementable in a library, exactly as you did in your dpaste! :-)

> (does `&foo` return `impl`?)

No, &foo will return the address of the wrapper function. I'm not sure if we can solve this. IIRC we can't overload &. Here's the alternative using a constructor which makes the address accessible. The syntax will still be different though:

__gshared void function() foo;
shared static this()
{
    foo = &foo1;
}

auto addr = &foo; // address of the variable
addr = cast(void*)foo; // the function address
April 07, 2016
On Thursday, 7 April 2016 at 14:46:06 UTC, Johannes Pfau wrote:
> Am Thu, 07 Apr 2016 13:27:05 +0000
> schrieb Johan Engelen <j@j.nl>:
>
>> On Thursday, 7 April 2016 at 11:25:47 UTC, Johannes Pfau wrote:
>> > Am Thu, 07 Apr 2016 10:52:42 +0000
>> > schrieb Kai Nacke <kai@redstar.de>:
>> > 
>> >> glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries
>> >> 
>> >> Would be awesome to have something similar in druntime/Phobos.
>> >> 
>> >> Regards,
>> >> Kai
>> >
>> > Available in GCC as the 'ifunc' attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes
>> >
>> > What do you mean by 'something similar in druntime/phobos'? A platform independent (slightly slower) variant?:
>> >
>> > http://dpaste.dzfl.pl/0aa81325a26a
>> 
>> I thought that the ifunc mechanism means an indirect call (i.e. a function ptr is set at the start of the program) ? That would be the same as what you are doing without performance difference.
>> 
>> https://gcc.gnu.org/wiki/FunctionMultiVersioning
>> "To keep the cost of dispatching low, the IFUNC mechanism is used
>> for dispatching. This makes the call to the dispatcher a one-time
>> thing during startup and a call to a function version is a single
>> jump ** indirect ** instruction." (emphasis mine)
>
> The simple variant I've posted needs an additional branch on every function call. If we instead initialize the function pointer in a shared static ctor there's indeed no performance difference.

Yep exactly.
For @target multiversioned functions, I thought one would want to create one static ctor that calls cpuid once and sets all function ptrs of that module.

>> (does `&foo` return `impl`?)
>
> No, &foo will return the address of the wrapper function. I'm not sure if we can solve this. IIRC we can't overload &.

OK. Well, the @target multifunctioning would need compiler support anyway and it is easy to do something slightly different for `&foo` when foo is a multiversioned function.

This should be fairly easy to implement in LDC, with some smarts needed in ordering and selecting the best function version.

April 08, 2016
On 7 April 2016 at 13:27, Walter Bright via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> On 4/6/2016 7:43 PM, Manu via Digitalmars-d wrote:
>>>
>>> 1. This has been characterized as a blocker, it is not, as it does not
>>> impede writing code that takes advantage of various SIMD code generation
>>> at
>>> compile time.
>>
>>
>> It's sufficiently blocking that I have not felt like working any
>> further without this feature present. I can't feel like it 'works' or
>> it's 'done', until I can demonstrate this functionality.
>> Perhaps we can call it a psychological blocker, and I am personally
>> highly susceptible to those.
>
>
> I can understand that it might be demotivating for you, but that is not a blocker. A blocker has no reasonable workaround. This has a trivial workaround:
>
>    gdc -simd=AFX foo.d
>
> becomes:
>
>    gdc -simd=AFX -version=AFX foo.d
>
> It's even simpler if you use a makefile variable:
>
>     FPU=AFX
>
>     gdc -simd=$(FPU) -version=$(FPU)

Sure. I've done this in my own tests. I just never published that anyone else should do it.


> You also mentioned being blocked (i.e. demotivated) for *years* by this, and I assume that may be because we don't care about SIMD support. That would be wrong, as I care a lot about it. But I had no idea you were having a problem with this, as you did not file any bug reports. Suffering in silence is never going to work :-)

There's been threads, but sure, I could have done more to push it along.
Motivation is a complex and not particularly logical emotion, there's
a lot of factors feeding into it.

Not least of which, is that I haven't been working in games for a
while, which means I haven't depended on it for my work. Don't take
that to read I have lost interest in the support, just that the
pressure is reduced.
You'll have noticed that C++ interaction is my recent focus, since
that's directly related to my current day-job, and the path that I
need to solve now to get D into my work.
That's consuming almost 100% of my D-time-allocation... if I could
ever manage to just kick that goal, it might free me up >_< .. I keep
on trying.


>>> 2. I'm not sure these global settings are the best approach, especially
>>> if
>>> one is writing applications that dynamically adjusts based on the CPU the
>>> user is running on.
>>
>>
>> They are necessary to provide a baseline. It is typical when building code that you specify a min-spec. This is what's used by default throughout the application.
>
>
> It is not necessary to do it that way. Call std.cpuid to determine what is available at runtime, and issue an error message if not. There is no runtime cost to that. In fact, it has to be done ANYWAY, as it isn't user friendly to seg fault trying to execute instructions that do not exist.

The author still needs to be able to control at compile-time what
min-spec shall be supported.
I agree the check is valuable, but I think it's an unrelated detail.


>> Runtime selection is not practical in a broad sense. Emitting small
>> fragments of SIMD here and there will probably take a loss if they are
>> all surrounded by a runtime selector. SIMD is all about pipelining,
>> and runtime branches on SIMD version are antithesis to good SIMD
>> usage; they can't be applied for small-scale deployment.
>> In my experience, runtime selection is desirable for large scale
>> instantiations at an outer level of the work loop. I've tried to
>> design this intent in my library, by making each simd API capable of
>> receiving SIMD version information via template arg, and within the
>> library, the version is always passed through to dependent calls.
>> The Idea is, if you follow this pattern; propagating a SIMD version
>> template arg through to your outer function, then you can instantiate
>> your higher-level work function for any number of SIMD feature
>> combinations you feel is appropriate.
>
>
> Doing it at a high level is what I meant, not for each SIMD code fragment.

Sure, so you agree we need a mechanism for the author to tune the default selection then? Or are you suggesting SSE2 is 'fine' as a default? (ie, that is what is implied by D_SIMD)


>> Naturally, this process requires a default, otherwise this usage baggage will cloud the API everywhere (rather than in the few cases where a developer specifically wants to make use of it), and many developers in 2015 feel SSE2 is a weak default. I would choose SSE4.1 in my applications, xbox developers would choose AVX1, it's very application/target-audience specific, but SSE2 is the only reasonable selection if we are not to accept a hint from the command line.
>
>
> I still don't see how it is a problem to do the switch at a high level.

It's not a problem, that's exactly my design, but it's not a universal solution.

> Heck, you could put the ENTIRE ENGINE inside a template, have a template parameter be the instruction set, and instantiate the template for each supported instruction set.
>
> Then,
>
>     void app(int simd)() { ... my fabulous app ... }
>
>     int main() {
>       auto fpu = core.cpuid.getfpu();
>       switch (fpu) {
>         case SIMD: app!(SIMD)(); break;
>         case SIMD4: app!(SIMD4)(); break;
>         default: error("unsupported FPU"); exit(1);
>       }
>     }

Sure, I've designed for this specifically, but it's not practical to
wind this all the way to the top of the stack.
Some hot code will make make use of this pattern, but small fragments
that appear throughout the code don't want to have this baggage
applied. They should just work with the developer's deliberately
selected default. It's not worth runtime selection on small
deployments. You will likely end up with numerous helper functions,
which when involved in the runtime-selected loops, would have
different versions generated appropriately, but when these helper
functions appear on their own, they would want to use a sensible
default.

>> I've done it with a template arg because it can be manually
>> propagated, and users can extrapolate the pattern into their outer
>> work functions, which can then easily have multiple versions
>> instantiated for runtime selection.
>> I think it's also important to mangle it into the symbol name for the
>> reasons I mention above.
>
>
> Note that version identifiers are not usable directly as template parameters. You'd have to set up a mapping.

I guess you haven't looked at my code, but yes, it's all mapped to enums used by the templates. The versions would assign a constant used as the template's default arg.
April 07, 2016
On 4/7/2016 3:52 AM, Kai Nacke wrote:
> On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:
>> Then,
>>
>>     void app(int simd)() { ... my fabulous app ... }
>>
>>     int main() {
>>       auto fpu = core.cpuid.getfpu();
>>       switch (fpu) {
>>         case SIMD: app!(SIMD)(); break;
>>         case SIMD4: app!(SIMD4)(); break;
>>         default: error("unsupported FPU"); exit(1);
>>       }
>>     }
>>
>
> glibc has a special mechanism for resolving the called function during loading.
> See the section on the GNU Indirect Function Mechanism here:
> https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries
>
>
> Would be awesome to have something similar in druntime/Phobos.

We already have core.cupid, which covers most of what that article talks about. The indirect function thing appears to be a way to selectively load from various dlls. But that can be done anyway with core.cpuid and dynamic dll loading, so I'm not sure what advantage it brings.

April 07, 2016
On 4/7/2016 5:27 PM, Manu via Digitalmars-d wrote:
> You'll have noticed that C++ interaction is my recent focus, since
> that's directly related to my current day-job, and the path that I
> need to solve now to get D into my work.

We recognize C++ interoperability to be a key feature of D. I hope you like the support you got with the C++ virtual functions! I got bogged down recently with getting the C++ exception handling support working better, hopefully we've turned the corner on that one. I'd hoped to be further along at the moment with C++ interoperability (but it's always going to be a work in progress).


> That's consuming almost 100% of my D-time-allocation... if I could
> ever manage to just kick that goal, it might free me up >_< .. I keep
> on trying.

I do appreciate your efforts in this direction.


>> Doing it at a high level is what I meant, not for each SIMD code fragment.
> Sure, so you agree we need a mechanism for the author to tune the
> default selection then?

From the command line, probably not. I like the pragma thing better.


> Or are you suggesting SSE2 is 'fine' as a default? (ie, that is what is implied by D_SIMD)

It is fine as a default, as it is the baseline minimum machine D is expecting.

April 07, 2016
On 4/7/2016 3:15 AM, Johannes Pfau wrote:
> The problem is that march=x can set more than one
> feature flag. So instead of
>
> gdc -march=armv7-a
> you have to do
> gdc -march=armv7-a -fversion=ARM_FEATURE_CRC32
> -fversion=ARM_FEATURE_UNALIGNED ...
>
> Sou have to know exactly which features are supported for a CPU.
> Essentially you have to duplicate the CPU<=>feature database already
> present in GCC (and likely LLVM too) in your Makefile. And you'll need
> -march=armv7-a anyway to make sure the GCC codegen can use these
> features as well.
>
> So this issue is not a blocker, but what you propose is a workaround at
> best, not a solution.


Having a veritable blizzard of these predefined versions, that constantly are obsoleted and new ones appearing, seems like a serious problem when trying to standardize the language.