April 06, 2016
On 4/6/2016 7:43 PM, Manu via Digitalmars-d wrote:
>> 1. This has been characterized as a blocker, it is not, as it does not
>> impede writing code that takes advantage of various SIMD code generation at
>> compile time.
>
> It's sufficiently blocking that I have not felt like working any
> further without this feature present. I can't feel like it 'works' or
> it's 'done', until I can demonstrate this functionality.
> Perhaps we can call it a psychological blocker, and I am personally
> highly susceptible to those.

I can understand that it might be demotivating for you, but that is not a blocker. A blocker has no reasonable workaround. This has a trivial workaround:

   gdc -simd=AFX foo.d

becomes:

   gdc -simd=AFX -version=AFX foo.d

It's even simpler if you use a makefile variable:

    FPU=AFX

    gdc -simd=$(FPU) -version=$(FPU)

You also mentioned being blocked (i.e. demotivated) for *years* by this, and I assume that may be because we don't care about SIMD support. That would be wrong, as I care a lot about it. But I had no idea you were having a problem with this, as you did not file any bug reports. Suffering in silence is never going to work :-)


>> 2. I'm not sure these global settings are the best approach, especially if
>> one is writing applications that dynamically adjusts based on the CPU the
>> user is running on.
>
> They are necessary to provide a baseline. It is typical when building
> code that you specify a min-spec. This is what's used by default
> throughout the application.

It is not necessary to do it that way. Call std.cpuid to determine what is available at runtime, and issue an error message if not. There is no runtime cost to that. In fact, it has to be done ANYWAY, as it isn't user friendly to seg fault trying to execute instructions that do not exist.


> Runtime selection is not practical in a broad sense. Emitting small
> fragments of SIMD here and there will probably take a loss if they are
> all surrounded by a runtime selector. SIMD is all about pipelining,
> and runtime branches on SIMD version are antithesis to good SIMD
> usage; they can't be applied for small-scale deployment.
> In my experience, runtime selection is desirable for large scale
> instantiations at an outer level of the work loop. I've tried to
> design this intent in my library, by making each simd API capable of
> receiving SIMD version information via template arg, and within the
> library, the version is always passed through to dependent calls.
> The Idea is, if you follow this pattern; propagating a SIMD version
> template arg through to your outer function, then you can instantiate
> your higher-level work function for any number of SIMD feature
> combinations you feel is appropriate.

Doing it at a high level is what I meant, not for each SIMD code fragment.


> Naturally, this process requires a default, otherwise this usage
> baggage will cloud the API everywhere (rather than in the few cases
> where a developer specifically wants to make use of it), and many
> developers in 2015 feel SSE2 is a weak default. I would choose SSE4.1
> in my applications, xbox developers would choose AVX1, it's very
> application/target-audience specific, but SSE2 is the only reasonable
> selection if we are not to accept a hint from the command line.

I still don't see how it is a problem to do the switch at a high level. Heck, you could put the ENTIRE ENGINE inside a template, have a template parameter be the instruction set, and instantiate the template for each supported instruction set.

Then,

    void app(int simd)() { ... my fabulous app ... }

    int main() {
      auto fpu = core.cpuid.getfpu();
      switch (fpu) {
        case SIMD: app!(SIMD)(); break;
        case SIMD4: app!(SIMD4)(); break;
        default: error("unsupported FPU"); exit(1);
      }
    }

> I've done it with a template arg because it can be manually
> propagated, and users can extrapolate the pattern into their outer
> work functions, which can then easily have multiple versions
> instantiated for runtime selection.
> I think it's also important to mangle it into the symbol name for the
> reasons I mention above.

Note that version identifiers are not usable directly as template parameters. You'd have to set up a mapping.

And yes, if mangled in as part of the symbol, the linker won't pick the wrong one.
April 06, 2016
On 4/6/2016 7:25 PM, Manu via Digitalmars-d wrote:
> Sure, but it's an ongoing maintenance task, constantly requiring
> population with metadata for new processors that become available.
> Remember, most processors are arm processors, and there are like 20
> manufacturers of arm chips, and many of those come in a series of
> minor variations with/without sub-features present, and in a lot of
> cases, each permutation of features attached to random manufacturers
> arm chip 'X' doesn't actually have a name to describe it. It's also
> completely impractical to declare a particular arm chip by name when
> compiling for arm. It's a sloppy relationship comparing intel and AMD
> let alone the myriad of arm chips available.
> TL;DR, defining architectures with an intel-centric naming convention
> is a very bad idea.

You're not making a good case for a standard language defined set of definitions for all these (they'll always be obsolete, inadequate and probably wrong, as you point out).

April 07, 2016
On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:
>
> I can understand that it might be demotivating for you, but that is not a blocker. A blocker has no reasonable workaround. This has a trivial workaround:
>
>    gdc -simd=AFX foo.d
>
> becomes:
>
>    gdc -simd=AFX -version=AFX foo.d
>
> It's even simpler if you use a makefile variable:
>
>     FPU=AFX
>
>     gdc -simd=$(FPU) -version=$(FPU)

    ldc -mcpu=native

becomes:

     ????
>
> I still don't see how it is a problem to do the switch at a high level. Heck, you could put the ENTIRE ENGINE inside a template, have a template parameter be the instruction set, and instantiate the template for each supported instruction set.
>
> Then,
>
>     void app(int simd)() { ... my fabulous app ... }
>
>     int main() {
>       auto fpu = core.cpuid.getfpu();
>       switch (fpu) {
>         case SIMD: app!(SIMD)(); break;
>         case SIMD4: app!(SIMD4)(); break;
>         default: error("unsupported FPU"); exit(1);
>       }
>     }

1. Executable size will grow with every instruction set release
2. BLAS already has big executable size
And main:
3. This would not solve the problem for generic BLAS implementation for Phobos at all! How you would force compiler to USE and NOT USE specific vector permutations for example in the same object file? Yes, I know, DMD has not permutations. No, I don't want to write permutation for each architecture. Why? I can write simple D code that generates single LLVM IR code which would work for ALL targets!

Best regards,
Ilya

April 07, 2016
On 4/7/2016 12:59 AM, 9il wrote:
> 1. Executable size will grow with every instruction set release

Yes, and nobody cares. With virtual memory and demand loading, unexecuted code will never be loaded off of disk and will never consume memory space. And with a 64 bit address space, there will never be a shortage of virtual address space.

It will consume space on your 1 terabyte drive. Meh. I have several of those drives, and what consumes space is video, not code binaries :-)



> 3. This would not solve the problem for generic BLAS implementation for Phobos
> at all! How you would force compiler to USE and NOT USE specific vector
> permutations for example in the same object file? Yes, I know, DMD has not
> permutations. No, I don't want to write permutation for each architecture. Why?
> I can write simple D code that generates single LLVM IR code which would work
> for ALL targets!

There's no reason for the compiler to make target CPU information available when writing generic code.
April 07, 2016
Am Thu, 7 Apr 2016 12:25:03 +1000
schrieb Manu via Digitalmars-d <digitalmars-d@puremagic.com>:

> On 6 April 2016 at 23:26, 9il via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> > On Wednesday, 6 April 2016 at 12:40:04 UTC, Manu wrote:
> >>
> >> On 6 April 2016 at 07:41, Johan Engelen via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> >>>
> >>> [...]
> >>
> >>
> >> With respect to SIMD, knowing a processor model like 'broadwell' is not helpful, since we really want to know 'sse4'. If we know processor model, then we need to keep a compile-time table in our code somewhere if every possible cpu ever known and it's associated feature set. Knowing the feature we're interested is what we need.
> >
> >
> > Yes, however this can be implemented in a spcial Phobos module. So compilers would need less work. --Ilya
> 
> Sure, but it's an ongoing maintenance task, constantly requiring
> population with metadata for new processors that become available.
> Remember, most processors are arm processors, and there are like 20
> manufacturers of arm chips, and many of those come in a series of
> minor variations with/without sub-features present, and in a lot of
> cases, each permutation of features attached to random manufacturers
> arm chip 'X' doesn't actually have a name to describe it. It's also
> completely impractical to declare a particular arm chip by name when
> compiling for arm. It's a sloppy relationship comparing intel and AMD
> let alone the myriad of arm chips available.
> TL;DR, defining architectures with an intel-centric naming convention
> is a very bad idea.

GCC already keeps a cpu <=> feature mapping (after all it needs to know what features it can use when you specify -mcpu) so for GDC exposing available features isn't more difficult than exposing the CPU type.

I'm not sure if you can actually enable/disable CPU features manually without -mcpu?

However, available features and even the type used to describe the CPU are completely architecture specific in GCC. This means for GDC we have to write custom code for every supported architecture. (We already have to do this for version(Architecture) though).



FYI this is handled in the gcc/config subsystem: https://github.com/gcc-mirror/gcc/tree/master/gcc/config

#defines for C/ARM: arm_cpu_builtins in https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm-c.c (__ARM_NEON__ etc)

As you can see the only common requirement for backend architectures is to call def_or_undef_macro. This means we have to modify the gcc/config files and write replacements for arm_cpu_builtins and similar functions.

Known ARM cores and feature sets: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm-cores.def


I guess every backend-architecture has to provide cpu names for -mcpu so that's probably the one thing we could expose to D for all architectures. (Names are of course GCC specific, but I guess LLVM should use compatible names). This is less work to implement in GDC but you'd have to duplicate the GCC feature table in phobos. OTOH standardizing the names and available feature flag means somebody with knowledge in that area has to write down a spec.



TLDR:
If required we can always expose compiler specific versions
(GNU_NEON/LDC_NEON) even without DMD approval/integration. This should
be coordinated with LDC though. Somebody has to make a list of needed
identifiers, preferably mentioning the matching C macros.

Things get much more complicated if you need feature flags not currently used by / present in GCC.
April 07, 2016
On Thursday, 7 April 2016 at 09:41:06 UTC, Walter Bright wrote:
> On 4/7/2016 12:59 AM, 9il wrote:
>> 1. Executable size will grow with every instruction set release
>
> Yes, and nobody cares. With virtual memory and demand loading, unexecuted code will never be loaded off of disk and will never consume memory space. And with a 64 bit address space, there will never be a shortage of virtual address space.
>
> It will consume space on your 1 terabyte drive. Meh. I have several of those drives, and what consumes space is video, not code binaries :-)
>

what about 1GB game 2D for a Phone, or maybe a clock?

>
>> 3. This would not solve the problem for generic BLAS implementation for Phobos
>> at all! How you would force compiler to USE and NOT USE specific vector
>> permutations for example in the same object file? Yes, I know, DMD has not
>> permutations. No, I don't want to write permutation for each architecture. Why?
>> I can write simple D code that generates single LLVM IR code which would work
>> for ALL targets!
>
> There's no reason for the compiler to make target CPU information available when writing generic code.

This is not true for BLAS based on D. You don't want to see the opportunities. The final result of your dogmatic decision would make code slower for DMD, but LDC and GDC would implement required simple features. I just wanted to write fast code for DMD too.
April 07, 2016
Am Wed, 6 Apr 2016 17:42:30 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> On 4/6/2016 5:36 AM, Manu via Digitalmars-d wrote:
> > But at very least, the important detail is that the version ID's are standardised and shared among all compilers.
> 
> It's a reasonable suggestion; some points:
> 
> 1. This has been characterized as a blocker, it is not, as it does not impede writing code that takes advantage of various SIMD code generation at compile time.
> 
> 2. I'm not sure these global settings are the best approach, especially if one is writing applications that dynamically adjusts based on the CPU the user is running on. The main trouble comes about when different modules are compiled with different settings. What happens with template code generation, when the templates are pulled from different modules? What happens when COMDAT functions are generated? (The linker picks one arbitrarily and discards the others.) Which settings wind up in the executable will be not easily predictable.
> 

That's my #1 argument why '-version' is dangerous and 'static if' is better ;-) If you've got a version() block in a template and compile two modules using the same template with different -version flags you'll have exactly that problem. Have an enum myFlag = x; in a config module + static if => problem solved.

The problem isn't having global settings, the problem is having to manually specify the same global setting for every source file.
April 07, 2016
Am Wed, 6 Apr 2016 20:27:31 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> On 4/6/2016 7:43 PM, Manu via Digitalmars-d wrote:
> >> 1. This has been characterized as a blocker, it is not, as it does not impede writing code that takes advantage of various SIMD code generation at compile time.
> >
> > It's sufficiently blocking that I have not felt like working any further without this feature present. I can't feel like it 'works' or it's 'done', until I can demonstrate this functionality. Perhaps we can call it a psychological blocker, and I am personally highly susceptible to those.
> 
> I can understand that it might be demotivating for you, but that is not a blocker. A blocker has no reasonable workaround. This has a trivial workaround:
> 
>     gdc -simd=AFX foo.d
> 
> becomes:
> 
>     gdc -simd=AFX -version=AFX foo.d
> 

The problem is that march=x can set more than one
feature flag. So instead of

gdc -march=armv7-a
you have to do
gdc -march=armv7-a -fversion=ARM_FEATURE_CRC32
-fversion=ARM_FEATURE_UNALIGNED ...

Sou have to know exactly which features are supported for a CPU. Essentially you have to duplicate the CPU<=>feature database already present in GCC (and likely LLVM too) in your Makefile. And you'll need -march=armv7-a anyway to make sure the GCC codegen can use these features as well.

So this issue is not a blocker, but what you propose is a workaround at best, not a solution.
April 07, 2016
Am Thu, 7 Apr 2016 02:41:06 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> > 3. This would not solve the problem for generic BLAS implementation for Phobos at all! How you would force compiler to USE and NOT USE specific vector permutations for example in the same object file? Yes, I know, DMD has not permutations. No, I don't want to write permutation for each architecture. Why? I can write simple D code that generates single LLVM IR code which would work for ALL targets!
> 
> There's no reason for the compiler to make target CPU information available when writing generic code.

Actually for GDC/GCC you can't even write functions using certain SIMD stuff as 'generic' code. Unless you use -mavx or -march the builtins are not exposed to user code. IIRC the compiler even complains about inline ASM if you use unsupported instructions.

You also can't always compile with the 'biggest' feature set, as GCC might use these features in codegen.

TLDR;
For GCC/GDC you will have to use target flags / @attribute(target) to
mix feature sets.
April 07, 2016
On Thursday, 7 April 2016 at 03:27:31 UTC, Walter Bright wrote:
> Then,
>
>     void app(int simd)() { ... my fabulous app ... }
>
>     int main() {
>       auto fpu = core.cpuid.getfpu();
>       switch (fpu) {
>         case SIMD: app!(SIMD)(); break;
>         case SIMD4: app!(SIMD4)(); break;
>         default: error("unsupported FPU"); exit(1);
>       }
>     }
>

glibc has a special mechanism for resolving the called function during loading. See the section on the GNU Indirect Function Mechanism here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Optimized%20Libraries

Would be awesome to have something similar in druntime/Phobos.

Regards,
Kai