Any usable SIMD implementation? (page 11)

On 4/13/2016 3:58 AM, Marco Leise wrote: > How about this style as an alternative?: > > immutable bool mmx; > immutable bool hasPopcnt; > > shared static this() > { > import gcc.builtins; > mmx = __builtin_cpu_supports("mmx" ) > 0; > hasPopcnt = __builtin_cpu_supports("popcnt") > 0; > } > Please do not invent an alternative interface, use the one in core.cpuid: http://dlang.org/phobos/core_cpuid.html#.mmx

Am Wed, 13 Apr 2016 04:14:48 -0700 schrieb Walter Bright <newshound2@digitalmars.com>: > On 4/13/2016 3:58 AM, Marco Leise wrote: > > How about this style as an alternative?: > > > > immutable bool mmx; > > immutable bool hasPopcnt; > > > > shared static this() > > { > > import gcc.builtins; > > mmx = __builtin_cpu_supports("mmx" ) > 0; > > hasPopcnt = __builtin_cpu_supports("popcnt") > 0; > > } > > > > Please do not invent an alternative interface, use the one in core.cpuid: > > http://dlang.org/phobos/core_cpuid.html#.mmx Yes, they are all @property and a substitution with direct access to the globals will work around GDC's lack of cross-module inlining. Otherwise these feature checks which might be used in hot code, are more costly than they should be. I hate when things get in the way of efficiency. :) -- Marco

On 4/13/2016 5:47 AM, Marco Leise wrote: > Yes, they are all @property and a substitution with direct > access to the globals will work around GDC's lack of > cross-module inlining. Otherwise these feature checks which > might be used in hot code, are more costly than they should be. > I hate when things get in the way of efficiency. :) It doesn't need to be efficient, because such checks should be done at a higher level in the program's logic, not on low level code. Even so, the program could cache the result of the call.

On 13 April 2016 at 13:14, Walter Bright via Digitalmars-d <digitalmars-d@puremagic.com> wrote: > On 4/13/2016 3:58 AM, Marco Leise wrote: >> >> How about this style as an alternative?: >> >> immutable bool mmx; >> immutable bool hasPopcnt; >> >> shared static this() >> { >> import gcc.builtins; >> mmx = __builtin_cpu_supports("mmx" ) > 0; >> hasPopcnt = __builtin_cpu_supports("popcnt") > 0; >> } >> > > Please do not invent an alternative interface, use the one in core.cpuid: > > http://dlang.org/phobos/core_cpuid.html#.mmx An alternative interface needs to be invented anyway for other CPUs.

On 4/14/2016 1:21 AM, Iain Buclaw via Digitalmars-d wrote: > An alternative interface needs to be invented anyway for other CPUs. That would be fine. But there is no reason to redo core.cpuid for x86 machines.

On Sunday, 3 April 2016 at 07:39:00 UTC, Manu wrote: > On 3 April 2016 at 16:14, 9il via Digitalmars-d <digitalmars-d@puremagic.com> wrote: >> >> Is it possible to introduce compile time information about target platform? I am working on BLAS from scratch implementation. And it is no hope to create something useable without CT information about target. >> >> Best regards, >> Ilya > > My SIMD implementation has been blocked on that for years too. > I need to know the SIMD level flags passed to the compiler at least, and DMD needs to introduce the concept. https://github.com/ldc-developers/ldc/pull/1434

On Tuesday, 12 April 2016 at 10:55:18 UTC, xenon325 wrote: > > Have you seen how GCC's function multiversioning [1] ? > I've been thinking about the gcc multiversioning since you mentioned it previously. I keep thinking about how the optimal algorithm for something like matrix multiplication depends on the size of the matrices. For instance, you might do something for very small matrices that just relies on one processor, then you add in SIMD as the size grows, then you add in multiple CPUs, then you add in the GPU (or maybe you add before CPUs), then you add in multiple computers. I don't know how some of those choices would get made at compile time for dynamic arrays. Would need some kind of run-time approach. At least for static arrays, you could do multiple versions of the function and then use template constraints to call whichever function. Some tuning would be necessary.

Am Fri, 15 Apr 2016 18:54:12 +0000 schrieb jmh530 <john.michael.hall@gmail.com>: > On Tuesday, 12 April 2016 at 10:55:18 UTC, xenon325 wrote: > > > > Have you seen how GCC's function multiversioning [1] ? > > > > I've been thinking about the gcc multiversioning since you mentioned it previously. > > I keep thinking about how the optimal algorithm for something like matrix multiplication depends on the size of the matrices. > > For instance, you might do something for very small matrices that just relies on one processor, then you add in SIMD as the size grows, then you add in multiple CPUs, then you add in the GPU (or maybe you add before CPUs), then you add in multiple computers. GCC only has one architecture as a target at a time. As long as this is so, there is little point in contemplating how it handles multiple architectures and network traffic. :) CPUs run the bulk of code, from booting over kernel and drivers to applications and there will always be something that can be optimized if it is statically known that a certain instruction set is supported. To pick up your matrices example, imagine OpenGL code that has some 4x4 matrices that are in no direct relation to each other. The GPU is only good at bulk processing, and it doesn't apply here. So you need the general purpose processor and benefit from the knowledge that some SSE level is supported. In general, when you have to make many quick decisions on small amounts of data the GPU or networking are out of question. -- Marco

April 16, 2016

Re: Any usable SIMD implementation?

Posted by Marco Leise
in reply to Walter Bright

Permalink

Marco Leise

Posted in reply to Walter Bright

Permalink

Am Tue, 12 Apr 2016 23:22:37 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> >            "mulq %[y]"
> >            : "=a" tmp.lo, "=d" tmp.hi : "a" x, [y] "rm" y;
> 
> I don't see anything elegant about those lines, starting with "mulq" is not in any of the AMD or Intel CPU manuals. The assembler should notice that 'y' is a ulong and select the 64 bit version of the MUL opcode automatically.
> 
> I can see nothing to recommend the:
> 
>      "=a" tmp.lo
> 
> syntax. How about something comprehensible like "tmp.lo = EAX"? I bet people could even figure that out without consulting stackoverflow! :-)
> 
> I have no idea what:
> 
>     "a" x
> 
> and:
> 
>      [y] "rm" y
> 
> mean, nor why the ":" appears sometimes and the "," other times.

Tell me again, what's more elgant !

        uint* pnb = cast(uint*)cf.processorNameBuffer.ptr;
        version(GNU)
        {
            asm { "cpuid" : "=a" pnb[0], "=b" pnb[1], "=c" pnb[ 2], "=d" pnb[ 3] : "a" 0x8000_0002; }
            asm { "cpuid" : "=a" pnb[4], "=b" pnb[5], "=c" pnb[ 6], "=d" pnb[ 7] : "a" 0x8000_0003; }
            asm { "cpuid" : "=a" pnb[8], "=b" pnb[9], "=c" pnb[10], "=d" pnb[11] : "a" 0x8000_0004; }
        }
        else version(D_InlineAsm_X86)
        {
            asm pure nothrow @nogc {
                push ESI;
                mov ESI, pnb;
                mov EAX, 0x8000_0002;
                cpuid;
                mov [ESI], EAX;
                mov [ESI+4], EBX;
                mov [ESI+8], ECX;
                mov [ESI+12], EDX;
                mov EAX, 0x8000_0003;
                cpuid;
                mov [ESI+16], EAX;
                mov [ESI+20], EBX;
                mov [ESI+24], ECX;
                mov [ESI+28], EDX;
                mov EAX, 0x8000_0004;
                cpuid;
                mov [ESI+32], EAX;
                mov [ESI+36], EBX;
                mov [ESI+40], ECX;
                mov [ESI+44], EDX;
                pop ESI;
            }
        }
        else version(D_InlineAsm_X86_64)
        {
            asm pure nothrow @nogc {
                push RSI;
                mov RSI, pnb;
                mov EAX, 0x8000_0002;
                cpuid;
                mov [RSI], EAX;
                mov [RSI+4], EBX;
                mov [RSI+8], ECX;
                mov [RSI+12], EDX;
                mov EAX, 0x8000_0003;
                cpuid;
                mov [RSI+16], EAX;
                mov [RSI+20], EBX;
                mov [RSI+24], ECX;
                mov [RSI+28], EDX;
                mov EAX, 0x8000_0004;
                cpuid;
                mov [RSI+32], EAX;
                mov [RSI+36], EBX;
                mov [RSI+40], ECX;
                mov [RSI+44], EDX;
                pop RSI;
            }
        }

-- 
Marco

On 4/16/2016 2:40 PM, Marco Leise wrote: > Tell me again, what's more elgant ! If I wanted to write in assembler, I wouldn't write in a high level language, especially a weird one like GNU version.

Forums