Jump to page: 1 2
Thread overview
How to deal with inline asm functions in Phobos/druntime?
Apr 07, 2015
Johan Engelen
Apr 08, 2015
Kai Nacke
Apr 08, 2015
Johan Engelen
Apr 08, 2015
Kai Nacke
Apr 08, 2015
Johan Engelen
Apr 08, 2015
David Nadlinger
Apr 08, 2015
Kai Nacke
Apr 08, 2015
Johan Engelen
Apr 08, 2015
Daniel Murphy
Apr 08, 2015
David Nadlinger
Apr 08, 2015
Johan Engelen
Apr 12, 2015
Johan Engelen
Apr 13, 2015
Kai Nacke
April 07, 2015
Hi all,
  I've hit on a problem that I know how to fix, but I do not know how to properly do it. Thanks for your help.

A number of functions have inline assembly implementations in Phobos, e.g. std.math.ilogb(). I don't know why exactly they have asm implementations for Windows. The default path "core.stdc.math.ilogbl(x);" would be fine on Windows. The problem is that LDC would take that same assembly code, although it assumes DMD calling conventions. Also, _much_ better code is generated when ldc.llvmasm inline assembly code is used for non-naked asm functions: for D-style naked asm, the generated code contains huge function pro-/epilogues. The LDC implementation without assumptions about calling conventions for MSVC-compatible ilogb would be:

        import ldc.llvmasm;
        return __asm!int(
           `fldl     $1         ;
            fxam                ;
            fstsw   %AX         ;
            and     $$0x45, %AH ;
            cmp     $$0x40, %AH ;
            jz      Lzeronan    ;
            cmp     $$5, %AH    ;
            jz      Linfinity   ;
            cmp     $$1, %AH    ;
            jz      Lzeronan    ;
            fxtract             ;
            fstp    %ST(0)      ;
            fistpl  (%RSP)      ;
            mov     (%RSP), $0  ;
            jmp     Ldone       ;
        Lzeronan:
            mov     $$0x80000000, $0 ;
            fstp    %ST(0)           ;
            jmp     Ldone            ;

        Linfinity:
            mov     $$0x7FFFFFFF, $0 ;
            fstp    %ST(0)           ;

        Ldone: ;`, "=r,*m,~{ax},~{memory}", &x);

I think it'd be relatively straightforward to write the code such that it works for 80-bit and 64-bit reals.

My question: how do I fix our fork of Phobos? Do we just want to pass the call to core.stdc.math.ilogbl, and disregard the 'optimized' inline asm? Or do we add "version(LDC) version (Win64)" or similar and add our own asm implementations?

It is fun to write these small asm blobs, but I am not sure how maintainable all this will be.

Confused :S

Thanks!
  Johan
April 08, 2015
On Tuesday, 7 April 2015 at 18:15:58 UTC, Johan Engelen wrote:
> It is fun to write these small asm blobs, but I am not sure how maintainable all this will be.
>
> Confused :S

Hi Johan!

The functions in Druntime/Phobos exists because the MSVC runtime does not support 80bit reals which is a must-have for DMD.

If you are going to rewrite assembly snippets then please use ldc.llvmasm because of all the advantages you have listed. Plus, ldc.llvmasm do not disable function inlining which is also a big win.

Regarding std.math, there are already lot of places with versions for LDC. One reason is that LLVM intrinsics exists. Another reason is that LDC supports PPC, ARM and MIPS, too. At last, a reason is that the DMD ASM does not work. I prefer having rewritten assembly functions. But please note that there must always be a fallback to the core.stdc function because of the many supported architectures.

BTW: Many users of LDC want fast math functions. Having a better implementation of math functions is therefore desirable.

Regards,
Kai
April 08, 2015
Hi Kai,
  OK!
I'd very much appreciate an example of how to add the code above to ilogb in std.math. The "version(...)" stuff quickly becomes a mess.
Could you take the code above and show me how you would put this into std.math?
I'm sorry if this is too much hand-holding, it is really very much appreciated.

-Johan
April 08, 2015
On Wednesday, 8 April 2015 at 09:29:56 UTC, Johan Engelen wrote:
> Hi Kai,
>   OK!
> I'd very much appreciate an example of how to add the code above to ilogb in std.math. The "version(...)" stuff quickly becomes a mess.
> Could you take the code above and show me how you would put this into std.math?
> I'm sorry if this is too much hand-holding, it is really very much appreciated.
>
> -Johan

Hi Johan!

The structure of ilogb() is:

int ilogb(real x)
{
    version(Win64_DMD_InlineAsm_X87)
        ....
    version(CRuntime_Microsoft)
        ....
    else
        return core.stdc.math.ilogbl(x);
}

You can simply add your stuff at the top of the function, resulting in:

int ilogb(real x)
{
    version(LDC)
    {
        version(X86_64)
            ....
        else version(X86_64)
            ....
        else
            return core.stdc.math.ilogbl(x);
    }
    else
    version(Win64_DMD_InlineAsm_X87)
        ....
    else version(CRuntime_Microsoft)
        ....
    else
        return core.stdc.math.ilogbl(x);
}

Look at the 'misplaced' else: Doing the addition in this way you do not touch the original implementation. This makes merging more easier. If you look closer at std.math, you see a lot of 'wrong' formatting/spacing if version(LDC) is involved. The goal here is to leave as much as possible of the original source untouched.

I hope this answers your question. If not then do not hesitate to asked again.

Regards,
Kai
April 08, 2015
On Wed, Apr 8, 2015 at 8:31 AM, Kai Nacke via digitalmars-d-ldc <digitalmars-d-ldc@puremagic.com> wrote:
> BTW: Many users of LDC want fast math functions. Having a better implementation of math functions is therefore desirable.

Speaking about performance: Haven't we switched to 64 bit reals on Win64 for the time being? If so, we'd probably want a version of ilogb that uses SSE instead of the x87 FPU. This is probably (I'm rather sure, but did not measure it) a much bigger performance problem than any overly long general purpose register pushing/popping prologues generated by non-naked asm functions.

ilogb() does not seem to be a commonly used function, though, so it
might not be worth the effort.

 — David

April 08, 2015
On Wednesday, 8 April 2015 at 11:29:26 UTC, David Nadlinger wrote:
> Speaking about performance: Haven't we switched to 64 bit reals on
> Win64 for the time being? If so, we'd probably want a version of ilogb
> that uses SSE instead of the x87 FPU. This is probably (I'm rather
> sure, but did not measure it) a much bigger performance problem than
> any overly long general purpose register pushing/popping prologues
> generated by non-naked asm functions.

Yes, reals are 64bit on Win64. Using SSE can really help here.
Speaking of performance: The only way to know the some piece of code is faster is to measure it. This can reveal surprising insights.... (Hint!)

> ilogb() does not seem to be a commonly used function, though, so it
> might not be worth the effort.

There are lot of other functions, too. E.g. the inline assembler in expi() is commented because of a problem with the asm parser. I think Johan will find enough worthwhile targets.

Regards,
Kai
April 08, 2015
Thanks Kai!

Well, first we should get LDC fully functional on Win64, no? :P
That was my main goal, really.

About SSE: I can't vectorize the code for this one function with one real as argument! I had done a brief search for what instructions are available on xmm regs (argument is passed through xmm0), but it is mostly simple arithmetic I think, not the kind of stuff that is used in the original druntime asm code.
(Btw, the pro-epilogues also consist of pushing/popping all XMM regs, quite a bit of data, but indeed no clue how slow/fast that is. Didn't measure a thing, but it just looked kind of wasteful :)

But thinking about David's comment a bit more, if I understand ilogb correctly, all it needs to do is output the exponent of the real as an int. For that, one doesn't need floating point operations at all. Just a bit of bit shifting and masking, and subtracting the floating point format's exponent bias value. I think we can just express that as normal D code, which can then be optimized / vectorized by LLVM ? That code could go upstream, with a few static ifs for all floating point formats supported.

Again, I want to get LDC fully functional foremost, so all this is fun but distracting ;)
April 08, 2015
On Wednesday, 8 April 2015 at 11:09:48 UTC, Kai Nacke wrote:
>
> Look at the 'misplaced' else: Doing the addition in this way you do not touch the original implementation. This makes merging more easier. If you look closer at std.math, you see a lot of 'wrong' formatting/spacing if version(LDC) is involved. The goal here is to leave as much as possible of the original source untouched.

Exactly!
April 08, 2015
"Johan Engelen"  wrote in message news:hqpjetsgaeoqkfyqexka@forum.dlang.org...

> About SSE: I can't vectorize the code for this one function with one real as argument! I had done a brief search for what instructions are available on xmm regs (argument is passed through xmm0), but it is mostly simple arithmetic I think, not the kind of stuff that is used in the original druntime asm code.
> (Btw, the pro-epilogues also consist of pushing/popping all XMM regs, quite a bit of data, but indeed no clue how slow/fast that is. Didn't measure a thing, but it just looked kind of wasteful :)

I don't think it's so much about vectorizing as it is about avoiding the x87 FPU, which you can do when 80-bit precision is not needed. 

April 08, 2015
On 8 Apr 2015, at 15:15, Daniel Murphy via digitalmars-d-ldc wrote:
> I don't think it's so much about vectorizing as it is about avoiding the x87 FPU, which you can do when 80-bit precision is not needed.

Indeed. On x86_64, the SSE registers (%xmm0 and so on) are used by default for single- and double-precision floating point operations. The x87 FPU is not particularly well-optimized on newer CPUs to begin with, and transferring data from the SSE registers to the FPU on function entry and then back again is quite costly too.

For example, this is what made us (all D compilers) look bad on that Perlin noise microbenchmark (the thread from a couple of months ago).

 — David
« First   ‹ Prev
1 2