Thread overview
inline asm in inlined function / ECX clobbered / stack frame / naked
May 06, 2019
James Blachly
May 06, 2019
kinke
May 06, 2019
kinke
May 07, 2019
James Blachly
May 08, 2019
kinke
May 08, 2019
kinke
May 28, 2019
James Blachly
May 29, 2019
kinke
May 05, 2019
I am posting here instead of learn because it is my understanding that DMD will not inline a fn that has inline asm, and because one of my questions relates to why LLVM is unsafely(?) using ECX.

Problem Statement: I am writing a faster round-up-to-nearest-power-of-2 function. An inline asm [1] version using BSR instruction is about twice as fast as (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, (x)|=(x)>>16, ++(x)).[2]

I know about core.bitop.bsr and std.math.nextPow2 which uses it. My asm code block is 2.5x faster than codegen for (2 << bsr(x)) which surprises me...

Summary: I've inlined a function consisting of mostly inline assembly. In the initial approach ECX is clobbered but not recognized by LDC/LLVM which tries to use it as loop iter. In approach 2, I pushed RCX to stack, LDC/LLVM sees me use RCX(?) and uses EDX instead, but now the inline asm referencing stack frame variable (x[EBP] or [EBP + x] -- side note, should that be RBP?) is off by 8 bytes, which I can rescue by manually writing [EBP + x + 8] which seems hinky and wrong. By moving the PUSH after EAX loaded from stack, all is in order.

Questions at bottom.

Results below manifest with ldc2 -release -O2 (version 1.15.0)

Link to compiler explorer: https://godbolt.org/z/uLaIS5

Approach 1:
uint roundup32(uint x)
{
    pragma(inline, true);
    version(LDC) pragma(LDC_allow_inline);

    if (x <= 2) return x;
        asm
        {
            mov EAX, x[EBP];
            sub EAX, 1;
            bsr ECX, EAX;   // ecx = y = msb(x-1)
            mov EAX, 2;
            shl EAX, CL;    // return (2 << y)
        }
   }    // returns EAX
}

Result 1:
This works well _WHEN FN NOT INLINED_.

When inlined, my code clobbers ECX which calling function was using to track loop iteration.

.LBB1_1:
        mov     dword ptr [rsp + 8], ecx	; x param == loop iter
        mov     eax, dword ptr [rsp + 8]	; x param
        sub     eax, 1
        bsr     ecx, eax
        mov     eax, 2
        shl     eax, cl			; inline asm block ends here
        mov     eax, eax		; alignment?
        add     rbx, rax		; rbx == accumulator
        add     ecx, 1		; next round for loop, but ecx was chgd
        cmp     ecx, (# of loop iterations)
        jne     .LBB1_1

So, perhaps I can just save ECX.

Approach 2:
Same, but PUSH RCX / POP RCX at beginning and end of the asm block.

Result 2:
compiler detects I have used RCX and instead uses EDX as loop counter.
New problem: since I have pushed on to stack, the param x is at offset rbp + 16, but the generated code still b elieves it is at rbp + 8:

.LBB1_1:
        mov     dword ptr [rsp + 8], edx	; x param == loop iter
        push    rcx				; approach 2
        mov     eax, dword ptr [rsp + 8]	; looking for x here :-(
        sub     eax, 1
        bsr     ecx, eax
        mov     eax, 2
        shl     eax, cl
        pop     rcx			; inline asm blocks ends here
        mov     eax, eax		; alignment?
        add     rbx, rax		; rbx == accumulator
        add     edx, 1		; next round for loop
        cmp     edx, (# of loop iterations)
        jne     .LBB1_1

So now because we are looking in the wrong spot on the stack, the function has the wrong value in eax.

Approach 3:
Cheat, because I know RCX was pushed, and change the local reference to x from [RBP + x] to [RBP + x + 8].

Result 3:
This works, but feels wrong.

Final, successful approach:
I can just move the PUSH RCX until after EAX was loaded from [rsp + 8] and not worry about the altered stack frame.



Questions:
0. Should I quit using DMD style assembly and use LLVM style?

1. gcc inline asm has the capability to specify in/out values and what's clobbered. I assume this is a deficiency of DMD style inline asm?

2. LDC/LLVM can inline this function which eliminates prolog/epliogue, but still has a local stack frame consisting of the "passed" value x. What is going on -- This must be typical behavior when function inlined, yes?

3. Even though it is essentially a naked function, I cannot add naked; because then I can only access variables by name from the global scope. Why, when since it is inlined I can still access the "passed" x ?
3b. Is an inlined function de facto "naked" meaning there is no need for this keyword?

4a. How can I notify LDC/LLVM that ECX is being used OR: why does it not notice in approach 1, but it does appear to in approach 2?
4b. Cdecl says ECX is a caller-saved register [3]; does this go out the window when function is inlined? In this case, how can I be sure it is safe to use EAX, ECX, or EDX in an inline asm block in in a function that will be inlined?
(Interestingly when function not inlined, EBX is used to track loop iter since it knows ECX could be clobbered)

5. Is the final solution of moving the PUSH RCX until after EAX loaded from the stack the correct approach?


This was quite lengthy; As a novice **I am very grateful for your expert help.**

James


References:
[1] DMD style assembly: https://dlang.org/spec/iasm.html

[2] You can read more about round up optimization here: http://locklessinc.com/articles/next_pow2/ (although note that in this very old article, the bitshift trick was the fastest whereas on modern processors the BSR instruction method is faster)

[3] https://www.agner.org/optimize/calling_conventions.pdf
May 06, 2019
On Monday, 6 May 2019 at 03:09:38 UTC, James Blachly wrote:
> I know about core.bitop.bsr and std.math.nextPow2 which uses it. My asm code block is 2.5x faster than codegen for (2 << bsr(x)) which surprises me...

Sorry, but I'll just focus on that, and not on the asm questions. The reason is simple, I discourage anyone from going down to asm level if it can be avoided.

So, I have:

pragma(inline, true)
uint roundup32(uint x)
{
    import core.bitop;
    //if (x <= 2) return x;
    return 2u << bsr(x-1);
}

`ldc2 -mtriple=x86_64-linux-gnu -O -output-s foo.d` (AT&T syntax...):

_D3foo9roundup32FkZk:
	addl	$-1, %edi
	bsrl	%edi, %ecx
	xorl	$31, %ecx
	xorb	$31, %cl
	movl	$2, %eax
	shll	%cl, %eax
	retq

I can't believe that's 2.5x slower than your almost identical asm block. And that code is portable, not just OS- and ABI-independent, but also architecture-wise. 1000x better than inline asm IMO.
May 06, 2019
Adding `-mcpu=haswell` to the LDC command-line, the generated asm is:

	addl	$-1, %edi
	lzcntl	%edi, %eax
	xorb	$31, %al
	movl	$2, %ecx
	shlxl	%eax, %ecx, %eax

[Add `-mcpu=native` to tune code-gen for *your* CPU.]
May 06, 2019
On 5/6/19 3:55 PM, kinke wrote:
> Adding `-mcpu=haswell` to the LDC command-line, the generated asm is:
> 
>      addl    $-1, %edi
>      lzcntl    %edi, %eax
>      xorb    $31, %al
>      movl    $2, %ecx
>      shlxl    %eax, %ecx, %eax
> 
> [Add `-mcpu=native` to tune code-gen for *your* CPU.]

Thanks kinke. Agree, I am all for portability. This is my first stab at asm.

The speed is not really the point; this is really me asking for help understanding compiler internals and fundamentals of inline assembly.

Nevertheless, running each of 3 algorithms (asm, famous bitwise OR ops for this problem, and std.bitop.bsr) 2^28 times I get the following results:

assembly version:
Elapsed msec: 564
Sum: 48038395756849835
Stopwatch is reset: 0
kroundup32:
Elapsed msec: 776
Sum: 48038395756849835
Stopwatch is reset: 0
bitop_bsr:
Elapsed msec: 1299
Sum: 48038395756849835

-mcpu=native:
The performance of bitop_bsr is markedly improved when using -mcpu=native on linux (but only marginally better on MacOS) which is really surprising(???) but still slower than the hand asm, about 2.25x the runtime from 2.5x on ivybridge (Mac) and 2.9x on sandybridge (linux). Maybe because I do not have the latest and greatest processors?

Assembly at bottom. The two xor instructions after bsr are unnecessary(?) and perhaps a contributor to slowdown.

Anyway, mostly I wanted help understanding register safety and proper use of inline asm.

Of note, looking at the .s output, on MacOS the inline asm {} block is flagged in the assembly source file with ## InlineAsm Start ... ##InlineAsm End, whereas under linux, it is flagged with #APP ... #NO_APP. Since this is compiler-emitted I would have expected it to be the same across platforms -- or is this determined by LLVM version?

James


Assembly with -O2 -release -mcpu=native

Note that these are the non-inlined version.
Also not sure why roundup32(uint x) has so much more prologue than bitop_bsr(uint x).


_D9asminline9bitop_bsrFkZk:
        .cfi_startproc
        cmpl    $2, %edi
        ja      .LBB0_2
        movl    %edi, %eax
        retq
.LBB0_2:
        addl    $-1, %edi
        bsrl    %edi, %ecx
        xorl    $31, %ecx
        xorb    $31, %cl
        movl    $2, %eax
        shll    %cl, %eax
        retq


---

_D9asminline9roundup32FkZk:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        movl    %edi, -4(%rbp)
        cmpl    $2, %edi
        ja      .LBB1_2
        movl    %edi, %eax
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq
.LBB1_2:
        .cfi_def_cfa %rbp, 16
        #APP
        movl    -4(%rbp), %eax
        subl    $1, %eax
        pushq   %rcx
        bsrl    %eax, %ecx
        movl    $2, %eax
        shll    %cl, %eax
        popq    %rcx
        #NO_APP
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq

May 08, 2019
On Tuesday, 7 May 2019 at 01:20:43 UTC, James Blachly wrote:
> The performance of bitop_bsr is markedly improved when using -mcpu=native on linux (but only marginally better on MacOS) which is really surprising(???) but still slower than the hand asm, about 2.25x the runtime from 2.5x on ivybridge (Mac) and 2.9x on sandybridge (linux).

I still couldn't believe this, so benchmarked myself.
First observation: you didn't re-start() the StopWatch after resetting it; the next stop() will then measure the elapsed time since it was originally started (!).

I modified the code a bit, switching to LLVM inline asm (highly recommended if DMD compatibility isn't that important) and adding a nextPow2-based version:

```
enum MAXITER = 1 << 28;

pragma(inline, true):

uint roundup32(uint x)
{
    if (x <= 2) return x;

    import ldc.llvmasm;
    return __asm!uint(
        `bsrl %eax, %ecx
         #xorl $$31, %ecx
         #xorb $$31, %cl
         movl $$2, %eax
         shll %cl, %eax`,
        "={eax},{eax},~{ecx}", x - 1);
}

uint bitop_bsr(uint x)
{
    import core.bitop;
    return x <= 2 ? x : 2 << bsr(x - 1);
}

uint nextPow2(uint x)
{
    import std.math;
    return x <= 2 ? x : std.math.nextPow2(x - 1);
}

uint kroundup32(uint x)
{
    x -= 1;
    x |= (x >> 1);
    x |= (x >> 2);
    x |= (x >> 4);
    x |= (x >> 8);
    x |= (x >> 16);
    return x + 1;
}

void main()
{
    static void benchmark(alias Func)()
    {
        import std.datetime.stopwatch;
        import std.stdio;

        ulong sum;

        auto sw = StopWatch(AutoStart.yes);
        for(uint i; i < MAXITER; i++)
            sum += Func(i);
        sw.stop();

        writeln(__traits(identifier, Func), ":\t", sw.peek.total!"msecs", "ms\t", sum);
    }

    benchmark!roundup32();
    benchmark!bitop_bsr();
    benchmark!nextPow2();
    benchmark!kroundup32();
}
```

My results with an Intel Ivy Bridge CPU:

ldc2 -O -run asm.d:
roundup32:      254ms   48038395756849835
bitop_bsr:      386ms   48038395756849835
nextPow2:       381ms   48038395756849835
kroundup32:     326ms   48038395756849835

Observation 1: the inline asm version saving the 2 XORs is indeed faster, but by about 50% and not more than twice as fast. This was surprising to me; checking https://gmplib.org/~tege/x86-timing.pdf, BSR should take a surprisingly low 3 cycles vs. 5 cycles with the 2 additional dependent XORs on my Ivy Bridge.

Observation 2: std.math.nextPow2() is fine (equivalent to bitop_bsr).

Observation 3: kroundup32 is faster than nextPow2/bitop_bsr. With `-mcpu=native`, the first 3 timings are unchanged in my case, but the kroundup32 runtime shrinks to ~140ms. By then, it was clear that auto-vectorization must be interfering, and a look at the (inlined) asm confirmed it. Adding `-disable-loop-vectorization` to the cmdline, the kroundup32 time is exactly the same as nextPow2/bitop_bsr, ~380ms, independent from `-mcpu=native`.

The x86 timings PDF also shows that nextPow2 should be faster than the inline asm version on an AMD Zen CPU with tuned codegen, as BSR takes 4 cycles there, and LZCNT+XOR only 2 cycles. So whether inline asm really pays off in this case is highly CPU/codegen specific. It kills auto-vectorization for sure though, while nextPow2 may be vectorizable for some non-x86 targets. If the real-world loop body is vectorizable, then kroundup32 may be significantly faster than the other versions...
May 08, 2019
I should probably add that the inline asm version can be simplified to a much more readable:

return x <= 2 ? x : 2 << __asm!uint("bsrl $1, $0", "=r,r", x - 1);

Check out http://llvm.org/docs/LangRef.html#inline-asm-constraint-string if interested in LLVM inline assembly.
May 28, 2019
On 5/8/19 11:42 AM, kinke wrote:
> I should probably add that the inline asm version can be simplified to a much more readable:
> 
> return x <= 2 ? x : 2 << __asm!uint("bsrl $1, $0", "=r,r", x - 1);
> 
> Check out http://llvm.org/docs/LangRef.html#inline-asm-constraint-string if interested in LLVM inline assembly.

Kinke,

Thank you for your extensive work and detailed answers. I will definitely take your advice and move to LLVM style inline asm.

Regarding my last unanswered(I think) question about why LDC/LLVM was not saving ECX when inlining the naked function:

since problem not evident when using the LLVM inline asm, this is appears to be a bug related to combination of function inlining and DMD style asm.

May 29, 2019
On Tuesday, 28 May 2019 at 20:45:35 UTC, James Blachly wrote:
> Thank you for your extensive work and detailed answers. I will definitely take your advice and move to LLVM style inline asm.

You're welcome.

> Regarding my last unanswered(I think) question about why LDC/LLVM was not saving ECX when inlining the naked function:
>
> since problem not evident when using the LLVM inline asm, this is appears to be a bug related to combination of function inlining and DMD style asm.

1. Your function wasn't naked ('essentially naked' != naked, there's a big difference; e.g., have a look at the LLVM IR).
2. Functions with DMD-style inline asm cannot be inlined (basically due to its lacking flexibility), and we normally bail out when applying `pragma(inline, true)`. You successfully circumvented that error by explicitly using `pragma(LDC_allow_inline)` (I can't think of a good use case for that pragma OTOH).