Thread overview
LDC2 win64 calling convention
Nov 28, 2018
realhet
Nov 28, 2018
kinke
Nov 28, 2018
kinke
Nov 28, 2018
realhet
Nov 28, 2018
kinke
Nov 29, 2018
realhet
Dec 01, 2018
Johan Engelen
November 28, 2018
Hi,

Is there a documentation about the win64 calling convention used with LDC2 compiler?

So far I try to use the Microsoft x64 calling convention, but I'm not sure if that's the one I have to. But it's not too accurate becaues I think it uses the stack only.

https://en.wikipedia.org/wiki/X86_calling_conventions#Microsoft_x64_calling_convention

I'm asking for Your help in the following things:

1. Is there register parameters? (I think no)
2. What are the volatile regs? RAX, RCX, RDX, XMM6..XMM15?
3. Is the stack pointer aligned to 16?
4. Is there a 32 byte shadow area on the stack?

Thank you!
November 28, 2018
On Wednesday, 28 November 2018 at 18:56:14 UTC, realhet wrote:
> 1. Is there register parameters? (I think no)

Of course, e.g., POD structs of power-of-2 sizes <= 8 bytes and integral scalars as well as float/double/vectors. The stack isn't used at all, aggregates > 8 bytes are passed by ref (caller makes a copy on its stack and passes a pointer to it to the callee); that seems not to be mentioned at all in the Wiki article.

> 2. What are the volatile regs? RAX, RCX, RDX, XMM6..XMM15?

See Microsoft's docs.

> 3. Is the stack pointer aligned to 16?

It is IIRC.

> 4. Is there a 32 byte shadow area on the stack?

Yes, IIRC.

---

LDC conforms to the regular Win64 ABI (incl. __vectorcall extension for vectors). The biggest difference is that `extern(D)` (as opposed to `extern(C)` or `extern(C++)`) reverses the arguments - `foo(1, 2, 3, 4)` becomes `foo(4, 3, 2, 1)`, so not the first 4 args are passed in registers if possible, but the last ones (incl. special cases wrt. struct-return + `this` pointers). Other than that, there are just very few special cases for delegates and dynamic arrays, which only apply to `extern(D)`.
November 28, 2018
On Wednesday, 28 November 2018 at 20:17:53 UTC, kinke wrote:
> The stack isn't used at all

To prevent confusion: it's used of course, e.g., if there are more than 4 total parameters. Just not in the classical sense, i.e., a 16-bytes struct isn't pushed directly onto the stack, but the caller makes the copy and passes a pointer, either in a register or on the stack.
November 28, 2018
Thank You for the explanation!

But my tests has different results:

void* SSE_sobelRow(ubyte* src, ubyte* dst, size_t srcStride){ asm{
  push RDI;

  mov RAX, 0; mov RDX, 0; mov RCX, 0; //clear 'parameter' registers

  mov RAX, src;
  mov RDI, dst;

  //gen
  movups XMM0,[RAX];
  movaps XMM1,XMM0;
  pslldq XMM0,1;
  movaps XMM2,XMM1;
  psrldq XMM1,1;
  pavgb XMM1,XMM0;
  pavgb XMM1,XMM2;
  movups [RDI],XMM1;
  //gen end

  pop RDI;
}}

When I clear those volatile regs that are used for register calling, I'm still able to get good results.
However when I put "mov [RBP+8], 0" into the code it generates an access violation, so this is why I think parameters are on the stack.

What I'm really unsire is that the registers I HAVE TO save in my asm routine.
Currently I think I only able to trash the contents of RAX, RCX, RDX, XMM0..XMM5 based on the Microsoft calling model. But I'm not sure what's the actual case with LDC2 Win64.

If my code is surrounded by SSE the optimizations of the LDC2 compiler, and I can't satisfy the requirements, I will have random errors in the future. I better avoid those.

On the 32bit target the rule is simpe: you could do with all the XMM regs and a,c,d what you want. Now at 64bit I'm quite unsure. :S
November 28, 2018
You're not using naked asm; this entails a prologue (spilling the params to stack etc.). Additionally, LDC doesn't really like accessing params and locals in DMD-style inline asm, see https://github.com/ldc-developers/ldc/issues/2854.

You can check the final asm trivially online, e.g., https://run.dlang.io/is/e0c2Ly (click the ASM button). You'll see that your params are in R8, RDX and RCX (reversed order as mentioned earlier).
November 29, 2018
On Wednesday, 28 November 2018 at 21:58:16 UTC, kinke wrote:
> You're not using naked asm; this entails a prologue (spilling the params to stack etc.). Additionally, LDC doesn't really like accessing params and locals in DMD-style inline asm, see https://github.com/ldc-developers/ldc/issues/2854.
>
> You can check the final asm trivially online, e.g., https://run.dlang.io/is/e0c2Ly (click the ASM button). You'll see that your params are in R8, RDX and RCX (reversed order as mentioned earlier).

Hi again.

I just tried a new debugger: x64dbg. I really like it, it is not the bloatware I got used to nowadays.

It turns out that LDC2's parameter/register handling is really clever:

- Register saving/restoring: fully automatic. It analyzes my asm and saves/restores only those regs I overwrite.
- Parameters: Reversed Microsoft x64 calling convention, just as you said. Parameters in the registers will be 'spilled' onto the stack no matter if I'm using them by their names or by the register. Maybe this is not too clever but as I can use the params by their name from anywhere, it can make my code nicer.
- Must not use the "ret" instruction because it will take it literally and will skip the auto-generated exit code.

In conclusion: Maybe LDC2 generates a lot of extra code, but I always make longer asm routines, so it's not a problem for me at all while it helps me a lot.

December 01, 2018
On Thursday, 29 November 2018 at 15:10:41 UTC, realhet wrote:
>
> In conclusion: Maybe LDC2 generates a lot of extra code, but I always make longer asm routines, so it's not a problem for me at all while it helps me a lot.

An extra note: I recommend you look into using `ldc.llvmasm.__asm` to write inline assembly. Some advantages: no worrying about calling conventions (portability) and you'll have more instructions available.
If you care about performance, usually you should _not_ write assembly, but for the 1% of other cases: the compiler also understands your asm much better if you use __asm.

LDC's __asm syntax is very similar (if not the same) to what GDC uses for inline assembly.

-Johan