Thread overview
move+forward as intrinsics, incl. revised forward semantics for perfect forwarding
Oct 13
kinke
October 13

IMO we need to make core.lifetime.{move,forward} compiler intrinsics, to enable further optimizations that aren't possible with a library solution.

Move

  • semantics: move an lvalue to a new rvalue, at a new memory address, 'hijacking' the lvalue resources; the lvalue is reset to T.init (blit, not assignment!) afterwards
  • will be complete with move ctor; syntax needs to be decided, but signature is (ref T) (yes, must be an explicit ref)
    • allows to opt out of the default blit (memcpy struct payload), e.g., to fix up interior pointers
    • move ctor interop with C++ should be doable (just getting the extern(C++) mangle right)
    • problem: handle/avoid all compiler-implicit moves/blits (would have to call move ctor and dtor now; emplace FTW!)
  • would be nice as intrinsic:
    • not to have to import core.lifetime everywhere and end up with complicated template bloat for a basically trivial operation
    • potential optimization: elide lvalue reset to T.init and its destruction iff:
      • it is a local (can skip destruction)
      • and not used after the move
      • and the destruction of T.init is a noop (modulo mods to the struct's own payload), so its elision not observable

When move isn't sufficient: perfect forwarding

forward must become an intrinsic:

  • for vars with ref storage class: as-is, yields the original lvalue
  • non-ref lvalues (NEW semantics): 're-interpret as rvalue' - no move, and accordingly no destruction after forwarding (because the rvalue will already be destructed earlier)
    • only valid for locals (incl. params), the destruction of other lvalues cannot be skipped
    • invalid/undefined to access the original lvalue after forwarding it (has been destructed already)
    • probably only valid:
      • as function call argument expressions (glue layer needs to treat it like a frontend-generated temporary, passing it directly by ref)
      • as assignment right-hand-sides, for move-assign (dst = forward!src; => dst.opAssign(forward!src);)
      • as return expressions, for move-constructions (but prefer NRVO if possible, for direct emplace)
  • probably needs to keep template syntax (forward!x, not forward(x)) for backwards compatibility with druntime template

Let's take a look at an example:

import core.stdc.stdio;
import core.lifetime;

struct S {
    int x;

    this(int x) {
        this.x = x;
        printf("ctor: %p\n", &this);
    }

    this(this) {
        printf("copy: %p\n", &this);
    }

    ~this() {
        printf("dtor: %p\n", &this);
    }
}

void main() {
    {
        auto lval = S(1);
        printf("lval: %p\n", &lval);
        const r = bar1(lval);
        printf("   r: %p\n", &r);
    }

    {
        printf("\nrvalue:\n");
        const r = bar1(S(2));
        printf("   r: %p\n", &r);
    }
}

S bar1()(auto ref S s) {
    printf("bar1: %p\n", &s);
    return bar2(forward!s);
}

S bar2()(auto ref S s) {
    printf("bar2: %p\n", &s);
    return bar3(forward!s);
}

S bar3()(auto ref S s) {
    printf("bar3: %p\n", &s);
    return bar4(forward!s);
}

S bar4()(auto ref S s) {
    printf("bar4: %p, got a ref: %d\n", &s, __traits(isRef, s));
    return s; // copy parameter lvalue to return value
}

Output with DMD (and GDC), no backend optimizations:

ctor: 0x7ffebea26460
lval: 0x7ffebea26460
bar1: 0x7ffebea26460
bar2: 0x7ffebea26460
bar3: 0x7ffebea26460
bar4: 0x7ffebea26460, got a ref: 1
copy: 0x7ffebea263d0
   r: 0x7ffebea26464
dtor: 0x7ffebea26464
dtor: 0x7ffebea26460

rvalue:
ctor: 0x7ffebea2647c
bar1: 0x7ffebea26488
bar2: 0x7ffebea26424
bar3: 0x7ffebea263e4
bar4: 0x7ffebea263a4, got a ref: 0
copy: 0x7ffebea26358
dtor: 0x7ffebea263a4
dtor: 0x7ffebea263e4
dtor: 0x7ffebea26424
dtor: 0x7ffebea26488
   r: 0x7ffebea26478
dtor: 0x7ffebea26478

What we see is that current core.lifetime.forward propagates the ref-ness of the parameter, but has to core.lifetime.move it in the non-ref case, creating 3 explicit moves + destructions.

We also see that there are compiler-implicit moves ('optimized', i.e., no reset+destruction of the moved-from value):

  • when passing the S(2) rvalue to bar1 (not sure why, seems like a bug) - note the different addresses of ctor and bar1
  • for the return values - the addresses of copy and r diverge (constructed @ 0x7ffebea26358, destructed @ 0x7ffebea26478)

With LDC, we at least already get perfectly forwarded return values (the addresses of copy and r are identical):

ctor: 0x7ffda922edbc
lval: 0x7ffda922edbc
bar1: 0x7ffda922edbc
bar2: 0x7ffda922edbc
bar3: 0x7ffda922edbc
bar4: 0x7ffda922edbc, got a ref: 1
copy: 0x7ffda922edb8
   r: 0x7ffda922edb8
dtor: 0x7ffda922edb8
dtor: 0x7ffda922edbc

rvalue:
ctor: 0x7ffda922eda0
bar1: 0x7ffda922ed6c
bar2: 0x7ffda922ed1c
bar3: 0x7ffda922eccc
bar4: 0x7ffda922ecc8, got a ref: 0
copy: 0x7ffda922eda4
dtor: 0x7ffda922ecc8
dtor: 0x7ffda922eccc
dtor: 0x7ffda922ed1c
dtor: 0x7ffda922ed6c
   r: 0x7ffda922eda4
dtor: 0x7ffda922eda4

The compiler needs to implement RVO (Return Value Optimization, different to Named-RVO!) to enable perfect forwarding of the return values. In this example, r is allocated in main, then its address passed and forwarded as hidden pointer all the way to bar4, where it gets copy-constructed.

With the proposed forward semantics, we'd get perfect forwarding of the s parameters too, without the 3 explicit moves and destructions. The S(2) rvalue would be created in main, then passed and forwarded directly by ref all the way to bar4, where it would get destructed when the s param goes out of scope.

Cherry on top: Last-use optimization from DIP 1040

This would make the compiler automatically forward suited lvalues. In the example, we wouldn't have to use a single explicit forward in the barN trampolines, and the copy-construction of the return value in the non-ref version of bar4 would be optimized to a move-construction (return forward!s).

October 13

On Sunday, 13 October 2024 at 10:43:27 UTC, kinke wrote:

>
  • non-ref lvalues (NEW semantics): 're-interpret as rvalue' - no move, and accordingly no destruction after forwarding (because the rvalue will already be destructed earlier)
    • invalid/undefined to access the original lvalue after forwarding it (has been destructed already)

Unless the compiler can statically detect and prevent such accesses, this would make forward!x a @system operation, which IMO would be a step backward.

October 13
On 10/13/24 12:43, kinke wrote:
> IMO we need to make `core.lifetime.{move,forward}` compiler intrinsics, to enable further optimizations that aren't possible with a library solution.
> ...

Thanks for writing this up! I think this is a good starting point, but I would make some small tweaks.

> #### Move
> 
> * semantics: move an lvalue to a new rvalue, at a new memory address, 'hijacking' the lvalue resources; the lvalue is reset to T.init (blit, not assignment!) afterwards

Makes sense, though if the compiler can determine that something is a last use, it can optimize out the address change.

> * will be complete with move ctor; syntax needs to be decided, but signature is `(ref T)` (yes, must be an explicit ref)

I can see either idea work here. What is most important is that it is in fact treated as a constructor.

I guess the benefit of `this(S)` is uniformity with `this(ref S)`, and the benefit of `=this(ref S)` or `opMove(ref S)` is that it is obvious that the destructor will be called by the caller, potentially much later.

>    * allows to opt out of the default blit (memcpy struct payload), e.g., to fix up interior pointers
>    * move ctor interop with C++ should be doable (just getting the extern(C++) mangle right)
>    * problem: handle/avoid all compiler-implicit moves/blits (would have to call move ctor and dtor now; emplace FTW!)
> * would be nice as intrinsic:
>    * not to have to import `core.lifetime` everywhere and end up with complicated template bloat for a basically trivial operation
>    * potential optimization: elide lvalue reset to T.init and its destruction iff:
>      * it is a local (can skip destruction)
>      * and not used after the move
>      * and the destruction of T.init is a noop (modulo mods to the struct's own payload), so its elision not observable
> ...

Well, as I alluded to earlier, I think in such cases the object should just keep its original address and the move constructor does not need to be called at all. It reduces to a safe version of `__rvalue` in this case.

> #### When move isn't sufficient: perfect forwarding
> 
> forward must become an intrinsic:
> * for vars with `ref` storage class: as-is, yields the original lvalue
> * non-ref lvalues (NEW semantics): 're-interpret as rvalue' - no move, and accordingly no destruction after forwarding (because the rvalue will already be destructed earlier)
>    * only valid for locals (incl. params), the destruction of other lvalues cannot be skipped
>    * invalid/undefined to access the original lvalue after forwarding it (has been destructed already)

I think it would be better to do a `move`, where the `move` will usually be optimized to a safe `__rvalue` as above. I think unsafe `__rvalue` should be possible, but not `@safe`.

>    * probably only valid:
>      * as function call argument expressions (glue layer needs to treat it like a frontend-generated temporary, passing it directly by ref)
>      * as assignment right-hand-sides, for move-assign (`dst = forward! src;` => `dst.opAssign(forward!src);`)
>      * as return expressions, for move-constructions (but prefer NRVO if possible, for direct emplace)
> * probably needs to keep template syntax (`forward!x`, not `forward(x)`) for backwards compatibility with druntime template
> 
> Let's take a look at an example:
> ```D
> import core.stdc.stdio;
> import core.lifetime;
> 
> struct S {
>      int x;
> 
>      this(int x) {
>          this.x = x;
>          printf("ctor: %p\n", &this);
>      }
> 
>      this(this) {
>          printf("copy: %p\n", &this);
>      }
> 
>      ~this() {
>          printf("dtor: %p\n", &this);
>      }
> }
> 
> void main() {
>      {
>          auto lval = S(1);
>          printf("lval: %p\n", &lval);
>          const r = bar1(lval);
>          printf("   r: %p\n", &r);
>      }
> 
>      {
>          printf("\nrvalue:\n");
>          const r = bar1(S(2));
>          printf("   r: %p\n", &r);
>      }
> }
> 
> S bar1()(auto ref S s) {
>      printf("bar1: %p\n", &s);
>      return bar2(forward!s);
> }
> 
> S bar2()(auto ref S s) {
>      printf("bar2: %p\n", &s);
>      return bar3(forward!s);
> }
> 
> S bar3()(auto ref S s) {
>      printf("bar3: %p\n", &s);
>      return bar4(forward!s);
> }
> 
> S bar4()(auto ref S s) {
>      printf("bar4: %p, got a ref: %d\n", &s, __traits(isRef, s));
>      return s; // copy parameter lvalue to return value
> }
> ```
> 
> Output with DMD (and GDC), no backend optimizations:
> ```
> ctor: 0x7ffebea26460
> lval: 0x7ffebea26460
> bar1: 0x7ffebea26460
> bar2: 0x7ffebea26460
> bar3: 0x7ffebea26460
> bar4: 0x7ffebea26460, got a ref: 1
> copy: 0x7ffebea263d0
>     r: 0x7ffebea26464
> dtor: 0x7ffebea26464
> dtor: 0x7ffebea26460
> 
> rvalue:
> ctor: 0x7ffebea2647c
> bar1: 0x7ffebea26488
> bar2: 0x7ffebea26424
> bar3: 0x7ffebea263e4
> bar4: 0x7ffebea263a4, got a ref: 0
> copy: 0x7ffebea26358
> dtor: 0x7ffebea263a4
> dtor: 0x7ffebea263e4
> dtor: 0x7ffebea26424
> dtor: 0x7ffebea26488
>     r: 0x7ffebea26478
> dtor: 0x7ffebea26478
> ```
> 
> What we see is that current `core.lifetime.forward` propagates the ref- ness of the parameter, but has to `core.lifetime.move` it in the non-ref case, creating 3 explicit moves + destructions.
> 
> We also see that there are compiler-implicit moves ('optimized', i.e., no reset+destruction of the moved-from value):
> * when passing the `S(2)` rvalue to `bar1` (not sure why, seems like a bug) - note the different addresses of `ctor` and `bar1`
> * for the return values - the addresses of `copy` and `r` diverge (constructed @ 0x7ffebea26358, destructed @ 0x7ffebea26478)
> 
> With LDC, we at least already get perfectly forwarded return values (the addresses of `copy` and `r` are identical):
> ```
> ctor: 0x7ffda922edbc
> lval: 0x7ffda922edbc
> bar1: 0x7ffda922edbc
> bar2: 0x7ffda922edbc
> bar3: 0x7ffda922edbc
> bar4: 0x7ffda922edbc, got a ref: 1
> copy: 0x7ffda922edb8
>     r: 0x7ffda922edb8
> dtor: 0x7ffda922edb8
> dtor: 0x7ffda922edbc
> 
> rvalue:
> ctor: 0x7ffda922eda0
> bar1: 0x7ffda922ed6c
> bar2: 0x7ffda922ed1c
> bar3: 0x7ffda922eccc
> bar4: 0x7ffda922ecc8, got a ref: 0
> copy: 0x7ffda922eda4
> dtor: 0x7ffda922ecc8
> dtor: 0x7ffda922eccc
> dtor: 0x7ffda922ed1c
> dtor: 0x7ffda922ed6c
>     r: 0x7ffda922eda4
> dtor: 0x7ffda922eda4
> ```
> 
> The compiler needs to implement RVO (Return Value Optimization, different to Named-RVO!) to enable perfect forwarding of the return values. In this example, `r` is allocated in `main`, then its address passed and forwarded as hidden pointer all the way to `bar4`, where it gets copy-constructed.
> 
> With the proposed `forward` semantics, we'd get perfect forwarding of the `s` parameters too, without the 3 explicit moves and destructions. The `S(2)` rvalue would be created in `main`, then passed and forwarded directly by ref all the way to `bar4`, where it would get destructed when the `s` param goes out of scope.
> 
> #### Cherry on top: Last-use optimization from DIP 1040
> 
> This would make the compiler automatically `forward` suited lvalues. In the example, we wouldn't have to use a single explicit `forward` in the `barN` trampolines, *and* the copy-construction of the return value in the non-ref version of `bar4` would be optimized to a move-construction (`return forward!s`).

Sounds good, but I think simple cases like this one should be a priority. Even if there is no data-flow analysis as advanced as the one proposed in DIP1040, I think it is important that there is no copy in `bar4`.