On Wed, 11 Sept 2024 at 18:26, Walter Bright via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
On 9/11/2024 4:44 AM, Manu wrote:
> Okay, I see. You're depending on the optimiser to specifically collapse the goto
> into the branch as a simplification.

Actually, the same code is generated without optimization. All it's doing is
removing blocks that consist of nothing but "goto". It's a trivial optimization,
and was there in the earliest version of the compiler.


> Surely that's not even remotely reliable. There are several ways to optimise
> that function, and I see no reason an optimiser would reliably choose a
> construct like you show.

gcc -O does more or less the same thing.


> I'm actually a little surprised; a lifetime of experience with this sort of
> thing might have lead me to predict that the optimiser would /actually/ shift
> the `return 0` up into the place of the goto, effectively eliminating the
> goto... I'm sure I've seen optimisers do that transformation before, but I can't
> recall ever noting an instance of code generation that looks like what you
> pasted... I reckon I might have spotted that before.

The goto remains in the gcc -O version.


> ... and turns out, I'm right. I was so surprised with the codegen you present
> that I pulled out compiler explorer and ran some experiments.
> I tested GCC and Clang for x86, MIPS, and PPC, all of which I am extremely
> familiar with, and all of them optimise the way I predicted. None of them showed
> a pattern like you presented here.

gcc -O produced:

```
foo:
     mov       EAX,0
     test      EDI,EDI
     jne       L1B
     sub       RSP,8
     call      bar@PC32
     mov       EAX,1
     add       RSP,8
L1B:    rep
     ret
baz:
     mov       EAX,0
     test      EDI,EDI
     jne       L38
     sub       RSP,8
     call      bar@PC32
     mov       EAX,1
     add       RSP,8
L38:    rep
     ret
```

> If I had to guess; I would actually imagine that GCC and Clang will very
> deliberately NOT make a transformation like the one you show, for the precise
> reason that such a transformation changes the nature of static branch prediction
> which someone might have written code to rely on. It would be dangerous for the
> optimiser to transform the code in the way you show, and so it doesn't.

The transformation is (intermediate code):
```
if (i) goto L2; else goto L4;
L2:
    goto L3;
L4:
    bar();
    return 1;
L3:
    return 0;
```
becomes:
```
if (!i) goto L3; else goto L4;
L4:
     bar();
     return 1;
L3:
     return 0;
```
I.e. the goto->goto was replaced with a single goto.

It's not dangerous or weird at all, nor does it interfere with branch prediction.

It inverts the condition. In the case on trial, that inverts the branch prediction.
 
But that aside, I'm even more confused; I couldn't reproduce that in any of my tests.
Here's a bunch of my test copiles... they all turn out the same:

gcc:

baz(int):
        test    edi, edi
        je      .L10
        xor     eax, eax
        ret
.L10:
        sub     rsp, 8
        call    bar()
        mov     eax, 1
        add     rsp, 8
        ret

clang:

baz(int):
        xor     eax, eax
        test    edi, edi
        je      .LBB0_1
        ret
.LBB0_1:
        push    rax
        call    bar()@PLT
        mov     eax, 1
        add     rsp, 8
        ret

gcc-powerpc:

baz(int):
        cmpwi 0,3,0
        beq- 0,.L9
        li 3,0
        blr
.L9:
        stwu 1,-16(1)
        mflr 0
        stw 0,20(1)
        bl bar()
        lwz 0,20(1)
        li 3,1
        addi 1,1,16
        mtlr 0
        blr

arm64:

baz(int):
        cbz     w0, .L9
        mov     w0, 0
        ret
.L9:
        stp     x29, x30, [sp, -16]!
        mov     x29, sp
        bl      bar()
        mov     w0, 1
        ldp     x29, x30, [sp], 16
        ret

clang-mips:

baz(int):
        beqz    $4, $BB0_2
        addiu   $2, $zero, 0
        jr      $ra
        nop
$BB0_2:
        addiu   $sp, $sp, -24
        sw      $ra, 20($sp)
        sw      $fp, 16($sp)
        move    $fp, $sp
        jal     bar()
        nop
        addiu   $2, $zero, 1
        move    $sp, $fp
        lw      $fp, 16($sp)
        lw      $ra, 20($sp)
        jr      $ra
        addiu   $sp, $sp, 24

Even if you can manage to convince a compiler to write the output you're alleging, I would never imagine for a second that's a reliable strategy. The optimiser could do all kinds of things... even though in all my experiments, it does exactly what I predicted it would.