Thread overview
Switch codegen.
Aug 07, 2023
claptrap
Aug 13, 2023
Basile B.
Aug 13, 2023
claptrap
Aug 13, 2023
Johan
Aug 13, 2023
claptrap
Aug 13, 2023
Johan
Aug 13, 2023
claptrap
Aug 13, 2023
claptrap
Sep 01
claptrap
August 07, 2023

Given code like this...

const int x;
while(...)
{
    switch (x)
    {
        case 0: whatever; break;
        case 1: whatever; break;
        case 2: whatever; break;
        case 3: whatever; break;
        case 4: whatever; break;
    }
}

It generates a jump table if there's at least 4 or 5 cases, but at the start of the switch it generates 6 instructions (x64), check the bounds, lookup / calc the destination, and the actual jump. Like this...

        lea     r13, [rip + .LJTI0_0]
        cmp     r14d, 7
        ja      .LBB0_11
        movsxd  rax, dword ptr [r13 + 4*r12]
        add     rax, r13
        jmp     rax

So the point is if the switch is in a loop and x is constant, most of that could be done ahead, outside the loop. It could actually be reduced to a single indirect jump inside the loop. It seems there is a LLVM IR instruction that may be useful for this, "indirectbr", it's a sort of computed goto.

https://llvm.org/docs/LangRef.html#indirectbr-instruction

Is this at all feasible? If so how would this be done? In the actual IR codegen or as an optimization pass? I willing to do the work if it's possible.

August 13, 2023

On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:

>

Given code like this...

const int x;
while(...)
{
    switch (x)
    {
        case 0: whatever; break;
        case 1: whatever; break;
        case 2: whatever; break;
        case 3: whatever; break;
        case 4: whatever; break;
    }
}

It generates a jump table if there's at least 4 or 5 cases, but at the start of the switch it generates 6 instructions (x64), check the bounds, lookup / calc the destination, and the actual jump. Like this...

        lea     r13, [rip + .LJTI0_0]
        cmp     r14d, 7
        ja      .LBB0_11
        movsxd  rax, dword ptr [r13 + 4*r12]
        add     rax, r13
        jmp     rax

So the point is if the switch is in a loop and x is constant, most of that could be done ahead, outside the loop. It could actually be reduced to a single indirect jump inside the loop. It seems there is a LLVM IR instruction that may be useful for this, "indirectbr", it's a sort of computed goto.

https://llvm.org/docs/LangRef.html#indirectbr-instruction

Is this at all feasible? If so how would this be done? In the actual IR codegen or as an optimization pass? I willing to do the work if it's possible.

As I've heard good feedback about that alternative to LLVM switch and also because I'm also interested to implement that in another compiler, I've made a comparison of both way here : https://godbolt.org/z/G41P88EY7 (D code at the end as comments)

It seems that this would only work well for final switches (no default block) over enum members. The main problem is that you need to build a constant array made of blockaddress, meaning that

  1. if the first member value is not 0 then you'll need to apply an offset for the index used to select the right blockaddress. (Source 1, line 13)
  2. if there are gaps between members, you'll need to use a dummy block, containing unreachable, to fill them.

A note about the fact that I think that this would only works for enums: final switch over a value of an integral type is currently a lie (see issue 6060), there's still a fallback with a runtime error.

Otherwise, more concretly in LDC, the work would have to be done here. At first glance you'll need to define a flag, similarly to the already existing useSwitchInst (used for the infamous case where cases are not compile-time constants), let's say useIndirectBrInst.

To conclude, that seems faisable even if that would only be beneficial for a few specific cases.

August 13, 2023

I'm quite certain that if x is constant, the bounds check and table lookup will be hoisted out of the loop, indeed leading to just an indirect jump inside the loop.
Of course, this requires running the optimizer (-Ox, x>0).

-Johan

On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:

>

Given code like this...

const int x;
while(...)
{
    switch (x)
    {
        case 0: whatever; break;
        case 1: whatever; break;
        case 2: whatever; break;
        case 3: whatever; break;
        case 4: whatever; break;
    }
}

It generates a jump table if there's at least 4 or 5 cases, but at the start of the switch it generates 6 instructions (x64), check the bounds, lookup / calc the destination, and the actual jump. Like this...

        lea     r13, [rip + .LJTI0_0]
        cmp     r14d, 7
        ja      .LBB0_11
        movsxd  rax, dword ptr [r13 + 4*r12]
        add     rax, r13
        jmp     rax

So the point is if the switch is in a loop and x is constant, most of that could be done ahead, outside the loop. It could actually be reduced to a single indirect jump inside the loop. It seems there is a LLVM IR instruction that may be useful for this, "indirectbr", it's a sort of computed goto.

https://llvm.org/docs/LangRef.html#indirectbr-instruction

Is this at all feasible? If so how would this be done? In the actual IR codegen or as an optimization pass? I willing to do the work if it's possible.

August 13, 2023

On Sunday, 13 August 2023 at 10:45:11 UTC, Basile B. wrote:

>

As I've heard good feedback about that alternative to LLVM switch and also because I'm also interested to implement that in another compiler, I've made a comparison of both way here : https://godbolt.org/z/G41P88EY7 (D code at the end as comments)

It seems that this would only work well for final switches (no default block) over enum members. The main problem is that you need to build a constant array made of blockaddress, meaning that

  1. if the first member value is not 0 then you'll need to apply an offset for the index used to select the right blockaddress. (Source 1, line 13)
  2. if there are gaps between members, you'll need to use a dummy block, containing unreachable, to fill them.

A note about the fact that I think that this would only works for enums: final switch over a value of an integral type is currently a lie (see issue 6060), there's still a fallback with a runtime error.

Otherwise, more concretly in LDC, the work would have to be done here. At first glance you'll need to define a flag, similarly to the already existing useSwitchInst (used for the infamous case where cases are not compile-time constants), let's say useIndirectBrInst.

To conclude, that seems faisable even if that would only be beneficial for a few specific cases.

From what I can tell even "final switch" generates a bounds check, it does a compare and then a ja to the topmost case, i guess it's "better to at least jump to known code even if its the wrong code than it is to jump into the unknown. Im not sure if this is the IR generated by LDC, or what LLVM does.

Anyway I'm going to download the source and play around with it see what I can figure out.

Thanks.

August 13, 2023

On Sunday, 13 August 2023 at 13:13:30 UTC, Johan wrote:

>

I'm quite certain that if x is constant, the bounds check and table lookup will be hoisted out of the loop, indeed leading to just an indirect jump inside the loop.
Of course, this requires running the optimizer (-Ox, x>0).

I've tried as many variations as I can figure and it always generates this...

    cmp     r14d, 3
    ja      .LBB0_9
    movsxd  rax, dword ptr [r12 + 4*r15]
    add     rax, r12
    jmp     rax

https://d.godbolt.org/z/jzoevxoG4

That's with O2, with O3 I cant make head nor tail of the code it generates, it's some pretty insane optimization.

With my own code I use "asm int 3" to breakpoint in the function, so I can look at the disassembly even with O3 optimization, using visual studio, and it still generates those 5 instructions.

August 13, 2023

On Sunday, 13 August 2023 at 18:57:42 UTC, claptrap wrote:

>

On Sunday, 13 August 2023 at 13:13:30 UTC, Johan wrote:

>

I'm quite certain that if x is constant, the bounds check and table lookup will be hoisted out of the loop, indeed leading to just an indirect jump inside the loop.
Of course, this requires running the optimizer (-Ox, x>0).

I've tried as many variations as I can figure and it always generates this...

    cmp     r14d, 3
    ja      .LBB0_9
    movsxd  rax, dword ptr [r12 + 4*r15]
    add     rax, r12
    jmp     rax

https://d.godbolt.org/z/jzoevxoG4

That's with O2, with O3 I cant make head nor tail of the code it generates, it's some pretty insane optimization.

It's easier to look at LLVM IR in this case.
https://d.godbolt.org/z/b6b3TTeMW

  1. By providing the implementation of bar, you are giving the optimizer extra information. It may not inline bar, but it will make use of what it knows about the result of bar. That's why at -O3, there is insane optimization going on. Simply remove the implementation of bar for more insight (make it a function declaration, not a definition).

  2. Indeed at -O2, the optimizer deems it non-profitable or does not see that it can hoist the load out of the for loop. At -O3 it does hoist it out, like you wanted.

cheers,
Johan

August 13, 2023

On Sunday, 13 August 2023 at 20:52:18 UTC, Johan wrote:

>

On Sunday, 13 August 2023 at 18:57:42 UTC, claptrap wrote:

>

On Sunday, 13 August 2023 at 13:13:30 UTC, Johan wrote:

>

I'm quite certain that if x is constant, the bounds check and table lookup will be hoisted out of the loop, indeed leading to just an indirect jump inside the loop.
Of course, this requires running the optimizer (-Ox, x>0).

I've tried as many variations as I can figure and it always generates this...

    cmp     r14d, 3
    ja      .LBB0_9
    movsxd  rax, dword ptr [r12 + 4*r15]
    add     rax, r12
    jmp     rax

https://d.godbolt.org/z/jzoevxoG4

That's with O2, with O3 I cant make head nor tail of the code it generates, it's some pretty insane optimization.

It's easier to look at LLVM IR in this case.
https://d.godbolt.org/z/b6b3TTeMW

  1. By providing the implementation of bar, you are giving the optimizer extra information. It may not inline bar, but it will make use of what it knows about the result of bar. That's why at -O3, there is insane optimization going on. Simply remove the implementation of bar for more insight (make it a function declaration, not a definition).

  2. Indeed at -O2, the optimizer deems it non-profitable or does not see that it can hoist the load out of the for loop. At -O3 it does hoist it out, like you wanted.

cheers,
Johan

It hasnt hoisted the load out, it's transformed it from a loop containing a switch, to a switch containing 5 loops. The switch statement is never branched back to.

That does work for my case, probably the loop is too large, even with O3, the switch statement is still...

00007FF7C703075F movsxd rdi,dword ptr [rdx+rsi*4]
00007FF7C7030763 add rdi,rdx
00007FF7C7030766 jmp rdi

And it still checks the switch index inside the loop, it has moved it a bit earlier in the code. It seems it can tell that some of the code before the switch has no affect if the switch is bypassed.

August 13, 2023

On Sunday, 13 August 2023 at 22:55:12 UTC, claptrap wrote:

>

On Sunday, 13 August 2023 at 20:52:18 UTC, Johan wrote:

See..

https://d.godbolt.org/z/E3KaoP7o9

if you have enough extra stuff in the loop it goes back to a loop containing a switch, and 5 instructions for the switch.

September 01

On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:

>

I've given up looking into this for a number of reasons. Its taking me a long time to understand the code and it seems IndirectBr is different enough from switch that there's a lot of extra work that would need to be done. The case destinations need to be basic blocks and I cant figure how to get that from the switch cases. Plus it looks like you needs a way to build a lookup table with the addresses of the case blocks in. You then look up the target in that and pass it to IndirectBr. A lot of stuff I couldn't figure out.

And I was looking at this in the switch statement IR gen, so it still wouldn't help if the goal was to pull the address calculation & bounds check out of an enclosing loop.

It seems like its an optimization that should be done by the backend anyway. I mean the LLVM IR is SSA, so the address calc & check should be safe to move to where the condition variable is defined? If you switch(X), you can move the address lookup to where X is defined? Maybe?