Thread overview
[Bug 288] strange nonsensical x86-64 code generation with -O3 - rats' nest of useless conditional jumps
Mar 22, 2018
Iain Buclaw
Mar 23, 2018
Cecil Ward
Mar 23, 2018
Iain Buclaw
Mar 26, 2018
Cecil Ward
March 22, 2018
https://bugzilla.gdcproject.org/show_bug.cgi?id=288

--- Comment #1 from Iain Buclaw <ibuclaw@gdcproject.org> ---
> See the long list of useless conditional jumps towards the end of the first function in the asm output (whose demangled name is test.t1(unit))

Well, you'd never use -O3 if you care about speed anyway. :-)

And they are not useless jumps, it's just the foreach loop unrolled in its entirety.  You can see that it's a feature of the gcc-7 series and latter, irregardless of the target, they all produce the same unrolled loop.

https://explore.dgnu.org/g/vD3N4Y

It might be a nice experiment to add pragma(ivdep) and pragma(unroll) support
to give you more control.

https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html

I wouldn't hold my breath though (this is not strictly a bug).

-- 
You are receiving this mail because:
You are watching all bug changes.
March 23, 2018
On Thursday, 22 March 2018 at 22:16:16 UTC, Iain Buclaw wrote:
> https://bugzilla.gdcproject.org/show_bug.cgi?id=288
>
> --- Comment #1 from Iain Buclaw <ibuclaw@gdcproject.org> ---
>> See the long list of useless conditional jumps towards the end of the first function in the asm output (whose demangled name is test.t1(unit))
>
> Well, you'd never use -O3 if you care about speed anyway. :-)
>
> And they are not useless jumps, it's just the foreach loop unrolled in its entirety.  You can see that it's a feature of the gcc-7 series and latter, irregardless of the target, they all produce the same unrolled loop.
>
> https://explore.dgnu.org/g/vD3N4Y
>
> It might be a nice experiment to add pragma(ivdep) and pragma(unroll) support
> to give you more control.
>
> https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html
>
> I wouldn't hold my breath though (this is not strictly a bug).

Agreed. It is possibly not a bug, because I don't see that the code is dysfunctional, but I haven't looked through it. But since the backend is doing optimisation here with unrolling, that being sub-optimal given with this weird code is imho a bug in that the achievement of _optimisation_ is not attained.

No I understand this is nothing to do with D, and I understand that this is unrolling.

But notice the target of the jumps are all to the same location and finishes off with an unconditional jump to the same location.

I feel this is just a quirk of unrolling, in part, but that's not all I feel as the jumps don't make sense
 cmp #n / jxx L3
cmp #m / jxx L3
 jmp L3

is what we have so it all basically does absolutely nothing, unless cmp 1 cmp 2 cmp 3 cmp 4 is an incredibly bad way of testing ( x>=1 && x<=4 ) but with 30-odd tests it isn't very funny.

I know this is merely debug-only code, but am wondering what else might happen if you are misguided enough to use the crazy -O3 with unrolled loops that have conditionals in them.

My other complaint about GCC back-end’' code generation is that it (sometimes) doesn't go for jump-less movcc-style operations when it can. For x86/x64, LDC sometimes generates jump-free code using conditional instructions where GCC does not.

I can fix the problem with GDC by using A single & instead of a &&, which happens to be legal here. Sometimes in the past I have needed to take steps to make sure that I can do such an operator substitution trick in order to get jump-free far far faster code, faster where the alternatives are extremely short (and side-effect-free) and branch prediction failure is a certainty.

I don't know if there are ways in which the backend could try to ascertain whether the results of certain unrolling are really bad. In some cases they could be bad because the code is too long and generates problems with code cache size or won't fit into a loop buffer. A highly per-cpu sub-variant check would need to be carried out in the generated code size, at least in all cases where there is still a loop left (as opposed to full unrolling of known size), as every kind of AMD and Intel processor is different, as Agner Fog warns us. Here though I didn't even explicitly ask for unrolling, so you might harshly say that it is the compiler’s jib ti work out whether it is actually an anti-optimisation, regardless of the possible reasons why the result may be bad news, never mind just based on total generated code size not fitting into some per-CP limit.

My reason for reporting this was to inquire about for loop unrolling behaves in later versions of the back end, ask about jump generation vs jump-free alternatives (LDC showing the correct way to do things) and to ask if there are any suboptimality nasties junking in code that does not merely come down to driving an assert.

I would hope for an optimisation that handles the case of dense-packed cmp #1 | cmp #2 | cmp #3 | cmp #4 etc, especially with no holes, in the case where _all the jumps go to the same target_, so this can get reduced down into a two-test range check and huge optimisation. I would also hope that conditional jumps followed by an unconditional jump could be spotted and handled too. (Peephole general low level optimisation then? ie jxx L1 / jmp L1 =  jmp L1)

Perhaps this is all being generated too late and optimisations have ready happened and they opportunities those optimisers provide have been and gone. Would it be possible for the backend to include a number of repeat optimisation passes of certain kinds after unrolled code is generated, or doesn't it work like that?

Anyway, this is not for you but for that particular backend, I suspect. I was wondering if someone could have a word, pass it in to the relevant people. I think it's worth making -O3 more generally usable rather than crazy because it features good ideas gone bad.
March 23, 2018
On Friday, 23 March 2018 at 00:39:13 UTC, Cecil Ward wrote:
> On Thursday, 22 March 2018 at 22:16:16 UTC, Iain Buclaw wrote:
>> https://bugzilla.gdcproject.org/show_bug.cgi?id=288
>>
>> --- Comment #1 from Iain Buclaw <ibuclaw@gdcproject.org> ---
>>> See the long list of useless conditional jumps towards the end of the first function in the asm output (whose demangled name is test.t1(unit))
>>
>> Well, you'd never use -O3 if you care about speed anyway. :-)
>>
>> And they are not useless jumps, it's just the foreach loop unrolled in its entirety.  You can see that it's a feature of the gcc-7 series and latter, irregardless of the target, they all produce the same unrolled loop.
>>
>> https://explore.dgnu.org/g/vD3N4Y
>>
>> It might be a nice experiment to add pragma(ivdep) and pragma(unroll) support
>> to give you more control.
>>
>> https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html
>>
>> I wouldn't hold my breath though (this is not strictly a bug).
>
> Agreed. It is possibly not a bug, because I don't see that the code is dysfunctional, but I haven't looked through it. But since the backend is doing optimisation here with unrolling, that being sub-optimal given with this weird code is imho a bug in that the achievement of _optimisation_ is not attained.
>
> No I understand this is nothing to do with D, and I understand that this is unrolling.
>
> But notice the target of the jumps are all to the same location and finishes off with an unconditional jump to the same location.
>

Not quite, if you look a little closer, some jump to other branches hidden inbetween.

> I feel this is just a quirk of unrolling, in part, but that's not all I feel as the jumps don't make sense
>  cmp #n / jxx L3
> cmp #m / jxx L3
>  jmp L3
>
> is what we have so it all basically does absolutely nothing, unless cmp 1 cmp 2 cmp 3 cmp 4 is an incredibly bad way of testing ( x>=1 && x<=4 ) but with 30-odd tests it isn't very funny.
>

If you compile with -fdump-tree-optimized=stdout you will see that it's the middle-end that has lowered the code to a series of if jumps.

The backend consumer doesn't really have any chance for improving it.

> I know this is merely debug-only code, but am wondering what else might happen if you are misguided enough to use the crazy -O3 with unrolled loops that have conditionals in them.
>
> My other complaint about GCC back-end’' code generation is that it (sometimes) doesn't go for jump-less movcc-style operations when it can. For x86/x64, LDC sometimes generates jump-free code using conditional instructions where GCC does not.
>
> I can fix the problem with GDC by using A single & instead of a &&, which happens to be legal here. Sometimes in the past I have needed to take steps to make sure that I can do such an operator substitution trick in order to get jump-free far far faster code, faster where the alternatives are extremely short (and side-effect-free) and branch prediction failure is a certainty.
>

You could also try compiling with -O2.  I couldn't really see this in your given example, but honestly, if you want to optimize really aggressively you must be willing to coax the compiler in strange ways anyway.

> I don't know if there are ways in which the backend could try to ascertain whether the results of certain unrolling are really bad. In some cases they could be bad because the code is too long and generates problems with code cache size or won't fit into a loop buffer. A highly per-cpu sub-variant check would need to be carried out in the generated code size, at least in all cases where there is still a loop left (as opposed to full unrolling of known size), as every kind of AMD and Intel processor is different, as Agner Fog warns us. Here though I didn't even explicitly ask for unrolling, so you might harshly say that it is the compiler’s jib ti work out whether it is actually an anti-optimisation, regardless of the possible reasons why the result may be bad news, never mind just based on total generated code size not fitting into some per-CP limit.
>

Well again, from past experience -O3 doesn't really care about code size or cache line so much.  All optimizations passes which lower the code this way do so during SSA transformations, so irrespective of what is being targeted.

> My reason for reporting this was to inquire about for loop unrolling behaves in later versions of the back end, ask about jump generation vs jump-free alternatives (LDC showing the correct way to do things) and to ask if there are any suboptimality nasties junking in code that does not merely come down to driving an assert.
>
> I would hope for an optimisation that handles the case of dense-packed cmp #1 | cmp #2 | cmp #3 | cmp #4 etc, especially with no holes, in the case where _all the jumps go to the same target_, so this can get reduced down into a two-test range check and huge optimisation. I would also hope that conditional jumps followed by an unconditional jump could be spotted and handled too. (Peephole general low level optimisation then? ie jxx L1 / jmp L1 =  jmp L1)
>

But is it really sub-optimal what -O3 is doing?  Have you benchmarked it?  I certainly didn't.

---
bool IsPowerOf2(T)(T x) pure nothrow @safe @nogc
out ( ret )
{
    assert( ret == true || ret == false );
    debug
    {
        bool b = false;
        foreach( s; 0.. 8 * x.sizeof )
            b = b || ( x == (1uL << s) );
        assert( ret == b );
    }
}
body
{
    return ( ( x & (x - 1) ) == 0 ) & (x > 0);
}

bool t1( int x )
{
    return IsPowerOf2( x );
}

void main(string[] args)
{
    int test()
    {
        return t1(cast(int)args.length);
    }

    import std.datetime.stopwatch;
    import std.stdio;

    auto r = benchmark!(test)(10000000);
    writeln(r);
}
---

[54 ms, 299 μs, and 5 hnsecs]: gdc -O3 -fdebug --param max-peel-branches=32
[168 ms, 71 μs, and 1 hnsec]:  gdc -O3 -fdebug --param max-peel-branches=24

Looks like that despite your complaint, it is 3x faster with what you call "strange nonsensical code generation". :-)

> Perhaps this is all being generated too late and optimisations have ready happened and they opportunities those optimisers provide have been and gone. Would it be possible for the backend to include a number of repeat optimisation passes of certain kinds after unrolled code is generated, or doesn't it work like that?
>

I had a quick look, and there is a control parameter for this - '--param max-peel-branches', the default value is 32, reducing it to 24 is enough to ensure that the complete unrolling never happens, but see benchmarks for why changing this may not be a good idea.

> Anyway, this is not for you but for that particular backend, I suspect. I was wondering if someone could have a word, pass it in to the relevant people. I think it's worth making -O3 more generally usable rather than crazy because it features good ideas gone bad.

No, as a language implementer, our job is only to guarantee that semantic never breaks.  Anything else is not relevant to us.

In this case though, it looks like everything is fine though and there's nothing to report.

March 26, 2018
On Friday, 23 March 2018 at 22:14:35 UTC, Iain Buclaw wrote:
> On Friday, 23 March 2018 at 00:39:13 UTC, Cecil Ward wrote:
>> On Thursday, 22 March 2018 at 22:16:16 UTC, Iain Buclaw wrote:
>>> https://bugzilla.gdcproject.org/show_bug.cgi?id=288
>>>
>>> --- Comment #1 from Iain Buclaw <ibuclaw@gdcproject.org> ---
>>>> See the long list of useless conditional jumps towards the end of the first function in the asm output (whose demangled name is test.t1(unit))
>>>
>>> Well, you'd never use -O3 if you care about speed anyway. :-)
>>>
>>> And they are not useless jumps, it's just the foreach loop unrolled in its entirety.  You can see that it's a feature of the gcc-7 series and latter, irregardless of the target, they all produce the same unrolled loop.
>>>
>>> https://explore.dgnu.org/g/vD3N4Y
>>>
>>> It might be a nice experiment to add pragma(ivdep) and pragma(unroll) support
>>> to give you more control.
>>>
>>> https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html
>>>
>>> I wouldn't hold my breath though (this is not strictly a bug).
>>
>> Agreed. It is possibly not a bug, because I don't see that the code is dysfunctional, but I haven't looked through it. But since the backend is doing optimisation here with unrolling, that being sub-optimal given with this weird code is imho a bug in that the achievement of _optimisation_ is not attained.
>>
>> No I understand this is nothing to do with D, and I understand that this is unrolling.
>>
>> But notice the target of the jumps are all to the same location and finishes off with an unconditional jump to the same location.
>>
>
> Not quite, if you look a little closer, some jump to other branches hidden inbetween.
>
>> I feel this is just a quirk of unrolling, in part, but that's not all I feel as the jumps don't make sense
>>  cmp #n / jxx L3
>> cmp #m / jxx L3
>>  jmp L3
>>
>> is what we have so it all basically does absolutely nothing, unless cmp 1 cmp 2 cmp 3 cmp 4 is an incredibly bad way of testing ( x>=1 && x<=4 ) but with 30-odd tests it isn't very funny.
>>
>
> If you compile with -fdump-tree-optimized=stdout you will see that it's the middle-end that has lowered the code to a series of if jumps.
>
> The backend consumer doesn't really have any chance for improving it.
>
>> I know this is merely debug-only code, but am wondering what else might happen if you are misguided enough to use the crazy -O3 with unrolled loops that have conditionals in them.
>>
>> My other complaint about GCC back-end’' code generation is that it (sometimes) doesn't go for jump-less movcc-style operations when it can. For x86/x64, LDC sometimes generates jump-free code using conditional instructions where GCC does not.
>>
>> I can fix the problem with GDC by using A single & instead of a &&, which happens to be legal here. Sometimes in the past I have needed to take steps to make sure that I can do such an operator substitution trick in order to get jump-free far far faster code, faster where the alternatives are extremely short (and side-effect-free) and branch prediction failure is a certainty.
>>
>
> You could also try compiling with -O2.  I couldn't really see this in your given example, but honestly, if you want to optimize really aggressively you must be willing to coax the compiler in strange ways anyway.
>
>> I don't know if there are ways in which the backend could try to ascertain whether the results of certain unrolling are really bad. In some cases they could be bad because the code is too long and generates problems with code cache size or won't fit into a loop buffer. A highly per-cpu sub-variant check would need to be carried out in the generated code size, at least in all cases where there is still a loop left (as opposed to full unrolling of known size), as every kind of AMD and Intel processor is different, as Agner Fog warns us. Here though I didn't even explicitly ask for unrolling, so you might harshly say that it is the compiler’s jib ti work out whether it is actually an anti-optimisation, regardless of the possible reasons why the result may be bad news, never mind just based on total generated code size not fitting into some per-CP limit.
>>
>
> Well again, from past experience -O3 doesn't really care about code size or cache line so much.  All optimizations passes which lower the code this way do so during SSA transformations, so irrespective of what is being targeted.
>
>> My reason for reporting this was to inquire about for loop unrolling behaves in later versions of the back end, ask about jump generation vs jump-free alternatives (LDC showing the correct way to do things) and to ask if there are any suboptimality nasties junking in code that does not merely come down to driving an assert.
>>
>> I would hope for an optimisation that handles the case of dense-packed cmp #1 | cmp #2 | cmp #3 | cmp #4 etc, especially with no holes, in the case where _all the jumps go to the same target_, so this can get reduced down into a two-test range check and huge optimisation. I would also hope that conditional jumps followed by an unconditional jump could be spotted and handled too. (Peephole general low level optimisation then? ie jxx L1 / jmp L1 =  jmp L1)
>>
>
> But is it really sub-optimal what -O3 is doing?  Have you benchmarked it?  I certainly didn't.
>
> ---
> bool IsPowerOf2(T)(T x) pure nothrow @safe @nogc
> out ( ret )
> {
>     assert( ret == true || ret == false );
>     debug
>     {
>         bool b = false;
>         foreach( s; 0.. 8 * x.sizeof )
>             b = b || ( x == (1uL << s) );
>         assert( ret == b );
>     }
> }
> body
> {
>     return ( ( x & (x - 1) ) == 0 ) & (x > 0);
> }
>
> bool t1( int x )
> {
>     return IsPowerOf2( x );
> }
>
> void main(string[] args)
> {
>     int test()
>     {
>         return t1(cast(int)args.length);
>     }
>
>     import std.datetime.stopwatch;
>     import std.stdio;
>
>     auto r = benchmark!(test)(10000000);
>     writeln(r);
> }
> ---
>
> [54 ms, 299 μs, and 5 hnsecs]: gdc -O3 -fdebug --param max-peel-branches=32
> [168 ms, 71 μs, and 1 hnsec]:  gdc -O3 -fdebug --param max-peel-branches=24
>
> Looks like that despite your complaint, it is 3x faster with what you call "strange nonsensical code generation". :-)
>
>> Perhaps this is all being generated too late and optimisations have ready happened and they opportunities those optimisers provide have been and gone. Would it be possible for the backend to include a number of repeat optimisation passes of certain kinds after unrolled code is generated, or doesn't it work like that?
>>
>
> I had a quick look, and there is a control parameter for this - '--param max-peel-branches', the default value is 32, reducing it to 24 is enough to ensure that the complete unrolling never happens, but see benchmarks for why changing this may not be a good idea.
>
>> Anyway, this is not for you but for that particular backend, I suspect. I was wondering if someone could have a word, pass it in to the relevant people. I think it's worth making -O3 more generally usable rather than crazy because it features good ideas gone bad.
>
> No, as a language implementer, our job is only to guarantee that semantic never breaks.  Anything else is not relevant to us.
>
> In this case though, it looks like everything is fine though and there's nothing to report.

Iain you and I are fully in agreement, but I was hoping some feedback could be given to the relevant backend guys. Those runs of multiple cmps could in part be reduced down to range checks, and a conditional jump followed by an unconditional jump to the same target is silly machine code generation.

Everything is indeed fine with GDC per-se, as you say.  And in no way was it meant to be a whinge about GDC. It's a tool I have a lot of admiration for, and we are lucky two have _two_ high quality code-generators for x64 for D source in GDC and LDC. The backend implementors can maybe even keep an eye on one another’s tricks perhaps.

When I used the terms optimal / suboptimal I was thinking like an asm programmer, at the level of replacement of individual groups of instructions and the removal of instructions that either (a) literally have no useful reason to be there or (b) are very far from what any sane human asm programmer would write.

This has not been intended as a comment about optimisation of an actual _program_ - after all, the snippet of D code in question is merely a debug-build only assert, so its very function is to slow the program down :-) because it is just sanity checking a return value. Do I certainly wouldn't have benchmarked it as it's only debug-only code and in the release build it's entirely gone.

I understand your helpful point about the relevant transformations already having been triggered at a level higher up and so cannot later be undone. It must be then that there is no later last chance peephole ultra-low level x64 optimiser that takes care of anything that happens to comes out of the type cmp #n / jcc label1 cmp #n+1 / jcc label1 where the immediate values are dense since in this case dozen tests could be replaced by two, and the jcond label1 / jmp label1 pair is of course always daft when the two jump targets are identical and so that would be an opportunity for a peephole optimiser. I don't know if any backends do such a thing anyway under some circumstances.

I can't remember whether I spotted some jump targets were different - good point you made, didn't mean to mislead you. I mean that there were _runs_, adjacent groups of identical jump targets with dense check values and then there is the finish to it.

I asked about this because the GCC etc compilers normally do not produce x64 code with the _latter_ idiom in my limited experience. Perhaps this is only so because middle ends protect all backend and backend do not need to repeatedly invent the wheel by implementing those kind of machine instruction group checks themselves internally. I once wrote some transformational optimiser code analogous to this myself, not in a normal programming language compiler. I wrote a very complex thing to minimise XML once with a dozen layers of transformations. I found that deciding how to choose the order of optimiser transformations is a pain, and so I ended up doing the same optimiser transformation in more than one layer, just in case an intermediate optimisation produces output that contains a kind of bad idiom that was caught and stripped out by a higher level optimser, but at a lower layer it is now too late, so I catch it below instead with a re-run.

One nuisance with the GCC family's code generation is that it is more willing to use conditional jumps than LLVM is, from my brief experience of looking at code quality with GDC vs LDC. I wish there were a way of giving the compiler-stack a big hint to tell it avoid branches like the absolute plague wherever possible. The process of deciding which way to go must be very difficult and complex. Somehow LDC seems to get lucky.

There is also the question of the difference between a & and a &&. Changing the source code to a & from a && (which happens to be legitimate in this particular case)  makes the jumping go away. There is something which I don't understand in the middle of the compilers that has to either do things the same way or not within the limits of per-language semantics about guarantees (shortcut eval) of && and || not executing some of the intermediate tests, but of course most of the time analysis could show possibly show that there is no issue at all, and perhaps evaluating everything (because the tests are safe) across all && ||s does no harm in some particular special case despite what the language guarantees. These compilers are of course good enough to exploit conditional moves and sequences of arithmetic operations to do ? : type operations jump-free, and I may have seen code generated in some places that is jump-free and which uses AND and OR instructions in conjunction with eg x86 SETcc instructions, doing a truly excellent job of staying away from conditional jumps that will overpower branch predictors and dominate execution time in certain small tight loops.

Apologies for bothering you, anyway. It's just some feedback aimed at other people, but I wouldn't have any idea where to go, which is my ignorant fault.

Many thanks for all you do.