Thread overview
How to prevent optimizer from reordering stuff?
Mar 14, 2015
Dan Olson
Mar 14, 2015
David Nadlinger
Mar 14, 2015
Dan Olson
Mar 15, 2015
David Nadlinger
Mar 16, 2015
Dan Olson
Mar 16, 2015
Dan Olson
Mar 16, 2015
Dan Olson
March 14, 2015
While tracking down std.math problems for ARM, I find that optimizer will reorder instructions to get FPSCR flags before the divide operation.

Is there is a way to force instruction ordering here?  I tried the llvm_memory_fence, but it doesn't do the job.

real zero = 0.0;

void foo()
{
    import std.math, std.c.stdio, ldc.llvmasm;

    real x = 1.0 / zero;

    auto f = __asm!uint("vmrs $0, fpscr", "=r");
    IeeeFlags flags = ieeeFlags();
    printf("%f, %u %d\n", x, f, flags.divByZero);
}

Compiled with -O -mtriple=thumbv7-apple-ios, you can see that vdiv is after both my inline asm and std.math ieeeFlags().

	vldr	d8, [r0]
	@ InlineAsm Start
	vmrs	r4, fpscr
	@ InlineAsm End
	mov	r0, r5
	blx	__D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags
	vmov.f64	d16, #1.000000e+00
	mov	r0, r5
	vdiv.f64	d8, d16, d8

What to do?
--
Dan
March 14, 2015
On Saturday, 14 March 2015 at 18:42:45 UTC, Dan Olson wrote:
> While tracking down std.math problems for ARM, I find that optimizer
> will reorder instructions to get FPSCR flags before the divide
> operation.

IIRC FP flag/mode support is a tricky topic in LLVM in general, but this specific problem seems weird. What are the attributes for __D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags in the IR? The optimizer should never move code across arbitrary function calls…

David
March 14, 2015
"David Nadlinger" <code@klickverbot.at> writes:

> On Saturday, 14 March 2015 at 18:42:45 UTC, Dan Olson wrote:
>> While tracking down std.math problems for ARM, I find that optimizer will reorder instructions to get FPSCR flags before the divide operation.
>
> IIRC FP flag/mode support is a tricky topic in LLVM in general, but this specific problem seems weird. What are the attributes for __D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags in the IR? The optimizer should never move code across arbitrary function calls…
>
> David

Hi David.

I don't see any attributes for for that function.  I will just paste some of the -output-ll results since nothing sticks out to me.

declare fastcc void @_D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags(%std.math.IeeeFlags* noalias sret)

define fastcc void @_D10unittester3fooFZv() {
  %flags = alloca %std.math.IeeeFlags, align 4
  %1 = load double* @_D10unittester4zeroe, align 8
  %2 = fdiv double 1.000000e+00, %1
  %3 = tail call i32 asm sideeffect "vmrs $0, fpscr", "=r"() #0
  call fastcc void @_D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags(%std.math.IeeeFlags* noalias sret %flags)
  %tmp = call fastcc i1 @_D3std4math9IeeeFlags9divByZeroMFNdZb(%std.math.IeeeFlags* %flags)
  %4 = zext i1 %tmp to i32
  %tmp1 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([11 x i8]* @.str12, i32 0, i32 0), double %2, i32 %3, i32 %4)
  ret void
}

The only guess I have right now for this is from:

http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042e/IHI0042E_aapcs.pdf

  The FPSCR is the only status register that may be accessed by
  conforming code. It is a global register with the following
  properties:

  - The condition code bits (28-31), the cumulative saturation (QC) bit
    (27) and the cumulative exception-status bits (0-4) are not
    preserved across a public interface.

  (snip)

Maybe that means the compiler can says FPSCR state from my vdiv.f64 is undefined across function call boundaries, so ordering should not matter?
March 15, 2015
Hi Dan,

On 03/14/2015 09:20 PM, Dan Olson via digitalmars-d-ldc wrote:
> I don't see any attributes for for that function.  I will just paste
> some of the -output-ll results since nothing sticks out to me.

Yeah, seems like everything is in order (no pun intended) after the main IR-level optimizer. This suggests that the reordering happens on the target-specific optimization or instruction selection level. I suppose you could try disabling codegen optimizations if you wanted to investigate this further.

> Maybe that means the compiler can says FPSCR state from my vdiv.f64
> is undefined across function call boundaries, so ordering should not
> matter?

This seems like a reasonable guess. Did you try asking on the LLVM IRC channel or mailing list? Depending on the outcome (i.e. if the ABI is really to be interpreted that way), we should probably discuss its implications for D's FP handling strategy on the main D mailing lists.

Best,
David
March 16, 2015
David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc@puremagic.com> writes:

> Hi Dan,
>
> On 03/14/2015 09:20 PM, Dan Olson via digitalmars-d-ldc wrote:
>> I don't see any attributes for for that function.  I will just paste some of the -output-ll results since nothing sticks out to me.
>
> Yeah, seems like everything is in order (no pun intended) after the main IR-level optimizer. This suggests that the reordering happens on the target-specific optimization or instruction selection level. I suppose you could try disabling codegen optimizations if you wanted to investigate this further.

It is a good puzzle.  For what it is worth, clang does the same thing with similar code.

>> Maybe that means the compiler can says FPSCR state from my vdiv.f64 is undefined across function call boundaries, so ordering should not matter?
>
> This seems like a reasonable guess. Did you try asking on the LLVM IRC channel or mailing list? Depending on the outcome (i.e. if the ABI is really to be interpreted that way), we should probably discuss its implications for D's FP handling strategy on the main D mailing lists.

I have not asked elsewhere yet.  I'm going to explore the problem a bit more, then ask.
March 16, 2015
Ok, I have stumbled into an old problem it seems.

C99 invented "#pragma STDC FENV_ACCESS ON" to prevent optimizer from reordering instructions that affect float environment.  See note [2] here:

http://en.wikipedia.org/wiki/C99#Example

And clang (LLVM) does not support this pragma:

https://llvm.org/bugs/show_bug.cgi?id=10409

Work around in C is to use volatile vars to force ordering.

And one more reference:

http://wiki.musl-libc.org/wiki/Mathematical_Library#Fenv_and_error_handling
March 16, 2015
Dan Olson <zans.is.for.cans@yahoo.com> writes:

> While tracking down std.math problems for ARM, I find that optimizer will reorder instructions to get FPSCR flags before the divide operation.
>
> Is there is a way to force instruction ordering here?  I tried the llvm_memory_fence, but it doesn't do the job.
>
> real zero = 0.0;
>
> void foo()
> {
>     import std.math, std.c.stdio, ldc.llvmasm;
>
>     real x = 1.0 / zero;
>
>     auto f = __asm!uint("vmrs $0, fpscr", "=r");
>     IeeeFlags flags = ieeeFlags();
>     printf("%f, %u %d\n", x, f, flags.divByZero);
> }
>
> Compiled with -O -mtriple=thumbv7-apple-ios, you can see that vdiv is after both my inline asm and std.math ieeeFlags().
>
> 	vldr	d8, [r0]
> 	@ InlineAsm Start
> 	vmrs	r4, fpscr
> 	@ InlineAsm End
> 	mov	r0, r5
> 	blx	__D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags
> 	vmov.f64	d16, #1.000000e+00
> 	mov	r0, r5
> 	vdiv.f64	d8, d16, d8
>
> What to do?

I have a solution.  At least it is a start.  Specifying the result of the floating point operation as argument of an empty inline asm gives correct ordering.  And doesn't do any unnecessary stores like the C volatile trick (FORCE_EVAL macro).

For my use, I wrapped the inline asm in a function "use()" that is specific to ARM because of the 'w' constraint.  I am thinking it could be named FORCE_EVAL to align with what is in linux libm and then made general for other D cpu targets.

void use(T)(T x) @nogc nothrow
{
    import std.traits;
    static if (isFloatingPoint!(T))
        __asm("", "w", x);   // arm fp reg
    else
        __asm("", "r", x);
}


Compile as before (-O), but with use(x).

real zero = 0.0;

void foo()
{
    import std.math, std.c.stdio, ldc.llvmasm;

    real x = 1.0 / zero;
    use(x);

    // get float flags in arm specifc way
    auto f = __asm!uint("vmrs $0, fpscr", "=r");
    // get float flags D way
    IeeeFlags flags = ieeeFlags();
    printf("%f, %u %d\n", x, f, flags.divByZero);
}

Now vdiv.f64 happens before all the flag fetching.

	vmov.f64	d16, #1.000000e+00
	add	r5, sp, #4
	vldr	d17, [r0]
	mov	r0, r5
	vdiv.f64	d8, d16, d17          <------ yeah!
	@ InlineAsm Start
	@ InlineAsm End
	@ InlineAsm Start
	vmrs	r4, fpscr
	@ InlineAsm End
	blx	__D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags

--
Dan