Thread overview
Android/ARM codegen
Jul 16, 2015
Joakim
Jul 16, 2015
Dan Olson
Jul 16, 2015
Joakim
Jul 17, 2015
Dan Olson
Jul 17, 2015
Dan Olson
Jul 17, 2015
Joakim
Jul 17, 2015
Dan Olson
Jul 17, 2015
Joakim
Aug 03, 2015
Joakim
Aug 04, 2015
Dan Olson
July 16, 2015
Alright, I've been stepping through some failing tests: one that seems to be bad codegen is that many calls to std.algorithm.iteration.map seem to fail, for example, when running the tests for std.zip.  One of the std.zip tests calls std.random.uniform, which then gets its second parameter stomped by map from rndGen().  I've tracked it down to this function in the llvm IR:

define weak_odr void @_D3std9algorithm9iteration47__T3mapS363std6random6rndGenFNcNdNfZ9__lambda4Z42__T3mapTS3std5range13__T6RepeatTiZ6RepeatZ3mapMFNaNbNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult(%"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult"* noalias nocapture sret %.sret_arg, i8* %.nest_arg, %"std.range.Repeat!int.Repeat"* byval nocapture readonly %r_arg) #1
{

  %.structliteral = alloca %"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult", align 4 ; [#uses = 3 type = %"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult"*]

  %1 = getelementptr inbounds %"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult"* %.structliteral, i32 0, i32 0, i32 0 ; [#uses = 1 type = i32*]

  store i32 0, i32* %1, align 4

  %2 = getelementptr %"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult"* %.structliteral, i32 0, i32 1 ; [#uses = 1 type = i8**]

  store i8* %.nest_arg, i8** %2, align 4

  %tmp = call %"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult"* @_D3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult6__ctorMFNaNbNcNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult(%"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult"* returned %.structliteral, %"std.range.Repeat!int.Repeat"* byval %r_arg) ; [#uses = 1 type = %"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult"*]

  %3 = bitcast %"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult"* %tmp to i64* ; [#uses = 1 type = i64*]

  %4 = bitcast %"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult"* %.sret_arg to i64* ; [#uses = 1 type = i64*]

  %5 = load i64* %3, align 1                      ; [#uses = 1 type = i64]

  store i64 %5, i64* %4, align 4

  ret void
}

which gets translated to the following ARM assembly:

_D3std9algorithm9iteration47__T3mapS363std6random6rndGenFNcNdNfZ9__lambda4Z42__T3mapTS3std5range13__T6RepeatTiZ6RepeatZ3mapMFNaNbNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult:
        .fnstart
.Leh_func_begin94:
        .pad    #8
        sub     sp, sp, #8
        .save   {r4, lr}
        push    {r4, lr}
        .pad    #8
        sub     sp, sp, #8
        mov     r4, r0
        mov     r0, #0
        str     r2, [sp, #20]
        stmib   sp, {r0, r1}
        add     r0, sp, #4
        ldr     r1, [sp, #20]
        bl      _D3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult6__ctorMFNaNbNcNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult(PLT)
        ldmib   sp, {r0, r1}
        stm     r4, {r0, r1}
        add     sp, sp, #8
        pop     {r4, lr}
        add     sp, sp, #8
        bx      lr

The problem appears to be that even though r4 is saved at the beginning of the function, it is overwritten by r1 in the stmib instruction afterwards.  Obviously there's no point in pushing r4 and popping it at the end of the function, if you've lost it by overwriting in between.  Specifically, "sub sp, sp, #8" advances the stack pointer by 8, then stmib increments it 4 _before_ storing r0 then r1, overwriting the contents of r4 saved at the beginning of the function.  Other calls to map also seem to fail in other instances of the same template function, but interestingly in different ways, ie no uses of stmib there.  I haven't tracked down exactly why those other instances fail.

I'm not sure how to reduce this further and if I need to file a bug for llvm: any pointers?
July 16, 2015
"Joakim" <dlang@joakim.fea.st> writes:
>
> The problem appears to be that even though r4 is saved at the beginning of the function, it is overwritten by r1 in the stmib instruction afterwards.  Obviously there's no point in pushing r4 and popping it at the end of the function, if you've lost it by overwriting in between.  Specifically, "sub sp, sp, #8" advances the stack pointer by 8, then stmib increments it 4 _before_ storing r0 then r1, overwriting the contents of r4 saved at the beginning of the function.  Other calls to map also seem to fail in other instances of the same template function, but interestingly in different ways, ie no uses of stmib there.  I haven't tracked down exactly why those other instances fail.

That doesn't look right.

What is your triple and other options (optimization), llvm version your patched src is based on, and ldc branch (merge-2.067 branch I think)? I'd like to compare the asm snippet with what I get get for thumbv7-apple-ios, because it should be similar, but it passes std.zip.

It could be I just haven't stumbled into the combination you are using.

-- 
Dan

July 16, 2015
On Thursday, 16 July 2015 at 16:40:31 UTC, Dan Olson wrote:
> "Joakim" <dlang@joakim.fea.st> writes:
>>
>> The problem appears to be that even though r4 is saved at the beginning of the function, it is overwritten by r1 in the stmib instruction afterwards.  Obviously there's no point in pushing r4 and popping it at the end of the function, if you've lost it by overwriting in between.  Specifically, "sub sp, sp, #8" advances the stack pointer by 8, then stmib increments it 4 _before_ storing r0 then r1, overwriting the contents of r4 saved at the beginning of the function.  Other calls to map also seem to fail in other instances of the same template function, but interestingly in different ways, ie no uses of stmib there.  I haven't tracked down exactly why those other instances fail.
>
> That doesn't look right.
>
> What is your triple and other options (optimization), llvm version your patched src is based on, and ldc branch (merge-2.067 branch I think)? I'd like to compare the asm snippet with what I get get for thumbv7-apple-ios, because it should be similar, but it passes std.zip.
>
> It could be I just haven't stumbled into the combination you are using.

llvm triple and other flags used:

--output-o -w -d -mtriple=armv7-none-linux-androideabi -relocation-model=pic -O3 -release -unittest

ldc was compiled against a locally-compiled llvm 3.6 from the Android NDK repo, which has some modifications:

https://android.googlesource.com/toolchain/llvm/+log/release_36

Yes, I'm using the merge-2.067 branch of ldc as of commit 122ea372d from a couple weeks ago, with inlining turned off, as noted in the earlier EH thread.  I would be curious to see what llvm IR and ARM assembly is generated for you on iOS and what other flags you're using.  The actual function shown is from the std.random module.
July 17, 2015
On Thursday, 16 July 2015 at 17:07:33 UTC, Joakim wrote:
> On Thursday, 16 July 2015 at 16:40:31 UTC, Dan Olson wrote:
>> [...]
>
> llvm triple and other flags used:
>
> --output-o -w -d -mtriple=armv7-none-linux-androideabi -relocation-model=pic -O3 -release -unittest
>
> ldc was compiled against a locally-compiled llvm 3.6 from the Android NDK repo, which has some modifications:
>
> https://android.googlesource.com/toolchain/llvm/+log/release_36
>
> Yes, I'm using the merge-2.067 branch of ldc as of commit 122ea372d from a couple weeks ago, with inlining turned off, as noted in the earlier EH thread.  I would be curious to see what llvm IR and ARM assembly is generated for you on iOS and what other flags you're using.  The actual function shown is from the std.random module.

And the nice thing about LLVM / LDC is that I can specify your triple and other options with one compiler.  I love that about LLVM: all cross compilers in one.
July 17, 2015
"Joakim" <dlang@joakim.fea.st> writes:

> Alright, I've been stepping through some failing tests: one that seems to be bad codegen is that many calls to std.algorithm.iteration.map seem to fail, for example, when running the tests for std.zip.  One of the std.zip tests calls std.random.uniform, which then gets its second parameter stomped by map from rndGen().

Joakim, I have a hunch.  Can you try changing your abi-android.cpp so passByVal() returns false always?  That is the only difference in our generated IR and in my experience byval causes incorrect code on ARM.

As an experiment, I changed my passByVal() back to return true for Tstruct (the LDC default), and then my IR is identical to yours and my ARM code, though slightly different, also is clobbering a saved reg on the stack.

Anyway, I pasted my GOOD assembly below yours for comparison.

> which gets translated to the following ARM assembly:
>
> _D3std9algorithm9iteration47__T3mapS363std6random6rndGenFNcNdNfZ9__lambda4Z42__T3mapTS3std5range13__T6RepeatTiZ6RepeatZ3mapMFNaNbNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult:
>         .fnstart
> .Leh_func_begin94:
>         .pad    #8
>         sub     sp, sp, #8
>         .save   {r4, lr}
>         push    {r4, lr}
>         .pad    #8
>         sub     sp, sp, #8
>         mov     r4, r0
>         mov     r0, #0
>         str     r2, [sp, #20]
>         stmib   sp, {r0, r1}
>         add     r0, sp, #4
>         ldr     r1, [sp, #20]
>         bl
> _D3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult6__ctorMFNaNbNcNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult(PLT)
>         ldmib   sp, {r0, r1}
>         stm     r4, {r0, r1}
>         add     sp, sp, #8
>         pop     {r4, lr}
>         add     sp, sp, #8
>         bx      lr

Better assembly produced with -output-s -w -d -mtriple=armv7-apple-ios -relocation-model=pic -O3 -release -unittest -disable-inlining (needed to match up with your change).  It is similar but uses stm (e.g stmia) instead of stmib.

__D3std9algorithm9iteration47__T3mapS363std6random6rndGenFNcNdNfZ9__lambda4Z42__T3mapTS3std5range13__T6RepeatTiZ6RepeatZ3mapMFNaNbNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult:
	push	{r4, r7, lr}
	add	r7, sp, #4
	sub	sp, sp, #8
	mov	r4, r0
	mov	r0, #0
	stm	sp, {r0, r1}
	mov	r0, sp
	mov	r1, r2
	bl	__D3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult6__ctorMFNaNbNcNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult
	ldm	sp, {r0, r1}
	strd	r0, r1, [r4]
	sub	sp, r7, #4
	pop	{r4, r7, pc}

I wanted to use your triple with OS of android, but std/random.d needs Android specific D code from elsewhere, so compile failed.

As aside, with inlining enabled, that function shrinks to 3 instructions:

__D3std9algorithm9iteration47__T3mapS363std6random6rndGenFNcNdNfZ9__lambda4Z42__T3mapTS3std5range13__T6RepeatTiZ6RepeatZ3mapMFNaNbNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult:
	mov	r3, r1
	strd	r2, r3, [r0]
	bx	lr

Hope it helps
July 17, 2015
On Friday, 17 July 2015 at 08:58:30 UTC, Dan Olson wrote:
> "Joakim" <dlang@joakim.fea.st> writes:
>
>> Alright, I've been stepping through some failing tests: one that seems to be bad codegen is that many calls to std.algorithm.iteration.map seem to fail, for example, when running the tests for std.zip.  One of the std.zip tests calls std.random.uniform, which then gets its second parameter stomped by map from rndGen().
>
> Joakim, I have a hunch.  Can you try changing your abi-android.cpp so passByVal() returns false always?  That is the only difference in our generated IR and in my experience byval causes incorrect code on ARM.
>
> As an experiment, I changed my passByVal() back to return true for Tstruct (the LDC default), and then my IR is identical to yours and my ARM code, though slightly different, also is clobbering a saved reg on the stack.

Yep, that fixed it, std.zip passes its tests now. :) I'll run the rest of the tests and report back, thanks for your help.
July 17, 2015
"Joakim" <dlang@joakim.fea.st> writes:
> Yep, that fixed it, std.zip passes its tests now. :) I'll run the rest of the tests and report back, thanks for your help.

:-)
July 17, 2015
On Friday, 17 July 2015 at 16:33:07 UTC, Dan Olson wrote:
> "Joakim" <dlang@joakim.fea.st> writes:
>> Yep, that fixed it, std.zip passes its tests now. :) I'll run the rest of the tests and report back, thanks for your help.
>
> :-)

Around 30 more modules from std.phobos now pass their tests with that change, most of which used to segfault somewhere in a test before.  core.time from druntime also passes all its tests now.  That one change made a big difference, thanks for pointing it out.

Now to fix the rest.  Many are related to the real type, which I haven't patched for 64-bit yet.
August 03, 2015
On Friday, 17 July 2015 at 17:57:01 UTC, Joakim wrote:
> On Friday, 17 July 2015 at 16:33:07 UTC, Dan Olson wrote:
>> "Joakim" <dlang@joakim.fea.st> writes:
>>> Yep, that fixed it, std.zip passes its tests now. :) I'll run the rest of the tests and report back, thanks for your help.
>>
>> :-)
>
> Around 30 more modules from std.phobos now pass their tests with that change, most of which used to segfault somewhere in a test before.  core.time from druntime also passes all its tests now.  That one change made a big difference, thanks for pointing it out.
>
> Now to fix the rest.  Many are related to the real type, which I haven't patched for 64-bit yet.

Dan, as I said in the main forum, most of the druntime/phobos modules' tests pass on Android/ARM now.  However, I had to turn off optimizations for a handful of modules, have you had to do the same?  As noted before, one optimization pass was screwing up ldc.eh.  I also had to turn off all optimizations, ie -O0, for std.random and std.stream to get their unit tests to pass.  For one phobos module, std.regex, turning off all optimizations for druntime's core.memory got the regex tests to pass.

Other than those four modules, everything is compiled with -O3 and seems to work, except for the two modules that still segfault, std.net.isemail and std.regex.internal.tests, where compiling those modules with -O0 doesn't make a difference.  I haven't spent any time tracking down if other optimized modules might be causing those two to segfault, as was the case with std.regex and core.memory, or exactly which llvm optimizations are causing problems with core.memory, std.random, and std.stream.

Are you seeing similar results with your 2.067 branch of ldc with iOS?  Since ARM codegen should be similar for the two, I wonder if I'm the only one seeing this.
August 04, 2015
"Joakim" <dlang@joakim.fea.st> writes:

> On Friday, 17 July 2015 at 17:57:01 UTC, Joakim wrote:
>> On Friday, 17 July 2015 at 16:33:07 UTC, Dan Olson wrote:
>>> "Joakim" <dlang@joakim.fea.st> writes:
>
> Dan, as I said in the main forum, most of the druntime/phobos modules' tests pass on Android/ARM now.  However, I had to turn off optimizations for a handful of modules, have you had to do the same? As noted before, one optimization pass was screwing up ldc.eh.  I also had to turn off all optimizations, ie -O0, for std.random and std.stream to get their unit tests to pass.  For one phobos module, std.regex, turning off all optimizations for druntime's core.memory got the regex tests to pass.
>
> Other than those four modules, everything is compiled with -O3 and seems to work, except for the two modules that still segfault, std.net.isemail and std.regex.internal.tests, where compiling those modules with -O0 doesn't make a difference.  I haven't spent any time tracking down if other optimized modules might be causing those two to segfault, as was the case with std.regex and core.memory, or exactly which llvm optimizations are causing problems with core.memory, std.random, and std.stream.
>
> Are you seeing similar results with your 2.067 branch of ldc with iOS? Since ARM codegen should be similar for the two, I wonder if I'm the only one seeing this.

Hi Joakim - I have -O3 optimization on for all modules in the release build and I think I have tested with most of the other -O levels.  I did run into an alignment error with neon instructions in std.random unittest with LLVM 3.5.1 and 0.15.1 (2.066) and eventually disabled neon (-mattr=-neon) during optimization as a workaround.  I have not tried reenabling neon for merge-2.067 and LLVM 3.6.

The problem was a vst1.64 instruction requesting 128-bit alignment when the data was only 64-bit aligned:

0x52197a:  vst1.64 {d16, d17}, [r5:128]  // r5 addr not properly aligned

If it can happen in std.random it could happen elsewhere.

You might try -mattr=-neon and see what happens.

One other difference to think about is that Android is AAPCS and iOS is a variant of the older APCS.  LLVM has some different paths for these. I studied it some a few weeks ago as I made the extern(C) ABI compatible with clang.
-- 
Dan