Thread overview
Improving codegen for ARM Cortex-M
Jul 20, 2018
Mike Franklin
Jul 20, 2018
Mike Franklin
Jul 20, 2018
Mike Franklin
Jul 20, 2018
Mike Franklin
July 20, 2018
I've finally succeeded in getting a build of my STM32 ARM Cortex-M proof of concept in LDC and GDC, thanks to the recent changes in both compilers.  So, I now have a way to compare code generation between the two compilers.

The project is extremely simple; it just generates a bunch of random rectangles on it's small LCD screen.  This is done by simply writing to memory in a frame buffer.

Unfortunately, GDC's code executes quite a bit slower than LDC's code.  The difference is quite noticeable, as I can see the rate of the status LED blinking much slower with GDC than with LDC.

The code to do this is below (I simplified it for this discussion, but tested to ensure reproduction of the symptoms.  I also did away with the random behavior to remove that variable).

a block of code in main.d
---
uint i = 0;
while(true)
{
    lcd.fillRect(x, y, width, height, color);
    if ((i % 1000) == 0)
    {
        statusLED.toggle();
    }

    i++;
}

in lcd.d
---
@noinline pragma(inline, false) void fillRect(int x, int y, uint width, uint height, ushort color)
{
    int y2 = y + height;
    for(int _y = y; _y <= y2; _y++)
    {
        ltdc.fillSpan(x, _y, width, color);
    }
}

from ltdc.d
-----------
void fillSpan(int x, int y, uint spanWidth, ushort color)
{
    int start = y * width + x;
    for(int i = 0; i < spanWidth; i++)
    {
        frameBuffer[start + i] = color;
    }
}

LDC disassembly
---------------
ldc2 -conf= -disable-simplify-libcalls -c -Os  -mtriple=thumb-none-eabi -float-abi=hard -mcpu=cortex-m4 -Isource/runtime -boundscheck=off

<_D5board3lcd8fillRectFiikktZv>:
80000b8:  e92d 43f0   stmdb  sp!, {r4, r5, r6, r7, r8, r9, lr}
80000bc:  eb03 0e01   add.w  lr, r3, r1
80000c0:  459e        cmp  lr, r3
80000c2:  bfb8        it  lt
80000c4:  e8bd 83f0   ldmialt.w  sp!, {r4, r5, r6, r7, r8, r9, pc}
80000c8:  f8dd c01c   ldr.w  ip, [sp, #28]
80000cc:  ebc3 1503   rsb  r5, r3, r3, lsl #4
80000d0:  f240 0800   movw  r8, #0
80000d4:  f002 0103   and.w  r1, r2, #3
80000d8:  f1a2 0901   sub.w  r9, r2, #1
80000dc:  f2c2 0800   movt  r8, #8192  ; 0x2000
80000e0:  1a54        subs  r4, r2, r1
80000e2:  eb0c 1505   add.w  r5, ip, r5, lsl #4
80000e6:  eb08 0545   add.w  r5, r8, r5, lsl #1
80000ea:  1d2f        adds  r7, r5, #4
80000ec:  b1f2        cbz  r2, 800012c <_D5board3lcd8fillRectFiikktZv+0x74>
80000ee:  2500        movs  r5, #0
80000f0:  f1b9 0f03   cmp.w  r9, #3
80000f4:  d30a        bcc.n  800010c <_D5board3lcd8fillRectFiikktZv+0x54>
80000f6:  463e        mov  r6, r7
80000f8:  3504        adds  r5, #4
80000fa:  f826 0c02   strh.w  r0, [r6, #-2]
80000fe:  f826 0c04   strh.w  r0, [r6, #-4]
8000102:  8030        strh  r0, [r6, #0]
8000104:  8070        strh  r0, [r6, #2]
8000106:  3608        adds  r6, #8
8000108:  42ac        cmp  r4, r5
800010a:  d1f5        bne.n  80000f8 <_D5board3lcd8fillRectFiikktZv+0x40>
800010c:  b171        cbz  r1, 800012c <_D5board3lcd8fillRectFiikktZv+0x74>
800010e:  ebc3 1603   rsb  r6, r3, r3, lsl #4
8000112:  2901        cmp  r1, #1
8000114:  eb0c 1606   add.w  r6, ip, r6, lsl #4
8000118:  4435        add  r5, r6
800011a:  f828 0015   strh.w  r0, [r8, r5, lsl #1]
800011e:  d005        beq.n  800012c <_D5board3lcd8fillRectFiikktZv+0x74>
8000120:  eb08 0545   add.w  r5, r8, r5, lsl #1
8000124:  2902        cmp  r1, #2
8000126:  8068        strh  r0, [r5, #2]
8000128:  bf18        it  ne
800012a:  80a8        strhne  r0, [r5, #4]
800012c:  3301        adds  r3, #1
800012e:  f507 77f0   add.w  r7, r7, #480  ; 0x1e0
8000132:  4573        cmp  r3, lr
8000134:  ddda        ble.n  80000ec <_D5board3lcd8fillRectFiikktZv+0x34>
8000136:  e8bd 83f0   ldmia.w  sp!, {r4, r5, r6, r7, r8, r9, pc}

GDC disassembly
---------------
arm-none-eabi-gdc -c -O2 -nophoboslib -nostdinc -nodefaultlibs -nostdlib -mthumb -mcpu=cortex-m4 -mtune=cortex-m4 -mfloat-abi=hard -Isource/runtime -fno-bounds-check -ffunction-sections -fdata-sections -fno-weak

<_D5board3lcd8fillRectFiikktZv>:
800049c:  b470        push  {r4, r5, r6}
800049e:  440b        add  r3, r1
80004a0:  4299        cmp  r1, r3
80004a2:  f8bd 500c   ldrh.w  r5, [sp, #12]
80004a6:  dc15        bgt.n  80004d4 <_D5board3lcd8fillRectFiikktZv+0x38>
80004a8:  ebc1 1401   rsb  r4, r1, r1, lsl #4
80004ac:  eb00 1004   add.w  r0, r0, r4, lsl #4
80004b0:  4c09        ldr  r4, [pc, #36]  ; (80004d8 <_D5board3lcd8fillRectFiikktZv+0x3c>)
80004b2:  4410        add  r0, r2
80004b4:  ebc2 76c2   rsb  r6, r2, r2, lsl #31
80004b8:  eb04 0440   add.w  r4, r4, r0, lsl #1
80004bc:  0076        lsls  r6, r6, #1
80004be:  b122        cbz  r2, 80004ca <_D5board3lcd8fillRectFiikktZv+0x2e>
80004c0:  19a0        adds  r0, r4, r6
80004c2:  f820 5b02   strh.w  r5, [r0], #2
80004c6:  42a0        cmp  r0, r4
80004c8:  d1fb        bne.n  80004c2 <_D5board3lcd8fillRectFiikktZv+0x26>
80004ca:  3101        adds  r1, #1
80004cc:  428b        cmp  r3, r1
80004ce:  f504 74f0   add.w  r4, r4, #480  ; 0x1e0
80004d2:  daf4        bge.n  80004be <_D5board3lcd8fillRectFiikktZv+0x22>
80004d4:  bc70        pop  {r4, r5, r6}
80004d6:  4770        bx  lr
80004d8:  20000000   .word  0x20000000

For both LDC and GDC `fillSpan` gets inlined into `fillRect`.  I had to disable inlining for `fillRect` to make it easier to compare the disassembly, otherwise all I get is a huge `main`.

I used `O2` for GDC because the `Os` was even slower, and didn't inline `fillSpan`.

Although GDC's code is shorter, LDC's code is faster.  My guess is that this is due to the `ldm` and `stm` instructions in the LDC disassembly which are SIMD instructions (load multiple, and store multiple), but I'm not sure.

I've tried a number of different optimization permutations (too many to list here), but they didn't seem to make any difference.

I ask for any insight you might have, should you wish to give this your attention.  Regardless, I'll keep investigating.

Thanks,
Mike


July 20, 2018
Actually the assembly output from objdump isn't quite accurate.  Here's the generated assembly from the compiler.

LDC
---
ldc2 -conf= -disable-simplify-libcalls -c -Os  -mtriple=thumb-none-eabi -float-abi=hard -mcpu=cortex-m4 -Isource/runtime -boundscheck=off

_D5board3lcd8fillRectFiikktZv:
	.fnstart
	.save	{r4, r5, r6, r7, r8, r9, lr}
	push.w	{r4, r5, r6, r7, r8, r9, lr}
	add.w	lr, r3, r1
	cmp	lr, r3
	it	lt
	poplt.w	{r4, r5, r6, r7, r8, r9, pc}
	ldr.w	r12, [sp, #28]
	rsb	r5, r3, r3, lsl #4
	movw	r8, :lower16:_D5board4ltdc11frameBufferG76800t
	and	r1, r2, #3
	sub.w	r9, r2, #1
	movt	r8, :upper16:_D5board4ltdc11frameBufferG76800t
	subs	r4, r2, r1
	add.w	r5, r12, r5, lsl #4
	add.w	r5, r8, r5, lsl #1
	adds	r7, r5, #4
.LBB1_1:
	cbz	r2, .LBB1_8
	movs	r5, #0
	cmp.w	r9, #3
	blo	.LBB1_5
	mov	r6, r7
.LBB1_4:
	adds	r5, #4
	strh	r0, [r6, #-2]
	strh	r0, [r6, #-4]
	strh	r0, [r6]
	strh	r0, [r6, #2]
	adds	r6, #8
	cmp	r4, r5
	bne	.LBB1_4
.LBB1_5:
	cbz	r1, .LBB1_8
	rsb	r6, r3, r3, lsl #4
	cmp	r1, #1
	add.w	r6, r12, r6, lsl #4
	add	r5, r6
	strh.w	r0, [r8, r5, lsl #1]
	beq	.LBB1_8
	add.w	r5, r8, r5, lsl #1
	cmp	r1, #2
	strh	r0, [r5, #2]
	it	ne
	strhne	r0, [r5, #4]
.LBB1_8:
	adds	r3, #1
	add.w	r7, r7, #480
	cmp	r3, lr
	ble	.LBB1_1
	pop.w	{r4, r5, r6, r7, r8, r9, pc}

GDC
---
arm-none-eabi-gdc -c -O2 -nophoboslib -nostdinc -nodefaultlibs -nostdlib -mthumb -mcpu=cortex-m4 -mtune=cortex-m4 -mfloat-abi=hard -Isource/runtime -fno-bounds-check -ffunction-sections -fdata-sections -fno-weak

_D5board3lcd8fillRectFiikktZv:
	.fnstart
.LFB4:
	@ args = 4, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	push	{r4, r5, r6}
	add	r3, r3, r1
	cmp	r1, r3
	ldrh	r5, [sp, #12]
	bgt	.L47
	rsb	r4, r1, r1, lsl #4
	add	r0, r0, r4, lsl #4
	ldr	r4, .L58
	add	r0, r0, r2
	rsb	r6, r2, r2, lsl #31
	add	r4, r4, r0, lsl #1
	lsls	r6, r6, #1
.L51:
	cbz	r2, .L49
	adds	r0, r4, r6
.L50:
	strh	r5, [r0], #2	@ movhi
	cmp	r0, r4
	bne	.L50
.L49:
	adds	r1, r1, #1
	cmp	r3, r1
	add	r4, r4, #480
	bge	.L51
.L47:
	pop	{r4, r5, r6}
	bx	lr

Mike
July 20, 2018
On Friday, 20 July 2018 at 12:49:59 UTC, Mike Franklin wrote:

> GDC
> ---
> arm-none-eabi-gdc -c -O2 -nophoboslib -nostdinc -nodefaultlibs -nostdlib -mthumb -mcpu=cortex-m4 -mtune=cortex-m4 -mfloat-abi=hard -Isource/runtime -fno-bounds-check -ffunction-sections -fdata-sections -fno-weak
>
> _D5board3lcd8fillRectFiikktZv:
> 	.fnstart
> .LFB4:
> 	@ args = 4, pretend = 0, frame = 0
> 	@ frame_needed = 0, uses_anonymous_args = 0
> 	@ link register save eliminated.
> 	push	{r4, r5, r6}
> 	add	r3, r3, r1
> 	cmp	r1, r3
> 	ldrh	r5, [sp, #12]
> 	bgt	.L47
> 	rsb	r4, r1, r1, lsl #4
> 	add	r0, r0, r4, lsl #4
> 	ldr	r4, .L58
> 	add	r0, r0, r2
> 	rsb	r6, r2, r2, lsl #31
> 	add	r4, r4, r0, lsl #1
> 	lsls	r6, r6, #1
> .L51:
> 	cbz	r2, .L49
> 	adds	r0, r4, r6
> .L50:
> 	strh	r5, [r0], #2	@ movhi
> 	cmp	r0, r4
> 	bne	.L50
> .L49:
> 	adds	r1, r1, #1
> 	cmp	r3, r1
> 	add	r4, r4, #480
> 	bge	.L51
> .L47:
> 	pop	{r4, r5, r6}
> 	bx	lr

Gah.  Sorry folks.  I keep screwing up.  I can see above that `fillSpan` function is not being inlined.  I must be doing something wrong.

Please ignore this thread.

Sorry,
Mike

July 20, 2018
On Friday, 20 July 2018 at 11:11:12 UTC, Mike Franklin wrote:

> I ask for any insight you might have, should you wish to give this your attention.  Regardless, I'll keep investigating.

Just to follow up, after I enabled `-funroll-loops` for GDC, it was almost twice as fast as LDC, though the code size was a little larger.

Bottom line is:  I just need to learn the compilers better (both of them) and learn how to tune them for the application.

Mike