June 28, 2017 Re: Testing GDC (GCC 7.1) on Runtime-less ARM Cortex-M | ||||
---|---|---|---|---|
| ||||
Posted in reply to Iain Buclaw | On Wednesday, 28 June 2017 at 16:52:26 UTC, Iain Buclaw wrote: > You probably want to tone down on optimizations as well. -O3 will be doing a lot of work, sometimes for little or no gain. In most cases, -O2 -finline-functions is good enough, which can be abbreviated further as simply -Os. [for full list of enabled/disabled passes: gdc -Q -Os --help=optimizers] > > You can see a breakdown of what areas the compiler spends the most time in with -ftime-report Compiling with -O3 ------------------ phase opt and generate : 50.74 (96%) usr 21.24 (99%) sys 72.94 (97%) wall 2426962 kB (94%) ggc TOTAL : 53.02 21.49 75.55 2589984 kB real 1m21.339s user 0m57.086s sys 0m22.297s arm-none-eabi-size binary/firmware text data bss dec hex filename 6228 0 153600 159828 27054 binary/firmware Compiling with -O2 -finline-functions ------------------------------------- phase opt and generate : 50.71 (96%) usr 20.58 (98%) sys 72.04 (97%) wall 2381419 kB (94%) ggc TOTAL : 52.89 20.93 74.63 2544441 kB real 1m20.755s user 0m56.857s sys 0m21.826s arm-none-eabi-size binary/firmware text data bss dec hex filename 5912 0 153600 159512 26f18 binary/firmware Compiling with -O0 ------------------ phase opt and generate : 22.95 (91%) usr 5.42 (94%) sys 28.38 (92%) wall 1777106 kB (92%) ggc TOTAL : 25.14 5.74 30.94 1940102 kB real 0m36.476s user 0m29.600s sys 0m6.647s arm-none-eabi-size binary/firmware text data bss dec hex filename 45250 0 153600 198850 308c2 binary/firmware ------------------------------------------------------------------------- The vast majority of time is spent in "phase opt and generate". A few observations: * Elapsed time isn't much different between -O3 and -O2 -finline-functions * -O2 -finline-functions gave me a smaller binary :) * -O0 reduced time significantly, but "phase opt and generate" still takes an awfully long time relative to everything else What exactly is "phase opt and generate"? I'm assuming "opt" means optimizer, but why is it taking such a long time even with -O0? Maybe it's the "generate" part of that that's the most significant. With -O0 there's still quite a few things enabled, so maybe I'll start appending a "-fno" to each one and see if I can find a culprit. -O0 -Q --help=optimizers -faggressive-loop-optimizations [enabled] -fauto-inc-dec [enabled] -fdce [enabled] -fdelete-null-pointer-checks [enabled] -fdse [enabled] -fearly-inlining [enabled] -ffp-contract=[off|on|fast] fast -ffp-int-builtin-inexact [enabled] -ffunction-cse [enabled] -fgcse-lm [enabled] -finline [enabled] -finline-atomics [enabled] -fira-hoist-pressure [enabled] -fira-share-save-slots [enabled] -fira-share-spill-slots [enabled] -fivopts [enabled] -fjump-tables [enabled] -flifetime-dse [enabled] -fmath-errno [enabled] -fpeephole [enabled] -fplt [enabled] -fprefetch-loop-arrays [enabled] -fprintf-return-value [enabled] -freg-struct-return [enabled] -frename-registers [enabled] -frtti [enabled] -fsched-critical-path-heuristic [enabled] -fsched-dep-count-heuristic [enabled] -fsched-group-heuristic [enabled] -fsched-interblock [enabled] -fsched-last-insn-heuristic [enabled] -fsched-rank-heuristic [enabled] -fsched-spec [enabled] -fsched-spec-insn-heuristic [enabled] -fsched-stalled-insns-dep [enabled] -fschedule-fusion [enabled] -fshort-enums [enabled] -fshrink-wrap-separate [enabled] -fsigned-zeros [enabled] -fsimd-cost-model=[unlimited|dynamic|cheap] unlimited -fsplit-ivs-in-unroller [enabled] -fssa-backprop [enabled] -fstack-reuse=[all|named_vars|none] all -fstdarg-opt [enabled] -fstrict-volatile-bitfields [enabled] -fno-threadsafe-statics [enabled] -ftrapping-math [enabled] -ftree-cselim [enabled] -ftree-forwprop [enabled] -ftree-loop-if-convert [enabled] -ftree-loop-im [enabled] -ftree-loop-ivcanon [enabled] -ftree-loop-optimize [enabled] -ftree-phiprop [enabled] -ftree-reassoc [enabled] -ftree-scev-cprop [enabled] -fvar-tracking [enabled] -fvar-tracking-assignments [enabled] -fweb [enabled] |
June 29, 2017 Re: Testing GDC (GCC 7.1) on Runtime-less ARM Cortex-M | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mike | On 28 June 2017 at 23:15, Mike via D.gnu <d.gnu@puremagic.com> wrote: > ------------------------------------------------------------------------- > The vast majority of time is spent in "phase opt and generate". A few observations: > > * Elapsed time isn't much different between -O3 and -O2 -finline-functions * -O2 -finline-functions gave me a smaller binary :) I did say that -Os (optimize for size) is practically identical to this. So I'm not surprised. ;-) And yeah, one of the big differences between -O2 and -O3 is that when it comes to inlining, -O3 mostly disregards size and cost heuristics. > * -O0 reduced time significantly, but "phase opt and generate" still takes an awfully long time relative to everything else > > What exactly is "phase opt and generate"? I'm assuming "opt" means optimizer, but why is it taking such a long time even with -O0? Maybe it's the "generate" part of that that's the most significant. > Phase opt and generate is the topl-evel timer for the entire "backend" compilation phase. I was expecting to see more of a breakdown of individual passes. > With -O0 there's still quite a few things enabled, so maybe I'll start appending a "-fno" to each one and see if I can find a culprit. > A thought just occurred to me, you are compiling the entire program + object.d right? Nothing else will link/be linked to the binary? If that is the case, you should definitely compile with -fwhole-program. I suspect that may cut down your compilation time by half or even more. Regards, Iain. |
June 28, 2017 Re: Testing GDC (GCC 7.1) on Runtime-less ARM Cortex-M | ||||
---|---|---|---|---|
| ||||
Posted in reply to Iain Buclaw | On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote: > Phase opt and generate is the topl-evel timer for the entire "backend" compilation phase. I was expecting to see more of a breakdown of individual passes. Sorry, it didn't look broken down to me. Here's the full report. arm-none-eabi-gdc -c -O2 -finline-functions -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -Isource/runtime -fno-bounds-check -fno-invariants -fno-in -fno-out -ffunction-sections -fdata-sections -ftime-report source/gcc/attribute.d source/board/package.d source/board/ILI9341.d source/board/lcd.d source/board/spi5.d source/board/statusLED.d source/board/random.d source/board/ltdc.d source/stm32f42/bus.d source/stm32f42/scb.d source/stm32f42/trace.d source/stm32f42/dma2d.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/rcc.d source/stm32f42/rng.d source/stm32f42/nvic.d source/stm32f42/mmio.d source/stm32f42/flash.d source/stm32f42/gpio.d source/stm32f42/ltdc.d source/main.d -o binary/firmware.o Execution times (seconds) phase setup : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 2310 kB ( 0%) ggc phase parsing : 2.21 ( 4%) usr 0.32 ( 2%) sys 2.55 ( 3%) wall 160684 kB ( 6%) ggc phase opt and generate : 51.89 (96%) usr 20.13 (98%) sys 72.29 (97%) wall 2381419 kB (94%) ggc phase last asm : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 26 kB ( 0%) ggc phase finalize : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc garbage collection : 0.90 ( 2%) usr 0.04 ( 0%) sys 1.05 ( 1%) wall 0 kB ( 0%) ggc dump files : 4.17 ( 8%) usr 1.96 (10%) sys 5.67 ( 8%) wall 0 kB ( 0%) ggc callgraph construction : 0.66 ( 1%) usr 0.20 ( 1%) sys 1.07 ( 1%) wall 26036 kB ( 1%) ggc callgraph optimization : 1.55 ( 3%) usr 0.78 ( 4%) sys 1.89 ( 3%) wall 1689 kB ( 0%) ggc ipa dead code removal : 0.29 ( 1%) usr 0.00 ( 0%) sys 0.28 ( 0%) wall 0 kB ( 0%) ggc ipa inheritance graph : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc ipa devirtualization : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc ipa cp : 0.21 ( 0%) usr 0.01 ( 0%) sys 0.18 ( 0%) wall 6160 kB ( 0%) ggc ipa inlining heuristics : 0.69 ( 1%) usr 0.15 ( 1%) sys 0.67 ( 1%) wall 88573 kB ( 3%) ggc ipa function splitting : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 9 kB ( 0%) ggc ipa comdats : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 0 kB ( 0%) ggc ipa various optimizations: 0.08 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc ipa reference : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc ipa profile : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 0.38 ( 1%) usr 0.09 ( 0%) sys 0.54 ( 1%) wall 0 kB ( 0%) ggc ipa icf : 1.59 ( 3%) usr 0.01 ( 0%) sys 1.60 ( 2%) wall 11 kB ( 0%) ggc ipa SRA : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc ipa free lang data : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc ipa free inline summary : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc cfg construction : 0.15 ( 0%) usr 0.06 ( 0%) sys 0.12 ( 0%) wall 5 kB ( 0%) ggc cfg cleanup : 0.66 ( 1%) usr 0.27 ( 1%) sys 1.04 ( 1%) wall 17 kB ( 0%) ggc trivially dead code : 0.12 ( 0%) usr 0.05 ( 0%) sys 0.38 ( 1%) wall 0 kB ( 0%) ggc df scan insns : 0.45 ( 1%) usr 0.19 ( 1%) sys 0.56 ( 1%) wall 5569 kB ( 0%) ggc df multiple defs : 0.24 ( 0%) usr 0.06 ( 0%) sys 0.28 ( 0%) wall 0 kB ( 0%) ggc df reaching defs : 0.15 ( 0%) usr 0.03 ( 0%) sys 0.26 ( 0%) wall 0 kB ( 0%) ggc df live regs : 0.60 ( 1%) usr 0.25 ( 1%) sys 0.70 ( 1%) wall 0 kB ( 0%) ggc df live&initialized regs: 0.32 ( 1%) usr 0.13 ( 1%) sys 0.57 ( 1%) wall 0 kB ( 0%) ggc df use-def / def-use chains: 0.05 ( 0%) usr 0.03 ( 0%) sys 0.11 ( 0%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 0.56 ( 1%) usr 0.18 ( 1%) sys 0.86 ( 1%) wall 2562 kB ( 0%) ggc register information : 0.14 ( 0%) usr 0.13 ( 1%) sys 0.40 ( 1%) wall 0 kB ( 0%) ggc alias analysis : 0.79 ( 1%) usr 0.34 ( 2%) sys 1.14 ( 2%) wall 28569 kB ( 1%) ggc alias stmt walking : 0.10 ( 0%) usr 0.02 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc register scan : 0.07 ( 0%) usr 0.01 ( 0%) sys 0.11 ( 0%) wall 106 kB ( 0%) ggc rebuild jump labels : 0.05 ( 0%) usr 0.05 ( 0%) sys 0.15 ( 0%) wall 0 kB ( 0%) ggc parser (global) : 2.19 ( 4%) usr 0.32 ( 2%) sys 2.51 ( 3%) wall 160144 kB ( 6%) ggc early inlining heuristics: 0.17 ( 0%) usr 0.09 ( 0%) sys 0.24 ( 0%) wall 19510 kB ( 1%) ggc inline parameters : 0.35 ( 1%) usr 0.18 ( 1%) sys 0.44 ( 1%) wall 58124 kB ( 2%) ggc integration : 0.63 ( 1%) usr 0.24 ( 1%) sys 0.85 ( 1%) wall 80071 kB ( 3%) ggc tree gimplify : 0.48 ( 1%) usr 0.17 ( 1%) sys 0.53 ( 1%) wall 109681 kB ( 4%) ggc tree eh : 0.13 ( 0%) usr 0.07 ( 0%) sys 0.20 ( 0%) wall 13982 kB ( 1%) ggc tree CFG construction : 0.19 ( 0%) usr 0.05 ( 0%) sys 0.17 ( 0%) wall 54230 kB ( 2%) ggc tree CFG cleanup : 0.69 ( 1%) usr 0.38 ( 2%) sys 1.19 ( 2%) wall 1131 kB ( 0%) ggc tree tail merge : 0.11 ( 0%) usr 0.02 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc tree VRP : 0.93 ( 2%) usr 0.35 ( 2%) sys 1.29 ( 2%) wall 89761 kB ( 4%) ggc tree Early VRP : 0.21 ( 0%) usr 0.08 ( 0%) sys 0.31 ( 0%) wall 42204 kB ( 2%) ggc tree copy propagation : 0.06 ( 0%) usr 0.03 ( 0%) sys 0.10 ( 0%) wall 0 kB ( 0%) ggc tree PTA : 1.78 ( 3%) usr 0.85 ( 4%) sys 2.50 ( 3%) wall 4103 kB ( 0%) ggc tree PHI insertion : 0.07 ( 0%) usr 0.02 ( 0%) sys 0.03 ( 0%) wall 6571 kB ( 0%) ggc tree SSA rewrite : 0.16 ( 0%) usr 0.06 ( 0%) sys 0.20 ( 0%) wall 20087 kB ( 1%) ggc tree SSA other : 0.21 ( 0%) usr 0.13 ( 1%) sys 0.51 ( 1%) wall 5602 kB ( 0%) ggc tree SSA incremental : 0.15 ( 0%) usr 0.10 ( 0%) sys 0.30 ( 0%) wall 60 kB ( 0%) ggc tree operand scan : 0.34 ( 1%) usr 0.22 ( 1%) sys 0.56 ( 1%) wall 56364 kB ( 2%) ggc dominator optimization : 0.73 ( 1%) usr 0.22 ( 1%) sys 0.75 ( 1%) wall 7545 kB ( 0%) ggc backwards jump threading: 0.30 ( 1%) usr 0.09 ( 0%) sys 0.25 ( 0%) wall 111 kB ( 0%) ggc tree SRA : 0.13 ( 0%) usr 0.04 ( 0%) sys 0.17 ( 0%) wall 28 kB ( 0%) ggc isolate eroneous paths : 0.04 ( 0%) usr 0.03 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc tree CCP : 0.68 ( 1%) usr 0.24 ( 1%) sys 0.85 ( 1%) wall 7302 kB ( 0%) ggc tree PHI const/copy prop: 0.05 ( 0%) usr 0.02 ( 0%) sys 0.10 ( 0%) wall 0 kB ( 0%) ggc tree split crit edges : 0.05 ( 0%) usr 0.06 ( 0%) sys 0.17 ( 0%) wall 19 kB ( 0%) ggc tree reassociation : 0.23 ( 0%) usr 0.07 ( 0%) sys 0.38 ( 1%) wall 6 kB ( 0%) ggc tree PRE : 1.28 ( 2%) usr 0.48 ( 2%) sys 1.78 ( 2%) wall 50466 kB ( 2%) ggc tree FRE : 0.69 ( 1%) usr 0.36 ( 2%) sys 1.22 ( 2%) wall 17297 kB ( 1%) ggc tree code sinking : 0.10 ( 0%) usr 0.05 ( 0%) sys 0.13 ( 0%) wall 6 kB ( 0%) ggc tree linearize phis : 0.19 ( 0%) usr 0.08 ( 0%) sys 0.27 ( 0%) wall 41714 kB ( 2%) ggc tree backward propagate : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc tree forward propagate : 0.23 ( 0%) usr 0.08 ( 0%) sys 0.38 ( 1%) wall 62 kB ( 0%) ggc tree phiprop : 0.06 ( 0%) usr 0.01 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc tree conservative DCE : 0.21 ( 0%) usr 0.15 ( 1%) sys 0.36 ( 0%) wall 209 kB ( 0%) ggc tree aggressive DCE : 0.28 ( 1%) usr 0.12 ( 1%) sys 0.44 ( 1%) wall 83438 kB ( 3%) ggc tree buildin call DCE : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc tree DSE : 0.09 ( 0%) usr 0.09 ( 0%) sys 0.21 ( 0%) wall 0 kB ( 0%) ggc PHI merge : 0.07 ( 0%) usr 0.04 ( 0%) sys 0.11 ( 0%) wall 0 kB ( 0%) ggc tree loop optimization : 0.02 ( 0%) usr 0.01 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc loopless fn : 0.04 ( 0%) usr 0.01 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc tree loop invariant motion: 0.01 ( 0%) usr 0.02 ( 0%) sys 0.07 ( 0%) wall 1 kB ( 0%) ggc complete unrolling : 0.05 ( 0%) usr 0.04 ( 0%) sys 0.12 ( 0%) wall 136 kB ( 0%) ggc tree iv optimization : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 120 kB ( 0%) ggc tree copy headers : 0.03 ( 0%) usr 0.02 ( 0%) sys 0.03 ( 0%) wall 7 kB ( 0%) ggc tree SSA uncprop : 0.28 ( 1%) usr 0.13 ( 1%) sys 0.31 ( 0%) wall 0 kB ( 0%) ggc tree NRV optimization : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 849 kB ( 0%) ggc tree switch conversion : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc tree strlen optimization: 0.03 ( 0%) usr 0.01 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc dominance frontiers : 0.09 ( 0%) usr 0.02 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc dominance computation : 1.42 ( 3%) usr 0.51 ( 2%) sys 1.94 ( 3%) wall 0 kB ( 0%) ggc control dependences : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc out of ssa : 0.22 ( 0%) usr 0.10 ( 0%) sys 0.35 ( 0%) wall 7465 kB ( 0%) ggc expand vars : 0.02 ( 0%) usr 0.02 ( 0%) sys 0.04 ( 0%) wall 506 kB ( 0%) ggc expand : 0.63 ( 1%) usr 0.24 ( 1%) sys 1.12 ( 1%) wall 63840 kB ( 3%) ggc post expand cleanups : 0.24 ( 0%) usr 0.04 ( 0%) sys 0.23 ( 0%) wall 18401 kB ( 1%) ggc varconst : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 539 kB ( 0%) ggc lower subreg : 0.07 ( 0%) usr 0.01 ( 0%) sys 0.05 ( 0%) wall 0 kB ( 0%) ggc jump : 0.13 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc forward prop : 0.73 ( 1%) usr 0.26 ( 1%) sys 0.86 ( 1%) wall 2110 kB ( 0%) ggc CSE : 0.50 ( 1%) usr 0.19 ( 1%) sys 0.73 ( 1%) wall 1053 kB ( 0%) ggc dead code elimination : 0.23 ( 0%) usr 0.07 ( 0%) sys 0.38 ( 1%) wall 0 kB ( 0%) ggc dead store elim1 : 0.24 ( 0%) usr 0.09 ( 0%) sys 0.48 ( 1%) wall 1039 kB ( 0%) ggc dead store elim2 : 0.27 ( 0%) usr 0.14 ( 1%) sys 0.39 ( 1%) wall 960 kB ( 0%) ggc loop analysis : 0.10 ( 0%) usr 0.06 ( 0%) sys 0.11 ( 0%) wall 0 kB ( 0%) ggc loop init : 1.34 ( 2%) usr 0.51 ( 2%) sys 1.93 ( 3%) wall 183463 kB ( 7%) ggc loop invariant motion : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall 1 kB ( 0%) ggc loop doloop : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 36 kB ( 0%) ggc loop fini : 0.61 ( 1%) usr 0.31 ( 2%) sys 0.94 ( 1%) wall 0 kB ( 0%) ggc CPROP : 0.21 ( 0%) usr 0.05 ( 0%) sys 0.21 ( 0%) wall 295 kB ( 0%) ggc PRE : 0.09 ( 0%) usr 0.01 ( 0%) sys 0.06 ( 0%) wall 4 kB ( 0%) ggc auto inc dec : 0.12 ( 0%) usr 0.06 ( 0%) sys 0.15 ( 0%) wall 934 kB ( 0%) ggc CSE 2 : 0.29 ( 1%) usr 0.18 ( 1%) sys 0.44 ( 1%) wall 171 kB ( 0%) ggc branch prediction : 0.20 ( 0%) usr 0.06 ( 0%) sys 0.16 ( 0%) wall 4067 kB ( 0%) ggc combiner : 0.84 ( 2%) usr 0.22 ( 1%) sys 1.42 ( 2%) wall 13624 kB ( 1%) ggc if-conversion : 0.28 ( 1%) usr 0.08 ( 0%) sys 0.41 ( 1%) wall 2 kB ( 0%) ggc scheduling : 1.45 ( 3%) usr 0.63 ( 3%) sys 2.15 ( 3%) wall 4177 kB ( 0%) ggc integrated RA : 1.83 ( 3%) usr 0.70 ( 3%) sys 2.45 ( 3%) wall 964084 kB (38%) ggc LRA non-specific : 0.69 ( 1%) usr 0.33 ( 2%) sys 0.90 ( 1%) wall 2272 kB ( 0%) ggc LRA virtuals elimination: 0.27 ( 0%) usr 0.15 ( 1%) sys 0.36 ( 0%) wall 1881 kB ( 0%) ggc LRA reload inheritance : 0.09 ( 0%) usr 0.04 ( 0%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc LRA create live ranges : 0.12 ( 0%) usr 0.06 ( 0%) sys 0.12 ( 0%) wall 1 kB ( 0%) ggc LRA hard reg assignment : 0.09 ( 0%) usr 0.05 ( 0%) sys 0.20 ( 0%) wall 0 kB ( 0%) ggc reload : 0.12 ( 0%) usr 0.06 ( 0%) sys 0.13 ( 0%) wall 0 kB ( 0%) ggc reload CSE regs : 0.46 ( 1%) usr 0.09 ( 0%) sys 0.43 ( 1%) wall 2852 kB ( 0%) ggc thread pro- & epilogue : 0.25 ( 0%) usr 0.13 ( 1%) sys 0.45 ( 1%) wall 37093 kB ( 1%) ggc if-conversion 2 : 0.06 ( 0%) usr 0.02 ( 0%) sys 0.18 ( 0%) wall 0 kB ( 0%) ggc peephole 2 : 0.11 ( 0%) usr 0.04 ( 0%) sys 0.18 ( 0%) wall 11 kB ( 0%) ggc hard reg cprop : 0.12 ( 0%) usr 0.05 ( 0%) sys 0.19 ( 0%) wall 0 kB ( 0%) ggc scheduling 2 : 1.05 ( 2%) usr 0.44 ( 2%) sys 1.67 ( 2%) wall 3203 kB ( 0%) ggc machine dep reorg : 0.21 ( 0%) usr 0.05 ( 0%) sys 0.26 ( 0%) wall 10319 kB ( 0%) ggc reorder blocks : 0.10 ( 0%) usr 0.03 ( 0%) sys 0.20 ( 0%) wall 20 kB ( 0%) ggc shorten branches : 0.16 ( 0%) usr 0.05 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc final : 0.88 ( 2%) usr 0.47 ( 2%) sys 1.51 ( 2%) wall 15600 kB ( 1%) ggc variable output : 0.30 ( 1%) usr 0.03 ( 0%) sys 0.33 ( 0%) wall 10352 kB ( 0%) ggc symout : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc tree if-combine : 0.05 ( 0%) usr 0.02 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc straight-line strength reduction: 0.13 ( 0%) usr 0.07 ( 0%) sys 0.22 ( 0%) wall 0 kB ( 0%) ggc store merging : 0.07 ( 0%) usr 0.03 ( 0%) sys 0.01 ( 0%) wall 9 kB ( 0%) ggc address lowering : 0.01 ( 0%) usr 0.01 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc early local passes : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc unaccounted optimizations: 0.01 ( 0%) usr 0.01 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc rest of compilation : 5.83 (11%) usr 2.63 (13%) sys 8.49 (11%) wall 101391 kB ( 4%) ggc unaccounted post reload : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc unaccounted late compilation: 0.03 ( 0%) usr 0.01 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc remove unused locals : 0.18 ( 0%) usr 0.08 ( 0%) sys 0.22 ( 0%) wall 0 kB ( 0%) ggc address taken : 0.13 ( 0%) usr 0.03 ( 0%) sys 0.13 ( 0%) wall 0 kB ( 0%) ggc rebuild frequencies : 0.03 ( 0%) usr 0.01 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc repair loop structures : 0.07 ( 0%) usr 0.03 ( 0%) sys 0.14 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 54.11 20.45 74.91 2544441 kB > A thought just occurred to me, you are compiling the entire program + object.d right? Nothing else will link/be linked to the binary? I'm passing all files except druntime files via the command line. druntime files are imported via -Isource/runtime. But essentially yes, I'm compiling the entire application in one command so I can get cross-module inlining. I tried moving all runtime files to the command line, but I get errors about __entrypoint. cc1d: error: module __entrypoint is in file '__entrypoint.d' which cannot be read Specify path to file '__entrypoint.d' with -I switch > If that is the case, you should definitely compile with -fwhole-program. I suspect that may cut down your compilation time by half or even more. If I only import __entrypoint.d and pass the rest of the runtime files on the command line and compile with -fwhole-program, it compiles in 5s, but I only get an 8byte binary. I suspect this is due to the error above about __entrypoint. That is, if there's no entry point, the whole program gets garbage collected. I think you might be on to something here though. I'm out of time now; gotta catch a plane soon. I'll try to do more troubleshooting when I return. Mike |
June 29, 2017 Re: Testing GDC (GCC 7.1) on Runtime-less ARM Cortex-M | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mike | On 29 June 2017 at 00:50, Mike via D.gnu <d.gnu@puremagic.com> wrote:
> On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote:
>> If that is the case, you should definitely compile with -fwhole-program. I suspect that may cut down your compilation time by half or even more.
>
>
> If I only import __entrypoint.d and pass the rest of the runtime files on the command line and compile with -fwhole-program, it compiles in 5s, but I only get an 8byte binary. I suspect this is due to the error above about __entrypoint. That is, if there's no entry point, the whole program gets garbage collected.
>
> I think you might be on to something here though.
>
Yes, it seems like there's a more than a few hundred functions being emitted, and as they are all considered externally visible, none can be removed during the optimization pass. If they are all considered static (in the C sense), then unused and inlined functions can be removed immediately, giving the backend less work. I think the only caveat with using -fwhole-program is that C main must be present in the compilation, otherwise as you've noted, everything gets removed as unused code.
Regards,
Iain.
|
July 17, 2017 Re: Testing GDC (GCC 7.1) on Runtime-less ARM Cortex-M | ||||
---|---|---|---|---|
| ||||
Posted in reply to Iain Buclaw | On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote:
> A thought just occurred to me, you are compiling the entire program + object.d right? Nothing else will link/be linked to the binary?
>
> If that is the case, you should definitely compile with -fwhole-program. I suspect that may cut down your compilation time by half or even more.
I'm back and have spend the last two days trying to get my project compiled with -fwhole-program in an effort to reduce the compile times, but I haven't had any success.
Regardless of what I do, the compiler doesn't emit anything except main.
arm-none-eabi-gdc -c -O2 -finline-functions -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -fwhole-program -Isource/entrypoint -fno-bounds-check -ffunction-sections -fdata-sections source/gcc/attribute.d source/runtime/exception.d source/runtime/invariant.d source/runtime/object.d source/runtime/dmain2.d source/board/lcd.d source/board/ILI9341.d source/board/package.d source/board/statusLED.d source/board/ltdc.d source/board/random.d source/board/spi5.d source/stm32f42/nvic.d source/stm32f42/gpio.d source/stm32f42/flash.d source/stm32f42/scb.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/ltdc.d source/stm32f42/trace.d source/stm32f42/rcc.d source/stm32f42/bus.d source/stm32f42/rng.d source/stm32f42/dma2d.d source/stm32f42/mmio.d source/main.d -o binary/firmware.o
arm-none-eabi-nm binary/firmware.o
U _Dmain
U _d_run_main
00000000 T main
Are you sure this works with multiple modules passed to the compiler on one line?
Mike
|
July 18, 2017 Re: Testing GDC (GCC 7.1) on Runtime-less ARM Cortex-M | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mike | On 17 July 2017 at 22:26, Mike via D.gnu <d.gnu@puremagic.com> wrote: > On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote: > >> A thought just occurred to me, you are compiling the entire program + object.d right? Nothing else will link/be linked to the binary? >> >> If that is the case, you should definitely compile with -fwhole-program. I suspect that may cut down your compilation time by half or even more. > > > I'm back and have spend the last two days trying to get my project compiled with -fwhole-program in an effort to reduce the compile times, but I haven't had any success. > > Regardless of what I do, the compiler doesn't emit anything except main. > > arm-none-eabi-gdc -c -O2 -finline-functions -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -fwhole-program -Isource/entrypoint -fno-bounds-check -ffunction-sections -fdata-sections source/gcc/attribute.d source/runtime/exception.d source/runtime/invariant.d source/runtime/object.d source/runtime/dmain2.d source/board/lcd.d source/board/ILI9341.d source/board/package.d source/board/statusLED.d source/board/ltdc.d source/board/random.d source/board/spi5.d source/stm32f42/nvic.d source/stm32f42/gpio.d source/stm32f42/flash.d source/stm32f42/scb.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/ltdc.d source/stm32f42/trace.d source/stm32f42/rcc.d source/stm32f42/bus.d source/stm32f42/rng.d source/stm32f42/dma2d.d source/stm32f42/mmio.d source/main.d -o binary/firmware.o > > arm-none-eabi-nm binary/firmware.o > U _Dmain > U _d_run_main > 00000000 T main > > Are you sure this works with multiple modules passed to the compiler on one line? > Humm, it looks like it can't sufficiently determine that both the _d_run_main declarations are for the same symbol. After making a tweak, it compiles a program that does a little bit more. https://gist.github.com/ibuclaw/d1fad77b074fded2682a16df93369a84 This is the compiled result: https://gist.github.com/ibuclaw/93c707f4729ffc4376539283defdc709 Iain. |
July 18, 2017 Re: Testing GDC (GCC 7.1) on Runtime-less ARM Cortex-M | ||||
---|---|---|---|---|
| ||||
On 18 July 2017 at 01:19, Iain Buclaw <ibuclaw@gdcproject.org> wrote:
> On 17 July 2017 at 22:26, Mike via D.gnu <d.gnu@puremagic.com> wrote:
>> On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote:
>>
>>> A thought just occurred to me, you are compiling the entire program + object.d right? Nothing else will link/be linked to the binary?
>>>
>>> If that is the case, you should definitely compile with -fwhole-program. I suspect that may cut down your compilation time by half or even more.
>>
>>
>> I'm back and have spend the last two days trying to get my project compiled with -fwhole-program in an effort to reduce the compile times, but I haven't had any success.
>>
>> Regardless of what I do, the compiler doesn't emit anything except main.
>>
>> arm-none-eabi-gdc -c -O2 -finline-functions -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -fwhole-program -Isource/entrypoint -fno-bounds-check -ffunction-sections -fdata-sections source/gcc/attribute.d source/runtime/exception.d source/runtime/invariant.d source/runtime/object.d source/runtime/dmain2.d source/board/lcd.d source/board/ILI9341.d source/board/package.d source/board/statusLED.d source/board/ltdc.d source/board/random.d source/board/spi5.d source/stm32f42/nvic.d source/stm32f42/gpio.d source/stm32f42/flash.d source/stm32f42/scb.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/ltdc.d source/stm32f42/trace.d source/stm32f42/rcc.d source/stm32f42/bus.d source/stm32f42/rng.d source/stm32f42/dma2d.d source/stm32f42/mmio.d source/main.d -o binary/firmware.o
>>
>> arm-none-eabi-nm binary/firmware.o
>> U _Dmain
>> U _d_run_main
>> 00000000 T main
>>
>> Are you sure this works with multiple modules passed to the compiler on one line?
>>
>
> Humm, it looks like it can't sufficiently determine that both the _d_run_main declarations are for the same symbol.
>
> After making a tweak, it compiles a program that does a little bit more.
Infact, after inspecting the unoptimized result, I think I can safely say this is the entire program, as it is intended to be built.
Now, what can we learn from this? What can and should be done better?
Could this be achieved without -fwhole-program?
So far, the compiler really does just force every symbol to be emitted, at the cost of compilation time (no function can be removed during optimization). I think this is really just a workaround for not setting properly the visibility for compiled symbols (there's a lack of documentation in D about this), I think we could do better and be more explicit in setting this up. However the problem has almost always been templates that get cut, but shouldn't have, and so end up as being undefined at link-time.
There's probably a few bug reports to be raised.
Regards
Iain.
|
July 18, 2017 Re: Testing GDC (GCC 7.1) on Runtime-less ARM Cortex-M | ||||
---|---|---|---|---|
| ||||
Posted in reply to Iain Buclaw | On Monday, 17 July 2017 at 23:57:02 UTC, Iain Buclaw wrote:
> Infact, after inspecting the unoptimized result, I think I can safely say this is the entire program, as it is intended to be built.
>
If I compile without optimizations, I am able to get a complete binary also, even without the changes you made. But with optimizations, everything gets removed.
Anyway, I'm going to move on from this for now and just try to get a working binary. I believe once I properly implement volatileLoad and volatileStore, I'll be able to get a working binary.
Mike
|
Copyright © 1999-2021 by the D Language Foundation