June 28, 2017
On Wednesday, 28 June 2017 at 16:52:26 UTC, Iain Buclaw wrote:

> You probably want to tone down on optimizations as well.  -O3 will be doing a lot of work, sometimes for little or no gain.  In most cases, -O2 -finline-functions is good enough, which can be abbreviated further as simply -Os.  [for full list of enabled/disabled passes: gdc -Q -Os --help=optimizers]
>
> You can see a breakdown of what areas the compiler spends the most time in with -ftime-report

Compiling with -O3
------------------
phase opt and generate  :  50.74 (96%) usr  21.24 (99%) sys  72.94 (97%) wall 2426962 kB (94%) ggc
TOTAL                   :  53.02            21.49            75.55            2589984 kB

real    1m21.339s
user    0m57.086s
sys     0m22.297s

arm-none-eabi-size binary/firmware
   text    data     bss     dec     hex filename
   6228       0  153600  159828   27054 binary/firmware


Compiling with -O2 -finline-functions
-------------------------------------
phase opt and generate  :  50.71 (96%) usr  20.58 (98%) sys  72.04 (97%) wall 2381419 kB (94%) ggc
TOTAL                   :  52.89            20.93            74.63            2544441 kB

real    1m20.755s
user    0m56.857s
sys     0m21.826s

arm-none-eabi-size binary/firmware
   text    data     bss     dec     hex filename
   5912       0  153600  159512   26f18 binary/firmware

Compiling with -O0
------------------
phase opt and generate  :  22.95 (91%) usr   5.42 (94%) sys  28.38 (92%) wall 1777106 kB (92%) ggc
TOTAL                   :  25.14             5.74            30.94            1940102 kB

real    0m36.476s
user    0m29.600s
sys     0m6.647s

arm-none-eabi-size binary/firmware
   text    data     bss     dec     hex filename
  45250       0  153600  198850   308c2 binary/firmware


-------------------------------------------------------------------------
The vast majority of time is spent in "phase opt and generate".  A few observations:

* Elapsed time isn't much different between -O3 and -O2 -finline-functions
* -O2 -finline-functions gave me a smaller binary :)
* -O0 reduced time significantly, but "phase opt and generate" still takes an awfully long time relative to everything else

What exactly is "phase opt and generate"?  I'm assuming "opt" means optimizer, but why is it taking such a long time even with -O0?  Maybe it's the "generate" part of that that's the most significant.

With -O0 there's still quite a few things enabled, so maybe I'll start appending a "-fno" to each one and see if I can find a culprit.

-O0 -Q --help=optimizers
  -faggressive-loop-optimizations       [enabled]
  -fauto-inc-dec                        [enabled]
  -fdce                                 [enabled]
  -fdelete-null-pointer-checks          [enabled]
  -fdse                                 [enabled]
  -fearly-inlining                      [enabled]
  -ffp-contract=[off|on|fast]           fast
  -ffp-int-builtin-inexact              [enabled]
  -ffunction-cse                        [enabled]
  -fgcse-lm                             [enabled]
  -finline                              [enabled]
  -finline-atomics                      [enabled]
  -fira-hoist-pressure                  [enabled]
  -fira-share-save-slots                [enabled]
  -fira-share-spill-slots               [enabled]
  -fivopts                              [enabled]
  -fjump-tables                         [enabled]
  -flifetime-dse                        [enabled]
  -fmath-errno                          [enabled]
  -fpeephole                            [enabled]
  -fplt                                 [enabled]
  -fprefetch-loop-arrays                [enabled]
  -fprintf-return-value                 [enabled]
  -freg-struct-return                   [enabled]
  -frename-registers                    [enabled]
  -frtti                                [enabled]
  -fsched-critical-path-heuristic       [enabled]
  -fsched-dep-count-heuristic           [enabled]
  -fsched-group-heuristic               [enabled]
  -fsched-interblock                    [enabled]
  -fsched-last-insn-heuristic           [enabled]
  -fsched-rank-heuristic                [enabled]
  -fsched-spec                          [enabled]
  -fsched-spec-insn-heuristic           [enabled]
  -fsched-stalled-insns-dep             [enabled]
  -fschedule-fusion                     [enabled]
  -fshort-enums                         [enabled]
  -fshrink-wrap-separate                [enabled]
  -fsigned-zeros                        [enabled]
  -fsimd-cost-model=[unlimited|dynamic|cheap]   unlimited
  -fsplit-ivs-in-unroller               [enabled]
  -fssa-backprop                        [enabled]
  -fstack-reuse=[all|named_vars|none]   all
  -fstdarg-opt                          [enabled]
  -fstrict-volatile-bitfields           [enabled]
  -fno-threadsafe-statics               [enabled]
  -ftrapping-math                       [enabled]
  -ftree-cselim                         [enabled]
  -ftree-forwprop                       [enabled]
  -ftree-loop-if-convert                [enabled]
  -ftree-loop-im                        [enabled]
  -ftree-loop-ivcanon                   [enabled]
  -ftree-loop-optimize                  [enabled]
  -ftree-phiprop                        [enabled]
  -ftree-reassoc                        [enabled]
  -ftree-scev-cprop                     [enabled]
  -fvar-tracking                        [enabled]
  -fvar-tracking-assignments            [enabled]
  -fweb                                 [enabled]
June 29, 2017
On 28 June 2017 at 23:15, Mike via D.gnu <d.gnu@puremagic.com> wrote:
> -------------------------------------------------------------------------
> The vast majority of time is spent in "phase opt and generate".  A few observations:
>
> * Elapsed time isn't much different between -O3 and -O2 -finline-functions * -O2 -finline-functions gave me a smaller binary :)

I did say that -Os (optimize for size) is practically identical to
this. So I'm not surprised. ;-)

And yeah, one of the big differences between -O2 and -O3 is that when it comes to inlining, -O3 mostly disregards size and cost heuristics.


> * -O0 reduced time significantly, but "phase opt and generate" still takes an awfully long time relative to everything else
>
> What exactly is "phase opt and generate"?  I'm assuming "opt" means optimizer, but why is it taking such a long time even with -O0?  Maybe it's the "generate" part of that that's the most significant.
>

Phase opt and generate is the topl-evel timer for the entire "backend" compilation phase.  I was expecting to see more of a breakdown of individual passes.


> With -O0 there's still quite a few things enabled, so maybe I'll start appending a "-fno" to each one and see if I can find a culprit.
>

A thought just occurred to me, you are compiling the entire program + object.d right?  Nothing else will link/be linked to the binary?

If that is the case, you should definitely compile with -fwhole-program.  I suspect that may cut down your compilation time by half or even more.


Regards,
Iain.
June 28, 2017
On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote:

> Phase opt and generate is the topl-evel timer for the entire "backend" compilation phase.  I was expecting to see more of a breakdown of individual passes.

Sorry, it didn't look broken down to me.  Here's the full report.

arm-none-eabi-gdc -c -O2 -finline-functions -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -Isource/runtime -fno-bounds-check -fno-invariants -fno-in -fno-out -ffunction-sections -fdata-sections -ftime-report source/gcc/attribute.d source/board/package.d source/board/ILI9341.d source/board/lcd.d source/board/spi5.d source/board/statusLED.d source/board/random.d source/board/ltdc.d source/stm32f42/bus.d source/stm32f42/scb.d source/stm32f42/trace.d source/stm32f42/dma2d.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/rcc.d source/stm32f42/rng.d source/stm32f42/nvic.d source/stm32f42/mmio.d source/stm32f42/flash.d source/stm32f42/gpio.d source/stm32f42/ltdc.d source/main.d -o binary/firmware.o

Execution times (seconds)
 phase setup             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    2310 kB ( 0%) ggc
 phase parsing           :   2.21 ( 4%) usr   0.32 ( 2%) sys   2.55 ( 3%) wall  160684 kB ( 6%) ggc
 phase opt and generate  :  51.89 (96%) usr  20.13 (98%) sys  72.29 (97%) wall 2381419 kB (94%) ggc
 phase last asm          :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      26 kB ( 0%) ggc
 phase finalize          :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 garbage collection      :   0.90 ( 2%) usr   0.04 ( 0%) sys   1.05 ( 1%) wall       0 kB ( 0%) ggc
 dump files              :   4.17 ( 8%) usr   1.96 (10%) sys   5.67 ( 8%) wall       0 kB ( 0%) ggc
 callgraph construction  :   0.66 ( 1%) usr   0.20 ( 1%) sys   1.07 ( 1%) wall   26036 kB ( 1%) ggc
 callgraph optimization  :   1.55 ( 3%) usr   0.78 ( 4%) sys   1.89 ( 3%) wall    1689 kB ( 0%) ggc
 ipa dead code removal   :   0.29 ( 1%) usr   0.00 ( 0%) sys   0.28 ( 0%) wall       0 kB ( 0%) ggc
 ipa inheritance graph   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 ipa devirtualization    :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 ipa cp                  :   0.21 ( 0%) usr   0.01 ( 0%) sys   0.18 ( 0%) wall    6160 kB ( 0%) ggc
 ipa inlining heuristics :   0.69 ( 1%) usr   0.15 ( 1%) sys   0.67 ( 1%) wall   88573 kB ( 3%) ggc
 ipa function splitting  :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       9 kB ( 0%) ggc
 ipa comdats             :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       0 kB ( 0%) ggc
 ipa various optimizations:   0.08 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 ipa reference           :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 ipa profile             :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.07 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   0.38 ( 1%) usr   0.09 ( 0%) sys   0.54 ( 1%) wall       0 kB ( 0%) ggc
 ipa icf                 :   1.59 ( 3%) usr   0.01 ( 0%) sys   1.60 ( 2%) wall      11 kB ( 0%) ggc
 ipa SRA                 :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 ipa free lang data      :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 ipa free inline summary :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 cfg construction        :   0.15 ( 0%) usr   0.06 ( 0%) sys   0.12 ( 0%) wall       5 kB ( 0%) ggc
 cfg cleanup             :   0.66 ( 1%) usr   0.27 ( 1%) sys   1.04 ( 1%) wall      17 kB ( 0%) ggc
 trivially dead code     :   0.12 ( 0%) usr   0.05 ( 0%) sys   0.38 ( 1%) wall       0 kB ( 0%) ggc
 df scan insns           :   0.45 ( 1%) usr   0.19 ( 1%) sys   0.56 ( 1%) wall    5569 kB ( 0%) ggc
 df multiple defs        :   0.24 ( 0%) usr   0.06 ( 0%) sys   0.28 ( 0%) wall       0 kB ( 0%) ggc
 df reaching defs        :   0.15 ( 0%) usr   0.03 ( 0%) sys   0.26 ( 0%) wall       0 kB ( 0%) ggc
 df live regs            :   0.60 ( 1%) usr   0.25 ( 1%) sys   0.70 ( 1%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   0.32 ( 1%) usr   0.13 ( 1%) sys   0.57 ( 1%) wall       0 kB ( 0%) ggc
 df use-def / def-use chains:   0.05 ( 0%) usr   0.03 ( 0%) sys   0.11 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.56 ( 1%) usr   0.18 ( 1%) sys   0.86 ( 1%) wall    2562 kB ( 0%) ggc
 register information    :   0.14 ( 0%) usr   0.13 ( 1%) sys   0.40 ( 1%) wall       0 kB ( 0%) ggc
 alias analysis          :   0.79 ( 1%) usr   0.34 ( 2%) sys   1.14 ( 2%) wall   28569 kB ( 1%) ggc
 alias stmt walking      :   0.10 ( 0%) usr   0.02 ( 0%) sys   0.07 ( 0%) wall       0 kB ( 0%) ggc
 register scan           :   0.07 ( 0%) usr   0.01 ( 0%) sys   0.11 ( 0%) wall     106 kB ( 0%) ggc
 rebuild jump labels     :   0.05 ( 0%) usr   0.05 ( 0%) sys   0.15 ( 0%) wall       0 kB ( 0%) ggc
 parser (global)         :   2.19 ( 4%) usr   0.32 ( 2%) sys   2.51 ( 3%) wall  160144 kB ( 6%) ggc
 early inlining heuristics:   0.17 ( 0%) usr   0.09 ( 0%) sys   0.24 ( 0%) wall   19510 kB ( 1%) ggc
 inline parameters       :   0.35 ( 1%) usr   0.18 ( 1%) sys   0.44 ( 1%) wall   58124 kB ( 2%) ggc
 integration             :   0.63 ( 1%) usr   0.24 ( 1%) sys   0.85 ( 1%) wall   80071 kB ( 3%) ggc
 tree gimplify           :   0.48 ( 1%) usr   0.17 ( 1%) sys   0.53 ( 1%) wall  109681 kB ( 4%) ggc
 tree eh                 :   0.13 ( 0%) usr   0.07 ( 0%) sys   0.20 ( 0%) wall   13982 kB ( 1%) ggc
 tree CFG construction   :   0.19 ( 0%) usr   0.05 ( 0%) sys   0.17 ( 0%) wall   54230 kB ( 2%) ggc
 tree CFG cleanup        :   0.69 ( 1%) usr   0.38 ( 2%) sys   1.19 ( 2%) wall    1131 kB ( 0%) ggc
 tree tail merge         :   0.11 ( 0%) usr   0.02 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 tree VRP                :   0.93 ( 2%) usr   0.35 ( 2%) sys   1.29 ( 2%) wall   89761 kB ( 4%) ggc
 tree Early VRP          :   0.21 ( 0%) usr   0.08 ( 0%) sys   0.31 ( 0%) wall   42204 kB ( 2%) ggc
 tree copy propagation   :   0.06 ( 0%) usr   0.03 ( 0%) sys   0.10 ( 0%) wall       0 kB ( 0%) ggc
 tree PTA                :   1.78 ( 3%) usr   0.85 ( 4%) sys   2.50 ( 3%) wall    4103 kB ( 0%) ggc
 tree PHI insertion      :   0.07 ( 0%) usr   0.02 ( 0%) sys   0.03 ( 0%) wall    6571 kB ( 0%) ggc
 tree SSA rewrite        :   0.16 ( 0%) usr   0.06 ( 0%) sys   0.20 ( 0%) wall   20087 kB ( 1%) ggc
 tree SSA other          :   0.21 ( 0%) usr   0.13 ( 1%) sys   0.51 ( 1%) wall    5602 kB ( 0%) ggc
 tree SSA incremental    :   0.15 ( 0%) usr   0.10 ( 0%) sys   0.30 ( 0%) wall      60 kB ( 0%) ggc
 tree operand scan       :   0.34 ( 1%) usr   0.22 ( 1%) sys   0.56 ( 1%) wall   56364 kB ( 2%) ggc
 dominator optimization  :   0.73 ( 1%) usr   0.22 ( 1%) sys   0.75 ( 1%) wall    7545 kB ( 0%) ggc
 backwards jump threading:   0.30 ( 1%) usr   0.09 ( 0%) sys   0.25 ( 0%) wall     111 kB ( 0%) ggc
 tree SRA                :   0.13 ( 0%) usr   0.04 ( 0%) sys   0.17 ( 0%) wall      28 kB ( 0%) ggc
 isolate eroneous paths  :   0.04 ( 0%) usr   0.03 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 tree CCP                :   0.68 ( 1%) usr   0.24 ( 1%) sys   0.85 ( 1%) wall    7302 kB ( 0%) ggc
 tree PHI const/copy prop:   0.05 ( 0%) usr   0.02 ( 0%) sys   0.10 ( 0%) wall       0 kB ( 0%) ggc
 tree split crit edges   :   0.05 ( 0%) usr   0.06 ( 0%) sys   0.17 ( 0%) wall      19 kB ( 0%) ggc
 tree reassociation      :   0.23 ( 0%) usr   0.07 ( 0%) sys   0.38 ( 1%) wall       6 kB ( 0%) ggc
 tree PRE                :   1.28 ( 2%) usr   0.48 ( 2%) sys   1.78 ( 2%) wall   50466 kB ( 2%) ggc
 tree FRE                :   0.69 ( 1%) usr   0.36 ( 2%) sys   1.22 ( 2%) wall   17297 kB ( 1%) ggc
 tree code sinking       :   0.10 ( 0%) usr   0.05 ( 0%) sys   0.13 ( 0%) wall       6 kB ( 0%) ggc
 tree linearize phis     :   0.19 ( 0%) usr   0.08 ( 0%) sys   0.27 ( 0%) wall   41714 kB ( 2%) ggc
 tree backward propagate :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.07 ( 0%) wall       0 kB ( 0%) ggc
 tree forward propagate  :   0.23 ( 0%) usr   0.08 ( 0%) sys   0.38 ( 1%) wall      62 kB ( 0%) ggc
 tree phiprop            :   0.06 ( 0%) usr   0.01 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 tree conservative DCE   :   0.21 ( 0%) usr   0.15 ( 1%) sys   0.36 ( 0%) wall     209 kB ( 0%) ggc
 tree aggressive DCE     :   0.28 ( 1%) usr   0.12 ( 1%) sys   0.44 ( 1%) wall   83438 kB ( 3%) ggc
 tree buildin call DCE   :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 tree DSE                :   0.09 ( 0%) usr   0.09 ( 0%) sys   0.21 ( 0%) wall       0 kB ( 0%) ggc
 PHI merge               :   0.07 ( 0%) usr   0.04 ( 0%) sys   0.11 ( 0%) wall       0 kB ( 0%) ggc
 tree loop optimization  :   0.02 ( 0%) usr   0.01 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 loopless fn             :   0.04 ( 0%) usr   0.01 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 tree loop invariant motion:   0.01 ( 0%) usr   0.02 ( 0%) sys   0.07 ( 0%) wall       1 kB ( 0%) ggc
 complete unrolling      :   0.05 ( 0%) usr   0.04 ( 0%) sys   0.12 ( 0%) wall     136 kB ( 0%) ggc
 tree iv optimization    :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     120 kB ( 0%) ggc
 tree copy headers       :   0.03 ( 0%) usr   0.02 ( 0%) sys   0.03 ( 0%) wall       7 kB ( 0%) ggc
 tree SSA uncprop        :   0.28 ( 1%) usr   0.13 ( 1%) sys   0.31 ( 0%) wall       0 kB ( 0%) ggc
 tree NRV optimization   :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall     849 kB ( 0%) ggc
 tree switch conversion  :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree strlen optimization:   0.03 ( 0%) usr   0.01 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 dominance frontiers     :   0.09 ( 0%) usr   0.02 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation   :   1.42 ( 3%) usr   0.51 ( 2%) sys   1.94 ( 3%) wall       0 kB ( 0%) ggc
 control dependences     :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 out of ssa              :   0.22 ( 0%) usr   0.10 ( 0%) sys   0.35 ( 0%) wall    7465 kB ( 0%) ggc
 expand vars             :   0.02 ( 0%) usr   0.02 ( 0%) sys   0.04 ( 0%) wall     506 kB ( 0%) ggc
 expand                  :   0.63 ( 1%) usr   0.24 ( 1%) sys   1.12 ( 1%) wall   63840 kB ( 3%) ggc
 post expand cleanups    :   0.24 ( 0%) usr   0.04 ( 0%) sys   0.23 ( 0%) wall   18401 kB ( 1%) ggc
 varconst                :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall     539 kB ( 0%) ggc
 lower subreg            :   0.07 ( 0%) usr   0.01 ( 0%) sys   0.05 ( 0%) wall       0 kB ( 0%) ggc
 jump                    :   0.13 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 forward prop            :   0.73 ( 1%) usr   0.26 ( 1%) sys   0.86 ( 1%) wall    2110 kB ( 0%) ggc
 CSE                     :   0.50 ( 1%) usr   0.19 ( 1%) sys   0.73 ( 1%) wall    1053 kB ( 0%) ggc
 dead code elimination   :   0.23 ( 0%) usr   0.07 ( 0%) sys   0.38 ( 1%) wall       0 kB ( 0%) ggc
 dead store elim1        :   0.24 ( 0%) usr   0.09 ( 0%) sys   0.48 ( 1%) wall    1039 kB ( 0%) ggc
 dead store elim2        :   0.27 ( 0%) usr   0.14 ( 1%) sys   0.39 ( 1%) wall     960 kB ( 0%) ggc
 loop analysis           :   0.10 ( 0%) usr   0.06 ( 0%) sys   0.11 ( 0%) wall       0 kB ( 0%) ggc
 loop init               :   1.34 ( 2%) usr   0.51 ( 2%) sys   1.93 ( 3%) wall  183463 kB ( 7%) ggc
 loop invariant motion   :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall       1 kB ( 0%) ggc
 loop doloop             :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall      36 kB ( 0%) ggc
 loop fini               :   0.61 ( 1%) usr   0.31 ( 2%) sys   0.94 ( 1%) wall       0 kB ( 0%) ggc
 CPROP                   :   0.21 ( 0%) usr   0.05 ( 0%) sys   0.21 ( 0%) wall     295 kB ( 0%) ggc
 PRE                     :   0.09 ( 0%) usr   0.01 ( 0%) sys   0.06 ( 0%) wall       4 kB ( 0%) ggc
 auto inc dec            :   0.12 ( 0%) usr   0.06 ( 0%) sys   0.15 ( 0%) wall     934 kB ( 0%) ggc
 CSE 2                   :   0.29 ( 1%) usr   0.18 ( 1%) sys   0.44 ( 1%) wall     171 kB ( 0%) ggc
 branch prediction       :   0.20 ( 0%) usr   0.06 ( 0%) sys   0.16 ( 0%) wall    4067 kB ( 0%) ggc
 combiner                :   0.84 ( 2%) usr   0.22 ( 1%) sys   1.42 ( 2%) wall   13624 kB ( 1%) ggc
 if-conversion           :   0.28 ( 1%) usr   0.08 ( 0%) sys   0.41 ( 1%) wall       2 kB ( 0%) ggc
 scheduling              :   1.45 ( 3%) usr   0.63 ( 3%) sys   2.15 ( 3%) wall    4177 kB ( 0%) ggc
 integrated RA           :   1.83 ( 3%) usr   0.70 ( 3%) sys   2.45 ( 3%) wall  964084 kB (38%) ggc
 LRA non-specific        :   0.69 ( 1%) usr   0.33 ( 2%) sys   0.90 ( 1%) wall    2272 kB ( 0%) ggc
 LRA virtuals elimination:   0.27 ( 0%) usr   0.15 ( 1%) sys   0.36 ( 0%) wall    1881 kB ( 0%) ggc
 LRA reload inheritance  :   0.09 ( 0%) usr   0.04 ( 0%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 LRA create live ranges  :   0.12 ( 0%) usr   0.06 ( 0%) sys   0.12 ( 0%) wall       1 kB ( 0%) ggc
 LRA hard reg assignment :   0.09 ( 0%) usr   0.05 ( 0%) sys   0.20 ( 0%) wall       0 kB ( 0%) ggc
 reload                  :   0.12 ( 0%) usr   0.06 ( 0%) sys   0.13 ( 0%) wall       0 kB ( 0%) ggc
 reload CSE regs         :   0.46 ( 1%) usr   0.09 ( 0%) sys   0.43 ( 1%) wall    2852 kB ( 0%) ggc
 thread pro- & epilogue  :   0.25 ( 0%) usr   0.13 ( 1%) sys   0.45 ( 1%) wall   37093 kB ( 1%) ggc
 if-conversion 2         :   0.06 ( 0%) usr   0.02 ( 0%) sys   0.18 ( 0%) wall       0 kB ( 0%) ggc
 peephole 2              :   0.11 ( 0%) usr   0.04 ( 0%) sys   0.18 ( 0%) wall      11 kB ( 0%) ggc
 hard reg cprop          :   0.12 ( 0%) usr   0.05 ( 0%) sys   0.19 ( 0%) wall       0 kB ( 0%) ggc
 scheduling 2            :   1.05 ( 2%) usr   0.44 ( 2%) sys   1.67 ( 2%) wall    3203 kB ( 0%) ggc
 machine dep reorg       :   0.21 ( 0%) usr   0.05 ( 0%) sys   0.26 ( 0%) wall   10319 kB ( 0%) ggc
 reorder blocks          :   0.10 ( 0%) usr   0.03 ( 0%) sys   0.20 ( 0%) wall      20 kB ( 0%) ggc
 shorten branches        :   0.16 ( 0%) usr   0.05 ( 0%) sys   0.07 ( 0%) wall       0 kB ( 0%) ggc
 final                   :   0.88 ( 2%) usr   0.47 ( 2%) sys   1.51 ( 2%) wall   15600 kB ( 1%) ggc
 variable output         :   0.30 ( 1%) usr   0.03 ( 0%) sys   0.33 ( 0%) wall   10352 kB ( 0%) ggc
 symout                  :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 tree if-combine         :   0.05 ( 0%) usr   0.02 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 straight-line strength reduction:   0.13 ( 0%) usr   0.07 ( 0%) sys   0.22 ( 0%) wall       0 kB ( 0%) ggc
 store merging           :   0.07 ( 0%) usr   0.03 ( 0%) sys   0.01 ( 0%) wall       9 kB ( 0%) ggc
 address lowering        :   0.01 ( 0%) usr   0.01 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 early local passes      :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted optimizations:   0.01 ( 0%) usr   0.01 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 rest of compilation     :   5.83 (11%) usr   2.63 (13%) sys   8.49 (11%) wall  101391 kB ( 4%) ggc
 unaccounted post reload :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted late compilation:   0.03 ( 0%) usr   0.01 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 remove unused locals    :   0.18 ( 0%) usr   0.08 ( 0%) sys   0.22 ( 0%) wall       0 kB ( 0%) ggc
 address taken           :   0.13 ( 0%) usr   0.03 ( 0%) sys   0.13 ( 0%) wall       0 kB ( 0%) ggc
 rebuild frequencies     :   0.03 ( 0%) usr   0.01 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 repair loop structures  :   0.07 ( 0%) usr   0.03 ( 0%) sys   0.14 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 :  54.11            20.45            74.91            2544441 kB

> A thought just occurred to me, you are compiling the entire program + object.d right?  Nothing else will link/be linked to the binary?

I'm passing all files except druntime files via the command line.  druntime files are imported via -Isource/runtime.   But essentially yes, I'm compiling the entire application in one command so I can get cross-module inlining.

I tried moving all runtime files to the command line, but I get errors about __entrypoint.

cc1d: error: module __entrypoint is in file '__entrypoint.d' which cannot be read
Specify path to file '__entrypoint.d' with -I switch


> If that is the case, you should definitely compile with -fwhole-program.  I suspect that may cut down your compilation time by half or even more.

If I only import __entrypoint.d and pass the rest of the runtime files on the command line and compile with -fwhole-program, it compiles in 5s, but I only get an 8byte binary.  I suspect this is due to the error above about __entrypoint.  That is, if there's no entry point, the whole program gets garbage collected.

I think you might be on to something here though.

I'm out of time now; gotta catch a plane soon.  I'll try to do more troubleshooting when I return.

Mike

June 29, 2017
On 29 June 2017 at 00:50, Mike via D.gnu <d.gnu@puremagic.com> wrote:
> On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote:
>> If that is the case, you should definitely compile with -fwhole-program. I suspect that may cut down your compilation time by half or even more.
>
>
> If I only import __entrypoint.d and pass the rest of the runtime files on the command line and compile with -fwhole-program, it compiles in 5s, but I only get an 8byte binary.  I suspect this is due to the error above about __entrypoint.  That is, if there's no entry point, the whole program gets garbage collected.
>
> I think you might be on to something here though.
>

Yes, it seems like there's a more than a few hundred functions being emitted, and as they are all considered externally visible, none can be removed during the optimization pass.  If they are all considered static (in the C sense), then unused and inlined functions can be removed immediately, giving the backend less work.  I think the only caveat with using -fwhole-program is that C main must be present in the compilation, otherwise as you've noted, everything gets removed as unused code.

Regards,
Iain.
July 17, 2017
On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote:

> A thought just occurred to me, you are compiling the entire program + object.d right?  Nothing else will link/be linked to the binary?
>
> If that is the case, you should definitely compile with -fwhole-program.  I suspect that may cut down your compilation time by half or even more.

I'm back and have spend the last two days trying to get my project compiled with -fwhole-program in an effort to reduce the compile times, but I haven't had any success.

Regardless of what I do, the compiler doesn't emit anything except main.

arm-none-eabi-gdc -c -O2 -finline-functions -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -fwhole-program -Isource/entrypoint -fno-bounds-check -ffunction-sections -fdata-sections source/gcc/attribute.d source/runtime/exception.d source/runtime/invariant.d source/runtime/object.d source/runtime/dmain2.d source/board/lcd.d source/board/ILI9341.d source/board/package.d source/board/statusLED.d source/board/ltdc.d source/board/random.d source/board/spi5.d source/stm32f42/nvic.d source/stm32f42/gpio.d source/stm32f42/flash.d source/stm32f42/scb.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/ltdc.d source/stm32f42/trace.d source/stm32f42/rcc.d source/stm32f42/bus.d source/stm32f42/rng.d source/stm32f42/dma2d.d source/stm32f42/mmio.d source/main.d -o binary/firmware.o

arm-none-eabi-nm binary/firmware.o
         U _Dmain
         U _d_run_main
00000000 T main

Are you sure this works with multiple modules passed to the compiler on one line?

Mike


July 18, 2017
On 17 July 2017 at 22:26, Mike via D.gnu <d.gnu@puremagic.com> wrote:
> On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote:
>
>> A thought just occurred to me, you are compiling the entire program + object.d right?  Nothing else will link/be linked to the binary?
>>
>> If that is the case, you should definitely compile with -fwhole-program. I suspect that may cut down your compilation time by half or even more.
>
>
> I'm back and have spend the last two days trying to get my project compiled with -fwhole-program in an effort to reduce the compile times, but I haven't had any success.
>
> Regardless of what I do, the compiler doesn't emit anything except main.
>
> arm-none-eabi-gdc -c -O2 -finline-functions -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -fwhole-program -Isource/entrypoint -fno-bounds-check -ffunction-sections -fdata-sections source/gcc/attribute.d source/runtime/exception.d source/runtime/invariant.d source/runtime/object.d source/runtime/dmain2.d source/board/lcd.d source/board/ILI9341.d source/board/package.d source/board/statusLED.d source/board/ltdc.d source/board/random.d source/board/spi5.d source/stm32f42/nvic.d source/stm32f42/gpio.d source/stm32f42/flash.d source/stm32f42/scb.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/ltdc.d source/stm32f42/trace.d source/stm32f42/rcc.d source/stm32f42/bus.d source/stm32f42/rng.d source/stm32f42/dma2d.d source/stm32f42/mmio.d source/main.d -o binary/firmware.o
>
> arm-none-eabi-nm binary/firmware.o
>          U _Dmain
>          U _d_run_main
> 00000000 T main
>
> Are you sure this works with multiple modules passed to the compiler on one line?
>

Humm, it looks like it can't sufficiently determine that both the _d_run_main declarations are for the same symbol.

After making a tweak, it compiles a program that does a little bit more.

https://gist.github.com/ibuclaw/d1fad77b074fded2682a16df93369a84

This is the compiled result:

https://gist.github.com/ibuclaw/93c707f4729ffc4376539283defdc709

Iain.
July 18, 2017
On 18 July 2017 at 01:19, Iain Buclaw <ibuclaw@gdcproject.org> wrote:
> On 17 July 2017 at 22:26, Mike via D.gnu <d.gnu@puremagic.com> wrote:
>> On Wednesday, 28 June 2017 at 22:17:09 UTC, Iain Buclaw wrote:
>>
>>> A thought just occurred to me, you are compiling the entire program + object.d right?  Nothing else will link/be linked to the binary?
>>>
>>> If that is the case, you should definitely compile with -fwhole-program. I suspect that may cut down your compilation time by half or even more.
>>
>>
>> I'm back and have spend the last two days trying to get my project compiled with -fwhole-program in an effort to reduce the compile times, but I haven't had any success.
>>
>> Regardless of what I do, the compiler doesn't emit anything except main.
>>
>> arm-none-eabi-gdc -c -O2 -finline-functions -nophoboslib -nostdinc -nodefaultlibs -nostdlib -fno-emit-moduleinfo -mthumb -mcpu=cortex-m4 -fwhole-program -Isource/entrypoint -fno-bounds-check -ffunction-sections -fdata-sections source/gcc/attribute.d source/runtime/exception.d source/runtime/invariant.d source/runtime/object.d source/runtime/dmain2.d source/board/lcd.d source/board/ILI9341.d source/board/package.d source/board/statusLED.d source/board/ltdc.d source/board/random.d source/board/spi5.d source/stm32f42/nvic.d source/stm32f42/gpio.d source/stm32f42/flash.d source/stm32f42/scb.d source/stm32f42/spi.d source/stm32f42/pwr.d source/stm32f42/ltdc.d source/stm32f42/trace.d source/stm32f42/rcc.d source/stm32f42/bus.d source/stm32f42/rng.d source/stm32f42/dma2d.d source/stm32f42/mmio.d source/main.d -o binary/firmware.o
>>
>> arm-none-eabi-nm binary/firmware.o
>>          U _Dmain
>>          U _d_run_main
>> 00000000 T main
>>
>> Are you sure this works with multiple modules passed to the compiler on one line?
>>
>
> Humm, it looks like it can't sufficiently determine that both the _d_run_main declarations are for the same symbol.
>
> After making a tweak, it compiles a program that does a little bit more.


Infact, after inspecting the unoptimized result, I think I can safely say this is the entire program, as it is intended to be built.

Now, what can we learn from this?  What can and should be done better?
 Could this be achieved without -fwhole-program?

So far, the compiler really does just force every symbol to be emitted, at the cost of compilation time (no function can be removed during optimization).  I think this is really just a workaround for not setting properly the visibility for compiled symbols (there's a lack of documentation in D about this), I think we could do better and be more explicit in setting this up.  However the problem has almost always been templates that get cut, but shouldn't have, and so end up as being undefined at link-time.

There's probably a few bug reports to be raised.

Regards
Iain.
July 18, 2017
On Monday, 17 July 2017 at 23:57:02 UTC, Iain Buclaw wrote:

> Infact, after inspecting the unoptimized result, I think I can safely say this is the entire program, as it is intended to be built.
>

If I compile without optimizations, I am able to get a complete binary also, even without the changes you made.  But with optimizations, everything gets removed.

Anyway, I'm going to move on from this for now and just try to get a working binary.  I believe once I properly implement volatileLoad and volatileStore, I'll be able to get a working binary.

Mike


1 2 3
Next ›   Last »