DMD now incorporates a disassembler (page 3)

On 09/01/2022 4:01 PM, Walter Bright wrote: > I buried my PDP-11 long ago. Sob. There is a kit for the control panel[0]. Backed by a raspberry pi. I'm pretty keen to eventually buy one and build it. These kits are cool! [0] https://www.tindie.com/products/obso/pdp-11-replica-kit-the-pidp-11/

January 09, 2022

Re: DMD now incorporates a disassembler

Posted by max haughton
in reply to Walter Bright

Permalink

max haughton

Posted in reply to Walter Bright

Permalink

On Sunday, 9 January 2022 at 02:58:43 UTC, Walter Bright wrote:

I've never seen one. What's the switch for gcc to do the same thing?

For GCC/Clang you'd want -S (and then -masm=intel to make the output ~~beautiful to nobody but the blind~~ readable). This dumps the output to a file, which isn't exactly the same as what -vasm does, but I have already begun piping the -vasm output to a file since (say) hello world yields a thousand lines of output which is much easier to consume in a text editor.

To do it with ldc the flag is --output-s. I have opened a PR to make the ldc dmd-compatibility wrapper (ldmd2) mimic -vasm

Intel (and to a lesser extent Clang) actually annotate the generated text with annotations intended to be read by the humans.

e.g.

Intel C++ (which is in the process of being replaced with Clang relabeled as Intel C++) prints it's (hopeless unless you are using PGO, but still) estimates of the branch probabilities.


test      al, al                                        #5.8
je        ..B1.4        # Prob 22%                      #5.8
                        # LOE rbx rbp r12 r13 r14 r15
                        # Execution count [7.80e-01]

You can also ask the compiler to generate an optimization report inline with the assembly code. This is useful when tuning since you can tell what the compiler is or isn't getting right (e.g. find which roads to force the loop unrolling down). The Intel Compiler also has a reputation for having an arsenal of dirty tricks to make your code "faster" which it will deploy on the hope that you (say) don't notice that your floating point numbers are now less precise.

-qopt-report-phase=vec yields:

        # optimization report
        # LOOP WITH UNSIGNED INDUCTION VARIABLE
        # LOOP WAS VECTORIZED
        # REMAINDER LOOP FOR VECTORIZATION
        # MASKED VECTORIZATION
        # VECTORIZATION HAS UNALIGNED MEMORY REFERENCES
        # VECTORIZATION SPEEDUP COEFFECIENT 3.554688
        # VECTOR TRIP COUNT IS ESTIMATED CONSTANT
        # VECTOR LENGTH 16
        # NORMALIZED VECTORIZATION OVERHEAD 0.687500
        # MAIN VECTOR TYPE: 32-bits integer
vpcmpuq   k1, zmm16, zmm18, 6                           #5.5
vpcmpuq   k0, zmm16, zmm17, 6                           #5.5
vpaddq    zmm18, zmm18, zmm19                           #5.5
vpaddq    zmm17, zmm17, zmm19                           #5.5
kunpckbw  k2, k0, k1                                    #5.5
vmovdqu32 zmm20{k2}{z}, ZMMWORD PTR [rcx+r8*4]          #7.9
vpxord    zmm21{k2}{z}, zmm20, ZMMWORD PTR [rax+r8*4]   #7.9
vmovdqu32 ZMMWORD PTR [rcx+r8*4]{k2}, zmm21             #7.9
add       r8, 16                                        #5.5
cmp       r8, rdx                                       #5.5
jb        ..B1.15       # Prob 82%                      #5.5

People don't seem to care about SPEC numbers too much anymore, but the Intel Compilers still have many features for gaming standard test scores.

http://www.spec.org/cpu2006/results/res2007q3/cpu2006-20070821-01880.html If you looked at this, you'd think that Intel just managed a huge increase on libquantum which we can all use on our own code, but it turns out they worked out they can just tell the compiler to automagically parallelize the code, but still only have 1 nominal process.

https://stackoverflow.com/questions/61016358/why-can-gcc-only-do-loop-interchange-optimization-when-the-int-size-is-a-compile for more overfitting.

Compilers that take a detour through an assembler to generate code are inherently slower.

Certainly, although in my experience not by much. Time spent in the assembler in dominated by time spent in the linker, and just about everywhere else in the compiler (especially when you turn optimizations on). Hello World is about 4ms in the assembler on my machine.

GCC and Clang have very different architectures in this regard but end up being pretty similar in terms of compile times. The linker an exception to that rule of thumb, however, in that the LLVM linker is much faster than any current GNU offering.

> >

It doesn't have a distinct IR like LLVM does but the final stage of the RTL is basically a 1:1 representation of the instruction set:

That looks like intermediate code, not assembler.

It is the (final) intermediate code, but it's barely intermediate at this stage i.e. these are effectively just the target instructions printed with LISP syntax.

It's, helpfully, quite obfuscated unfortunately: Some of that is technical baggage, some of it is due to the way that GCC was explicitly directed to be difficult to consume).

I'm not suggesting any normal programmer should use, just showing what GCC does since I mentioned LLVM.

Anyway, I've been playing with -vasm and I think it seems pretty good so far. There are some formatting issues which shouldn't be hard to fix at all (this is why we asked for some basic tests of the shape of the output), put I think I've only found one (touch wood) situation where it actually gets the instruction wrong so far.

Testing it has led to me finding some fairly bugs in the dmd inline assembler, which I am in the process of filing.

On Friday, 7 January 2022 at 23:14:54 UTC, Dukc wrote: > On Friday, 7 January 2022 at 21:41:55 UTC, Walter Bright wrote: >> Compile with -vasm to see it! Enjoy! >> >> For the file test.d: >> >> int demo(int x) >> { >> return x * x; >> } >> >> Compiling with: >> >> dmd test.d -c -vasm >> >> prints: >> >> _D4test4demoFiZi: >> 0000: 89 F8 mov EAX,EDI >> 0002: 0F AF C0 imul EAX,EAX >> 0005: C3 ret >> >> >> https://github.com/dlang/dmd/pull/13447 > > Wow, very useful! This feature surely lowers the bar to check the disassembly when optimising. Thanks! > > I'm slightly disappointed it does not output the asm inlined to D code but that's just my daydreaming with no practical reasons to back it up. https://stackoverflow.com/questions/2511018/how-does-objdump-manage-to-display-source-code-with-the-s-option Enjoy

On 1/8/2022 10:04 PM, max haughton wrote: > Anyway, I've been playing with -vasm and I think it seems pretty good so far. There are some formatting issues which shouldn't be hard to fix at all (this is why we asked for some basic tests of the shape of the output), put I think I've only found one (touch wood) situation where it actually gets the instruction *wrong* so far. > > Testing it has led to me finding some fairly bugs in the dmd inline assembler, which I am in the process of filing. Thanks. This helps a lot!

On 1/8/2022 10:04 PM, max haughton wrote: > For GCC/Clang you'd want -S I know about that, but take a look at it: > cat fred.c int fred(int a[10]) { return a[11]; } > cc -S test.c > cat test.s .file "test.c" .text .globl test .type test, @function test: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movl $0, %eax popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE0: .size test, .-test .ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4" .section .note.GNU-stack,"",@progbits ************************************************ Contrast with what -vasm does: > cat test.d: int fred(int* a) { return a[11]; } > dmd -c test.d -vasm _D4test4fredFPiZi: 0000: 8B 47 2C mov EAX,02Ch[RDI] 0003: C3 ret *********************************************** -vasm gives me what I want to see. There aren't extra steps to getting it, the object code is included, and all the boilerplate is omitted. It's all about the friction.

On 1/9/2022 11:33 AM, max haughton wrote: > https://stackoverflow.com/questions/2511018/how-does-objdump-manage-to-display-source-code-with-the-s-option obj2asm does the same thing: https://www.digitalmars.com/ctg/obj2asm.html

Forums