On Sunday, 9 January 2022 at 02:58:43 UTC, Walter Bright wrote:
>I've never seen one. What's the switch for gcc to do the same thing?
For GCC/Clang you'd want -S (and then -masm=intel to make the output beautiful to nobody but the blind readable). This dumps the output to a file, which isn't exactly the same as what -vasm does, but I have already begun piping the -vasm output to a file since (say) hello world yields a thousand lines of output which is much easier to consume in a text editor.
To do it with ldc the flag is --output-s
. I have opened a PR to make the ldc dmd-compatibility wrapper (ldmd2
) mimic -vasm
Intel (and to a lesser extent Clang) actually annotate the generated text with annotations intended to be read by the humans.
e.g.
Intel C++ (which is in the process of being replaced with Clang relabeled as Intel C++) prints it's (hopeless unless you are using PGO, but still) estimates of the branch probabilities.
test al, al #5.8
je ..B1.4 # Prob 22% #5.8
# LOE rbx rbp r12 r13 r14 r15
# Execution count [7.80e-01]
You can also ask the compiler to generate an optimization report inline with the assembly code. This is useful when tuning since you can tell what the compiler is or isn't getting right (e.g. find which roads to force the loop unrolling down). The Intel Compiler also has a reputation for having an arsenal of dirty tricks to make your code "faster" which it will deploy on the hope that you (say) don't notice that your floating point numbers are now less precise.
-qopt-report-phase=vec
yields:
# optimization report
# LOOP WITH UNSIGNED INDUCTION VARIABLE
# LOOP WAS VECTORIZED
# REMAINDER LOOP FOR VECTORIZATION
# MASKED VECTORIZATION
# VECTORIZATION HAS UNALIGNED MEMORY REFERENCES
# VECTORIZATION SPEEDUP COEFFECIENT 3.554688
# VECTOR TRIP COUNT IS ESTIMATED CONSTANT
# VECTOR LENGTH 16
# NORMALIZED VECTORIZATION OVERHEAD 0.687500
# MAIN VECTOR TYPE: 32-bits integer
vpcmpuq k1, zmm16, zmm18, 6 #5.5
vpcmpuq k0, zmm16, zmm17, 6 #5.5
vpaddq zmm18, zmm18, zmm19 #5.5
vpaddq zmm17, zmm17, zmm19 #5.5
kunpckbw k2, k0, k1 #5.5
vmovdqu32 zmm20{k2}{z}, ZMMWORD PTR [rcx+r8*4] #7.9
vpxord zmm21{k2}{z}, zmm20, ZMMWORD PTR [rax+r8*4] #7.9
vmovdqu32 ZMMWORD PTR [rcx+r8*4]{k2}, zmm21 #7.9
add r8, 16 #5.5
cmp r8, rdx #5.5
jb ..B1.15 # Prob 82% #5.5
People don't seem to care about SPEC numbers too much anymore, but the Intel Compilers still have many features for gaming standard test scores.
http://www.spec.org/cpu2006/results/res2007q3/cpu2006-20070821-01880.html If you looked at this, you'd think that Intel just managed a huge increase on libquantum
which we can all use on our own code, but it turns out they worked out they can just tell the compiler to automagically parallelize the code, but still only have 1 nominal process.
https://stackoverflow.com/questions/61016358/why-can-gcc-only-do-loop-interchange-optimization-when-the-int-size-is-a-compile for more overfitting.
>Compilers that take a detour through an assembler to generate code are inherently slower.
Certainly, although in my experience not by much. Time spent in the assembler in dominated by time spent in the linker, and just about everywhere else in the compiler (especially when you turn optimizations on). Hello World is about 4ms in the assembler on my machine.
GCC and Clang have very different architectures in this regard but end up being pretty similar in terms of compile times. The linker an exception to that rule of thumb, however, in that the LLVM linker is much faster than any current GNU offering.
> >It doesn't have a distinct IR like LLVM does but the final stage of the RTL is basically a 1:1 representation of the instruction set:
That looks like intermediate code, not assembler.
It is the (final) intermediate code, but it's barely intermediate at this stage i.e. these are effectively just the target instructions printed with LISP syntax.
It's, helpfully, quite obfuscated unfortunately: Some of that is technical baggage, some of it is due to the way that GCC was explicitly directed to be difficult to consume).
I'm not suggesting any normal programmer should use, just showing what GCC does since I mentioned LLVM.
Anyway, I've been playing with -vasm and I think it seems pretty good so far. There are some formatting issues which shouldn't be hard to fix at all (this is why we asked for some basic tests of the shape of the output), put I think I've only found one (touch wood) situation where it actually gets the instruction wrong so far.
Testing it has led to me finding some fairly bugs in the dmd inline assembler, which I am in the process of filing.