A small comparison

Nov 20, 2013

bearophile

Nov 21, 2013

jerro

Nov 21, 2013

Nov 22, 2013

Dec 13, 2013

November 20, 2013

Posted by bearophile

Permalink

bearophile

Permalink

I've found a post about a functional library for LuaJIT code:

http://rtsisyk.github.io/luafun/intro.html


This dynamically typed Lua code:

require "fun" ()
 n = 100
 x = sum(map(function(x) return x^2 end, take(n, tabulate(math.sin))))
 -- calculate sum(sin(x)^2 for x in 0..n-1)
 print(x)
 50.011981355266


Gets compiled to:

 ->LOOP:
 0bcaffd0  movsd [rsp+0x8], xmm7
 0bcaffd6  addsd xmm4, xmm5
 0bcaffda  ucomisd xmm6, xmm1
 0bcaffde  jnb 0x0bca0028        ->6
 0bcaffe4  addsd xmm6, xmm0
 0bcaffe8  addsd xmm7, xmm0
 0bcaffec  fld qword [rsp+0x8]
 0bcafff0  fsin
 0bcafff2  fstp qword [rsp]
 0bcafff5  movsd xmm5, [rsp]
 0bcafffa  mulsd xmm5, xmm5
 0bcafffe  jmp 0x0bcaffd0        ->LOOP


So I've written a similar D version using Phobos:

void main() {
    import std.stdio, std.algorithm, std.range;
    enum n = 100;
    immutable double r = n.iota.map!(x => x.sin ^^ 2).reduce!q{a + b};
    r.writeln;
}


LDC2 compiles it to (32 bit):

LBB0_1:
    fxch
    fstpt   44(%esp)
    fld %st(0)
    fstpt   (%esp)
    fld1
    faddp   %st(1)
    fstpt   32(%esp)
    calll   __D3std4math3sinFNaNbNfeZe
    subl    $12, %esp
    fmul    %st(0), %st(0)
    fldt    44(%esp)
    faddp   %st(1)
    fstpt   44(%esp)
    fldt    32(%esp)
    fldt    44(%esp)
    decl    %esi
    fxch
    jne LBB0_1


As you see the D version doesn't use the fsin instruction as LuaJIT, it calls a library function.


If I import the sin from core.stdc.math ldc2 produces (notice the usage of xmm registers):

LBB0_1:
    movsd   %xmm1, 32(%esp)
    movsd   %xmm0, 40(%esp)
    movsd   %xmm1, (%esp)
    calll   _sin
    fstpl   48(%esp)
    movsd   48(%esp), %xmm0
    mulsd   %xmm0, %xmm0
    movsd   40(%esp), %xmm1
    addsd   %xmm0, %xmm1
    movsd   %xmm1, 40(%esp)
    movsd   32(%esp), %xmm1
    movsd   40(%esp), %xmm0
    addsd   LCPI0_0, %xmm1
    decl    %esi
    jne LBB0_1


I have also written a normal for loop in both C and D. This is C code:


#include "stdio.h"
#include "math.h"

int main() {
    double r = 0.0;
    int i;
    for (i = 0; i < 100; i++) {
        const double aux = sin(i);
        r += aux * aux;
    }
    printf("%f\n", r);
    return 0;
}


GCC compiles it to:


.L3:
    cvtsi2sd    %ebx, %xmm0
    call    sin
.L2:
    mulsd   %xmm0, %xmm0
    addl    $1, %ebx
    cmpl    $100, %ebx
    addsd   8(%rsp), %xmm0
    movsd   %xmm0, 8(%rsp)
    jne .L3


The Intel compiler compiles it to this (this is even vectorized, sin2 and mulpd work on two doubles at a time):

..B1.2:                         # Preds ..B1.8 ..B1.7
        cvtdq2pd  %xmm8, %xmm0                                  #8.32
        call      __svml_sin2                                   #8.28
        mulpd     %xmm0, %xmm0                                  #9.20
        addb      $2, %r12b                                     #7.5
        paddd     %xmm9, %xmm8                                  #5.14
        addpd     %xmm0, %xmm10                                 #9.9
        cmpb      $100, %r12b                                   #7.5
        jb        ..B1.2        # Prob 99%                      #7.5


In scientific code it's common to have small kernels that take most of the run-time of a program, so it's important to shave every instructions from such loops.

Do you think Phobos/LDC2 are doing well enough here?

Bye,
bearophile

> void main() { > import std.stdio, std.algorithm, std.range; > enum n = 100; > immutable double r = n.iota.map!(x => x.sin ^^ 2).reduce!q{a + b}; > r.writeln; > } If I use core.stdc.math.sin and compile the code with: gdc c.d -o c -O3 -fno-bounds-check -frelease I get: 404e80: cvtsi2sd %ebx,%xmm0 404e84: callq 403070 <sin@plt> 404e89: mulsd %xmm0,%xmm0 404e8d: add $0x1,%ebx 404e90: cmp $0x64,%ebx 404e93: addsd 0x28(%rsp),%xmm0 404e99: movsd %xmm0,0x28(%rsp) 404e9f: jne 404e80 <_Dmain+0x60> This is almost exactly the same code as > .L3: > cvtsi2sd %ebx, %xmm0 > call sin > .L2: > mulsd %xmm0, %xmm0 > addl $1, %ebx > cmpl $100, %ebx > addsd 8(%rsp), %xmm0 > movsd %xmm0, 8(%rsp) > jne .L3

jerro: > If I use core.stdc.math.sin and compile the code with: > > gdc c.d -o c -O3 -fno-bounds-check -frelease > > I get: > > 404e80: cvtsi2sd %ebx,%xmm0 > 404e84: callq 403070 <sin@plt> > 404e89: mulsd %xmm0,%xmm0 > 404e8d: add $0x1,%ebx > 404e90: cmp $0x64,%ebx > 404e93: addsd 0x28(%rsp),%xmm0 > 404e99: movsd %xmm0,0x28(%rsp) > 404e9f: jne 404e80 <_Dmain+0x60> > > This is almost exactly the same code as > >> .L3: >> cvtsi2sd %ebx, %xmm0 >> call sin >> .L2: >> mulsd %xmm0, %xmm0 >> addl $1, %ebx >> cmpl $100, %ebx >> addsd 8(%rsp), %xmm0 >> movsd %xmm0, 8(%rsp) >> jne .L3 I guess the back-end is very similar, and it's able to optimize the code well :-) What's the difference between std.math.sin and core.stdc.math.sin? Bye, bearophile

Hi bearophile! On Wednesday, 20 November 2013 at 22:12:29 UTC, bearophile wrote: > > In scientific code it's common to have small kernels that take most of the run-time of a program, so it's important to shave every instructions from such loops. > > Do you think Phobos/LDC2 are doing well enough here? The difference between std.math.sin and core.stdc.math.sin is that std.math.sin is mapped to the LLVM intrinsic llvm.sin while core.stdc.math.sin is the sine function from the C library. I would expect that the LLVM intrinsic is replaced with fsin (at least with -m32). I am unsure why this does not happen. My understanding is that the auto-vectorizer should kick-in here (see http://llvm.org/docs/Vectorizers.html). I really need to go deeper here to understand what's happening in LLVM. For sure, Phobos/LDC2 should do better here!!! Regards, Kai

On Friday, 22 November 2013 at 21:44:28 UTC, Kai Nacke wrote: > I would expect that the LLVM intrinsic is replaced with fsin (at least with -m32). I am unsure why this does not happen. The answer is simple: it is yet not implemented. Just look at the slides from last LLVM developer meeting: http://www.llvm.org/devmtg/2013-11/slides/Rotem-Vectorization.pdf Regards, Kai

Forums