OOP, faster data layouts, compilers

Re: OOP, faster data layouts, compilers

May 03, 2011

Sep 02, 2015

Sean Cavanaugh: > In many ways the biggest thing I use regularly in game development that I would lose by moving to D would be good built-in SIMD support. Don has given a nice answer about how D2 plans to face this. To focus more what Don was saying I think a small exaple will help. This is a C implementation of one Computer Shootout benchmarks, that generates a binary PPM image of the Mandelbrot set: http://shootout.alioth.debian.org/u32/program.php?test=mandelbrot&lang=gcc&id=4 This is an important part of that C version: typedef double v2df __attribute__ ((vector_size(16))); /* vector of two doubles */ const v2df zero = { 0.0, 0.0 }; const v2df four = { 4.0, 4.0 }; // Constant throughout the program, value depends on N int bytes_per_row; double inverse_w; double inverse_h; // Program argument: height and width of the image int N; // Lookup table for initial real-axis value v2df *Crvs; // Mandelbrot bitmap uint8_t *bitmap; static void calc_row(int y) { uint8_t *row_bitmap = bitmap + (bytes_per_row * y); int x; const v2df Civ_init = { y*inverse_h-1.0, y*inverse_h-1.0 }; for (x = 0; x < N; x += 2) { v2df Crv = Crvs[x >> 1]; v2df Civ = Civ_init; v2df Zrv = zero; v2df Ziv = zero; v2df Trv = zero; v2df Tiv = zero; int i = 50; int two_pixels; v2df is_still_bounded; do { Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ; Zrv = Trv - Tiv + Crv; Trv = Zrv * Zrv; Tiv = Ziv * Ziv; // All bits will be set to 1 if 'Trv + Tiv' is less than 4 // and all bits will be set to 0 otherwise. Two elements // are calculated in parallel here. is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four); // Move the sign-bit of the low element to bit 0, move the // sign-bit of the high element to bit 1. The result is // that the pixel will be set if the calculation was // bounded. two_pixels = __builtin_ia32_movmskpd(is_still_bounded); } while (--i > 0 && two_pixels); // The pixel bits must be in the most and second most // significant position two_pixels <<= 6; // Add the two pixels to the bitmap, all bits are // initially zero since the area was allocated with calloc() row_bitmap[x >> 3] |= (uint8_t) (two_pixels >> (x & 7)); } } GCC 4.6 compiles the inner do-while loop of calc_row() to just this very clean assembly, that in my opinion is quite _beautiful_, it shows one of the most important final purposes of a good compiler: L9: subl $1, %ecx addpd %xmm0, %xmm0 mulpd %xmm0, %xmm1 movapd %xmm4, %xmm0 addpd %xmm6, %xmm1 addpd %xmm5, %xmm0 subpd %xmm3, %xmm0 movapd %xmm1, %xmm3 movapd %xmm0, %xmm4 mulpd %xmm1, %xmm3 mulpd %xmm0, %xmm4 movapd %xmm3, %xmm2 addpd %xmm4, %xmm2 cmplepd %xmm7, %xmm2 movmskpd %xmm2, %ebx je L18 testl %ebx, %ebx jne L9 Those addpd, subpd, mulpd, movapd, etc, instructions work on pairs of doubles (those v2df). And the code uses the cmplepd and movmskpd instructions too, in a very clean way, that I think not even GCC 4.6 is normally able to use by itself. A good language + compiler have many purposes, but producing ASM code like that is one of the most important purposes, expecially if you write numerical code. A numerical programmer really wants to write code that somehow produces equally clean and powerful code (or better, using AVX 256-bit registers and 3-way instructions) in numerical processing kernels (often such kernels are small, often just bodies of inner loops). D2 allows to write code almost as clean as this C one (but I think currently no D compiler is able to turn this into clean inlined addpd, subpd, mulpd, movapd instructions. This is a compiler issue, not a language one): v2df Zrv = zero; ... Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ; Zrv = Trv - Tiv + Crv; Trv = Zrv * Zrv; Tiv = Ziv * Ziv; In D it becomes: double[2] Zrv = zero; ... Ziv[] = (Zrv[] * Ziv[]) + (Zrv[] * Ziv[]) + Civ[]; Zrv[] = Trv[] - Tiv[] + Crv[]; Trv[] = Zrv[] * Zrv[]; Tiv[] = Ziv[] * Ziv[]; But then how do you write this in a clean way in D2/D3? do { ... is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four); two_pixels = __builtin_ia32_movmskpd(is_still_bounded); } while (--i > 0 && two_pixels); Using those __builtin_ia32_cmplepd() and __builtin_ia32_movmskpd() is not easy, so there is a tradeoff between allowing easy to write code, and giving power. So it's acceptable for a language to give a bit less power if the code is simpler to write. Yet, in a system language if you don't give people a way to produce ASM code as clean as the one I've shown in the inner loops of numerical processing code, some D2 programmers will be forced to write down inline asm, and that's sometimes worse than using intrinsics like __builtin_ia32_cmplepd(). Writing efficient inner loops is very important for numerical processing code, and I think numerical processing code is important for D2. Time ago I have suggested to extend the D2 vector operations to code like this, but I think this is not enough still: float[4] a, b, c, d; c = a[] == b[]; d = a[] >= b[]; Bye, bearophile

September 02, 2015

Re: OOP, faster data layouts, compilers

Posted by qznc
in reply to bearophile

Permalink

qznc

Posted in reply to bearophile

Permalink

On Tuesday, 3 May 2011 at 20:51:37 UTC, bearophile wrote:
> Sean Cavanaugh:
>
>> In many ways the biggest thing I use regularly in game development that I would lose by moving to D would be good built-in SIMD support.
>
> Don has given a nice answer about how D2 plans to face this.
>
> To focus more what Don was saying I think a small exaple will help. This is a C implementation of one Computer Shootout benchmarks, that generates a binary PPM image of the Mandelbrot set:
>
> http://shootout.alioth.debian.org/u32/program.php?test=mandelbrot&lang=gcc&id=4
>
> This is an important part of that C version:
>
>
> typedef double v2df __attribute__ ((vector_size(16))); /* vector of two doubles */
> const v2df zero = { 0.0, 0.0 };
> const v2df four = { 4.0, 4.0 };
>
> // Constant throughout the program, value depends on N
> int bytes_per_row;
> double inverse_w;
> double inverse_h;
>
> // Program argument: height and width of the image
> int N;
>
> // Lookup table for initial real-axis value
> v2df *Crvs;
>
> // Mandelbrot bitmap
> uint8_t *bitmap;
>
> static void calc_row(int y) {
>   uint8_t *row_bitmap = bitmap + (bytes_per_row * y);
>   int x;
>   const v2df Civ_init = { y*inverse_h-1.0, y*inverse_h-1.0 };
>
>   for (x = 0; x < N; x += 2) {
>     v2df Crv = Crvs[x >> 1];
>     v2df Civ = Civ_init;
>     v2df Zrv = zero;
>     v2df Ziv = zero;
>     v2df Trv = zero;
>     v2df Tiv = zero;
>     int i = 50;
>     int two_pixels;
>     v2df is_still_bounded;
>
>     do {
>       Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
>       Zrv = Trv - Tiv + Crv;
>       Trv = Zrv * Zrv;
>       Tiv = Ziv * Ziv;
>
>       // All bits will be set to 1 if 'Trv + Tiv' is less than 4
>       // and all bits will be set to 0 otherwise. Two elements
>       // are calculated in parallel here.
>       is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);
>
>       // Move the sign-bit of the low element to bit 0, move the
>       // sign-bit of the high element to bit 1. The result is
>       // that the pixel will be set if the calculation was
>       // bounded.
>       two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
>     } while (--i > 0 && two_pixels);
>
>     // The pixel bits must be in the most and second most
>     // significant position
>     two_pixels <<= 6;
>
>     // Add the two pixels to the bitmap, all bits are
>     // initially zero since the area was allocated with calloc()
>     row_bitmap[x >> 3] |= (uint8_t) (two_pixels >> (x & 7));
>   }
> }
>
>
> GCC 4.6 compiles the inner do-while loop of calc_row() to just this very clean assembly, that in my opinion is quite _beautiful_, it shows one of the most important final purposes of a good compiler:
>
> L9:
>     subl    $1, %ecx
>     addpd   %xmm0, %xmm0
>     mulpd   %xmm0, %xmm1
>     movapd  %xmm4, %xmm0
>     addpd   %xmm6, %xmm1
>     addpd   %xmm5, %xmm0
>     subpd   %xmm3, %xmm0
>     movapd  %xmm1, %xmm3
>     movapd  %xmm0, %xmm4
>     mulpd   %xmm1, %xmm3
>     mulpd   %xmm0, %xmm4
>     movapd  %xmm3, %xmm2
>     addpd   %xmm4, %xmm2
>     cmplepd %xmm7, %xmm2
>     movmskpd    %xmm2, %ebx
>     je  L18
>     testl   %ebx, %ebx
>     jne L9
>
>
> Those addpd, subpd, mulpd, movapd, etc, instructions work on pairs of doubles (those v2df). And the code uses the cmplepd and movmskpd instructions too, in a very clean way, that I think not even GCC 4.6 is normally able to use by itself. A good language + compiler have many purposes, but producing ASM code like that is one of the most important purposes, expecially if you write numerical code.
>
> A numerical programmer really wants to write code that somehow produces equally clean and powerful code (or better, using AVX 256-bit registers and 3-way instructions) in numerical processing kernels (often such kernels are small, often just bodies of inner loops).
>
> D2 allows to write code almost as clean as this C one (but I think currently no D compiler is able to turn this into clean inlined addpd, subpd, mulpd, movapd instructions. This is a compiler issue, not a language one):
>
> v2df Zrv = zero;
> ...
> Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
> Zrv = Trv - Tiv + Crv;
> Trv = Zrv * Zrv;
> Tiv = Ziv * Ziv;
>
>
> In D it becomes:
>
> double[2] Zrv = zero;
> ...
> Ziv[] = (Zrv[] * Ziv[]) + (Zrv[] * Ziv[]) + Civ[];
> Zrv[] = Trv[] - Tiv[] + Crv[];
> Trv[] = Zrv[] * Zrv[];
> Tiv[] = Ziv[] * Ziv[];
>
>
> But then how do you write this in a clean way in D2/D3?
>
> do {
>     ...
>     is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);
>     two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
> } while (--i > 0 && two_pixels);
>
>
>
> Using those __builtin_ia32_cmplepd() and __builtin_ia32_movmskpd() is not easy, so there is a tradeoff between allowing easy to write code, and giving power. So it's acceptable for a language to give a bit less power if the code is simpler to write. Yet, in a system language if you don't give people a way to produce ASM code as clean as the one I've shown in the inner loops of numerical processing code, some D2 programmers will be forced to write down inline asm, and that's sometimes worse than using intrinsics like __builtin_ia32_cmplepd().
>
> Writing efficient inner loops is very important for numerical processing code, and I think numerical processing code is important for D2.
>
> Time ago I have suggested to extend the D2 vector operations to code like this, but I think this is not enough still:
>
> float[4] a, b, c, d;
> c = a[] == b[];
> d = a[] >= b[];
>
> Bye,
> bearophile

Just found this old post, since I'm tuning mandelbrot.d right now [0].

The good news: LDC produces code, which is quite close to the C version.

mulsd  %xmm6,%xmm4
subsd  %xmm1,%xmm7
addsd  %xmm4,%xmm4
addsd  %xmm5,%xmm7
addsd  %xmm0,%xmm4
movaps %xmm7,%xmm6
mulsd  %xmm6,%xmm6
movaps %xmm4,%xmm2
mulsd  %xmm2,%xmm2
movaps %xmm2,%xmm1
addsd  %xmm6,%xmm1
ucomisd %xmm1,%xmm3
jb     4026f0 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0x130>
jl     402680 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0xc0>

Even better, the code is produce from the following (inlined!) source,
which is pretty much the mathematical definition.

for(auto i = 0; i < iter && norm(Z) <= lim; i++)
        Z = Z*Z + C;

The bad news: cmplepd and movmskpd are not used. Is that possible somehow four years later?

The gcc code is roughly twice as fast at the moment, but I don't know if cmplepd and movmskpd is the only thing missing.

[0] https://github.com/qznc/d-shootout

On Wednesday, 2 September 2015 at 19:04:10 UTC, qznc wrote: > The bad news: cmplepd and movmskpd are not used. Is that possible somehow four years later? I just checked, and LLVM does not know how to automatically vectorize that loop. You would need to write it manually using vector types (like in the C version). > [0] https://github.com/qznc/d-shootout As a general note, you might want to add "-boundscheck=off -mcpu=native" to the flags for LDC too for a fair comparison to the other compilers. Also, if you use the DMD-style flags (e.g. -O -inline), you should use the ldmd2 wrapper instead of ldc2. You might also want to use 2.067 branch of ldc2 (just released as an alpha version) for better comparability to DMD. — David

Forums