Optimization problem: bulk Boolean operations on vectors

Dec 23, 2016

Andrei Alexandrescu

Dec 23, 2016

Dec 23, 2016

Dec 23, 2016

Dec 25, 2016

Dec 23, 2016

Dec 23, 2016

Dec 23, 2016

Dec 24, 2016

Dec 24, 2016

Dec 24, 2016

Dec 24, 2016

Dec 24, 2016

Dec 24, 2016

Dec 24, 2016

Dec 24, 2016

Dec 24, 2016

Dec 23, 2016

Dec 27, 2016

On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu wrote: > An interesting problem to look at: > > https://github.com/dlang/dmd/pull/6352 > > Andrei Ok some hand written assembler to and-assign ints... enum SIZE = 100000000; void foo3(int* a, int* b) { asm { mov EAX,a; mov EDX,b; sub EDX,EAX; mov ECX,EAX; add ECX,SIZE*4; l1:; cmp EAX,ECX; jae done; mov EBX,[EAX]; and EBX,[EAX+EDX]; mov [EAX],EBX; add EAX,4; jmp l1; done:; } } I get.. 456ms for unconditional 505ms for conditional 268ms for assembler So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem.

On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu wrote: > An interesting problem to look at: > The foreach macro (src/tk/vec.h#L62) looks like low hanging fruit for optimization as well.

On Friday, 23 December 2016 at 18:03:54 UTC, hardreset wrote: > I get.. > > 456ms for unconditional > 505ms for conditional > 268ms for assembler > > So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem. Has anyone considered using ldc for DMD release compilations? > dmd -O -release -inline -run foo.d 138 ms, 603 μs, and 6 hnsecs > ldmd -O -release -inline -run foo.d 84 ms and 719 μs

On Friday, 23 December 2016 at 18:33:52 UTC, Seb wrote: > On Friday, 23 December 2016 at 18:03:54 UTC, hardreset wrote: >> I get.. >> >> 456ms for unconditional >> 505ms for conditional >> 268ms for assembler >> >> So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem. > > Has anyone considered using ldc for DMD release compilations? > >> dmd -O -release -inline -run foo.d > 138 ms, 603 μs, and 6 hnsecs > >> ldmd -O -release -inline -run foo.d > 84 ms and 719 μs This is indeed already done.

December 23, 2016

Re: Optimization problem: bulk Boolean operations on vectors

Posted by Walter Bright
in reply to hardreset

Permalink

Walter Bright

Posted in reply to hardreset

Permalink

On 12/23/2016 10:03 AM, hardreset wrote:
> enum SIZE = 100000000;
>
> void foo3(int* a, int* b)
> {
>     asm
>     {
>         mov   EAX,a;
>         mov   EDX,b;
>         sub   EDX,EAX;
>         mov   ECX,EAX;
>         add   ECX,SIZE*4;
>     l1:;
>         cmp   EAX,ECX;
>         jae   done;
>         mov   EBX,[EAX];
>         and   EBX,[EAX+EDX];
>         mov   [EAX],EBX;
>         add   EAX,4;
>         jmp l1;
>     done:;
>     }
> }
>
> I get..
>
> 456ms for unconditional
> 505ms for conditional
> 268ms for assembler
>
> So the asm is about 40% faster than the best of the alternatives. The point
> being that it's the generated code not the source that's the problem.

For this D code:

enum SIZE = 100000000;

void foo(int* a, int* b) {
    int* atop = a + 1000;
    ptrdiff_t offset = b - a;
    for (; a < atop; ++a)
	*a &= *(a + offset);
}

The following asm is generated by DMD:

                push    EBX
                mov     EBX,8[ESP]
                sub     EAX,EBX
                push    ESI
                cdq
                and     EDX,3
                add     EAX,EDX
                sar     EAX,2
                lea     ECX,0FA0h[EBX]
                mov     ESI,EAX
                cmp     EBX,ECX
                jae     L2C
L20:            mov     EDX,[ESI*4][EBX]
                and     [EBX],EDX
                add     EBX,4
                cmp     EBX,ECX
                jb      L20
L2C:            pop     ESI
                pop     EBX
                ret     4

The inner loop is 5 instructions, whereas the one you wrote is 7 instructions (I didn't benchmark it). With some more source code manipulation the divide can be eliminated, but that is irrelevant to the inner loop.

On 12/23/2016 05:11 PM, Walter Bright wrote: > void foo(int* a, int* b) { > int* atop = a + 1000; > ptrdiff_t offset = b - a; > for (; a < atop; ++a) > *a &= *(a + offset); > } That's a neat trick, thanks hardreset (and Walter for making it high-level). There are a bunch of places in druntime and phobos that could use it. -- Andrei

On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote: > > enum SIZE = 100000000; > > void foo(int* a, int* b) { > int* atop = a + 1000; // should be `a + SIZE`, right? > ptrdiff_t offset = b - a; > for (; a < atop; ++a) > *a &= *(a + offset); > } This code is not equivalent with the plain foreach loop. Execution is different when a > (size_t.max-SIZE). Thus the "ptrdiff" loop cannot be vectorized (or can only be vectorized when guarded with a check for potential overflow first). LDC: https://godbolt.org/g/YcCJdZ (note the funny jmp .LBB1_6 to a ret instruction... what's that?!) GDC does something more complex: https://godbolt.org/g/3XeI9p Just for info. Don't know which is faster, but I'm guessing the vectorized foreach loop.

On 12/23/2016 3:35 PM, Johan Engelen wrote: > On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote: >> >> enum SIZE = 100000000; >> >> void foo(int* a, int* b) { >> int* atop = a + 1000; // should be `a + SIZE`, right? >> ptrdiff_t offset = b - a; >> for (; a < atop; ++a) >> *a &= *(a + offset); >> } > > This code is not equivalent with the plain foreach loop. Execution is different > when a > (size_t.max-SIZE). The assumption is that 'a' points to the start of an array [0..SIZE], so there's no overflow. > Thus the "ptrdiff" loop cannot be vectorized (or can > only be vectorized when guarded with a check for potential overflow first). > LDC: > https://godbolt.org/g/YcCJdZ > (note the funny jmp .LBB1_6 to a ret instruction... what's that?!) > > GDC does something more complex: > https://godbolt.org/g/3XeI9p > > Just for info. Don't know which is faster, but I'm guessing the vectorized > foreach loop. The vectorized one probably is. But it sure is a lot of code. (The loop speed could possibly be doubled just by unrolling it a few times.)

On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote: > On 12/23/2016 10:03 AM, hardreset wrote: > > For this D code: > > enum SIZE = 100000000; > > void foo(int* a, int* b) { > int* atop = a + 1000; > ptrdiff_t offset = b - a; > for (; a < atop; ++a) > *a &= *(a + offset); > } > > The following asm is generated by DMD: > > push EBX > mov EBX,8[ESP] > sub EAX,EBX > push ESI > cdq > and EDX,3 > add EAX,EDX > sar EAX,2 > lea ECX,0FA0h[EBX] > mov ESI,EAX > cmp EBX,ECX > jae L2C > L20: mov EDX,[ESI*4][EBX] > and [EBX],EDX > add EBX,4 > cmp EBX,ECX > jb L20 > L2C: pop ESI > pop EBX > ret 4 > > The inner loop is 5 instructions, whereas the one you wrote is 7 instructions (I didn't benchmark it). With some more source code manipulation the divide can be eliminated, but that is irrelevant to the inner loop. I patched up the prolog code and timed it and it came out identical to my asm. I tried the ptrdiff C-like code and that still comes out 20% slower here. I'm compiling with... rdmd test.d -O -release -inline Am I missing something? How do I get the asm output?

Forums