Jump to page: 1 2
Thread overview
Optimization problem: bulk Boolean operations on vectors
Dec 23, 2016
hardreset
Dec 23, 2016
Seb
Dec 23, 2016
Stefan Koch
Dec 25, 2016
Martin Nowak
Dec 23, 2016
Walter Bright
Dec 23, 2016
Johan Engelen
Dec 24, 2016
Walter Bright
Dec 24, 2016
Johan Engelen
Dec 24, 2016
Walter Bright
Dec 24, 2016
hardreset
Dec 24, 2016
Walter Bright
Dec 24, 2016
Jacob Carlborg
Dec 24, 2016
safety0ff
Dec 24, 2016
Stefan Koch
Dec 23, 2016
safety0ff
Dec 27, 2016
Observer
December 23, 2016
An interesting problem to look at:

https://github.com/dlang/dmd/pull/6352

Andrei
December 23, 2016
On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu wrote:
> An interesting problem to look at:
>
> https://github.com/dlang/dmd/pull/6352
>
> Andrei

Ok some hand written assembler to and-assign ints...

enum SIZE = 100000000;

void foo3(int* a, int* b)
{
    asm
    {
        mov   EAX,a;
        mov   EDX,b;
        sub   EDX,EAX;
        mov   ECX,EAX;
        add   ECX,SIZE*4;
    l1:;
        cmp   EAX,ECX;
        jae   done;
        mov   EBX,[EAX];
        and   EBX,[EAX+EDX];
        mov   [EAX],EBX;
        add   EAX,4;
        jmp l1;
    done:;
    }
}

I get..

456ms for unconditional
505ms for conditional
268ms for assembler

So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem.
December 23, 2016
On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu wrote:
> An interesting problem to look at:
>

The foreach macro (src/tk/vec.h#L62) looks like low hanging fruit for optimization as well.
December 23, 2016
On Friday, 23 December 2016 at 18:03:54 UTC, hardreset wrote:
> I get..
>
> 456ms for unconditional
> 505ms for conditional
> 268ms for assembler
>
> So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem.

Has anyone considered using ldc for DMD release compilations?

>  dmd -O -release -inline -run foo.d
138 ms, 603 μs, and 6 hnsecs

> ldmd -O -release -inline -run foo.d
84 ms and 719 μs
December 23, 2016
On Friday, 23 December 2016 at 18:33:52 UTC, Seb wrote:
> On Friday, 23 December 2016 at 18:03:54 UTC, hardreset wrote:
>> I get..
>>
>> 456ms for unconditional
>> 505ms for conditional
>> 268ms for assembler
>>
>> So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem.
>
> Has anyone considered using ldc for DMD release compilations?
>
>>  dmd -O -release -inline -run foo.d
> 138 ms, 603 μs, and 6 hnsecs
>
>> ldmd -O -release -inline -run foo.d
> 84 ms and 719 μs

This is indeed already done.

December 23, 2016
On 12/23/2016 10:03 AM, hardreset wrote:
> enum SIZE = 100000000;
>
> void foo3(int* a, int* b)
> {
>     asm
>     {
>         mov   EAX,a;
>         mov   EDX,b;
>         sub   EDX,EAX;
>         mov   ECX,EAX;
>         add   ECX,SIZE*4;
>     l1:;
>         cmp   EAX,ECX;
>         jae   done;
>         mov   EBX,[EAX];
>         and   EBX,[EAX+EDX];
>         mov   [EAX],EBX;
>         add   EAX,4;
>         jmp l1;
>     done:;
>     }
> }
>
> I get..
>
> 456ms for unconditional
> 505ms for conditional
> 268ms for assembler
>
> So the asm is about 40% faster than the best of the alternatives. The point
> being that it's the generated code not the source that's the problem.

For this D code:

enum SIZE = 100000000;

void foo(int* a, int* b) {
    int* atop = a + 1000;
    ptrdiff_t offset = b - a;
    for (; a < atop; ++a)
	*a &= *(a + offset);
}

The following asm is generated by DMD:

                push    EBX
                mov     EBX,8[ESP]
                sub     EAX,EBX
                push    ESI
                cdq
                and     EDX,3
                add     EAX,EDX
                sar     EAX,2
                lea     ECX,0FA0h[EBX]
                mov     ESI,EAX
                cmp     EBX,ECX
                jae     L2C
L20:            mov     EDX,[ESI*4][EBX]
                and     [EBX],EDX
                add     EBX,4
                cmp     EBX,ECX
                jb      L20
L2C:            pop     ESI
                pop     EBX
                ret     4

The inner loop is 5 instructions, whereas the one you wrote is 7 instructions (I didn't benchmark it). With some more source code manipulation the divide can be eliminated, but that is irrelevant to the inner loop.
December 23, 2016
On 12/23/2016 05:11 PM, Walter Bright wrote:
> void foo(int* a, int* b) {
>     int* atop = a + 1000;
>     ptrdiff_t offset = b - a;
>     for (; a < atop; ++a)
>     *a &= *(a + offset);
> }

That's a neat trick, thanks hardreset (and Walter for making it high-level). There are a bunch of places in druntime and phobos that could use it. -- Andrei
December 23, 2016
On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
> 
> enum SIZE = 100000000;
>
> void foo(int* a, int* b) {
>     int* atop = a + 1000; // should be `a + SIZE`, right?
>     ptrdiff_t offset = b - a;
>     for (; a < atop; ++a)
> 	*a &= *(a + offset);
> }

This code is not equivalent with the plain foreach loop. Execution is different when a > (size_t.max-SIZE). Thus the "ptrdiff" loop cannot be vectorized (or can only be vectorized when guarded with a check for potential overflow first).
LDC:
https://godbolt.org/g/YcCJdZ
(note the funny jmp .LBB1_6 to a ret instruction... what's that?!)

GDC does something more complex:
https://godbolt.org/g/3XeI9p

Just for info. Don't know which is faster, but I'm guessing the vectorized foreach loop.

December 23, 2016
On 12/23/2016 3:35 PM, Johan Engelen wrote:
> On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
>>
>> enum SIZE = 100000000;
>>
>> void foo(int* a, int* b) {
>>     int* atop = a + 1000; // should be `a + SIZE`, right?
>>     ptrdiff_t offset = b - a;
>>     for (; a < atop; ++a)
>>     *a &= *(a + offset);
>> }
>
> This code is not equivalent with the plain foreach loop. Execution is different
> when a > (size_t.max-SIZE).

The assumption is that 'a' points to the start of an array [0..SIZE], so there's no overflow.

> Thus the "ptrdiff" loop cannot be vectorized (or can
> only be vectorized when guarded with a check for potential overflow first).
> LDC:
> https://godbolt.org/g/YcCJdZ
> (note the funny jmp .LBB1_6 to a ret instruction... what's that?!)
>
> GDC does something more complex:
> https://godbolt.org/g/3XeI9p
>
> Just for info. Don't know which is faster, but I'm guessing the vectorized
> foreach loop.

The vectorized one probably is. But it sure is a lot of code. (The loop speed could possibly be doubled just by unrolling it a few times.)

December 24, 2016
On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
> On 12/23/2016 10:03 AM, hardreset wrote:
>
> For this D code:
>
> enum SIZE = 100000000;
>
> void foo(int* a, int* b) {
>     int* atop = a + 1000;
>     ptrdiff_t offset = b - a;
>     for (; a < atop; ++a)
> 	*a &= *(a + offset);
> }
>
> The following asm is generated by DMD:
>
>                 push    EBX
>                 mov     EBX,8[ESP]
>                 sub     EAX,EBX
>                 push    ESI
>                 cdq
>                 and     EDX,3
>                 add     EAX,EDX
>                 sar     EAX,2
>                 lea     ECX,0FA0h[EBX]
>                 mov     ESI,EAX
>                 cmp     EBX,ECX
>                 jae     L2C
> L20:            mov     EDX,[ESI*4][EBX]
>                 and     [EBX],EDX
>                 add     EBX,4
>                 cmp     EBX,ECX
>                 jb      L20
> L2C:            pop     ESI
>                 pop     EBX
>                 ret     4
>
> The inner loop is 5 instructions, whereas the one you wrote is 7 instructions (I didn't benchmark it). With some more source code manipulation the divide can be eliminated, but that is irrelevant to the inner loop.

I patched up the prolog code and timed it and it came out identical to my asm. I tried the ptrdiff C-like code and that still comes out 20% slower here. I'm compiling with...

rdmd test.d -O -release -inline

Am I missing something? How do I get the asm output?



« First   ‹ Prev
1 2