Thread overview | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
December 23, 2016 Optimization problem: bulk Boolean operations on vectors | ||||
---|---|---|---|---|
| ||||
An interesting problem to look at: https://github.com/dlang/dmd/pull/6352 Andrei |
December 23, 2016 Re: Optimization problem: bulk Boolean operations on vectors | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu wrote:
> An interesting problem to look at:
>
> https://github.com/dlang/dmd/pull/6352
>
> Andrei
Ok some hand written assembler to and-assign ints...
enum SIZE = 100000000;
void foo3(int* a, int* b)
{
asm
{
mov EAX,a;
mov EDX,b;
sub EDX,EAX;
mov ECX,EAX;
add ECX,SIZE*4;
l1:;
cmp EAX,ECX;
jae done;
mov EBX,[EAX];
and EBX,[EAX+EDX];
mov [EAX],EBX;
add EAX,4;
jmp l1;
done:;
}
}
I get..
456ms for unconditional
505ms for conditional
268ms for assembler
So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem.
|
December 23, 2016 Re: Optimization problem: bulk Boolean operations on vectors | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu wrote:
> An interesting problem to look at:
>
The foreach macro (src/tk/vec.h#L62) looks like low hanging fruit for optimization as well.
|
December 23, 2016 Re: Optimization problem: bulk Boolean operations on vectors | ||||
---|---|---|---|---|
| ||||
Posted in reply to hardreset | On Friday, 23 December 2016 at 18:03:54 UTC, hardreset wrote: > I get.. > > 456ms for unconditional > 505ms for conditional > 268ms for assembler > > So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem. Has anyone considered using ldc for DMD release compilations? > dmd -O -release -inline -run foo.d 138 ms, 603 μs, and 6 hnsecs > ldmd -O -release -inline -run foo.d 84 ms and 719 μs |
December 23, 2016 Re: Optimization problem: bulk Boolean operations on vectors | ||||
---|---|---|---|---|
| ||||
Posted in reply to Seb | On Friday, 23 December 2016 at 18:33:52 UTC, Seb wrote:
> On Friday, 23 December 2016 at 18:03:54 UTC, hardreset wrote:
>> I get..
>>
>> 456ms for unconditional
>> 505ms for conditional
>> 268ms for assembler
>>
>> So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem.
>
> Has anyone considered using ldc for DMD release compilations?
>
>> dmd -O -release -inline -run foo.d
> 138 ms, 603 μs, and 6 hnsecs
>
>> ldmd -O -release -inline -run foo.d
> 84 ms and 719 μs
This is indeed already done.
|
December 23, 2016 Re: Optimization problem: bulk Boolean operations on vectors | ||||
---|---|---|---|---|
| ||||
Posted in reply to hardreset | On 12/23/2016 10:03 AM, hardreset wrote:
> enum SIZE = 100000000;
>
> void foo3(int* a, int* b)
> {
> asm
> {
> mov EAX,a;
> mov EDX,b;
> sub EDX,EAX;
> mov ECX,EAX;
> add ECX,SIZE*4;
> l1:;
> cmp EAX,ECX;
> jae done;
> mov EBX,[EAX];
> and EBX,[EAX+EDX];
> mov [EAX],EBX;
> add EAX,4;
> jmp l1;
> done:;
> }
> }
>
> I get..
>
> 456ms for unconditional
> 505ms for conditional
> 268ms for assembler
>
> So the asm is about 40% faster than the best of the alternatives. The point
> being that it's the generated code not the source that's the problem.
For this D code:
enum SIZE = 100000000;
void foo(int* a, int* b) {
int* atop = a + 1000;
ptrdiff_t offset = b - a;
for (; a < atop; ++a)
*a &= *(a + offset);
}
The following asm is generated by DMD:
push EBX
mov EBX,8[ESP]
sub EAX,EBX
push ESI
cdq
and EDX,3
add EAX,EDX
sar EAX,2
lea ECX,0FA0h[EBX]
mov ESI,EAX
cmp EBX,ECX
jae L2C
L20: mov EDX,[ESI*4][EBX]
and [EBX],EDX
add EBX,4
cmp EBX,ECX
jb L20
L2C: pop ESI
pop EBX
ret 4
The inner loop is 5 instructions, whereas the one you wrote is 7 instructions (I didn't benchmark it). With some more source code manipulation the divide can be eliminated, but that is irrelevant to the inner loop.
|
December 23, 2016 Re: Optimization problem: bulk Boolean operations on vectors | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On 12/23/2016 05:11 PM, Walter Bright wrote:
> void foo(int* a, int* b) {
> int* atop = a + 1000;
> ptrdiff_t offset = b - a;
> for (; a < atop; ++a)
> *a &= *(a + offset);
> }
That's a neat trick, thanks hardreset (and Walter for making it high-level). There are a bunch of places in druntime and phobos that could use it. -- Andrei
|
December 23, 2016 Re: Optimization problem: bulk Boolean operations on vectors | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote: > > enum SIZE = 100000000; > > void foo(int* a, int* b) { > int* atop = a + 1000; // should be `a + SIZE`, right? > ptrdiff_t offset = b - a; > for (; a < atop; ++a) > *a &= *(a + offset); > } This code is not equivalent with the plain foreach loop. Execution is different when a > (size_t.max-SIZE). Thus the "ptrdiff" loop cannot be vectorized (or can only be vectorized when guarded with a check for potential overflow first). LDC: https://godbolt.org/g/YcCJdZ (note the funny jmp .LBB1_6 to a ret instruction... what's that?!) GDC does something more complex: https://godbolt.org/g/3XeI9p Just for info. Don't know which is faster, but I'm guessing the vectorized foreach loop. |
December 23, 2016 Re: Optimization problem: bulk Boolean operations on vectors | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johan Engelen | On 12/23/2016 3:35 PM, Johan Engelen wrote: > On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote: >> >> enum SIZE = 100000000; >> >> void foo(int* a, int* b) { >> int* atop = a + 1000; // should be `a + SIZE`, right? >> ptrdiff_t offset = b - a; >> for (; a < atop; ++a) >> *a &= *(a + offset); >> } > > This code is not equivalent with the plain foreach loop. Execution is different > when a > (size_t.max-SIZE). The assumption is that 'a' points to the start of an array [0..SIZE], so there's no overflow. > Thus the "ptrdiff" loop cannot be vectorized (or can > only be vectorized when guarded with a check for potential overflow first). > LDC: > https://godbolt.org/g/YcCJdZ > (note the funny jmp .LBB1_6 to a ret instruction... what's that?!) > > GDC does something more complex: > https://godbolt.org/g/3XeI9p > > Just for info. Don't know which is faster, but I'm guessing the vectorized > foreach loop. The vectorized one probably is. But it sure is a lot of code. (The loop speed could possibly be doubled just by unrolling it a few times.) |
December 24, 2016 Re: Optimization problem: bulk Boolean operations on vectors | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
> On 12/23/2016 10:03 AM, hardreset wrote:
>
> For this D code:
>
> enum SIZE = 100000000;
>
> void foo(int* a, int* b) {
> int* atop = a + 1000;
> ptrdiff_t offset = b - a;
> for (; a < atop; ++a)
> *a &= *(a + offset);
> }
>
> The following asm is generated by DMD:
>
> push EBX
> mov EBX,8[ESP]
> sub EAX,EBX
> push ESI
> cdq
> and EDX,3
> add EAX,EDX
> sar EAX,2
> lea ECX,0FA0h[EBX]
> mov ESI,EAX
> cmp EBX,ECX
> jae L2C
> L20: mov EDX,[ESI*4][EBX]
> and [EBX],EDX
> add EBX,4
> cmp EBX,ECX
> jb L20
> L2C: pop ESI
> pop EBX
> ret 4
>
> The inner loop is 5 instructions, whereas the one you wrote is 7 instructions (I didn't benchmark it). With some more source code manipulation the divide can be eliminated, but that is irrelevant to the inner loop.
I patched up the prolog code and timed it and it came out identical to my asm. I tried the ptrdiff C-like code and that still comes out 20% slower here. I'm compiling with...
rdmd test.d -O -release -inline
Am I missing something? How do I get the asm output?
|
Copyright © 1999-2021 by the D Language Foundation