Optimization problem: bulk Boolean operations on vectors (page 2)

On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote: > > For this D code: > > enum SIZE = 100000000; > > void foo(int* a, int* b) { > int* atop = a + 1000; > ptrdiff_t offset = b - a; > for (; a < atop; ++a) > *a &= *(a + offset); > } Is subtraction of pointers which do not belong to the same array defined behavior in D?

On 12/23/2016 5:06 PM, hardreset wrote: > I patched up the prolog code and timed it and it came out identical to my asm. I > tried the ptrdiff C-like code and that still comes out 20% slower here. I'm > compiling with... > > rdmd test.d -O -release -inline > > Am I missing something? How do I get the asm output? dmd test -O obj2asm test.obj

On Saturday, 24 December 2016 at 01:38:24 UTC, safety0ff wrote: > On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote: >> >> For this D code: >> >> enum SIZE = 100000000; >> >> void foo(int* a, int* b) { >> int* atop = a + 1000; >> ptrdiff_t offset = b - a; >> for (; a < atop; ++a) >> *a &= *(a + offset); >> } > > Is subtraction of pointers which do not belong to the same array defined behavior in D? Yes, I think so. There are not crazy rules to make pointer math ub.

December 24, 2016

Re: Optimization problem: bulk Boolean operations on vectors

Posted by Johan Engelen
in reply to Walter Bright

Permalink

Johan Engelen

Posted in reply to Walter Bright

Permalink

On Saturday, 24 December 2016 at 00:04:48 UTC, Walter Bright wrote:
> On 12/23/2016 3:35 PM, Johan Engelen wrote:
>> On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
>>>
>>> enum SIZE = 100000000;
>>>
>>> void foo(int* a, int* b) {
>>>     int* atop = a + 1000; // should be `a + SIZE`, right?
>>>     ptrdiff_t offset = b - a;
>>>     for (; a < atop; ++a)
>>>     *a &= *(a + offset);
>>> }
>>
>> This code is not equivalent with the plain foreach loop. Execution is different
>> when a > (size_t.max-SIZE).
>
> The assumption is that 'a' points to the start of an array [0..SIZE], so there's no overflow.

Of course, but it's not codified in the source and thus the function is different from the foreach version.

More importantly, I think it is broken. I find it very hard to reason about the ptrdiff_t version, because of all the potential overflows and the mixing of unsigned and signed math. So I wrote a little helper program to print out pointer values.
```
// file a.d
import std.stdio;
void main()
{
    int* a = cast(int*) (size_t.max - 4000000LU + 2000000LU);
    int* b = cast(int*) (1000000LU);
    // ptrdiff_t cannot represent the large difference between
    //arrays at the start and end of memory.
    ptrdiff_t offset = b - a;

    writeln("a          = ", a);
    writeln("b          = ", b);
    writeln("a + offset = ", a + offset);
}
```
Then:
```
❯ dmd -run a.d
a          = FFFFFFFFFFC2F6FF
b          = F4240
a + offset = F423F
```
a+offset is not equal to b for the given pointer values. Whoops.

I find it amazing that the GCC backend manages to come up with a bunch of checks for different overflow cases and creates a vectorized inner loop.

-Johan

On 2016-12-24 02:06, hardreset wrote: > How do I get the asm output? You cannot get the assembly output from DMD since it directly outputs object code. Do get the assembly you need to use a disassembler. -- /Jacob Carlborg

On 12/24/2016 2:28 AM, Johan Engelen wrote: > Of course, but it's not codified in the source and thus the function is > different from the foreach version. > > More importantly, I think it is broken. I find it very hard to reason about the > ptrdiff_t version, because of all the potential overflows and the mixing of > unsigned and signed math. You are pedantically correct. It's possible someone would write a bizarre loop, and compiler optimizers need to take that into account when doing loop transformations. This is one of the reasons why D's array syntax and semantics is superior - it guarantees no overflow.

On Friday, 23 December 2016 at 18:33:52 UTC, Seb wrote: >> So the asm is about 40% faster than the best of the alternatives. The point being that it's the generated code not the source that's the problem. > > Has anyone considered using ldc for DMD release compilations? We considered using gdc (or ldc) to compensate for the slowdown after the ddmd transition, but nobody complained enough for that to happen ;). Also Windows users were stuck with dmd at that point. Using a proper optimizer, profiles, and LTO it was fairly simple to speedup dmd by some 20+% (IIRC). Unfortunately wasn't considered before 2.069.0, but that's one of the main reasons why we setup Travis-CI to test gdc/ldc builds. https://trello.com/c/OT6jlFNa/85-rebench-ddmd-vs-dmd-compiler-speed

On Friday, 23 December 2016 at 16:15:44 UTC, Andrei Alexandrescu wrote: > An interesting problem to look at: > > https://github.com/dlang/dmd/pull/6352 With respect to bulk operations on vectors, of course I recognize the desire to use high-level code which is portable across platforms. But I also wonder whether this is the sort of thing that is really best handled these days by using the multi-media instruction set instead of the general-purpose instruction set. There's a lot of special-purpose arithmetic that is provided by the CPU in such a context, and it seems a bit silly to ignore that possibility. Also more generally regarding low-level stuff like this, perhaps this reference might be of interest: https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0321842685

Forums