January 16, 2012
On 1/16/2012 2:06 PM, Manu wrote:
> Surely it would be possible for them to be overlapping slices?

Not allowed, treated like array bounds checking.
January 17, 2012
On Mon, 16 Jan 2012 20:25:28 +0100, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/16/2012 11:16 AM, Iain Buclaw wrote:
>>> But don't worry, I'm not planning on working on that at the moment :-)
>> Leave that sort of optimisation for the backend to handle please. ;-)
>
> Of course.
>
> I suspect Intel's compiler does that one, does gcc?
>

Thought of that too, but it's rather tough to manage slots in vector registers.
Could probably dust of Don's BLADE library.


It seems that gcc and icc are limited to loop optimization.

http://gcc.gnu.org/projects/tree-ssa/vectorization.html
http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/
January 17, 2012
On 16 January 2012 22:36, Iain Buclaw <ibuclaw@ubuntu.com> wrote:
> On 16 January 2012 21:57, Peter Alexander <peter.alexander.au@gmail.com> wrote:
>> On 16/01/12 8:56 PM, Iain Buclaw wrote:
>>>
>>> On 16 January 2012 19:25, Walter Bright<newshound2@digitalmars.com>  wrote:
>>>>
>>>> On 1/16/2012 11:16 AM, Iain Buclaw wrote:
>>>>
>>>>>>
>>>>>> But don't worry, I'm not planning on working on that at the moment :-)
>>>>>
>>>>>
>>>>> Leave that sort of optimisation for the backend to handle please. ;-)
>>>>
>>>>
>>>>
>>>> Of course.
>>>>
>>>> I suspect Intel's compiler does that one, does gcc?
>>>>
>>>
>>> There's auto-vectorisation for for(), foreach(), and foreach_reverse()
>>> loops that I have written support for.  I am not aware of GCC
>>> vectorising anything else.
>>>
>>> example:
>>>
>>> int a[256], b[256], c[256];
>>> void foo () {
>>>   for (int i=0; i<256; i++)
>>>     a[i] = b[i] + c[i];
>>> }
>>>
>>
>> Unfortunately, if the function was this:
>>
>> void foo(int[] a, int[] b, int[] c) {
>>
>>  for (int i=0; i<256; i++)
>>    a[i] = b[i] + c[i];
>> }
>>
>> Then it can't vectorize due to aliasing.
>
> Compile with -fstrict-aliasing then?
>
> I could certainly play about with having this enabled by default, but I forsee there may be issues (maybe have it on for @safe code?)
>
>
> Regards
> --
> Iain Buclaw
>
> *(p < e ? p++ : p) = (c & 0x0f) + '0';

OK, have turned on strict aliasing by default for D2.  You should now be able to vectorise loops that use locals and parameters. :-)


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
January 17, 2012
Walter:

> But don't worry, I'm not planning on working on that at the moment :-)

Until better optimizations are implemented, I see a "simple" optimization for vector ops. When the compiler knows an arrays are very small it unrolls the operation in-place:

int n = 5;
auto a = new int[n];
auto b = new int[n];
a[] += b[];

==>

int n = 5;
auto a = new int[n]; // a and b are dynamic arrays,
auto b = new int[n]; // but their length is easy to find at compile-time
a[0] += b[0];
a[1] += b[1];
a[2] += b[2];
a[3] += b[4];
a[5] += b[5];

Bye,
bearophile
January 17, 2012
On 17 January 2012 03:56, Iain Buclaw <ibuclaw@ubuntu.com> wrote:

> On 16 January 2012 22:36, Iain Buclaw <ibuclaw@ubuntu.com> wrote:
> > On 16 January 2012 21:57, Peter Alexander <peter.alexander.au@gmail.com>
> wrote:
> >> On 16/01/12 8:56 PM, Iain Buclaw wrote:
> >>>
> >>> On 16 January 2012 19:25, Walter Bright<newshound2@digitalmars.com>
> >>>  wrote:
> >>>>
> >>>> On 1/16/2012 11:16 AM, Iain Buclaw wrote:
> >>>>
> >>>>>>
> >>>>>> But don't worry, I'm not planning on working on that at the moment
> :-)
> >>>>>
> >>>>>
> >>>>> Leave that sort of optimisation for the backend to handle please. ;-)
> >>>>
> >>>>
> >>>>
> >>>> Of course.
> >>>>
> >>>> I suspect Intel's compiler does that one, does gcc?
> >>>>
> >>>
> >>> There's auto-vectorisation for for(), foreach(), and foreach_reverse()
> >>> loops that I have written support for.  I am not aware of GCC
> >>> vectorising anything else.
> >>>
> >>> example:
> >>>
> >>> int a[256], b[256], c[256];
> >>> void foo () {
> >>>   for (int i=0; i<256; i++)
> >>>     a[i] = b[i] + c[i];
> >>> }
> >>>
> >>
> >> Unfortunately, if the function was this:
> >>
> >> void foo(int[] a, int[] b, int[] c) {
> >>
> >>  for (int i=0; i<256; i++)
> >>    a[i] = b[i] + c[i];
> >> }
> >>
> >> Then it can't vectorize due to aliasing.
> >
> > Compile with -fstrict-aliasing then?
> >
> > I could certainly play about with having this enabled by default, but I forsee there may be issues (maybe have it on for @safe code?)
>
> OK, have turned on strict aliasing by default for D2.  You should now be able to vectorise loops that use locals and parameters. :-)
>

What protects these ranges from being overlapping? What if they were
sourced from pointers?
Are just we to say in D that aliasing is not allowed, and 'you shouldn't do
it'? People almost never alias intentionally, it's usually the
most insidious of bugs. :/


January 17, 2012
On 17 January 2012 05:55, bearophile <bearophileHUGS@lycos.com> wrote:

> Walter:
>
> > But don't worry, I'm not planning on working on that at the moment :-)
>
> Until better optimizations are implemented, I see a "simple" optimization for vector ops. When the compiler knows an arrays are very small it unrolls the operation in-place:
>
> int n = 5;
> auto a = new int[n];
> auto b = new int[n];
> a[] += b[];
>
> ==>
>
> int n = 5;
> auto a = new int[n]; // a and b are dynamic arrays,
> auto b = new int[n]; // but their length is easy to find at compile-time
> a[0] += b[0];
> a[1] += b[1];
> a[2] += b[2];
> a[3] += b[4];
> a[5] += b[5];
>

If this doesn't already exist, I think it's quite important. I was thinking
about needing to repeatedly specialise a template last night for a bunch of
short lengths of arrays, for this exact reason.
Unrolling short loops must be one of the most trivial and worthwhile
optimisations...


January 17, 2012
On 16/01/12 17:51, Martin Nowak wrote:
> On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> On 1/15/12 12:56 AM, Walter Bright wrote:
>>> I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
>>> Anyhow, it's good enough now to play around with. Consider it alpha
>>> quality. Expect bugs - but make bug reports, as there's a serious lack
>>> of source code to test it with.
>>> -----------------------
>>> import core.simd;
>>>
>>> void test1a(float[4] a) { }
>>>
>>> void test1()
>>> {
>>> float[4] a = 1.2;
>>> a[] = a[] * 3 + 7;
>>> test1a(a);
>>> }
>>>
>>> void test2a(float4 a) { }
>>>
>>> void test2()
>>> {
>>> float4 a = 1.2;
>>> a = a * 3 + 7;
>>> test2a(a);
>>> }
>>
>> These two functions should have the same speed. The function that
>> ought to be slower is:
>>
>> void test1()
>> {
>> float[5] a = 1.2;
>> float[] b = a[1 .. $];
>> b[] = b[] * 3 + 7;
>> test1a(a);
>> }
>>
>>
>> Andrei
>
> Unfortunately druntime's array ops are a mess and fail
> to speed up anything below 16 floats.
> Additionally there is overhead for a function call and
> they have to check alignment at runtime.
>
> martin

Yes. The structural problem in the compiler is that array ops get turned into function calls far too early. It happens in the semantic pass, but it shouldn't happen in the front-end at all -- it should be done in the glue layer, at the beginning of code generation.

Incidentally, this is the reason that CTFE doesn't work with array ops.



January 17, 2012
On Tue, 17 Jan 2012 09:20:43 +0100, Manu <turkeyman@gmail.com> wrote:

> On 17 January 2012 05:55, bearophile <bearophileHUGS@lycos.com> wrote:
>
>> Walter:
>>
>> > But don't worry, I'm not planning on working on that at the moment :-)
>>
>> Until better optimizations are implemented, I see a "simple" optimization
>> for vector ops. When the compiler knows an arrays are very small it unrolls
>> the operation in-place:
>>
>> int n = 5;
>> auto a = new int[n];
>> auto b = new int[n];
>> a[] += b[];
>>
>> ==>
>>
>> int n = 5;
>> auto a = new int[n]; // a and b are dynamic arrays,
>> auto b = new int[n]; // but their length is easy to find at compile-time
>> a[0] += b[0];
>> a[1] += b[1];
>> a[2] += b[2];
>> a[3] += b[4];
>> a[5] += b[5];
>>
>
> If this doesn't already exist, I think it's quite important. I was thinking
> about needing to repeatedly specialise a template last night for a bunch of
> short lengths of arrays, for this exact reason.
> Unrolling short loops must be one of the most trivial and worthwhile
> optimisations...

If the compiler knows it's a compile time constant
thus you could use a static foreach.
January 17, 2012
On Tue, 17 Jan 2012 09:42:12 +0100, Don Clugston <dac@nospam.com> wrote:

> On 16/01/12 17:51, Martin Nowak wrote:
>> On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> wrote:
>>
>>> On 1/15/12 12:56 AM, Walter Bright wrote:
>>>> I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
>>>> Anyhow, it's good enough now to play around with. Consider it alpha
>>>> quality. Expect bugs - but make bug reports, as there's a serious lack
>>>> of source code to test it with.
>>>> -----------------------
>>>> import core.simd;
>>>>
>>>> void test1a(float[4] a) { }
>>>>
>>>> void test1()
>>>> {
>>>> float[4] a = 1.2;
>>>> a[] = a[] * 3 + 7;
>>>> test1a(a);
>>>> }
>>>>
>>>> void test2a(float4 a) { }
>>>>
>>>> void test2()
>>>> {
>>>> float4 a = 1.2;
>>>> a = a * 3 + 7;
>>>> test2a(a);
>>>> }
>>>
>>> These two functions should have the same speed. The function that
>>> ought to be slower is:
>>>
>>> void test1()
>>> {
>>> float[5] a = 1.2;
>>> float[] b = a[1 .. $];
>>> b[] = b[] * 3 + 7;
>>> test1a(a);
>>> }
>>>
>>>
>>> Andrei
>>
>> Unfortunately druntime's array ops are a mess and fail
>> to speed up anything below 16 floats.
>> Additionally there is overhead for a function call and
>> they have to check alignment at runtime.
>>
>> martin
>
> Yes. The structural problem in the compiler is that array ops get turned into function calls far too early. It happens in the semantic pass, but it shouldn't happen in the front-end at all -- it should be done in the glue layer, at the beginning of code generation.
>
> Incidentally, this is the reason that CTFE doesn't work with array ops.
>
>
>
Oh, I was literally speaking of the runtime implementation.
It should loop with 4 XMM regs the continue with 1 XMM reg
and finish up scalar.
Right now it quantizes on 16 floats and does the remaining
ones scalar, which is really bad for very small arrays.

I was about to rewrite it at some point.
https://gist.github.com/1235470

I think having a runtime template is better than
doing this massive extern(C) interface that has
to be kept in sync. That would also open up room
for a better CTFE integration.

martin
January 17, 2012
On 1/17/2012 12:17 AM, Manu wrote:
> What protects these ranges from being overlapping?

A runtime check, like array bounds checking.