SIMD benchmark (page 5) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » SIMD benchmark (page 5)

January 16, 2012

Re: SIMD benchmark

Posted by Walter Bright
in reply to Manu

Walter Bright

Posted in reply to Manu

On 1/16/2012 2:06 PM, Manu wrote:
> Surely it would be possible for them to be overlapping slices?

Not allowed, treated like array bounds checking.

January 17, 2012

Re: SIMD benchmark

Posted by Martin Nowak
in reply to Walter Bright

Martin Nowak

Posted in reply to Walter Bright

On Mon, 16 Jan 2012 20:25:28 +0100, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/16/2012 11:16 AM, Iain Buclaw wrote:
>>> But don't worry, I'm not planning on working on that at the moment :-)
>> Leave that sort of optimisation for the backend to handle please. ;-)
>
> Of course.
>
> I suspect Intel's compiler does that one, does gcc?
>

Thought of that too, but it's rather tough to manage slots in vector registers.
Could probably dust of Don's BLADE library.


It seems that gcc and icc are limited to loop optimization.

http://gcc.gnu.org/projects/tree-ssa/vectorization.html
http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/

January 17, 2012

Re: SIMD benchmark

Posted by Iain Buclaw

Iain Buclaw

On 16 January 2012 22:36, Iain Buclaw <ibuclaw@ubuntu.com> wrote:
> On 16 January 2012 21:57, Peter Alexander <peter.alexander.au@gmail.com> wrote:
>> On 16/01/12 8:56 PM, Iain Buclaw wrote:
>>>
>>> On 16 January 2012 19:25, Walter Bright<newshound2@digitalmars.com>  wrote:
>>>>
>>>> On 1/16/2012 11:16 AM, Iain Buclaw wrote:
>>>>
>>>>>>
>>>>>> But don't worry, I'm not planning on working on that at the moment :-)
>>>>>
>>>>>
>>>>> Leave that sort of optimisation for the backend to handle please. ;-)
>>>>
>>>>
>>>>
>>>> Of course.
>>>>
>>>> I suspect Intel's compiler does that one, does gcc?
>>>>
>>>
>>> There's auto-vectorisation for for(), foreach(), and foreach_reverse()
>>> loops that I have written support for.  I am not aware of GCC
>>> vectorising anything else.
>>>
>>> example:
>>>
>>> int a[256], b[256], c[256];
>>> void foo () {
>>>   for (int i=0; i<256; i++)
>>>     a[i] = b[i] + c[i];
>>> }
>>>
>>
>> Unfortunately, if the function was this:
>>
>> void foo(int[] a, int[] b, int[] c) {
>>
>>  for (int i=0; i<256; i++)
>>    a[i] = b[i] + c[i];
>> }
>>
>> Then it can't vectorize due to aliasing.
>
> Compile with -fstrict-aliasing then?
>
> I could certainly play about with having this enabled by default, but I forsee there may be issues (maybe have it on for @safe code?)
>
>
> Regards
> --
> Iain Buclaw
>
> *(p < e ? p++ : p) = (c & 0x0f) + '0';

OK, have turned on strict aliasing by default for D2.  You should now be able to vectorise loops that use locals and parameters. :-)


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

January 17, 2012

Re: SIMD benchmark

Posted by bearophile
in reply to Walter Bright

bearophile

Posted in reply to Walter Bright

Walter:

> But don't worry, I'm not planning on working on that at the moment :-)

Until better optimizations are implemented, I see a "simple" optimization for vector ops. When the compiler knows an arrays are very small it unrolls the operation in-place:

int n = 5;
auto a = new int[n];
auto b = new int[n];
a[] += b[];

==>

int n = 5;
auto a = new int[n]; // a and b are dynamic arrays,
auto b = new int[n]; // but their length is easy to find at compile-time
a[0] += b[0];
a[1] += b[1];
a[2] += b[2];
a[3] += b[4];
a[5] += b[5];

Bye,
bearophile

January 17, 2012

Re: SIMD benchmark

Posted by Manu

Manu

Attachments:

text/html part

On 17 January 2012 03:56, Iain Buclaw <ibuclaw@ubuntu.com> wrote:

> On 16 January 2012 22:36, Iain Buclaw <ibuclaw@ubuntu.com> wrote:
> > On 16 January 2012 21:57, Peter Alexander <peter.alexander.au@gmail.com>
> wrote:
> >> On 16/01/12 8:56 PM, Iain Buclaw wrote:
> >>>
> >>> On 16 January 2012 19:25, Walter Bright<newshound2@digitalmars.com>
> >>>  wrote:
> >>>>
> >>>> On 1/16/2012 11:16 AM, Iain Buclaw wrote:
> >>>>
> >>>>>>
> >>>>>> But don't worry, I'm not planning on working on that at the moment
> :-)
> >>>>>
> >>>>>
> >>>>> Leave that sort of optimisation for the backend to handle please. ;-)
> >>>>
> >>>>
> >>>>
> >>>> Of course.
> >>>>
> >>>> I suspect Intel's compiler does that one, does gcc?
> >>>>
> >>>
> >>> There's auto-vectorisation for for(), foreach(), and foreach_reverse()
> >>> loops that I have written support for.  I am not aware of GCC
> >>> vectorising anything else.
> >>>
> >>> example:
> >>>
> >>> int a[256], b[256], c[256];
> >>> void foo () {
> >>>   for (int i=0; i<256; i++)
> >>>     a[i] = b[i] + c[i];
> >>> }
> >>>
> >>
> >> Unfortunately, if the function was this:
> >>
> >> void foo(int[] a, int[] b, int[] c) {
> >>
> >>  for (int i=0; i<256; i++)
> >>    a[i] = b[i] + c[i];
> >> }
> >>
> >> Then it can't vectorize due to aliasing.
> >
> > Compile with -fstrict-aliasing then?
> >
> > I could certainly play about with having this enabled by default, but I forsee there may be issues (maybe have it on for @safe code?)
>
> OK, have turned on strict aliasing by default for D2.  You should now be able to vectorise loops that use locals and parameters. :-)
>

What protects these ranges from being overlapping? What if they were
sourced from pointers?
Are just we to say in D that aliasing is not allowed, and 'you shouldn't do
it'? People almost never alias intentionally, it's usually the
most insidious of bugs. :/

January 17, 2012

Re: SIMD benchmark

Posted by Manu
in reply to bearophile

Manu

Posted in reply to bearophile

Attachments:

text/html part

On 17 January 2012 05:55, bearophile <bearophileHUGS@lycos.com> wrote:

> Walter:
>
> > But don't worry, I'm not planning on working on that at the moment :-)
>
> Until better optimizations are implemented, I see a "simple" optimization for vector ops. When the compiler knows an arrays are very small it unrolls the operation in-place:
>
> int n = 5;
> auto a = new int[n];
> auto b = new int[n];
> a[] += b[];
>
> ==>
>
> int n = 5;
> auto a = new int[n]; // a and b are dynamic arrays,
> auto b = new int[n]; // but their length is easy to find at compile-time
> a[0] += b[0];
> a[1] += b[1];
> a[2] += b[2];
> a[3] += b[4];
> a[5] += b[5];
>

If this doesn't already exist, I think it's quite important. I was thinking
about needing to repeatedly specialise a template last night for a bunch of
short lengths of arrays, for this exact reason.
Unrolling short loops must be one of the most trivial and worthwhile
optimisations...

January 17, 2012

Re: SIMD benchmark

Posted by Don Clugston
in reply to Martin Nowak

Don Clugston

Posted in reply to Martin Nowak

On 16/01/12 17:51, Martin Nowak wrote:
> On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> On 1/15/12 12:56 AM, Walter Bright wrote:
>>> I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
>>> Anyhow, it's good enough now to play around with. Consider it alpha
>>> quality. Expect bugs - but make bug reports, as there's a serious lack
>>> of source code to test it with.
>>> -----------------------
>>> import core.simd;
>>>
>>> void test1a(float[4] a) { }
>>>
>>> void test1()
>>> {
>>> float[4] a = 1.2;
>>> a[] = a[] * 3 + 7;
>>> test1a(a);
>>> }
>>>
>>> void test2a(float4 a) { }
>>>
>>> void test2()
>>> {
>>> float4 a = 1.2;
>>> a = a * 3 + 7;
>>> test2a(a);
>>> }
>>
>> These two functions should have the same speed. The function that
>> ought to be slower is:
>>
>> void test1()
>> {
>> float[5] a = 1.2;
>> float[] b = a[1 .. $];
>> b[] = b[] * 3 + 7;
>> test1a(a);
>> }
>>
>>
>> Andrei
>
> Unfortunately druntime's array ops are a mess and fail
> to speed up anything below 16 floats.
> Additionally there is overhead for a function call and
> they have to check alignment at runtime.
>
> martin

Yes. The structural problem in the compiler is that array ops get turned into function calls far too early. It happens in the semantic pass, but it shouldn't happen in the front-end at all -- it should be done in the glue layer, at the beginning of code generation.

Incidentally, this is the reason that CTFE doesn't work with array ops.

January 17, 2012

Re: SIMD benchmark

Posted by Martin Nowak

Martin Nowak

On Tue, 17 Jan 2012 09:20:43 +0100, Manu <turkeyman@gmail.com> wrote:

> On 17 January 2012 05:55, bearophile <bearophileHUGS@lycos.com> wrote:
>
>> Walter:
>>
>> > But don't worry, I'm not planning on working on that at the moment :-)
>>
>> Until better optimizations are implemented, I see a "simple" optimization
>> for vector ops. When the compiler knows an arrays are very small it unrolls
>> the operation in-place:
>>
>> int n = 5;
>> auto a = new int[n];
>> auto b = new int[n];
>> a[] += b[];
>>
>> ==>
>>
>> int n = 5;
>> auto a = new int[n]; // a and b are dynamic arrays,
>> auto b = new int[n]; // but their length is easy to find at compile-time
>> a[0] += b[0];
>> a[1] += b[1];
>> a[2] += b[2];
>> a[3] += b[4];
>> a[5] += b[5];
>>
>
> If this doesn't already exist, I think it's quite important. I was thinking
> about needing to repeatedly specialise a template last night for a bunch of
> short lengths of arrays, for this exact reason.
> Unrolling short loops must be one of the most trivial and worthwhile
> optimisations...

If the compiler knows it's a compile time constant
thus you could use a static foreach.

January 17, 2012

Re: SIMD benchmark

Posted by Martin Nowak
in reply to Don Clugston

Martin Nowak

Posted in reply to Don Clugston

On Tue, 17 Jan 2012 09:42:12 +0100, Don Clugston <dac@nospam.com> wrote:

> On 16/01/12 17:51, Martin Nowak wrote:
>> On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> wrote:
>>
>>> On 1/15/12 12:56 AM, Walter Bright wrote:
>>>> I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
>>>> Anyhow, it's good enough now to play around with. Consider it alpha
>>>> quality. Expect bugs - but make bug reports, as there's a serious lack
>>>> of source code to test it with.
>>>> -----------------------
>>>> import core.simd;
>>>>
>>>> void test1a(float[4] a) { }
>>>>
>>>> void test1()
>>>> {
>>>> float[4] a = 1.2;
>>>> a[] = a[] * 3 + 7;
>>>> test1a(a);
>>>> }
>>>>
>>>> void test2a(float4 a) { }
>>>>
>>>> void test2()
>>>> {
>>>> float4 a = 1.2;
>>>> a = a * 3 + 7;
>>>> test2a(a);
>>>> }
>>>
>>> These two functions should have the same speed. The function that
>>> ought to be slower is:
>>>
>>> void test1()
>>> {
>>> float[5] a = 1.2;
>>> float[] b = a[1 .. $];
>>> b[] = b[] * 3 + 7;
>>> test1a(a);
>>> }
>>>
>>>
>>> Andrei
>>
>> Unfortunately druntime's array ops are a mess and fail
>> to speed up anything below 16 floats.
>> Additionally there is overhead for a function call and
>> they have to check alignment at runtime.
>>
>> martin
>
> Yes. The structural problem in the compiler is that array ops get turned into function calls far too early. It happens in the semantic pass, but it shouldn't happen in the front-end at all -- it should be done in the glue layer, at the beginning of code generation.
>
> Incidentally, this is the reason that CTFE doesn't work with array ops.
>
>
>
Oh, I was literally speaking of the runtime implementation.
It should loop with 4 XMM regs the continue with 1 XMM reg
and finish up scalar.
Right now it quantizes on 16 floats and does the remaining
ones scalar, which is really bad for very small arrays.

I was about to rewrite it at some point.
https://gist.github.com/1235470

I think having a runtime template is better than
doing this massive extern(C) interface that has
to be kept in sync. That would also open up room
for a better CTFE integration.

martin

January 17, 2012

Re: SIMD benchmark

Posted by Walter Bright
in reply to Manu

Walter Bright

Posted in reply to Manu

On 1/17/2012 12:17 AM, Manu wrote:
> What protects these ranges from being overlapping?

A runtime check, like array bounds checking.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation