Vector performance - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Vector performance

Thread overview

Vector performance
Jan 10, 2012 Manu
Jan 10, 2012 bearophile
Jan 10, 2012 Manu
Jan 10, 2012 Walter Bright
Jan 11, 2012 F i L
Jan 11, 2012 Manu
Jan 11, 2012 F i L
Jan 11, 2012 Manu
Jan 12, 2012 F i L
Jan 12, 2012 Walter Bright
Jan 12, 2012 Manu
Jan 12, 2012 Iain Buclaw
Jan 13, 2012 Marco Leise
Jan 13, 2012 Iain Buclaw
Jan 13, 2012 Marco Leise
Jan 13, 2012 simendsjo

January 10, 2012

Vector performance

Posted by Manu

Manu

Attachments:

text/html part

Just thought I might share a real-life case study today. Been a lot of talk of SIMD stuff, some people might be interested.

Working on an android product today, I noticed the matrix library was
burning a ridiculous amount of our frame time.
The disassembly looked like pretty normal ARM float code, so rewriting a
couple of the key routines to use the VFPU (carefully), our key device
moved from 19fps -> 34fps (limited at 30, we can now ship).
GalaxyS 2 is now running at 170fps, and devices we previously considered
un-viable can now actually get a release! .. Most devices saw around 25-45%
speed improvement.

Imagine if all vector code throughout was using the vector hardware nicely,
and not just one or 2 key functions...
Getting the API right (intuitively encouraging proper usage and disallowing
inefficient operations), it'll make a big difference!

January 10, 2012

Re: Vector performance

Posted by bearophile
in reply to Manu

bearophile

Posted in reply to Manu

Manu:

> Imagine if all vector code throughout was using the vector hardware nicely, and not just one or 2 key functions...

Is Walter adding types/ops for 256 bit YMM registers too? (AVX2 is not here yet, but AVX is).

Bye,
bearophile

January 10, 2012

Re: Vector performance

Posted by Manu
in reply to bearophile

Manu

Posted in reply to bearophile

Attachments:

text/html part

On 10 January 2012 16:31, bearophile <bearophileHUGS@lycos.com> wrote:

> Manu:
>
> > Imagine if all vector code throughout was using the vector hardware
> nicely,
> > and not just one or 2 key functions...
>
> Is Walter adding types/ops for 256 bit YMM registers too? (AVX2 is not
> here yet, but AVX is).
>

Eventually.
I don't think we need to do that until we have gotten the API right though.

January 10, 2012

Re: Vector performance

Posted by Walter Bright
in reply to Manu

Walter Bright

Posted in reply to Manu

On 1/10/2012 6:39 AM, Manu wrote:
> On 10 January 2012 16:31, bearophile <bearophileHUGS@lycos.com
> <mailto:bearophileHUGS@lycos.com>> wrote:
>
>     Manu:
>
>      > Imagine if all vector code throughout was using the vector hardware nicely,
>      > and not just one or 2 key functions...
>
>     Is Walter adding types/ops for 256 bit YMM registers too? (AVX2 is not here
>     yet, but AVX is).
>
>
> Eventually.
> I don't think we need to do that until we have gotten the API right though.

Right. We'll see how the 128 bit SIMD works out before doing the work to extend it.

January 11, 2012

Re: Vector performance

Posted by F i L
in reply to Manu

F i L

Posted in reply to Manu

On Tuesday, 10 January 2012 at 14:14:41 UTC, Manu wrote:
> Just thought I might share a real-life case study today. Been a lot of talk
> of SIMD stuff, some people might be interested.
>
> Working on an android product today, I noticed the matrix library was
> burning a ridiculous amount of our frame time.
> The disassembly looked like pretty normal ARM float code, so rewriting a
> couple of the key routines to use the VFPU (carefully), our key device
> moved from 19fps -> 34fps (limited at 30, we can now ship).
> GalaxyS 2 is now running at 170fps, and devices we previously considered
> un-viable can now actually get a release! .. Most devices saw around 25-45%
> speed improvement.
>
> Imagine if all vector code throughout was using the vector hardware nicely,
> and not just one or 2 key functions...
> Getting the API right (intuitively encouraging proper usage and disallowing
> inefficient operations), it'll make a big difference!

Wow, impressive difference.

In the future, how will [your idea of] D's SIMD vector libraries effect my math libraries? Will I simply replace:

   struct Vector4(T) {
       T x, y, z, w;
   }

with something like:

   struct Vector4(T) {
       __vector(T[4]) values;
   }

or will std.simd automatically provide a full range of vector operations (normalize, dot, cross, etc) like mono.simd? I can't help but hope for the latter, even if it does make my current efforts redundant, it would defiantly be a benefit to future D pioneers.

January 11, 2012

Re: Vector performance

Posted by Manu
in reply to F i L

Manu

Posted in reply to F i L

Attachments:

text/html part

On 11 January 2012 02:47, F i L <witte2008@gmail.com> wrote:

> On Tuesday, 10 January 2012 at 14:14:41 UTC, Manu wrote:
>
>> Just thought I might share a real-life case study today. Been a lot of
>> talk
>> of SIMD stuff, some people might be interested.
>>
>> Working on an android product today, I noticed the matrix library was
>> burning a ridiculous amount of our frame time.
>> The disassembly looked like pretty normal ARM float code, so rewriting a
>> couple of the key routines to use the VFPU (carefully), our key device
>> moved from 19fps -> 34fps (limited at 30, we can now ship).
>> GalaxyS 2 is now running at 170fps, and devices we previously considered
>> un-viable can now actually get a release! .. Most devices saw around
>> 25-45%
>> speed improvement.
>>
>> Imagine if all vector code throughout was using the vector hardware
>> nicely,
>> and not just one or 2 key functions...
>> Getting the API right (intuitively encouraging proper usage and
>> disallowing
>> inefficient operations), it'll make a big difference!
>>
>
> Wow, impressive difference.
>
> In the future, how will [your idea of] D's SIMD vector libraries effect my math libraries? Will I simply replace:
>
>   struct Vector4(T) {
>       T x, y, z, w;
>   }
>
> with something like:
>
>   struct Vector4(T) {
>       __vector(T[4]) values;
>   }
>

This is too simple an example, but yes that's basically the idea. Have some code of more complex operations?


> or will std.simd automatically provide a full range of vector operations (normalize, dot, cross, etc) like mono.simd? I can't help but hope for the latter, even if it does make my current efforts redundant, it would defiantly be a benefit to future D pioneers.
>

Yes the lib would supply standard operations, probably even a matrix type or 2.

January 11, 2012

Re: Vector performance

Posted by F i L
in reply to Manu

F i L

Posted in reply to Manu

Manu wrote:
> Yes the lib would supply standard operations, probably even a matrix type or 2.

Okay cool. That's basically what I wanted to know. However, I'm still wondering exactly how flexible these libraries will be.

> Have some code of more complex operations?

My main concern is with my "transition" objects. Example:

   struct Transition(T) {
       T value, start, target;
       alias value this;

       void update(U)(U iteration) {
           value = start + ((target - start) * iteration);
       }
   }

   struct Vector4(T) {
       T x, y, z, w;

       auto abs() { ... }
       auto dot() { ... }
       auto norm() { ... }
       // ect...

       static if (isTransition(T)) {
           void update(U)(U iteration) {
               x.update(iteration);
               y.update(iteration);
               z.update(iteration);
               w.update(iteration);
           }
       }
   }

   void main() {
       // Simple transition vector
       auto tranVec = Transition!(Vector4!float)();
       tranVec.target = {50f, 36f}
       tranVec.update(0.5f);

       // Or transition per channel
       auto vecTran = Vector4!(Transition!float)();
       vecTran.x.target = 50f;
       vecTran.y.target = 36f;
       vecTran.update();
   }

I could make a free function "auto Linear(U)(U start, U target)" but it's but best to keep things in object oriented containers, IMO. I've illustrated a simple linear transition here, but the goal is to make many different transition types: Bezier, EaseIn, Circular, Bounce, etc and continuous/physics one like: SmoothLookAt, Giggly, Shaky, etc.

My matrix code also looks something like:

   struct Matrix4(T)
    if (isVector(T) || isTransitionOfVector(T)) {
       T x, y, z, w;
   }

So Transitions potentially work with matrices in some areas. I'm still new to Quarternion math, but I'm guessing these might be able to apply there as well.

So my main concern is how SIMD will effect this sort of flexibility, or if I'm going to have to rethink my whole model here to accommodate SSE operations. SIMD is usually 128 bit right? So making a Vector4!double doesn't really work... unless it was something like:

   struct Vector4(T) {
       version (SIMD_128) {
           static if (T.sizeof == 32) {
               __v128 xyzw;
           }
           else if (T.sizeof == 64) {
               __v128 xy;
               __v128 zw;
           }
       }
       version (SIMD_256) {
           // ...
       }
   }

Of course, that would obviously complicate the method code quite a bit. IDK, your thoughts?

January 11, 2012

Re: Vector performance

Posted by Manu
in reply to F i L

Manu

Posted in reply to F i L

Attachments:

text/html part

On 12 January 2012 01:15, F i L <witte2008@gmail.com> wrote:

> Manu wrote:
>
>> Yes the lib would supply standard operations, probably even a matrix type or 2.
>>
>
> Okay cool. That's basically what I wanted to know. However, I'm still wondering exactly how flexible these libraries will be.


Define 'flexible'?
Probably not very flexible, they will be fast!


> Have some code of more complex operations?
>>
>
> My main concern is with my "transition" objects. Example:
>
>   struct Transition(T) {
>       T value, start, target;
>       alias value this;
>
>       void update(U)(U iteration) {
>           value = start + ((target - start) * iteration);
>
>       }
>   }
>
>
>   struct Vector4(T) {
>       T x, y, z, w;
>
>       auto abs() { ... }
>       auto dot() { ... }
>       auto norm() { ... }
>       // ect...
>
>       static if (isTransition(T)) {
>           void update(U)(U iteration) {
>               x.update(iteration);
>               y.update(iteration);
>               z.update(iteration);
>               w.update(iteration);
>           }
>       }
>   }
>
>
>   void main() {
>       // Simple transition vector
>       auto tranVec = Transition!(Vector4!float)();
>       tranVec.target = {50f, 36f}
>       tranVec.update(0.5f);
>
>       // Or transition per channel
>       auto vecTran = Vector4!(Transition!float)();
>       vecTran.x.target = 50f;
>       vecTran.y.target = 36f;
>       vecTran.update();
>   }
>
> I could make a free function "auto Linear(U)(U start, U target)" but it's but best to keep things in object oriented containers, IMO. I've illustrated a simple linear transition here, but the goal is to make many different transition types: Bezier, EaseIn, Circular, Bounce, etc and continuous/physics one like: SmoothLookAt, Giggly, Shaky, etc.
>

I don't see any problem here. This looks trivial. It depends on basically
nothing, it might even work with what Walter has already added, and no libs
:)
I think the term 'iteration' is a bit ugly/misleading though, it should be
't' or 'time'.


My matrix code also looks something like:
>
>   struct Matrix4(T)
>    if (isVector(T) || isTransitionOfVector(T)) {
>
>       T x, y, z, w;
>   }
>
> So Transitions potentially work with matrices in some areas. I'm still new to Quarternion math, but I'm guessing these might be able to apply there as well.
>

I would probably make a transition of matrices, rather than a matrix of vector transitions (so you can get references to the internal matrices)... but aside from that, I don't see any problems here either.


So my main concern is how SIMD will effect this sort of flexibility, or if
> I'm going to have to rethink my whole model here to accommodate SSE operations. SIMD is usually 128 bit right? So making a Vector4!double doesn't really work... unless it was something like:
>
>   struct Vector4(T) {
>       version (SIMD_128) {
>           static if (T.sizeof == 32) {
>               __v128 xyzw;
>           }
>           else if (T.sizeof == 64) {
>               __v128 xy;
>               __v128 zw;
>           }
>       }
>       version (SIMD_256) {
>           // ...
>       }
>   }
>
> Of course, that would obviously complicate the method code quite a bit. IDK, your thoughts?
>

I think that is also possible if that's what you want to do, and I see no reason why any of these constructs wouldn't be efficient (or supported). You can probably even try it out now with what Walter has already done...

January 12, 2012

Re: Vector performance

Posted by F i L
in reply to Manu

F i L

Posted in reply to Manu

Manu wrote:
> Define 'flexible'?
> Probably not very flexible, they will be fast!

Flexible as in my examples.

> I think the term 'iteration' is a bit ugly/misleading though, it should be
> 't' or 'time'.

I've tried to come up with a better term. I guess the logic behind 'iteration' (which i got from someone else) is that an iteration of 2 gives you a value of two distances from start to target. Whereas 'time' (or 't') could imply any measurement, eg, seconds or hours. Maybe 'tween', as in between? idk, i'll keep looking.

> I would probably make a transition of matrices, rather than a matrix of
> vector transitions (so you can get references to the internal matrices)...

Well the idea is you can have both. You could even have a:

   Vector2!(Transition!(Vector4!(Transition!float))) // headache
   or something more practical...

   Vector4!(Vector4!float) // Matrix4f
   Vector4!(Transition!(Vector4!float)) // Smooth Matrix4f

Or anything like that. I should point out that my example didn't make it clear that a Matrix4!(Transition!float) would be pointless compared to Transition!(Matrix4!float) unless each Transition held it's own iteration value. Example:

   struct Transition(T, bool isTimer = false) {
       T value, start, target;
       alias value this;

       static if (isTimer) {
           float time, speed;

           void update() {
               time += speed;
               value = start + ((target - start) * time);
           }
       }
   }

That way each channel could update on it's own time frame. There may even be a way to have each channel be it's own separate Transition type. Which could be interesting. I'm still playing with possibilities.

> I think that is also possible if that's what you want to do, and I see no
> reason why any of these constructs wouldn't be efficient (or supported).
> You can probably even try it out now with what Walter has already done...

Cool, I was unaware Walter had begun implementing SIMD operations. I'll have to build DMD and test them out. What's the syntax like right now?

I was under the impression you would be helping him here, or that you would be building the SIMD-based math libraries. Or something like that. That's why I was posting my examples in question to how the std.simd lib would compare.

January 12, 2012

Re: Vector performance

Posted by Walter Bright
in reply to F i L

Walter Bright

Posted in reply to F i L

On 1/11/2012 4:46 PM, F i L wrote:
>> I think that is also possible if that's what you want to do, and I see no
>> reason why any of these constructs wouldn't be efficient (or supported).
>> You can probably even try it out now with what Walter has already done...
>
> Cool, I was unaware Walter had begun implementing SIMD operations. I'll have to
> build DMD and test them out. What's the syntax like right now?

It's not ready yet. Give me some more time ;-)

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation