• FEATURED
• FEATURED
• FEATURED
• FEATURED
• FEATURED

View more

View more

View more

### Image of the Day Submit

IOTD | Top Screenshots

### The latest, straight to your Inbox.

Subscribe to GameDev.net Direct to receive the latest updates and exclusive content.

# Performance optimization SSE vector dot/normalize

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

20 replies to this topic

### #1lipsryme  Members

Posted 20 September 2013 - 09:26 AM

Hey guys, I've done a "quick" first implementation of a vector normalize and dot product using sse intrinsics and was wondering if there's still something that could be optimized further.

Here's my code:

_declspec(align(16))
struct Vec4
{
float x, y, z, w;

inline const Vec4 Normalize()
{
__m128 tmp;

// copy data into the 128bit register
tmp = _mm_set_ps(w, z, y, x);

// 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
// and the result written to all 4 components
__m128 dp = _mm_dp_ps(tmp, tmp, 0x7F);

// compute rsqrt of the dot product
dp = _mm_rsqrt_ps(dp);

// vec * rsqrt(dot(vec, vec))
tmp = _mm_mul_ps(tmp, dp);

Vec4 vec;
union {__m128 v; float f[4]; } uf; // to access the 4 floats
uf.v = tmp;

vec.x = uf.f[0];
vec.y = uf.f[1];
vec.z = uf.f[2];
vec.w = 1.0f;

return vec;
}

inline const float Dot(const Vec4 &v2)
{
__m128 a;

// copy data into the 128bit register
a = _mm_set_ps(w, z, y, x);
__m128 b = _mm_set_ps(v2.w, v2.z, v2.y, v2.x);

// 0x7F = 0111 1111 ~ means we don't want w-component multiplied
// and the result written to all 4 components
__m128 dp = _mm_dp_ps(a, b, 0x7F);

Vec4 vec;
union {__m128 v; float f[4]; } uf; // to access the 4 floats
uf.v = dp;

return uf.f[0];
}
};

### #2Hodgman  Moderators

Posted 20 September 2013 - 10:59 AM

I don't know if this is the case, but I've seen often that if you've got lots of functions like this that move data back and forth between float registers and SSE registers, then that ends up wasting so much time, that SSE becomes the same speed (or slower) than regular float stuff!

I'd try to make your Vec4 class store it's members as an __m128.

In some engines I've even seen a floatInVec class (or Vec1), which internally is an __m128, but to the user it appears like a float. This lets you keep more of your math code working with SSE data, rather than having it juggle between float and SSE.

Regarding the Normalize and Dot functions - considering that you're only really operating on a Vec3, then if you're running these over large data-sets, it might be worthwhile making an SoAVec3 class (as well as the normal Vec4 class), which stores 4 x values, then 4 y values, then 4 z values in 3 __m128 variables. This reduces the total working size of the data-set, and often lets you write more optimized (shorter) functions.

### #3achild  Members

Posted 20 September 2013 - 11:15 AM

[deleted - stupid post suggesting the use of the dot product intrinsic when it is clearly already being used]

Edited by achild, 20 September 2013 - 11:31 AM.

### #4lipsryme  Members

Posted 20 September 2013 - 11:25 AM

@achild As you can see above I'm already using that intrinsic for the dot product

@hodgman I was thinking about that...I'll give it a try and see if it gives me some boost

### #5achild  Members

Posted 20 September 2013 - 11:29 AM

@achild As you can see above I'm already using that intrinsic for the dot product

@hodgman I was thinking about that...I'll give it a try and see if it gives me some boost

Wow. Not much you can say to that. Talk about not paying attention.

### #6AllEightUp  Moderators

Posted 20 September 2013 - 01:40 PM

As Hodge says, due to the load and then store costs, the code is likely going to be loosing so much performance that it may not be giving you any gains at all. Having said that, get rid of the union bits and replace it with the appropriate sse calls. Using that union trick is causing the compiler to flush the registers back to memory just so you can access them to put them in different memory. Here's the modified code (removed comments for brevity):

_declspec(align(16))
struct Vec4
{
float x, y, z, w;

inline const Vec4 Normalize()
{
__m128 tmp = _mm_load_ps( &x );
__m128 dp = _mm_dp_ps(tmp, tmp, 0x7F);
dp = _mm_rsqrt_ps(dp);
tmp = _mm_mul_ps(tmp, dp);

Vec4 vec;
_mm_store_ps( &vec.x, tmp );
vec.w = 1.0f;
return vec;
}

inline const float Dot(const Vec4 &v2)
{
__m128 a = _mm_load_ps( &x );
__m128 b = _mm_load_ps( &v2.x );
__m128 dp = _mm_dp_ps(a, b, 0x7F);

float result;
_mm_store_ss( &result, dp );
return result;
}
};
That should be a bit faster since it removes the unneeded register flushes and leverages the aligned load speeds given that the class is 16 byte aligned. In effect, even though you are not using an __m128 for storage in the class, this is treating the class as one anyway.

NOTE: Also note that these two functions are full of wait states due to the latencies of the operations being performed. If you are doing batches of normalizations/dot products, running 2 or 4 at a time interleaved will effectively triple the throughput of the function. Given SSE4, you don't actually need the SoA data reorg Hodge suggests, you just need to deal with more than one in flight at a time.

Edited by AllEightUp, 20 September 2013 - 01:52 PM.

### #7lipsryme  Members

Posted 20 September 2013 - 02:09 PM

I just tried your updated code AllEightUp and it actually got slower

Edited by lipsryme, 20 September 2013 - 02:11 PM.

### #8AllEightUp  Moderators

Posted 20 September 2013 - 02:16 PM

I just tried your updated code AllEightUp and it actually got slower

Erm, hmmmmmm.... Doesn't seem possible unless the compiler is making a mess which also seems unlikely. That code *should* be as close to the fewest cycles as would be possible without your vector class itself containing an __m128 so you can pass by register instead of reference. I'll have to play with it myself a bit and see if I missed something. What compiler are you using to test the code? And obviously is it a release build?

### #9lipsryme  Members

Posted 20 September 2013 - 02:22 PM

Yea release build and compiler is msvc (visual studio 2008)

edit: Doing it something like this makes it a little bit faster than before now:

// 4-component vector class using SIMD instructions
_declspec(align(16))
struct Vec4
{
__m128 v;

Vec4()
{

}

Vec4(float x, float y, float z, float w)
{
v = _mm_set_ps(w, z, y, x);
}

inline const float X()
{
union {__m128 v; float f[4]; } uf; // to access the 4 floats
uf.v = v;
return uf.f[0];
}

inline const float Y()
{
union {__m128 v; float f[4]; } uf; // to access the 4 floats
uf.v = v;
return uf.f[1];
}

inline const float Z()
{
union {__m128 v; float f[4]; } uf; // to access the 4 floats
uf.v = v;
return uf.f[2];
}

inline const void Normalize()
{
// 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
// and the result written to all 4 components
__m128 dp = _mm_dp_ps(v, v, 0x7F);

// compute rsqrt of the dot product
dp = _mm_rsqrt_ps(dp);

// vec * rsqrt(dot(vec, vec))
v = _mm_mul_ps(v, dp);
}

inline const float Dot(const Vec4 &v2) const
{
// 0x7F = 0111 1111 ~ means we don't want w-component multiplied
// and the result written to all 4 components
__m128 dp = _mm_dp_ps(v, v2.v, 0x7F);

float result;
_mm_store_ss(&result, dp);

return result;
}
};

Edited by lipsryme, 20 September 2013 - 02:48 PM.

### #10AllEightUp  Moderators

Posted 20 September 2013 - 02:34 PM

Oh.... VC 2008 was exceptionally bad with SIMD, it could be doing some really stupid stuff behind the scenes. 2010 and 2012 have massively improved SIMD handling.. You might set a breakpoint on the function switch to disassembly view and see if it is doing anything obviously stupid. Drop the disassembly here and I can take a peek at it also.

### #11lipsryme  Members

Posted 20 September 2013 - 02:54 PM

edit: holy cow VC2012 is almost twice as fast...problem is I'm forced to use 2008 for this project.

Here's the disassembly from the new code:

inline const void Normalize()
{
00413480  push        ebx
00413481  mov         ebx,esp
00413483  sub         esp,8
00413486  and         esp,0FFFFFFF0h
0041348C  push        ebp
0041348D  mov         ebp,dword ptr [ebx+4]
00413490  mov         dword ptr [esp+4],ebp
00413494  mov         ebp,esp
00413496  sub         esp,158h
0041349C  push        esi
0041349D  push        edi
0041349E  push        ecx
0041349F  lea         edi,[ebp-158h]
004134A5  mov         ecx,56h
004134AA  mov         eax,0CCCCCCCCh
004134AF  rep stos    dword ptr es:[edi]
004134B1  pop         ecx
004134B2  mov         eax,dword ptr [___security_cookie (42F0D0h)]
004134B7  xor         eax,ebp
004134B9  mov         dword ptr [ebp-4],eax
004134BC  mov         dword ptr [ebp-0Ch],ecx
// 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
// and the result written to all 4 components
__m128 dp = _mm_dp_ps(v, v, 0x7F);
004134BF  mov         eax,dword ptr [ebp-0Ch]
004134C2  movaps      xmm0,xmmword ptr [eax]
004134C5  mov         ecx,dword ptr [ebp-0Ch]
004134C8  movaps      xmm1,xmmword ptr [ecx]
004134CB  dpps        xmm1,xmm0,7Fh
004134D1  movaps      xmmword ptr [ebp-150h],xmm1
004134D8  movaps      xmm0,xmmword ptr [ebp-150h]
004134DF  movaps      xmmword ptr [ebp-30h],xmm0

// compute rsqrt of the dot product
dp = _mm_rsqrt_ps(dp);
004134E3  rsqrtps     xmm0,xmmword ptr [ebp-30h]
004134E7  movaps      xmmword ptr [ebp-130h],xmm0
004134EE  movaps      xmm0,xmmword ptr [ebp-130h]
004134F5  movaps      xmmword ptr [ebp-30h],xmm0

// vec * rsqrt(dot(vec, vec))
v = _mm_mul_ps(v, dp);
004134F9  movaps      xmm0,xmmword ptr [ebp-30h]
004134FD  mov         eax,dword ptr [ebp-0Ch]
00413500  movaps      xmm1,xmmword ptr [eax]
00413503  mulps       xmm1,xmm0
00413506  movaps      xmmword ptr [ebp-110h],xmm1
0041350D  mov         ecx,dword ptr [ebp-0Ch]
00413510  movaps      xmm0,xmmword ptr [ebp-110h]
00413517  movaps      xmmword ptr [ecx],xmm0
}
0041351A  pop         edi
0041351B  pop         esi
0041351C  mov         ecx,dword ptr [ebp-4]
0041351F  xor         ecx,ebp
00413526  mov         esp,ebp
00413528  pop         ebp
00413529  mov         esp,ebx
0041352B  pop         ebx
0041352C  ret

Edited by lipsryme, 20 September 2013 - 02:58 PM.

### #12AllEightUp  Moderators

Posted 20 September 2013 - 03:07 PM

Eef, there is bad and then there is *that* holy disaster. VC is completely destroying the function for some reason moving and flushing the data multiple times for no apparent reason. Are you sure you are patched all the way up and have the processor pack and all that? It was bad sure but that is absolutely pitiful. I'll post up my assembly output here, for the same thing in just a minute soon as I get a release with symbols built.

Here's the VC2012 codegen:

00B81A06 movaps xmm1,xmmword ptr [__xmm@0000000040400000400000003f800000 (0BDF4C0h)]
00B81A0D movaps xmm0,xmm1
00B81A10 dpps xmm0,xmm1,77h
00B81A16 sqrtps xmm0,xmm0
00B81A19 movss xmm2,dword ptr [__real@3c23d70a (0BDF384h)]
00B81A21 divps xmm1,xmm0

Just a bit more reasonable.

That misses the storage back to the wrapper because my code passes by register and this is in a unit test so I had to extract the code from the surrounding setup/teardown of the test. Oh, and it uses the full sqrt and divs due to needing a bit more accuracy in the calling code.

Edited by AllEightUp, 20 September 2013 - 03:24 PM.

### #13lipsryme  Members

Posted 20 September 2013 - 03:49 PM

Just installed SP1 -> still the same...

### #14AllEightUp  Moderators

Posted 20 September 2013 - 03:57 PM

Unfortunately then, given that codegen, you might want to go back to FPU, likely couldn't be any slower...

Though, I suppose you could try actual inline assembly to avoid the codegen. Hmm....

Edited by AllEightUp, 20 September 2013 - 04:03 PM.

### #15lipsryme  Members

Posted 20 September 2013 - 04:04 PM

Well the most recent code I've posted here is now a little bit faster than the first one up there so it's already a small improvement.

Edited by lipsryme, 20 September 2013 - 04:07 PM.

### #16Hodgman  Moderators

Posted 20 September 2013 - 07:54 PM

With VC2008, go into your project properties -> C/C++, and set

Enable Enhanced Instruction Set to Streaming SIMD Extensions 2 (/arch:SSE2)

and

Floating Point Model to Fast (/fp:fast)

I've found that it produces much less ridiculous float code with these options than compared to the defaults.

### #17lipsryme  Members

Posted 21 September 2013 - 03:39 AM

Already had /fp:fast but yea it got a little faster now (almost 15 percent faster than before). Thanks

Edited by lipsryme, 21 September 2013 - 03:39 AM.

### #18DT....  Members

Posted 22 September 2013 - 02:45 AM

http://www.agner.org/optimize/

### #19momotte  Members

Posted 09 October 2013 - 11:14 AM

edit: holy cow VC2012 is almost twice as fast...problem is I'm forced to use 2008 for this project.

Here's the disassembly from the new code:

inline const void Normalize()
{
00413480  push        ebx
00413481  mov         ebx,esp
00413483  sub         esp,8
00413486  and         esp,0FFFFFFF0h
0041348C  push        ebp
0041348D  mov         ebp,dword ptr [ebx+4]
00413490  mov         dword ptr [esp+4],ebp
00413494  mov         ebp,esp
00413496  sub         esp,158h
0041349C  push        esi
0041349D  push        edi
0041349E  push        ecx
0041349F  lea         edi,[ebp-158h]
004134A5  mov         ecx,56h
004134AA  mov         eax,0CCCCCCCCh
004134AF  rep stos    dword ptr es:[edi]
004134B1  pop         ecx
004134B2  mov         eax,dword ptr [___security_cookie (42F0D0h)]
004134B7  xor         eax,ebp
004134B9  mov         dword ptr [ebp-4],eax
004134BC  mov         dword ptr [ebp-0Ch],ecx
// 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
// and the result written to all 4 components
__m128 dp = _mm_dp_ps(v, v, 0x7F);
004134BF  mov         eax,dword ptr [ebp-0Ch]
004134C2  movaps      xmm0,xmmword ptr [eax]
004134C5  mov         ecx,dword ptr [ebp-0Ch]
004134C8  movaps      xmm1,xmmword ptr [ecx]
004134CB  dpps        xmm1,xmm0,7Fh
004134D1  movaps      xmmword ptr [ebp-150h],xmm1
004134D8  movaps      xmm0,xmmword ptr [ebp-150h]
004134DF  movaps      xmmword ptr [ebp-30h],xmm0

// compute rsqrt of the dot product
dp = _mm_rsqrt_ps(dp);
004134E3  rsqrtps     xmm0,xmmword ptr [ebp-30h]
004134E7  movaps      xmmword ptr [ebp-130h],xmm0
004134EE  movaps      xmm0,xmmword ptr [ebp-130h]
004134F5  movaps      xmmword ptr [ebp-30h],xmm0

// vec * rsqrt(dot(vec, vec))
v = _mm_mul_ps(v, dp);
004134F9  movaps      xmm0,xmmword ptr [ebp-30h]
004134FD  mov         eax,dword ptr [ebp-0Ch]
00413500  movaps      xmm1,xmmword ptr [eax]
00413503  mulps       xmm1,xmm0
00413506  movaps      xmmword ptr [ebp-110h],xmm1
0041350D  mov         ecx,dword ptr [ebp-0Ch]
00413510  movaps      xmm0,xmmword ptr [ebp-110h]
00413517  movaps      xmmword ptr [ecx],xmm0
}
0041351A  pop         edi
0041351B  pop         esi
0041351C  mov         ecx,dword ptr [ebp-4]
0041351F  xor         ecx,ebp
00413526  mov         esp,ebp
00413528  pop         ebp
00413529  mov         esp,ebx
0041352B  pop         ebx
0041352C  ret

This looks awfully like a debug build.

### #20Nypyren  Members

Posted 09 October 2013 - 01:46 PM

Make sure to thoroughly check your compile settings.  The Debug/Release switch won't necessarily give you the results you'd expect if you (or someone else on your team) has disabled the relevant optimizations in the Release build's settings.

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.