# Performance optimization SSE vector dot/normalize

This topic is 2112 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hey guys, I've done a "quick" first implementation of a vector normalize and dot product using sse intrinsics and was wondering if there's still something that could be optimized further.

Here's my code:

_declspec(align(16))
struct Vec4
{
float x, y, z, w;

inline const Vec4 Normalize()
{
__m128 tmp;

// copy data into the 128bit register
tmp = _mm_set_ps(w, z, y, x);

// 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
// and the result written to all 4 components
__m128 dp = _mm_dp_ps(tmp, tmp, 0x7F);

// compute rsqrt of the dot product
dp = _mm_rsqrt_ps(dp);

// vec * rsqrt(dot(vec, vec))
tmp = _mm_mul_ps(tmp, dp);

Vec4 vec;
union {__m128 v; float f[4]; } uf; // to access the 4 floats
uf.v = tmp;

vec.x = uf.f[0];
vec.y = uf.f[1];
vec.z = uf.f[2];
vec.w = 1.0f;

return vec;
}

inline const float Dot(const Vec4 &v2)
{
__m128 a;

// copy data into the 128bit register
a = _mm_set_ps(w, z, y, x);
__m128 b = _mm_set_ps(v2.w, v2.z, v2.y, v2.x);

// 0x7F = 0111 1111 ~ means we don't want w-component multiplied
// and the result written to all 4 components
__m128 dp = _mm_dp_ps(a, b, 0x7F);

Vec4 vec;
union {__m128 v; float f[4]; } uf; // to access the 4 floats
uf.v = dp;

return uf.f[0];
}
};


##### Share on other sites

I don't know if this is the case, but I've seen often that if you've got lots of functions like this that move data back and forth between float registers and SSE registers, then that ends up wasting so much time, that SSE becomes the same speed (or slower) than regular float stuff!

I'd try to make your Vec4 class store it's members as an __m128.

In some engines I've even seen a floatInVec class (or Vec1), which internally is an __m128, but to the user it appears like a float. This lets you keep more of your math code working with SSE data, rather than having it juggle between float and SSE.

Regarding the Normalize and Dot functions - considering that you're only really operating on a Vec3, then if you're running these over large data-sets, it might be worthwhile making an SoAVec3 class (as well as the normal Vec4 class), which stores 4 x values, then 4 y values, then 4 z values in 3 __m128 variables. This reduces the total working size of the data-set, and often lets you write more optimized (shorter) functions.

##### Share on other sites

[deleted - stupid post suggesting the use of the dot product intrinsic when it is clearly already being used]

Edited by achild

##### Share on other sites

@achild As you can see above I'm already using that intrinsic for the dot product

@hodgman I was thinking about that...I'll give it a try and see if it gives me some boost

##### Share on other sites

@achild As you can see above I'm already using that intrinsic for the dot product

@hodgman I was thinking about that...I'll give it a try and see if it gives me some boost

Wow. Not much you can say to that. Talk about not paying attention.

##### Share on other sites
As Hodge says, due to the load and then store costs, the code is likely going to be loosing so much performance that it may not be giving you any gains at all. Having said that, get rid of the union bits and replace it with the appropriate sse calls. Using that union trick is causing the compiler to flush the registers back to memory just so you can access them to put them in different memory. Here's the modified code (removed comments for brevity):

_declspec(align(16))
struct Vec4
{
float x, y, z, w;

inline const Vec4 Normalize()
{
__m128 tmp = _mm_load_ps( &x );
__m128 dp = _mm_dp_ps(tmp, tmp, 0x7F);
dp = _mm_rsqrt_ps(dp);
tmp = _mm_mul_ps(tmp, dp);

Vec4 vec;
_mm_store_ps( &vec.x, tmp );
vec.w = 1.0f;
return vec;
}

inline const float Dot(const Vec4 &v2)
{
__m128 a = _mm_load_ps( &x );
__m128 b = _mm_load_ps( &v2.x );
__m128 dp = _mm_dp_ps(a, b, 0x7F);

float result;
_mm_store_ss( &result, dp );
return result;
}
};

That should be a bit faster since it removes the unneeded register flushes and leverages the aligned load speeds given that the class is 16 byte aligned. In effect, even though you are not using an __m128 for storage in the class, this is treating the class as one anyway.

NOTE: Also note that these two functions are full of wait states due to the latencies of the operations being performed. If you are doing batches of normalizations/dot products, running 2 or 4 at a time interleaved will effectively triple the throughput of the function. Given SSE4, you don't actually need the SoA data reorg Hodge suggests, you just need to deal with more than one in flight at a time. Edited by Hiwas

##### Share on other sites

I just tried your updated code AllEightUp and it actually got slower

Edited by lipsryme

##### Share on other sites

I just tried your updated code AllEightUp and it actually got slower

Erm, hmmmmmm.... Doesn't seem possible unless the compiler is making a mess which also seems unlikely. That code *should* be as close to the fewest cycles as would be possible without your vector class itself containing an __m128 so you can pass by register instead of reference. I'll have to play with it myself a bit and see if I missed something. What compiler are you using to test the code? And obviously is it a release build?

##### Share on other sites

Yea release build and compiler is msvc (visual studio 2008)

edit: Doing it something like this makes it a little bit faster than before now:

// 4-component vector class using SIMD instructions
_declspec(align(16))
struct Vec4
{
__m128 v;

Vec4()
{

}

Vec4(float x, float y, float z, float w)
{
v = _mm_set_ps(w, z, y, x);
}

inline const float X()
{
union {__m128 v; float f[4]; } uf; // to access the 4 floats
uf.v = v;
return uf.f[0];
}

inline const float Y()
{
union {__m128 v; float f[4]; } uf; // to access the 4 floats
uf.v = v;
return uf.f[1];
}

inline const float Z()
{
union {__m128 v; float f[4]; } uf; // to access the 4 floats
uf.v = v;
return uf.f[2];
}

inline const void Normalize()
{
// 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
// and the result written to all 4 components
__m128 dp = _mm_dp_ps(v, v, 0x7F);

// compute rsqrt of the dot product
dp = _mm_rsqrt_ps(dp);

// vec * rsqrt(dot(vec, vec))
v = _mm_mul_ps(v, dp);
}

inline const float Dot(const Vec4 &v2) const
{
// 0x7F = 0111 1111 ~ means we don't want w-component multiplied
// and the result written to all 4 components
__m128 dp = _mm_dp_ps(v, v2.v, 0x7F);

float result;
_mm_store_ss(&result, dp);

return result;
}
};

Edited by lipsryme

##### Share on other sites
Oh.... VC 2008 was exceptionally bad with SIMD, it could be doing some really stupid stuff behind the scenes. 2010 and 2012 have massively improved SIMD handling.. You might set a breakpoint on the function switch to disassembly view and see if it is doing anything obviously stupid. Drop the disassembly here and I can take a peek at it also.

• 13
• 18
• 29
• 11
• 27