Jump to content
  • Advertisement
Sign in to follow this  
lipsryme

Performance optimization SSE vector dot/normalize

This topic is 1893 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hey guys, I've done a "quick" first implementation of a vector normalize and dot product using sse intrinsics and was wondering if there's still something that could be optimized further.

 

Here's my code:

_declspec(align(16))
struct Vec4
{
	float x, y, z, w;

	inline const Vec4 Normalize()
	{
		__m128 tmp;

		// copy data into the 128bit register
		tmp = _mm_set_ps(w, z, y, x);

		// 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
		// and the result written to all 4 components
		__m128 dp = _mm_dp_ps(tmp, tmp, 0x7F); 
		
		// compute rsqrt of the dot product
		dp = _mm_rsqrt_ps(dp);

		// vec * rsqrt(dot(vec, vec))
		tmp = _mm_mul_ps(tmp, dp);
		
		Vec4 vec;
		union {__m128 v; float f[4]; } uf; // to access the 4 floats
		uf.v = tmp;

		vec.x = uf.f[0];
		vec.y = uf.f[1];
		vec.z = uf.f[2];
		vec.w = 1.0f;

		return vec;
	}

	inline const float Dot(const Vec4 &v2)
	{
		__m128 a;
		
		// copy data into the 128bit register
		a = _mm_set_ps(w, z, y, x);
		__m128 b = _mm_set_ps(v2.w, v2.z, v2.y, v2.x);

		// 0x7F = 0111 1111 ~ means we don't want w-component multiplied
		// and the result written to all 4 components
		__m128 dp = _mm_dp_ps(a, b, 0x7F); 
		
		Vec4 vec;
		union {__m128 v; float f[4]; } uf; // to access the 4 floats
		uf.v = dp;

		return uf.f[0];
	}
};

Thanks in advance smile.png

Share this post


Link to post
Share on other sites
Advertisement

I don't know if this is the case, but I've seen often that if you've got lots of functions like this that move data back and forth between float registers and SSE registers, then that ends up wasting so much time, that SSE becomes the same speed (or slower) than regular float stuff!

I'd try to make your Vec4 class store it's members as an __m128.

In some engines I've even seen a floatInVec class (or Vec1), which internally is an __m128, but to the user it appears like a float. This lets you keep more of your math code working with SSE data, rather than having it juggle between float and SSE.

 

 

 

Regarding the Normalize and Dot functions - considering that you're only really operating on a Vec3, then if you're running these over large data-sets, it might be worthwhile making an SoAVec3 class (as well as the normal Vec4 class), which stores 4 x values, then 4 y values, then 4 z values in 3 __m128 variables. This reduces the total working size of the data-set, and often lets you write more optimized (shorter) functions.

Share this post


Link to post
Share on other sites

@achild As you can see above I'm already using that intrinsic for the dot product rolleyes.gif

@hodgman I was thinking about that...I'll give it a try and see if it gives me some boost

Share this post


Link to post
Share on other sites

@achild As you can see above I'm already using that intrinsic for the dot product rolleyes.gif

@hodgman I was thinking about that...I'll give it a try and see if it gives me some boost

 

blink.png  Wow. Not much you can say to that. Talk about not paying attention.

Share this post


Link to post
Share on other sites
Guest Hiwas
As Hodge says, due to the load and then store costs, the code is likely going to be loosing so much performance that it may not be giving you any gains at all. Having said that, get rid of the union bits and replace it with the appropriate sse calls. Using that union trick is causing the compiler to flush the registers back to memory just so you can access them to put them in different memory. Here's the modified code (removed comments for brevity):
 
_declspec(align(16))
struct Vec4
{
	float x, y, z, w;

	inline const Vec4 Normalize()
	{
		__m128 tmp = _mm_load_ps( &x );
		__m128 dp = _mm_dp_ps(tmp, tmp, 0x7F); 
		dp = _mm_rsqrt_ps(dp);
		tmp = _mm_mul_ps(tmp, dp);
		
		Vec4 vec;
		_mm_store_ps( &vec.x, tmp );
		vec.w = 1.0f;
		return vec;
	}

	inline const float Dot(const Vec4 &v2)
	{
		__m128 a = _mm_load_ps( &x );
		__m128 b = _mm_load_ps( &v2.x );
		__m128 dp = _mm_dp_ps(a, b, 0x7F); 

		float result;
		_mm_store_ss( &result, dp );
		return result;
	}
};
That should be a bit faster since it removes the unneeded register flushes and leverages the aligned load speeds given that the class is 16 byte aligned. In effect, even though you are not using an __m128 for storage in the class, this is treating the class as one anyway.

NOTE: Also note that these two functions are full of wait states due to the latencies of the operations being performed. If you are doing batches of normalizations/dot products, running 2 or 4 at a time interleaved will effectively triple the throughput of the function. Given SSE4, you don't actually need the SoA data reorg Hodge suggests, you just need to deal with more than one in flight at a time. Edited by Hiwas

Share this post


Link to post
Share on other sites
Guest Hiwas

I just tried your updated code AllEightUp and it actually got slower blink.png


Erm, hmmmmmm.... Doesn't seem possible unless the compiler is making a mess which also seems unlikely. That code *should* be as close to the fewest cycles as would be possible without your vector class itself containing an __m128 so you can pass by register instead of reference. I'll have to play with it myself a bit and see if I missed something. What compiler are you using to test the code? And obviously is it a release build? smile.png

Share this post


Link to post
Share on other sites

Yea release build and compiler is msvc (visual studio 2008)

 

 

edit: Doing it something like this makes it a little bit faster than before now:

// 4-component vector class using SIMD instructions
_declspec(align(16))
struct Vec4
{
    __m128 v;

    Vec4()
    {

    }

    Vec4(float x, float y, float z, float w)
    {
        v = _mm_set_ps(w, z, y, x);
    }

    inline const float X()
    {
        union {__m128 v; float f[4]; } uf; // to access the 4 floats
        uf.v = v;
        return uf.f[0];
    }

    inline const float Y()
    {
        union {__m128 v; float f[4]; } uf; // to access the 4 floats
        uf.v = v;
        return uf.f[1];
    }

    inline const float Z()
    {
        union {__m128 v; float f[4]; } uf; // to access the 4 floats
        uf.v = v;
        return uf.f[2];
    }

    inline const void Normalize()
    {
        // 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
        // and the result written to all 4 components
        __m128 dp = _mm_dp_ps(v, v, 0x7F); 

        // compute rsqrt of the dot product
        dp = _mm_rsqrt_ps(dp);

        // vec * rsqrt(dot(vec, vec))
        v = _mm_mul_ps(v, dp);
	  }

    inline const float Dot(const Vec4 &v2) const
    {
        // 0x7F = 0111 1111 ~ means we don't want w-component multiplied
        // and the result written to all 4 components
        __m128 dp = _mm_dp_ps(v, v2.v, 0x7F); 

        float result;
        _mm_store_ss(&result, dp);

        return result;
	  }
};
Edited by lipsryme

Share this post


Link to post
Share on other sites
Guest Hiwas
Oh.... VC 2008 was exceptionally bad with SIMD, it could be doing some really stupid stuff behind the scenes. 2010 and 2012 have massively improved SIMD handling.. You might set a breakpoint on the function switch to disassembly view and see if it is doing anything obviously stupid. Drop the disassembly here and I can take a peek at it also.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!