Performance optimization SSE vector dot/normalize

Started by
19 comments, last by Vilem Otte 10 years, 6 months ago

Hey guys, I've done a "quick" first implementation of a vector normalize and dot product using sse intrinsics and was wondering if there's still something that could be optimized further.

Here's my code:


_declspec(align(16))
struct Vec4
{
	float x, y, z, w;

	inline const Vec4 Normalize()
	{
		__m128 tmp;

		// copy data into the 128bit register
		tmp = _mm_set_ps(w, z, y, x);

		// 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
		// and the result written to all 4 components
		__m128 dp = _mm_dp_ps(tmp, tmp, 0x7F); 
		
		// compute rsqrt of the dot product
		dp = _mm_rsqrt_ps(dp);

		// vec * rsqrt(dot(vec, vec))
		tmp = _mm_mul_ps(tmp, dp);
		
		Vec4 vec;
		union {__m128 v; float f[4]; } uf; // to access the 4 floats
		uf.v = tmp;

		vec.x = uf.f[0];
		vec.y = uf.f[1];
		vec.z = uf.f[2];
		vec.w = 1.0f;

		return vec;
	}

	inline const float Dot(const Vec4 &v2)
	{
		__m128 a;
		
		// copy data into the 128bit register
		a = _mm_set_ps(w, z, y, x);
		__m128 b = _mm_set_ps(v2.w, v2.z, v2.y, v2.x);

		// 0x7F = 0111 1111 ~ means we don't want w-component multiplied
		// and the result written to all 4 components
		__m128 dp = _mm_dp_ps(a, b, 0x7F); 
		
		Vec4 vec;
		union {__m128 v; float f[4]; } uf; // to access the 4 floats
		uf.v = dp;

		return uf.f[0];
	}
};

Thanks in advance smile.png

Advertisement

I don't know if this is the case, but I've seen often that if you've got lots of functions like this that move data back and forth between float registers and SSE registers, then that ends up wasting so much time, that SSE becomes the same speed (or slower) than regular float stuff!

I'd try to make your Vec4 class store it's members as an __m128.

In some engines I've even seen a floatInVec class (or Vec1), which internally is an __m128, but to the user it appears like a float. This lets you keep more of your math code working with SSE data, rather than having it juggle between float and SSE.

Regarding the Normalize and Dot functions - considering that you're only really operating on a Vec3, then if you're running these over large data-sets, it might be worthwhile making an SoAVec3 class (as well as the normal Vec4 class), which stores 4 x values, then 4 y values, then 4 z values in 3 __m128 variables. This reduces the total working size of the data-set, and often lets you write more optimized (shorter) functions.

[deleted - stupid post suggesting the use of the dot product intrinsic when it is clearly already being used]

@achild As you can see above I'm already using that intrinsic for the dot product rolleyes.gif

@hodgman I was thinking about that...I'll give it a try and see if it gives me some boost

@achild As you can see above I'm already using that intrinsic for the dot product rolleyes.gif

@hodgman I was thinking about that...I'll give it a try and see if it gives me some boost

blink.png Wow. Not much you can say to that. Talk about not paying attention.

As Hodge says, due to the load and then store costs, the code is likely going to be loosing so much performance that it may not be giving you any gains at all. Having said that, get rid of the union bits and replace it with the appropriate sse calls. Using that union trick is causing the compiler to flush the registers back to memory just so you can access them to put them in different memory. Here's the modified code (removed comments for brevity):

_declspec(align(16))
struct Vec4
{
	float x, y, z, w;

	inline const Vec4 Normalize()
	{
		__m128 tmp = _mm_load_ps( &x );
		__m128 dp = _mm_dp_ps(tmp, tmp, 0x7F); 
		dp = _mm_rsqrt_ps(dp);
		tmp = _mm_mul_ps(tmp, dp);
		
		Vec4 vec;
		_mm_store_ps( &vec.x, tmp );
		vec.w = 1.0f;
		return vec;
	}

	inline const float Dot(const Vec4 &v2)
	{
		__m128 a = _mm_load_ps( &x );
		__m128 b = _mm_load_ps( &v2.x );
		__m128 dp = _mm_dp_ps(a, b, 0x7F); 

		float result;
		_mm_store_ss( &result, dp );
		return result;
	}
};
That should be a bit faster since it removes the unneeded register flushes and leverages the aligned load speeds given that the class is 16 byte aligned. In effect, even though you are not using an __m128 for storage in the class, this is treating the class as one anyway.

NOTE: Also note that these two functions are full of wait states due to the latencies of the operations being performed. If you are doing batches of normalizations/dot products, running 2 or 4 at a time interleaved will effectively triple the throughput of the function. Given SSE4, you don't actually need the SoA data reorg Hodge suggests, you just need to deal with more than one in flight at a time.

I just tried your updated code AllEightUp and it actually got slower blink.png

I just tried your updated code AllEightUp and it actually got slower blink.png


Erm, hmmmmmm.... Doesn't seem possible unless the compiler is making a mess which also seems unlikely. That code *should* be as close to the fewest cycles as would be possible without your vector class itself containing an __m128 so you can pass by register instead of reference. I'll have to play with it myself a bit and see if I missed something. What compiler are you using to test the code? And obviously is it a release build? smile.png

Yea release build and compiler is msvc (visual studio 2008)

edit: Doing it something like this makes it a little bit faster than before now:


// 4-component vector class using SIMD instructions
_declspec(align(16))
struct Vec4
{
    __m128 v;

    Vec4()
    {

    }

    Vec4(float x, float y, float z, float w)
    {
        v = _mm_set_ps(w, z, y, x);
    }

    inline const float X()
    {
        union {__m128 v; float f[4]; } uf; // to access the 4 floats
        uf.v = v;
        return uf.f[0];
    }

    inline const float Y()
    {
        union {__m128 v; float f[4]; } uf; // to access the 4 floats
        uf.v = v;
        return uf.f[1];
    }

    inline const float Z()
    {
        union {__m128 v; float f[4]; } uf; // to access the 4 floats
        uf.v = v;
        return uf.f[2];
    }

    inline const void Normalize()
    {
        // 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
        // and the result written to all 4 components
        __m128 dp = _mm_dp_ps(v, v, 0x7F); 

        // compute rsqrt of the dot product
        dp = _mm_rsqrt_ps(dp);

        // vec * rsqrt(dot(vec, vec))
        v = _mm_mul_ps(v, dp);
	  }

    inline const float Dot(const Vec4 &v2) const
    {
        // 0x7F = 0111 1111 ~ means we don't want w-component multiplied
        // and the result written to all 4 components
        __m128 dp = _mm_dp_ps(v, v2.v, 0x7F); 

        float result;
        _mm_store_ss(&result, dp);

        return result;
	  }
};
Oh.... VC 2008 was exceptionally bad with SIMD, it could be doing some really stupid stuff behind the scenes. 2010 and 2012 have massively improved SIMD handling.. You might set a breakpoint on the function switch to disassembly view and see if it is doing anything obviously stupid. Drop the disassembly here and I can take a peek at it also.

This topic is closed to new replies.

Advertisement