Sign in to follow this  
cignox1

25% speedup with SSE

Recommended Posts

Hi all, I think finally to having written some SSE code that actually perform faster than the compiler one :-) It is a matrix * vector multiplication and it is (approximately) 25% faster. But I don't know if it is correct: it does not modify a vector multiplied with the Identity matrix, but since I is the same as its inverse, it could be wrong (the layout of my matrix is IIRC (I wrote the class some time ago) COLxROW).
class vector3
{
    public:     
		union
		{
			struct
			{
				real x, y, z, w;
			};
			real values[4];
			__m128 ssevalues;
		};
.
.
.
};


const vector3 matrix::operator * (const vector3 &v) const
{
    vector3 t(v);

	//Apply(t); //Plain old c++ code

	__m128 row0, row1, row2, col, res, res_t;
	row0 = _mm_load_ps(mat[0]);
	row1 = _mm_load_ps(mat[1]);
	row2 = _mm_load_ps(mat[2]);
	
	col = _mm_set1_ps(v.ssevalues.m128_f32[0]);
	res = _mm_mul_ps(row0, col);
	col = _mm_set1_ps(v.ssevalues.m128_f32[1]);
	res_t = _mm_mul_ps(row1, col);
	res = _mm_add_ps(res, res_t);
	col = _mm_set1_ps(v.ssevalues.m128_f32[2]); 
	res_t = _mm_mul_ps(row2, col);
	res = _mm_add_ps(res, res_t);

	t.ssevalues = res;

	return t;
}


Please consider that I'm just experimenting at the moment and that I just began with intrinsics... Some tips about how to improve it? Thank you all!

Share this post


Link to post
Share on other sites
What you should do is compare it against a known good reference implementation in C. Just generate a couple million random matrices and vectors, run them through both, and confirm that the results are the same (or at least extremely close).

Share this post


Link to post
Share on other sites
Quote:
Original post by kohlrak
It's no mystery that SSE is faster than FPU code. The compiler uses FPU. =p


Yep, that's why I'm happy: I have had a few problems recently due to the compiler creating non SSE code with double much faster than my ASM and intrinsics float code... (and its float code as well).
Still, VC++ is actually able to do a better job if I set 'generate SSE code' as an option for the compiler. It's SSE version is 1/10 of second faster in my test.
But I'm quite happy with it, even while there is an error that makes the intrinsics version compute different values than those calculated by the C++ version. I will see...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this