25% speedup with SSE

Started by
2 comments, last by cignox1 16 years, 8 months ago
Hi all, I think finally to having written some SSE code that actually perform faster than the compiler one :-) It is a matrix * vector multiplication and it is (approximately) 25% faster. But I don't know if it is correct: it does not modify a vector multiplied with the Identity matrix, but since I is the same as its inverse, it could be wrong (the layout of my matrix is IIRC (I wrote the class some time ago) COLxROW).

class vector3
{
    public:     
		union
		{
			struct
			{
				real x, y, z, w;
			};
			real values[4];
			__m128 ssevalues;
		};
.
.
.
};



const vector3 matrix::operator * (const vector3 &v) const
{
    vector3 t(v);

	//Apply(t); //Plain old c++ code

	__m128 row0, row1, row2, col, res, res_t;
	row0 = _mm_load_ps(mat[0]);
	row1 = _mm_load_ps(mat[1]);
	row2 = _mm_load_ps(mat[2]);
	
	col = _mm_set1_ps(v.ssevalues.m128_f32[0]);
	res = _mm_mul_ps(row0, col);
	col = _mm_set1_ps(v.ssevalues.m128_f32[1]);
	res_t = _mm_mul_ps(row1, col);
	res = _mm_add_ps(res, res_t);
	col = _mm_set1_ps(v.ssevalues.m128_f32[2]); 
	res_t = _mm_mul_ps(row2, col);
	res = _mm_add_ps(res, res_t);

	t.ssevalues = res;

	return t;
}


Please consider that I'm just experimenting at the moment and that I just began with intrinsics... Some tips about how to improve it? Thank you all!
Advertisement
What you should do is compare it against a known good reference implementation in C. Just generate a couple million random matrices and vectors, run them through both, and confirm that the results are the same (or at least extremely close).
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
It's no mystery that SSE is faster than FPU code. The compiler uses FPU. =p
Quote:Original post by kohlrak
It's no mystery that SSE is faster than FPU code. The compiler uses FPU. =p


Yep, that's why I'm happy: I have had a few problems recently due to the compiler creating non SSE code with double much faster than my ASM and intrinsics float code... (and its float code as well).
Still, VC++ is actually able to do a better job if I set 'generate SSE code' as an option for the compiler. It's SSE version is 1/10 of second faster in my test.
But I'm quite happy with it, even while there is an error that makes the intrinsics version compute different values than those calculated by the C++ version. I will see...

This topic is closed to new replies.

Advertisement