Hi all, I think finally to having written some SSE code that actually perform faster than the compiler one :-) It is a matrix * vector multiplication and it is (approximately) 25% faster. But I don't know if it is correct: it does not modify a vector multiplied with the Identity matrix, but since I is the same as its inverse, it could be wrong (the layout of my matrix is IIRC (I wrote the class some time ago) COLxROW).
class vector3
{
public:
union
{
struct
{
real x, y, z, w;
};
real values[4];
__m128 ssevalues;
};
.
.
.
};
const vector3 matrix::operator * (const vector3 &v) const
{
vector3 t(v);
//Apply(t); //Plain old c++ code
__m128 row0, row1, row2, col, res, res_t;
row0 = _mm_load_ps(mat[0]);
row1 = _mm_load_ps(mat[1]);
row2 = _mm_load_ps(mat[2]);
col = _mm_set1_ps(v.ssevalues.m128_f32[0]);
res = _mm_mul_ps(row0, col);
col = _mm_set1_ps(v.ssevalues.m128_f32[1]);
res_t = _mm_mul_ps(row1, col);
res = _mm_add_ps(res, res_t);
col = _mm_set1_ps(v.ssevalues.m128_f32[2]);
res_t = _mm_mul_ps(row2, col);
res = _mm_add_ps(res, res_t);
t.ssevalues = res;
return t;
}
Please consider that I'm just experimenting at the moment and that I just began with intrinsics... Some tips about how to improve it?
Thank you all!