# 25% speedup with SSE

## Recommended Posts

cignox1    735
Hi all, I think finally to having written some SSE code that actually perform faster than the compiler one :-) It is a matrix * vector multiplication and it is (approximately) 25% faster. But I don't know if it is correct: it does not modify a vector multiplied with the Identity matrix, but since I is the same as its inverse, it could be wrong (the layout of my matrix is IIRC (I wrote the class some time ago) COLxROW).
class vector3
{
public:
union
{
struct
{
real x, y, z, w;
};
real values[4];
__m128 ssevalues;
};
.
.
.
};


const vector3 matrix::operator * (const vector3 &v) const
{
vector3 t(v);

//Apply(t); //Plain old c++ code

__m128 row0, row1, row2, col, res, res_t;

col = _mm_set1_ps(v.ssevalues.m128_f32[0]);
res = _mm_mul_ps(row0, col);
col = _mm_set1_ps(v.ssevalues.m128_f32[1]);
res_t = _mm_mul_ps(row1, col);
col = _mm_set1_ps(v.ssevalues.m128_f32[2]);
res_t = _mm_mul_ps(row2, col);

t.ssevalues = res;

return t;
}


Please consider that I'm just experimenting at the moment and that I just began with intrinsics... Some tips about how to improve it? Thank you all!

##### Share on other sites
Promit    13246
What you should do is compare it against a known good reference implementation in C. Just generate a couple million random matrices and vectors, run them through both, and confirm that the results are the same (or at least extremely close).

##### Share on other sites
kohlrak    100
It's no mystery that SSE is faster than FPU code. The compiler uses FPU. =p

##### Share on other sites
cignox1    735
Quote:
 Original post by kohlrakIt's no mystery that SSE is faster than FPU code. The compiler uses FPU. =p

Yep, that's why I'm happy: I have had a few problems recently due to the compiler creating non SSE code with double much faster than my ASM and intrinsics float code... (and its float code as well).
Still, VC++ is actually able to do a better job if I set 'generate SSE code' as an option for the compiler. It's SSE version is 1/10 of second faster in my test.
But I'm quite happy with it, even while there is an error that makes the intrinsics version compute different values than those calculated by the C++ version. I will see...