I'm trying to learn SSE intrinsices and I've written a dot product function with the following code:
inline float Dot(const CVector4 &vVector)
{
__m128 vec1;
__m128 vec2;
vec1 = _mm_mul_ps(this->v, vVector.v);
vec2 = _mm_movehl_ps(vec1, vec1);
vec2 = _mm_add_ss(vec2,vec1);
vec1 = _mm_shuffle_ps(vec1,vec1, _MM_SHUFFLE(1,0,0,0));
vec1 = _mm_add_ss(vec1,vec2);
return vec1.m128_f32[0];
}
Now when I look a the assembly code generated from the code above, it seems that the compiler (VC++ 2005) generates unneeded movaps instructions:
mov eax, DWORD PTR _vVector$[ebp]
movaps xmm1, XMMWORD PTR [ecx]
movaps xmm0, XMMWORD PTR [eax]
mulps xmm0, xmm1
movaps xmm1, xmm0 <--- ???
movhlps xmm1, xmm0
movaps xmm2, xmm0 <--- ???
addss xmm1, xmm0
shufps xmm2, xmm0, 1
addss xmm2, xmm1
movaps XMMWORD PTR _vec1$[esp+16], xmm2
movss xmm0, DWORD PTR _vec1$[esp+16]
Can I modify the code in such a way that the compiler wouldn't add these unnecessary instructions or is there no way around it?
I don't want to use inline assembly since the compiler usually adds a bunch of code before and after the assembly block which makes the generated code a lot slower.
Any suggestions?