I ran your code, it works perfectly fine and outputs 18.283001 if that is the input.
Always post a complete working sample that compiles and runs with the stated behavior when asking these kinds of questions. Your error is in another part of your program.
Yes you are completly right, i had another test that i forgot about and it was messing my variables. I'm really sorry about that and thanks alot for your time and answer.
I have cleanedup a bit of stuff and i got another problem problem though. Here is the code of the function (complete)
/////////////////////////////////////////////////////////////////////////////
// Multiply two 4x4 row-major matrices using SSE instructions.
__forceinline void SSEMultAlligned(const f32 *a, const f32 *b, f32 *c)
{
__asm
{
// Get pointers to matrices
// into registers.
mov eax, a
mov ecx, b
mov edx, c
movss xmm0, dword ptr [eax] // Move a[0] into xmm0 first element.
movaps xmm1, xmmword ptr [ecx] // Move row 0 of b into xmm1.
shufps xmm0, xmm0, 0h // Broadcast a[0] in all xmm0.
// mulps xmm0, xmm1 // Multiply a[0]with row 0 of b.
/*
// Row 0.
movss xmm0, dword ptr [eax] // Move a[0] into xmm0 first element.
movaps xmm1, xmmword ptr [ecx] // Move row 0 of b into xmm1.
shufps xmm0, xmm0, 0h // Broadcast a[0] in all xmm0.
movss xmm2, dword ptr [eax+10h] // Move a[1] into xmm2 first element.
mulps xmm0, xmm1 // Multiply a[0]with row 0 of b.
shufps xmm2, xmm2, 0h // Broadcast a[1] in all xmm2.
movaps xmm3, xmmword ptr [ecx+10h] // Move row 1 of b into xmm3.
movss xmm4, dword ptr [eax+20h] // Move a[2] into xmm4.
mulps xmm2, xmm3 // Multiply a[1] with row 1 of b
shufps xmm4, xmm4, 0h // Broadcast a[2] into xmm4.
addps xmm0, xmm2 // Accumulate result into xmm0.
movaps xmm2, xmmword ptr [ecx+20h] // Move row 2 of b into xmm2.
mulps xmm4, xmm2 // Multiply a[2] with row 2 of b.
movss xmm1, dword ptr [eax + 30h] // Load a[3] into xmm1 first element.
addps xmm0, xmm4 // Accumulate result into xmm0.
*/
movaps xmmword ptr [edx], xmm0 // Store first line of result into c.
}
}
It is called like that :
__declspec(align(16)) float aa[16] = {1.20f, 0.50f, 1.30f, 1.82f,
6.28f, 3.40f, 2.27f, 1.55f,
1.40f, 0.25f, 9.82f, 1.75f,
2.20f, 1.80f, 1.10f, 3.17f};
__declspec(align(16)) float bb[16] = {0.10f, -1.1f, 1.25f, 0.82f,
2.01f, 6.10f, 4.02f,-1.87f,
1.12f, 2.25f, 1.10f, 7.30f,
2.40f, 1.75f, 6.10f, 4.20f};
__declspec(align(16)) float cc[16];
SSEMultAlligned(aa, bb, cc);
If you run the code with the function as it is and put aa and cc in the debugger watch, cc will contain 1.200000, 1.200000, 1.200000, 1.200000 as it should. If you change xmm0 for xmm1 in the last line of the function and repeate, cc will conatin 0.10000000, -1.1000000, 1.2500000 and 0.81999999 as it should as well (the first row of bb).
However, if you uncomment the line mulps xmm0, xmm1 and output xmm0 again into cc, only the first component is multiplied correctly. The others are not. I get :
[0] 0.12000000 (which is ok, 0.12 = 1.2 * 0.1)
[1] -1.3200001 (wrong, should be 0.5 * -1.1 = -0.55)
[2] 1.5000000 (wrong again, should be 1.3 * 1.25 = 1.625)
[3] 0.98400003 (wrong too, should be 1.82 * 0.82 = 1.4924).
It looks as though only the first floating component is multiplied correctly, i don't know where the others come from but they are quite far from what they should be.
We think in generalities, but we live in details.
- Alfred North Whitehead