Floating point behavior of xmm registers

Started by
5 comments, last by Laval B 12 years, 7 months ago
Hi everyone.

I'm working on a 4x4 matrix product using SSE instruction set. To develop, i'm using inline assembly with Visual Studio 2010 in C++. I'm not using intrinsics. I just stumbled accross something i can't explain. For the sake of discussion let's assume a function defined like the following :



inline void Multiply(const float *a, const float *b, float *c)
{
__asm
{
mov eax, a
// mov ecx, b
mov edx, c

movss xmm0, dword ptr [eax]
movaps xmmword ptr [edx], xmm0

}
}


Basically, i store a 32 bits floating point number into the first 32 bits element of the xmm0 register then i store back the content of the register into a floating point array (having room for 16 floats). Yes, i have alligned my data on 16 bytes (no exception is raised). The problem is that the value in c is not the value i had stored from a. a[0] contained 18.283001 and the value returned into c[0] is 1.2000000. If i do


shufps xmm0, xmm0, 0h


before i store to c, the value in the first component of xmm0 is propagated to the other components but it is still the same, 1.2000000. I don't understand why putting a value in an xmm register makes that value change.

Anyone have an idea ?
We think in generalities, but we live in details.
- Alfred North Whitehead
Advertisement
You move 'c' into edx, not 'b', so you store in the wrong place. Use ecx or move 'b' into edx instead of 'c'. As of now you overwrite memory you shouldn't.
Ok, i edited my original post, c is also a pointer, which is the case in my code. So i store a into xmm0 then store back xmm0 into c. The result is the same.
We think in generalities, but we live in details.
- Alfred North Whitehead
I ran your code, it works perfectly fine and outputs 18.283001 if that is the input.
Always post a complete working sample that compiles and runs with the stated behavior when asking these kinds of questions. Your error is in another part of your program.

I ran your code, it works perfectly fine and outputs 18.283001 if that is the input.
Always post a complete working sample that compiles and runs with the stated behavior when asking these kinds of questions. Your error is in another part of your program.



Yes you are completly right, i had another test that i forgot about and it was messing my variables. I'm really sorry about that and thanks alot for your time and answer.

I have cleanedup a bit of stuff and i got another problem problem though. Here is the code of the function (complete)



/////////////////////////////////////////////////////////////////////////////
// Multiply two 4x4 row-major matrices using SSE instructions.
__forceinline void SSEMultAlligned(const f32 *a, const f32 *b, f32 *c)
{
__asm
{
// Get pointers to matrices
// into registers.
mov eax, a
mov ecx, b
mov edx, c

movss xmm0, dword ptr [eax] // Move a[0] into xmm0 first element.
movaps xmm1, xmmword ptr [ecx] // Move row 0 of b into xmm1.
shufps xmm0, xmm0, 0h // Broadcast a[0] in all xmm0.
// mulps xmm0, xmm1 // Multiply a[0]with row 0 of b.
/*
// Row 0.
movss xmm0, dword ptr [eax] // Move a[0] into xmm0 first element.
movaps xmm1, xmmword ptr [ecx] // Move row 0 of b into xmm1.
shufps xmm0, xmm0, 0h // Broadcast a[0] in all xmm0.
movss xmm2, dword ptr [eax+10h] // Move a[1] into xmm2 first element.
mulps xmm0, xmm1 // Multiply a[0]with row 0 of b.
shufps xmm2, xmm2, 0h // Broadcast a[1] in all xmm2.
movaps xmm3, xmmword ptr [ecx+10h] // Move row 1 of b into xmm3.
movss xmm4, dword ptr [eax+20h] // Move a[2] into xmm4.
mulps xmm2, xmm3 // Multiply a[1] with row 1 of b
shufps xmm4, xmm4, 0h // Broadcast a[2] into xmm4.
addps xmm0, xmm2 // Accumulate result into xmm0.
movaps xmm2, xmmword ptr [ecx+20h] // Move row 2 of b into xmm2.
mulps xmm4, xmm2 // Multiply a[2] with row 2 of b.
movss xmm1, dword ptr [eax + 30h] // Load a[3] into xmm1 first element.
addps xmm0, xmm4 // Accumulate result into xmm0.
*/
movaps xmmword ptr [edx], xmm0 // Store first line of result into c.

}
}



It is called like that :



__declspec(align(16)) float aa[16] = {1.20f, 0.50f, 1.30f, 1.82f,
6.28f, 3.40f, 2.27f, 1.55f,
1.40f, 0.25f, 9.82f, 1.75f,
2.20f, 1.80f, 1.10f, 3.17f};

__declspec(align(16)) float bb[16] = {0.10f, -1.1f, 1.25f, 0.82f,
2.01f, 6.10f, 4.02f,-1.87f,
1.12f, 2.25f, 1.10f, 7.30f,
2.40f, 1.75f, 6.10f, 4.20f};

__declspec(align(16)) float cc[16];


SSEMultAlligned(aa, bb, cc);



If you run the code with the function as it is and put aa and cc in the debugger watch, cc will contain 1.200000, 1.200000, 1.200000, 1.200000 as it should. If you change xmm0 for xmm1 in the last line of the function and repeate, cc will conatin 0.10000000, -1.1000000, 1.2500000 and 0.81999999 as it should as well (the first row of bb).

However, if you uncomment the line mulps xmm0, xmm1 and output xmm0 again into cc, only the first component is multiplied correctly. The others are not. I get :

[0] 0.12000000 (which is ok, 0.12 = 1.2 * 0.1)
[1] -1.3200001 (wrong, should be 0.5 * -1.1 = -0.55)
[2] 1.5000000 (wrong again, should be 1.3 * 1.25 = 1.625)
[3] 0.98400003 (wrong too, should be 1.82 * 0.82 = 1.4924).

It looks as though only the first floating component is multiplied correctly, i don't know where the others come from but they are quite far from what they should be.
We think in generalities, but we live in details.
- Alfred North Whitehead
xmm0 is filled with 1.2 1.2 1.2 1.2, then that is multiplied by xmm1, so no, every value in xmm1 should be multiplied by 1.2, which seems to be exactly the output you get.

xmm0 is filled with 1.2 1.2 1.2 1.2, then that is multiplied by xmm1, so no, every value in xmm1 should be multiplied by 1.2, which seems to be exactly the output you get.


That's right, i was looking at the wrong variable in my watch. Evereything is just fine.

Thanks alot.
We think in generalities, but we live in details.
- Alfred North Whitehead

This topic is closed to new replies.

Advertisement