Extended instruction sets typically give you the ability to improve performance by performing multiple operations at the same time. Some examples of this would be the mulps SSE instruction, which performs 4 single precision floating point multiplications at once. Can't think of any operation that requires 4 multiplications at once, how about the dot product? The usefulness of these instructions can be debated, especially as it relates to game development. Not all operations can be easily expressed using extended instructions, and not all operations are as efficient using these instructions as they would be using standard x87 instructions. Furthermore, instruction sets like MMX and SSE/SSE2 have strict address alignment requirements for variables. These requirements are typically somewhat alleviated by instructions that can load from unaligned addresses; these instructions are slower than the aligned versions.
I decided to use a common function for my tests. While perhaps not the best example of what can be done with SSE instructions, this operation is common enough and simple enough that following along shouldn't be too hard. The operation in question is the cross product. We will examine the code used to perform the cross product in a few ways. The first will be the standard cross product using floating point numbers. No real surprises there. Then we'll see how well the Visual Studio optimizer works by enabling streaming instructions. Finally we'll compare a hand coded assembly cross product using SSE, and a version using the compilers intrinsic instructions.
The first cross product we will examine is the simplest one, a version of the cross product that simply uses floating point operations to compute the cross product:
void crossp(float v1[4], float v2[4], float out[4]) { out[0] = v1[1] * v2[2] - v1[2] * v2[1]; out[1] = -1 * (v1[0] * v2[2] - v1[2] * v2[0]); out[2] = v1[0] * v2[1] - v1[1] * v2[0]; out[3] = 0;}void crossp_safe(float v1[4], float v2[4], float out[4]) {00401080 mov ecx,dword ptr [esp+4] out[0] = v1[1] * v2[2] - v1[2] * v2[1];00401084 fld dword ptr [eax+8] 00401087 fmul dword ptr [ecx+4] 0040108A fld dword ptr [ecx+8] 0040108D fmul dword ptr [eax+4] 00401090 fsubp st(1),st 00401092 fstp dword ptr [edx] out[1] = -1 * (v1[0] * v2[2] - v1[2] * v2[0]);00401094 fld dword ptr [ecx] 00401096 fmul dword ptr [eax+8] 00401099 fld dword ptr [ecx+8] 0040109C fmul dword ptr [eax] 0040109E fsubp st(1),st 004010A0 fmul dword ptr [__real@bf800000 (40215Ch)] 004010A6 fstp dword ptr [edx+4] out[2] = v1[0] * v2[1] - v1[1] * v2[0];004010A9 fld dword ptr [ecx] 004010AB fmul dword ptr [eax+4] 004010AE fld dword ptr [eax] 004010B0 fmul dword ptr [ecx+4] 004010B3 fsubp st(1),st 004010B5 fstp dword ptr [edx+8] out[3] = 0;004010B8 fldz 004010BA fstp dword ptr [edx+0Ch] }004010BD ret
The compilation of this results in the output following the function definition. For this example, I have disabled inlining. However the result with inlining is almost exactly the same. We can see that there are quite a few memory reads here, which can be rather costly. With caching however it shouldn't be too bad.
When we enable SSE in the compiler options, we expect the result to be optimized using SSE instructions, if possible. The output, in this case, was rather surprising for me. All it did was replace the floating point operations with the SSE instruction equivalents. This code is no more optimized than the previous code was.
void crossp(float v1[4], float v2[4], float out[4]) {00401080 mov ecx,dword ptr [esp+4] out[0] = v1[1] * v2[2] - v1[2] * v2[1];00401084 movss xmm0,dword ptr [eax+8] 00401089 mulss xmm0,dword ptr [ecx+4] 0040108E movss xmm1,dword ptr [ecx+8] 00401093 mulss xmm1,dword ptr [eax+4] 00401098 subss xmm0,xmm1 0040109C movss dword ptr [edx],xmm0 out[1] = -1 * (v1[0] * v2[2] - v1[2] * v2[0]);004010A0 movss xmm0,dword ptr [ecx] 004010A4 mulss xmm0,dword ptr [eax+8] 004010A9 movss xmm1,dword ptr [ecx+8] 004010AE mulss xmm1,dword ptr [eax] 004010B2 subss xmm0,xmm1 004010B6 mulss xmm0,dword ptr [__real@bf800000 (402160h)] 004010BE movss dword ptr [edx+4],xmm0 out[2] = v1[0] * v2[1] - v1[1] * v2[0];004010C3 movss xmm0,dword ptr [ecx] 004010C7 mulss xmm0,dword ptr [eax+4] 004010CC movss xmm1,dword ptr [eax] 004010D0 mulss xmm1,dword ptr [ecx+4] 004010D5 subss xmm0,xmm1 004010D9 movss dword ptr [edx+8],xmm0 out[3] = 0;004010DE xorps xmm0,xmm0 004010E1 movss dword ptr [edx+0Ch],xmm0 }004010E6 ret
I was curious if this was just an artifact of this operation in particular. So I decided to perform a test, and wrote some code that should be easy for the compiler to optimize using SSE instructions.
void sidemul(float v1[4], float v2[4]) { v1[0] *= v2[0]; v1[1] *= v2[1]; v1[2] *= v2[2]; v1[3] *= v2[3];}void sidemul(float v1[4], float v2[4]) { v1[0] *= v2[0];00401080 movss xmm0,dword ptr [eax] 00401084 mulss xmm0,dword ptr [ecx] 00401088 movss dword ptr [eax],xmm0 v1[1] *= v2[1];0040108C movss xmm0,dword ptr [ecx+4] 00401091 mulss xmm0,dword ptr [eax+4] 00401096 movss dword ptr [eax+4],xmm0 v1[2] *= v2[2];0040109B movss xmm0,dword ptr [ecx+8] 004010A0 mulss xmm0,dword ptr [eax+8] 004010A5 movss dword ptr [eax+8],xmm0 v1[3] *= v2[3];004010AA movss xmm0,dword ptr [ecx+0Ch] 004010AF mulss xmm0,dword ptr [eax+0Ch] 004010B4 movss dword ptr [eax+0Ch],xmm0 }004010B9 ret
Much to my horror, we can see that this function isn't optimized at all. Well, ok, it's optimized minorly in that the SSE versions of the floating point instructions are operating on single precision floats, while the floating point instructions are operating on 80 bit floating point numbers. We can quickly see that a better optimized version would be:
void sidemul_sse(float v1[4], float v2[4]) {004010C0 push ebp 004010C1 mov ebp,esp 004010C3 and esp,0FFFFFFF0h 004010C6 mov eax,dword ptr [v1] __m128 vec1, vec2; vec1 = _mm_load_ps(v2);004010C9 movaps xmm0,xmmword ptr [ecx] vec2 = _mm_load_ps(v1);004010CC movaps xmm1,xmmword ptr [eax] vec2 = _mm_mul_ps(vec2, vec1);004010CF mulps xmm0,xmm1 _mm_store_ps(v1, vec2);004010D2 movaps xmmword ptr [eax],xmm0 }004010D5 mov esp,ebp 004010D7 pop ebp 004010D8 ret
Clearly the optimizer must be borked (note that I am using Microsoft Visual Studio 2005 Team Edition for Software Developers).
Next up we want to compare a hand coded cross product using SSE instructions to that of an SSE cross product generated using compiler intrinsics.
void crossp_asm(float v1[4], float v2[4], float outv[4]) { __asm { mov ecx, [v1] movaps xmm0, [ecx] mov ecx, [v2] movaps xmm1, [ecx] movaps xmm4, xmm1 movaps xmm3, xmm0 shufps xmm4, xmm0, 0xCA shufps xmm3, xmm1, 0xD1 mulps xmm4, xmm3 movaps xmm2, xmm0 shufps xmm2, xmm1, 0xCA shufps xmm1, xmm0, 0xD1 mulps xmm2, xmm1 subps xmm4, xmm2 mov ecx, [outv] movaps [ecx], xmm4 } outv[1] *= -1;}void crossp_asm(float v1[4], float v2[4], float outv[4]) {00401000 mov eax,dword ptr [esp+0Ch] __asm { mov ecx, [v1]00401004 mov ecx,dword ptr [esp+4] movaps xmm0, [ecx]00401008 movaps xmm0,xmmword ptr [ecx] mov ecx, [v2]0040100B mov ecx,dword ptr [esp+8] movaps xmm1, [ecx]0040100F movaps xmm1,xmmword ptr [ecx] movaps xmm4, xmm100401012 movaps xmm4,xmm1 movaps xmm3, xmm000401015 movaps xmm3,xmm0 shufps xmm4, xmm0, 0xCA00401018 shufps xmm4,xmm0,0CAh shufps xmm3, xmm1, 0xD10040101C shufps xmm3,xmm1,0D1h mulps xmm4, xmm300401020 mulps xmm4,xmm3 movaps xmm2, xmm000401023 movaps xmm2,xmm0 shufps xmm2, xmm1, 0xCA00401026 shufps xmm2,xmm1,0CAh shufps xmm1, xmm0, 0xD10040102A shufps xmm1,xmm0,0D1h mulps xmm2, xmm10040102E mulps xmm2,xmm1 subps xmm4, xmm200401031 subps xmm4,xmm2 mov ecx, [outv]00401034 mov ecx,dword ptr [esp+0Ch] movaps [ecx], xmm400401038 movaps xmmword ptr [ecx],xmm4 } outv[1] *= -1;0040103B fld dword ptr [eax+4] 0040103E fmul dword ptr [__real@bf800000 (40215Ch)] 00401044 fstp dword ptr [eax+4] }00401047 ret
Because of how the parameters are passed to the function, we must perform this strange little trick of moving the address of the variable into a register, and then dereferencing the register to get the data into the SSE registers. Other than that, the code is fairly straight forward. We use the shuffle instructions to arrange the data so that we can perform our multiplications; we also have to do some work to make sure our signs are right. Again, the assembly dump is below the function definition. This is pretty straight forward, and it does appear to be someone shorter than our regular version by about two instructions.
Finally we have the intrinsic version of our cross product
void crossp_sse(float v1[4], float v2[4], float out[4]) { __m128 vector1, vector2, vector3, vector4, vector5; vector1 = _mm_load_ps(v1); vector2 = _mm_load_ps(v2); vector3 = _mm_shuffle_ps(vector2, vector1, _MM_SHUFFLE(3, 0, 2, 2)); vector4 = _mm_shuffle_ps(vector1, vector2, _MM_SHUFFLE(3, 1, 0, 1)); vector5 = _mm_mul_ps(vector3, vector4); vector3 = _mm_shuffle_ps(vector1, vector2, _MM_SHUFFLE(3, 0, 2, 2)); vector4 = _mm_shuffle_ps(vector2, vector1, _MM_SHUFFLE(3, 1, 0, 1)); vector3 = _mm_mul_ps(vector3, vector4); vector3 = _mm_sub_ps(vector5, vector3); _mm_store_ps(out, vector3); out[1] *= -1;}void crossp_sse(float v1[4], float v2[4], float out[4]) {00401050 push ebp 00401051 mov ebp,esp 00401053 and esp,0FFFFFFF0h __m128 vector1, vector2, vector3, vector4, vector5; vector1 = _mm_load_ps(v1);00401056 movaps xmm1,xmmword ptr [ecx] vector2 = _mm_load_ps(v2);00401059 movaps xmm0,xmmword ptr [edx] 0040105C mov eax,dword ptr [out] vector3 = _mm_shuffle_ps(vector2, vector1, _MM_SHUFFLE(3, 0, 2, 2)); vector4 = _mm_shuffle_ps(vector1, vector2, _MM_SHUFFLE(3, 1, 0, 1)); vector5 = _mm_mul_ps(vector3, vector4); vector3 = _mm_shuffle_ps(vector1, vector2, _MM_SHUFFLE(3, 0, 2, 2)); vector4 = _mm_shuffle_ps(vector2, vector1, _MM_SHUFFLE(3, 1, 0, 1));0040105F movaps xmm3,xmm0 00401062 shufps xmm3,xmm1,0D1h 00401066 movaps xmm2,xmm1 00401069 shufps xmm2,xmm0,0CAh vector3 = _mm_mul_ps(vector3, vector4);0040106D mulps xmm2,xmm3 00401070 movaps xmm3,xmm1 00401073 shufps xmm3,xmm0,0D1h 00401077 shufps xmm0,xmm1,0CAh 0040107B mulps xmm3,xmm0 vector3 = _mm_sub_ps(vector5, vector3);0040107E subps xmm3,xmm2 _mm_store_ps(out, vector3);00401081 movaps xmmword ptr [eax],xmm3 out[1] *= -1;00401084 fld dword ptr [eax+4] 00401087 fmul dword ptr [__real@bf800000 (40215Ch)] 0040108D fstp dword ptr [eax+4] }00401090 mov esp,ebp 00401092 pop ebp 00401093 ret
As we can see, the non-inline intrinsic version is just slightly longer than the assembly version. This is mostly due to the frame pointers. Things to note though: This function takes the addresses of the array pointers in registers, instead of requiring a read from memory to obtain pointers to the arrays.
So the next question is: how do these two perform when inlined? After inlining them, the intrinsic function eliminates the frame pointers entirely. The assembly version, however, ends up having more overhead than the intrinsic version. So the conclusion is: Using the compilers intrinsic instructions is recommended. Not only is the compiler able to better understand intrinsic instructions than it can your own assembly, it can perform optimizations on intrinsic instructions that it wouldn't be able to perform with your own hand crafted assembly.
I'll have to remember this entry - the number of times you get bitching between micro vs macro optimization in code and neither side seems to have any particularly conclusive evidence either way [rolleyes]
Would be nice to have a set of examples and results to use [smile]
Cheers,
Jack