Compiler Intrinsics

Published December 18, 2005
Advertisement
For a long time now I've been an advocate against using assembly language in applications. The only areas where I really saw it as necessary dealt with extended instruction sets, such as SSE. Recently I decided to re-examine that belief and see if it was even needed in those areas.

Extended instruction sets typically give you the ability to improve performance by performing multiple operations at the same time. Some examples of this would be the mulps SSE instruction, which performs 4 single precision floating point multiplications at once. Can't think of any operation that requires 4 multiplications at once, how about the dot product? The usefulness of these instructions can be debated, especially as it relates to game development. Not all operations can be easily expressed using extended instructions, and not all operations are as efficient using these instructions as they would be using standard x87 instructions. Furthermore, instruction sets like MMX and SSE/SSE2 have strict address alignment requirements for variables. These requirements are typically somewhat alleviated by instructions that can load from unaligned addresses; these instructions are slower than the aligned versions.

I decided to use a common function for my tests. While perhaps not the best example of what can be done with SSE instructions, this operation is common enough and simple enough that following along shouldn't be too hard. The operation in question is the cross product. We will examine the code used to perform the cross product in a few ways. The first will be the standard cross product using floating point numbers. No real surprises there. Then we'll see how well the Visual Studio optimizer works by enabling streaming instructions. Finally we'll compare a hand coded assembly cross product using SSE, and a version using the compilers intrinsic instructions.

The first cross product we will examine is the simplest one, a version of the cross product that simply uses floating point operations to compute the cross product:
void crossp(float v1[4], float v2[4], float out[4]) {	out[0] = v1[1] * v2[2] - v1[2] * v2[1];	out[1] = -1 * (v1[0] * v2[2] - v1[2] * v2[0]);	out[2] = v1[0] * v2[1] - v1[1] * v2[0];	out[3] = 0;}void crossp_safe(float v1[4], float v2[4], float out[4]) {00401080  mov         ecx,dword ptr [esp+4] 	out[0] = v1[1] * v2[2] - v1[2] * v2[1];00401084  fld         dword ptr [eax+8] 00401087  fmul        dword ptr [ecx+4] 0040108A  fld         dword ptr [ecx+8] 0040108D  fmul        dword ptr [eax+4] 00401090  fsubp       st(1),st 00401092  fstp        dword ptr [edx] 	out[1] = -1 * (v1[0] * v2[2] - v1[2] * v2[0]);00401094  fld         dword ptr [ecx] 00401096  fmul        dword ptr [eax+8] 00401099  fld         dword ptr [ecx+8] 0040109C  fmul        dword ptr [eax] 0040109E  fsubp       st(1),st 004010A0  fmul        dword ptr [__real@bf800000 (40215Ch)] 004010A6  fstp        dword ptr [edx+4] 	out[2] = v1[0] * v2[1] - v1[1] * v2[0];004010A9  fld         dword ptr [ecx] 004010AB  fmul        dword ptr [eax+4] 004010AE  fld         dword ptr [eax] 004010B0  fmul        dword ptr [ecx+4] 004010B3  fsubp       st(1),st 004010B5  fstp        dword ptr [edx+8] 	out[3] = 0;004010B8  fldz             004010BA  fstp        dword ptr [edx+0Ch] }004010BD  ret               

The compilation of this results in the output following the function definition. For this example, I have disabled inlining. However the result with inlining is almost exactly the same. We can see that there are quite a few memory reads here, which can be rather costly. With caching however it shouldn't be too bad.

When we enable SSE in the compiler options, we expect the result to be optimized using SSE instructions, if possible. The output, in this case, was rather surprising for me. All it did was replace the floating point operations with the SSE instruction equivalents. This code is no more optimized than the previous code was.
void crossp(float v1[4], float v2[4], float out[4]) {00401080  mov         ecx,dword ptr [esp+4] 	out[0] = v1[1] * v2[2] - v1[2] * v2[1];00401084  movss       xmm0,dword ptr [eax+8] 00401089  mulss       xmm0,dword ptr [ecx+4] 0040108E  movss       xmm1,dword ptr [ecx+8] 00401093  mulss       xmm1,dword ptr [eax+4] 00401098  subss       xmm0,xmm1 0040109C  movss       dword ptr [edx],xmm0 	out[1] = -1 * (v1[0] * v2[2] - v1[2] * v2[0]);004010A0  movss       xmm0,dword ptr [ecx] 004010A4  mulss       xmm0,dword ptr [eax+8] 004010A9  movss       xmm1,dword ptr [ecx+8] 004010AE  mulss       xmm1,dword ptr [eax] 004010B2  subss       xmm0,xmm1 004010B6  mulss       xmm0,dword ptr [__real@bf800000 (402160h)] 004010BE  movss       dword ptr [edx+4],xmm0 	out[2] = v1[0] * v2[1] - v1[1] * v2[0];004010C3  movss       xmm0,dword ptr [ecx] 004010C7  mulss       xmm0,dword ptr [eax+4] 004010CC  movss       xmm1,dword ptr [eax] 004010D0  mulss       xmm1,dword ptr [ecx+4] 004010D5  subss       xmm0,xmm1 004010D9  movss       dword ptr [edx+8],xmm0 	out[3] = 0;004010DE  xorps       xmm0,xmm0 004010E1  movss       dword ptr [edx+0Ch],xmm0 }004010E6  ret               

I was curious if this was just an artifact of this operation in particular. So I decided to perform a test, and wrote some code that should be easy for the compiler to optimize using SSE instructions.
void sidemul(float v1[4], float v2[4]) {	v1[0] *= v2[0];	v1[1] *= v2[1];	v1[2] *= v2[2];	v1[3] *= v2[3];}void sidemul(float v1[4], float v2[4]) {	v1[0] *= v2[0];00401080  movss       xmm0,dword ptr [eax] 00401084  mulss       xmm0,dword ptr [ecx] 00401088  movss       dword ptr [eax],xmm0 	v1[1] *= v2[1];0040108C  movss       xmm0,dword ptr [ecx+4] 00401091  mulss       xmm0,dword ptr [eax+4] 00401096  movss       dword ptr [eax+4],xmm0 	v1[2] *= v2[2];0040109B  movss       xmm0,dword ptr [ecx+8] 004010A0  mulss       xmm0,dword ptr [eax+8] 004010A5  movss       dword ptr [eax+8],xmm0 	v1[3] *= v2[3];004010AA  movss       xmm0,dword ptr [ecx+0Ch] 004010AF  mulss       xmm0,dword ptr [eax+0Ch] 004010B4  movss       dword ptr [eax+0Ch],xmm0 }004010B9  ret              

Much to my horror, we can see that this function isn't optimized at all. Well, ok, it's optimized minorly in that the SSE versions of the floating point instructions are operating on single precision floats, while the floating point instructions are operating on 80 bit floating point numbers. We can quickly see that a better optimized version would be:
void sidemul_sse(float v1[4], float v2[4]) {004010C0  push        ebp  004010C1  mov         ebp,esp 004010C3  and         esp,0FFFFFFF0h 004010C6  mov         eax,dword ptr [v1] 	__m128 vec1, vec2;	vec1 = _mm_load_ps(v2);004010C9  movaps      xmm0,xmmword ptr [ecx] 	vec2 = _mm_load_ps(v1);004010CC  movaps      xmm1,xmmword ptr [eax] 	vec2 = _mm_mul_ps(vec2, vec1);004010CF  mulps       xmm0,xmm1 	_mm_store_ps(v1, vec2);004010D2  movaps      xmmword ptr [eax],xmm0 }004010D5  mov         esp,ebp 004010D7  pop         ebp  004010D8  ret              

Clearly the optimizer must be borked (note that I am using Microsoft Visual Studio 2005 Team Edition for Software Developers).

Next up we want to compare a hand coded cross product using SSE instructions to that of an SSE cross product generated using compiler intrinsics.
void crossp_asm(float v1[4], float v2[4], float outv[4]) {	__asm {		mov ecx, [v1]		movaps xmm0, [ecx]		mov ecx, [v2]		movaps xmm1, [ecx]		movaps xmm4, xmm1		movaps xmm3, xmm0		shufps xmm4, xmm0, 0xCA		shufps xmm3, xmm1, 0xD1		mulps xmm4, xmm3		movaps xmm2, xmm0				shufps xmm2, xmm1, 0xCA		shufps xmm1, xmm0, 0xD1		mulps xmm2, xmm1		subps xmm4, xmm2		mov ecx, [outv]		movaps [ecx], xmm4	}	outv[1] *= -1;}void crossp_asm(float v1[4], float v2[4], float outv[4]) {00401000  mov         eax,dword ptr [esp+0Ch] 	__asm {		mov ecx, [v1]00401004  mov         ecx,dword ptr [esp+4] 		movaps xmm0, [ecx]00401008  movaps      xmm0,xmmword ptr [ecx] 		mov ecx, [v2]0040100B  mov         ecx,dword ptr [esp+8] 		movaps xmm1, [ecx]0040100F  movaps      xmm1,xmmword ptr [ecx] 		movaps xmm4, xmm100401012  movaps      xmm4,xmm1 		movaps xmm3, xmm000401015  movaps      xmm3,xmm0 		shufps xmm4, xmm0, 0xCA00401018  shufps      xmm4,xmm0,0CAh 		shufps xmm3, xmm1, 0xD10040101C  shufps      xmm3,xmm1,0D1h 		mulps xmm4, xmm300401020  mulps       xmm4,xmm3 		movaps xmm2, xmm000401023  movaps      xmm2,xmm0 				shufps xmm2, xmm1, 0xCA00401026  shufps      xmm2,xmm1,0CAh 		shufps xmm1, xmm0, 0xD10040102A  shufps      xmm1,xmm0,0D1h 		mulps xmm2, xmm10040102E  mulps       xmm2,xmm1 		subps xmm4, xmm200401031  subps       xmm4,xmm2 		mov ecx, [outv]00401034  mov         ecx,dword ptr [esp+0Ch] 		movaps [ecx], xmm400401038  movaps      xmmword ptr [ecx],xmm4 	}	outv[1] *= -1;0040103B  fld         dword ptr [eax+4] 0040103E  fmul        dword ptr [__real@bf800000 (40215Ch)] 00401044  fstp        dword ptr [eax+4] }00401047  ret               

Because of how the parameters are passed to the function, we must perform this strange little trick of moving the address of the variable into a register, and then dereferencing the register to get the data into the SSE registers. Other than that, the code is fairly straight forward. We use the shuffle instructions to arrange the data so that we can perform our multiplications; we also have to do some work to make sure our signs are right. Again, the assembly dump is below the function definition. This is pretty straight forward, and it does appear to be someone shorter than our regular version by about two instructions.

Finally we have the intrinsic version of our cross product
void crossp_sse(float v1[4], float v2[4], float out[4]) {	__m128 vector1, vector2, vector3, vector4, vector5;	vector1 = _mm_load_ps(v1);	vector2 = _mm_load_ps(v2);	vector3 = _mm_shuffle_ps(vector2, vector1, _MM_SHUFFLE(3, 0, 2, 2));	vector4 = _mm_shuffle_ps(vector1, vector2, _MM_SHUFFLE(3, 1, 0, 1));	vector5 = _mm_mul_ps(vector3, vector4);	vector3 = _mm_shuffle_ps(vector1, vector2, _MM_SHUFFLE(3, 0, 2, 2));	vector4 = _mm_shuffle_ps(vector2, vector1, _MM_SHUFFLE(3, 1, 0, 1));	vector3 = _mm_mul_ps(vector3, vector4);	vector3 = _mm_sub_ps(vector5, vector3);	_mm_store_ps(out, vector3);	out[1] *= -1;}void crossp_sse(float v1[4], float v2[4], float out[4]) {00401050  push        ebp  00401051  mov         ebp,esp 00401053  and         esp,0FFFFFFF0h 	__m128 vector1, vector2, vector3, vector4, vector5;	vector1 = _mm_load_ps(v1);00401056  movaps      xmm1,xmmword ptr [ecx] 	vector2 = _mm_load_ps(v2);00401059  movaps      xmm0,xmmword ptr [edx] 0040105C  mov         eax,dword ptr [out] 	vector3 = _mm_shuffle_ps(vector2, vector1, _MM_SHUFFLE(3, 0, 2, 2));	vector4 = _mm_shuffle_ps(vector1, vector2, _MM_SHUFFLE(3, 1, 0, 1));	vector5 = _mm_mul_ps(vector3, vector4);	vector3 = _mm_shuffle_ps(vector1, vector2, _MM_SHUFFLE(3, 0, 2, 2));	vector4 = _mm_shuffle_ps(vector2, vector1, _MM_SHUFFLE(3, 1, 0, 1));0040105F  movaps      xmm3,xmm0 00401062  shufps      xmm3,xmm1,0D1h 00401066  movaps      xmm2,xmm1 00401069  shufps      xmm2,xmm0,0CAh 	vector3 = _mm_mul_ps(vector3, vector4);0040106D  mulps       xmm2,xmm3 00401070  movaps      xmm3,xmm1 00401073  shufps      xmm3,xmm0,0D1h 00401077  shufps      xmm0,xmm1,0CAh 0040107B  mulps       xmm3,xmm0 	vector3 = _mm_sub_ps(vector5, vector3);0040107E  subps       xmm3,xmm2 	_mm_store_ps(out, vector3);00401081  movaps      xmmword ptr [eax],xmm3 	out[1] *= -1;00401084  fld         dword ptr [eax+4] 00401087  fmul        dword ptr [__real@bf800000 (40215Ch)] 0040108D  fstp        dword ptr [eax+4] }00401090  mov         esp,ebp 00401092  pop         ebp  00401093  ret              

As we can see, the non-inline intrinsic version is just slightly longer than the assembly version. This is mostly due to the frame pointers. Things to note though: This function takes the addresses of the array pointers in registers, instead of requiring a read from memory to obtain pointers to the arrays.

So the next question is: how do these two perform when inlined? After inlining them, the intrinsic function eliminates the frame pointers entirely. The assembly version, however, ends up having more overhead than the intrinsic version. So the conclusion is: Using the compilers intrinsic instructions is recommended. Not only is the compiler able to better understand intrinsic instructions than it can your own assembly, it can perform optimizations on intrinsic instructions that it wouldn't be able to perform with your own hand crafted assembly.
Previous Entry Hi!
0 likes 5 comments

Comments

jollyjeffers
Very interesting read.

I'll have to remember this entry - the number of times you get bitching between micro vs macro optimization in code and neither side seems to have any particularly conclusive evidence either way [rolleyes]

Would be nice to have a set of examples and results to use [smile]

Cheers,
Jack
December 18, 2005 07:54 AM
Muhammad Haggag
Washu, I so much hate you. If there were a single language that you didn't know, I'd want to learn it and OWN YOU at it. Try, at least. [grin]

A really useful entry, as usual. Thanks!
December 18, 2005 12:18 PM
Extrarius
Quote:Not only is the compiler able to better understand intrinsic instructions than it can your own assembly, it can perform optimizations on intrinsic instructions that it wouldn’t be able to perform with your own hand crafted assembly.
Would you agree that the problem with assembly is likely due to the way MSVS accepts raw assembly rather than some annotated version like GCC? It seems that the way inline assembly works in GCC that the compiler should be able to treat it as well as any intrinsic function, at least with respect to inlining.
December 19, 2005 10:09 AM
Washu
Because that's the next entry!
December 22, 2005 02:43 AM
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Advertisement