Jump to content
  • entries
    146
  • comments
    436
  • views
    198726

Compiler Intrinsics

Sign in to follow this  
Washu

690 views

For a long time now I've been an advocate against using assembly language in applications. The only areas where I really saw it as necessary dealt with extended instruction sets, such as SSE. Recently I decided to re-examine that belief and see if it was even needed in those areas.

Extended instruction sets typically give you the ability to improve performance by performing multiple operations at the same time. Some examples of this would be the mulps SSE instruction, which performs 4 single precision floating point multiplications at once. Can't think of any operation that requires 4 multiplications at once, how about the dot product? The usefulness of these instructions can be debated, especially as it relates to game development. Not all operations can be easily expressed using extended instructions, and not all operations are as efficient using these instructions as they would be using standard x87 instructions. Furthermore, instruction sets like MMX and SSE/SSE2 have strict address alignment requirements for variables. These requirements are typically somewhat alleviated by instructions that can load from unaligned addresses; these instructions are slower than the aligned versions.

I decided to use a common function for my tests. While perhaps not the best example of what can be done with SSE instructions, this operation is common enough and simple enough that following along shouldn't be too hard. The operation in question is the cross product. We will examine the code used to perform the cross product in a few ways. The first will be the standard cross product using floating point numbers. No real surprises there. Then we'll see how well the Visual Studio optimizer works by enabling streaming instructions. Finally we'll compare a hand coded assembly cross product using SSE, and a version using the compilers intrinsic instructions.

The first cross product we will examine is the simplest one, a version of the cross product that simply uses floating point operations to compute the cross product:

void crossp(float v1[4], float v2[4], float out[4]) {
out[0] = v1[1] * v2[2] - v1[2] * v2[1];
out[1] = -1 * (v1[0] * v2[2] - v1[2] * v2[0]);
out[2] = v1[0] * v2[1] - v1[1] * v2[0];
out[3] = 0;
}

void crossp_safe(float v1[4], float v2[4], float out[4]) {
00401080 mov ecx,dword ptr [esp+4]
out[0] = v1[1] * v2[2] - v1[2] * v2[1];
00401084 fld dword ptr [eax+8]
00401087 fmul dword ptr [ecx+4]
0040108A fld dword ptr [ecx+8]
0040108D fmul dword ptr [eax+4]
00401090 fsubp st(1),st
00401092 fstp dword ptr [edx]
out[1] = -1 * (v1[0] * v2[2] - v1[2] * v2[0]);
00401094 fld dword ptr [ecx]
00401096 fmul dword ptr [eax+8]
00401099 fld dword ptr [ecx+8]
0040109C fmul dword ptr [eax]
0040109E fsubp st(1),st
004010A0 fmul dword ptr [__real@bf800000 (40215Ch)]
004010A6 fstp dword ptr [edx+4]
out[2] = v1[0] * v2[1] - v1[1] * v2[0];
004010A9 fld dword ptr [ecx]
004010AB fmul dword ptr [eax+4]
004010AE fld dword ptr [eax]
004010B0 fmul dword ptr [ecx+4]
004010B3 fsubp st(1),st
004010B5 fstp dword ptr [edx+8]
out[3] = 0;
004010B8 fldz
004010BA fstp dword ptr [edx+0Ch]
}
004010BD ret



The compilation of this results in the output following the function definition. For this example, I have disabled inlining. However the result with inlining is almost exactly the same. We can see that there are quite a few memory reads here, which can be rather costly. With caching however it shouldn't be too bad.

When we enable SSE in the compiler options, we expect the result to be optimized using SSE instructions, if possible. The output, in this case, was rather surprising for me. All it did was replace the floating point operations with the SSE instruction equivalents. This code is no more optimized than the previous code was.

void crossp(float v1[4], float v2[4], float out[4]) {
00401080 mov ecx,dword ptr [esp+4]
out[0] = v1[1] * v2[2] - v1[2] * v2[1];
00401084 movss xmm0,dword ptr [eax+8]
00401089 mulss xmm0,dword ptr [ecx+4]
0040108E movss xmm1,dword ptr [ecx+8]
00401093 mulss xmm1,dword ptr [eax+4]
00401098 subss xmm0,xmm1
0040109C movss dword ptr [edx],xmm0
out[1] = -1 * (v1[0] * v2[2] - v1[2] * v2[0]);
004010A0 movss xmm0,dword ptr [ecx]
004010A4 mulss xmm0,dword ptr [eax+8]
004010A9 movss xmm1,dword ptr [ecx+8]
004010AE mulss xmm1,dword ptr [eax]
004010B2 subss xmm0,xmm1
004010B6 mulss xmm0,dword ptr [__real@bf800000 (402160h)]
004010BE movss dword ptr [edx+4],xmm0
out[2] = v1[0] * v2[1] - v1[1] * v2[0];
004010C3 movss xmm0,dword ptr [ecx]
004010C7 mulss xmm0,dword ptr [eax+4]
004010CC movss xmm1,dword ptr [eax]
004010D0 mulss xmm1,dword ptr [ecx+4]
004010D5 subss xmm0,xmm1
004010D9 movss dword ptr [edx+8],xmm0
out[3] = 0;
004010DE xorps xmm0,xmm0
004010E1 movss dword ptr [edx+0Ch],xmm0
}
004010E6 ret



I was curious if this was just an artifact of this operation in particular. So I decided to perform a test, and wrote some code that should be easy for the compiler to optimize using SSE instructions.

void sidemul(float v1[4], float v2[4]) {
v1[0] *= v2[0];
v1[1] *= v2[1];
v1[2] *= v2[2];
v1[3] *= v2[3];
}

void sidemul(float v1[4], float v2[4]) {
v1[0] *= v2[0];
00401080 movss xmm0,dword ptr [eax]
00401084 mulss xmm0,dword ptr [ecx]
00401088 movss dword ptr [eax],xmm0
v1[1] *= v2[1];
0040108C movss xmm0,dword ptr [ecx+4]
00401091 mulss xmm0,dword ptr [eax+4]
00401096 movss dword ptr [eax+4],xmm0
v1[2] *= v2[2];
0040109B movss xmm0,dword ptr [ecx+8]
004010A0 mulss xmm0,dword ptr [eax+8]
004010A5 movss dword ptr [eax+8],xmm0
v1[3] *= v2[3];
004010AA movss xmm0,dword ptr [ecx+0Ch]
004010AF mulss xmm0,dword ptr [eax+0Ch]
004010B4 movss dword ptr [eax+0Ch],xmm0
}
004010B9 ret



Much to my horror, we can see that this function isn't optimized at all. Well, ok, it's optimized minorly in that the SSE versions of the floating point instructions are operating on single precision floats, while the floating point instructions are operating on 80 bit floating point numbers. We can quickly see that a better optimized version would be:

void sidemul_sse(float v1[4], float v2[4]) {
004010C0 push ebp
004010C1 mov ebp,esp
004010C3 and esp,0FFFFFFF0h
004010C6 mov eax,dword ptr [v1]
__m128 vec1, vec2;
vec1 = _mm_load_ps(v2);
004010C9 movaps xmm0,xmmword ptr [ecx]
vec2 = _mm_load_ps(v1);
004010CC movaps xmm1,xmmword ptr [eax]
vec2 = _mm_mul_ps(vec2, vec1);
004010CF mulps xmm0,xmm1
_mm_store_ps(v1, vec2);
004010D2 movaps xmmword ptr [eax],xmm0
}
004010D5 mov esp,ebp
004010D7 pop ebp
004010D8 ret



Clearly the optimizer must be borked (note that I am using Microsoft Visual Studio 2005 Team Edition for Software Developers).

Next up we want to compare a hand coded cross product using SSE instructions to that of an SSE cross product generated using compiler intrinsics.

void crossp_asm(float v1[4], float v2[4], float outv[4]) {
__asm {
mov ecx, [v1]
movaps xmm0, [ecx]
mov ecx, [v2]
movaps xmm1, [ecx]

movaps xmm4, xmm1
movaps xmm3, xmm0

shufps xmm4, xmm0, 0xCA
shufps xmm3, xmm1, 0xD1

mulps xmm4, xmm3
movaps xmm2, xmm0

shufps xmm2, xmm1, 0xCA
shufps xmm1, xmm0, 0xD1

mulps xmm2, xmm1
subps xmm4, xmm2

mov ecx, [outv]
movaps [ecx], xmm4
}
outv[1] *= -1;
}

void crossp_asm(float v1[4], float v2[4], float outv[4]) {
00401000 mov eax,dword ptr [esp+0Ch]
__asm {
mov ecx, [v1]
00401004 mov ecx,dword ptr [esp+4]
movaps xmm0, [ecx]
00401008 movaps xmm0,xmmword ptr [ecx]
mov ecx, [v2]
0040100B mov ecx,dword ptr [esp+8]
movaps xmm1, [ecx]
0040100F movaps xmm1,xmmword ptr [ecx]

movaps xmm4, xmm1
00401012 movaps xmm4,xmm1
movaps xmm3, xmm0
00401015 movaps xmm3,xmm0

shufps xmm4, xmm0, 0xCA
00401018 shufps xmm4,xmm0,0CAh
shufps xmm3, xmm1, 0xD1
0040101C shufps xmm3,xmm1,0D1h

mulps xmm4, xmm3
00401020 mulps xmm4,xmm3
movaps xmm2, xmm0
00401023 movaps xmm2,xmm0

shufps xmm2, xmm1, 0xCA
00401026 shufps xmm2,xmm1,0CAh
shufps xmm1, xmm0, 0xD1
0040102A shufps xmm1,xmm0,0D1h

mulps xmm2, xmm1
0040102E mulps xmm2,xmm1
subps xmm4, xmm2
00401031 subps xmm4,xmm2

mov ecx, [outv]
00401034 mov ecx,dword ptr [esp+0Ch]
movaps [ecx], xmm4
00401038 movaps xmmword ptr [ecx],xmm4
}
outv[1] *= -1;
0040103B fld dword ptr [eax+4]
0040103E fmul dword ptr [__real@bf800000 (40215Ch)]
00401044 fstp dword ptr [eax+4]
}
00401047 ret



Because of how the parameters are passed to the function, we must perform this strange little trick of moving the address of the variable into a register, and then dereferencing the register to get the data into the SSE registers. Other than that, the code is fairly straight forward. We use the shuffle instructions to arrange the data so that we can perform our multiplications; we also have to do some work to make sure our signs are right. Again, the assembly dump is below the function definition. This is pretty straight forward, and it does appear to be someone shorter than our regular version by about two instructions.

Finally we have the intrinsic version of our cross product

void crossp_sse(float v1[4], float v2[4], float out[4]) {
__m128 vector1, vector2, vector3, vector4, vector5;

vector1 = _mm_load_ps(v1);
vector2 = _mm_load_ps(v2);

vector3 = _mm_shuffle_ps(vector2, vector1, _MM_SHUFFLE(3, 0, 2, 2));
vector4 = _mm_shuffle_ps(vector1, vector2, _MM_SHUFFLE(3, 1, 0, 1));

vector5 = _mm_mul_ps(vector3, vector4);

vector3 = _mm_shuffle_ps(vector1, vector2, _MM_SHUFFLE(3, 0, 2, 2));
vector4 = _mm_shuffle_ps(vector2, vector1, _MM_SHUFFLE(3, 1, 0, 1));

vector3 = _mm_mul_ps(vector3, vector4);
vector3 = _mm_sub_ps(vector5, vector3);

_mm_store_ps(out, vector3);

out[1] *= -1;
}

void crossp_sse(float v1[4], float v2[4], float out[4]) {
00401050 push ebp
00401051 mov ebp,esp
00401053 and esp,0FFFFFFF0h
__m128 vector1, vector2, vector3, vector4, vector5;

vector1 = _mm_load_ps(v1);
00401056 movaps xmm1,xmmword ptr [ecx]
vector2 = _mm_load_ps(v2);
00401059 movaps xmm0,xmmword ptr [edx]
0040105C mov eax,dword ptr [out]

vector3 = _mm_shuffle_ps(vector2, vector1, _MM_SHUFFLE(3, 0, 2, 2));
vector4 = _mm_shuffle_ps(vector1, vector2, _MM_SHUFFLE(3, 1, 0, 1));

vector5 = _mm_mul_ps(vector3, vector4);

vector3 = _mm_shuffle_ps(vector1, vector2, _MM_SHUFFLE(3, 0, 2, 2));
vector4 = _mm_shuffle_ps(vector2, vector1, _MM_SHUFFLE(3, 1, 0, 1));
0040105F movaps xmm3,xmm0
00401062 shufps xmm3,xmm1,0D1h
00401066 movaps xmm2,xmm1
00401069 shufps xmm2,xmm0,0CAh

vector3 = _mm_mul_ps(vector3, vector4);
0040106D mulps xmm2,xmm3
00401070 movaps xmm3,xmm1
00401073 shufps xmm3,xmm0,0D1h
00401077 shufps xmm0,xmm1,0CAh
0040107B mulps xmm3,xmm0
vector3 = _mm_sub_ps(vector5, vector3);
0040107E subps xmm3,xmm2

_mm_store_ps(out, vector3);
00401081 movaps xmmword ptr [eax],xmm3

out[1] *= -1;
00401084 fld dword ptr [eax+4]
00401087 fmul dword ptr [__real@bf800000 (40215Ch)]
0040108D fstp dword ptr [eax+4]
}
00401090 mov esp,ebp
00401092 pop ebp
00401093 ret



As we can see, the non-inline intrinsic version is just slightly longer than the assembly version. This is mostly due to the frame pointers. Things to note though: This function takes the addresses of the array pointers in registers, instead of requiring a read from memory to obtain pointers to the arrays.

So the next question is: how do these two perform when inlined? After inlining them, the intrinsic function eliminates the frame pointers entirely. The assembly version, however, ends up having more overhead than the intrinsic version. So the conclusion is: Using the compilers intrinsic instructions is recommended. Not only is the compiler able to better understand intrinsic instructions than it can your own assembly, it can perform optimizations on intrinsic instructions that it wouldn't be able to perform with your own hand crafted assembly.
Sign in to follow this  


5 Comments


Recommended Comments

Very interesting read.

I'll have to remember this entry - the number of times you get bitching between micro vs macro optimization in code and neither side seems to have any particularly conclusive evidence either way [rolleyes]

Would be nice to have a set of examples and results to use [smile]

Cheers,
Jack

Share this comment


Link to comment
Washu, I so much hate you. If there were a single language that you didn't know, I'd want to learn it and OWN YOU at it. Try, at least. [grin]

A really useful entry, as usual. Thanks!

Share this comment


Link to comment
Quote:
Not only is the compiler able to better understand intrinsic instructions than it can your own assembly, it can perform optimizations on intrinsic instructions that it wouldn’t be able to perform with your own hand crafted assembly.
Would you agree that the problem with assembly is likely due to the way MSVS accepts raw assembly rather than some annotated version like GCC? It seems that the way inline assembly works in GCC that the compiler should be able to treat it as well as any intrinsic function, at least with respect to inlining.

Share this comment


Link to comment
Guest Anonymous Poster

Posted

Not to be picky, but why no timing data?

Share this comment


Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!