• 10
• 12
• 12
• 14
• 17
• entries
146
436
• views
198357

# Compiler Intrinsics, Part 2

178 views

As discussed last time, compiler intrinsics provide for a nice clean way to add SIMD capabilities to your applications. The next question is: does it help?

SIMD functionality is supposed to allow you to perform a single operation on multiple pieces of data. For this to be an optimal process though your data must be arranged in such a way as to be vectorizable. Furthermore, the operation you are trying to perform must be something that can be vectorized.

In the previous example I chose the cross product to be my example. Now, I actually had an ulterior motive for choosing it. It's not something that is very conducive to vectorization.. Because of this the performance gains we obtain by vectorizing it are not significant enough to actually have spent the time researching, testing, debugging, and writing a vectorized version.

We, as programmers, like to think that we know what we are doing. That we can do better than the machine (because if a machine can out program us...dear god!). The truth is, however, we are human. Because of this we have some rather large limitations on what we can do, and what we can keep track of. Things like this are why computers were invented in the first place. It is also because of these limitations that compilers can most often outperform a human when optimizing. Why? The simple answer is: state. The compiler is capable of building a complete picture of the state of all of the registers and many other things at a particular point in a program. It knows of most code paths to a particular piece of code, and hence it can make decisions based on what it knows. Now, for a human this is also possible...for very small and very linear programs. Because of the compilers ability to keep all of this information available, it has the ability to make optimization decisions that we would otherwise miss.

The thing is, compilers aren't very good at vectorizing code, at least most aren't. There are some compilers on the market which are designed specifically to vectorize code; Intel's optimizing compiler, for instance. Vectorizing compilers tend to have to break language rules in order to rearrange data, code, and even alter the behavior of code, in order to vectorize it. This has two consequences:
1) Code can break, easily. The functionality of the code may not be fully understood by the compiler, so changes made by the compiler could easily break existing functionality.
2) The application becomes harder to debug due to the changes the compiler makes to the code, hence it becomes harder to isolate bugs introduced by the compiler, and bugs that are in your code.

So, now the question becomes: How does our vectorized cross product perform against the floating point or compiler "optimized" cross product? In order to test the performance I decided to have it perform 100,000 cross products on randomly generated vectors, repeat this process over 100 iterations and average the time of each iteration. For each iteration the list of vectors (200,000 in all) were regenerated. The resulting vectors were then saved out to a file. The test was setup for the most likely case, that of inlining being impossible (as function pointers were used). However, even when inlining was done, the numbers did not change significantly enough to report.
First, the results without SSE enabled (all times are in seconds):
crossp_safe: 0.00551182crossp_asm: 0.00499535crossp_sse: 0.00496931

As we can see, the vectorized cross products executed only minorly faster than the floating point cross product. The intrinsic cross product appears to be slightly faster than the hand coded version, however the difference is so little that most likely it was just that particular test execution. This is good news, it means that it wasn't a complete waste of time vectorizing that operation. However, the improvement wasn't significant enough for this operation. (Lets be frank, you don't actually execute that many cross products on the CPU. Your graphics shaders might, on the other hand, but that's entirely different.

Enabling SSE optimizations doesn't do a whole lot to improve our safe cross product:
crossp_safe: 0.00552355crossp_asm: 0.00507521crossp_sse: 0.0050099

As we can see, it still runs in about the same amount of time, a pitty that.

We can draw a few conclusions from these results. First of all, for commonly executed operations, vectorizing them can possibly make a significan difference in performance, provided they are suitable for vectorization. Secondly, operations that aren't commonly executed, or that do not lend themselves well for vectorization tend to not show the performance gains necessary to provide enough benefit to offset the cost of vectorization. Some operations such as the dot product can lend themselves well to vertorization, which with SSE3 becomes something as simple as:
float dotp_sse(float v1[4], float v2[4]) {	float result;	__m128 vec1, vec2;	vec1 = _mm_load_ps(v1);	vec2 = _mm_load_ps(v2);	vec1 = _mm_mul_ps(vec1, vec2);	vec1 = _mm_hadd_ps(vec1, vec1);	vec1 = _mm_hadd_ps(vec1, vec1);	_mm_store_ss(&result, vec1);	return result;}std::pair<float, float> dotp_sse(float v1[4], float v2[4], float shared[4]) {	float vec[4];	__m128 vec1, vec2, vec3;	vec1 = _mm_load_ps(v1);	vec2 = _mm_load_ps(v2);	vec3 = _mm_load_ps(shared);	vec1 = _mm_mul_ps(vec1, vec3);	vec2 = _mm_mul_ps(vec2, vec3);	vec1 = _mm_hadd_ps(vec1, vec2);	vec1 = _mm_hadd_ps(vec1, vec1);	_mm_store_ps(vec, vec1);	return std::make_pair(vec[0], vec[1]);}std::pair<float, float> dotp_sse(float v1[4], float v2[4], float v3[4], float v4[4]) {	float vec[4];	__m128 vec1, vec2, vec3, vec4;	vec1 = _mm_load_ps(v1);	vec2 = _mm_load_ps(v2);	vec3 = _mm_load_ps(v3);	vec4 = _mm_load_ps(v4);	vec1 = _mm_mul_ps(vec1, vec2);	vec2 = _mm_mul_ps(vec3, vec4);	vec1 = _mm_hadd_ps(vec1, vec3);	vec1 = _mm_hadd_ps(vec1, vec1);	_mm_store_ps(vec, vec1);	return std::make_pair(vec[0], vec[1]);}

Now, you may wonder why I have 3 dot products here, the simple answer is: using SSE3 horizontal additions, we can easily perform one, two dependant or two independent dot products easily enough, which is what the code above does. The first function does your standard dot product. The second does the dot product between v1 and shared, along with v2 and shared. Finally the third one does two dot products between two sets of vectors, v1 dotted with v2, and v3 dotted with v4. Coming up with three and four versions of this common operation is equally as simple.

There are a variety of other places you can apply SIMD operations, such as IK solvers. You should consider the function you are trying to vectorize, along with profiling to determine if you really need to vectorize a particular operation. If the function is an operation that doesn't vectorize well, then perhaps you should leave it as it is. If the operation isn't performed commonly, then there is no need to waste time vectorizing it.

I'll admit that I don't read your journal often and that you may already cover stuff like this, but as you didn't mention anything to do with cache operation in your tests, I thought that I'd mention it ;)

You may be thrashing the cache which could be a significant factor in your timing results. That's not to say it'll totally skew the results, but more likely it'll skew the values of the results. The benefits may be more clearly superior than is first visible due to adding the cost of cache hits in your timing. (equivalent pretty much to adding the

I'd recommend changing your test so that all your vectors fit in the L1 cache at the same time (even L2 cache hits will be expensive), prewarming your cache and then performing your tests.

Here is a potentially pathological test. Hopefully you didn't do your tests like this ;)

//(64 byte cache line means each array has 25000 cache lines)
vec4 asOutVecs[100000]; // aligned to 64 byte boundary?
vec4 asInVecs1[100000]; // aligned to 64 byte boundary?
vec4 asInVecs2[100000]; // aligned to 64 byte boundary?

// one iteration
for (i = 0; i < 100000; i++)
{
// this will definitely cause cache misses, because of your
// data size (and the only 2-wayedness of modern caches)
whatever_cross_product(&asOutVecs[i], &asInVecs1[i], &asInVecs2[i]);
}



Of course, you may already have taken this stuff into account and if you have, please ignore me! ;D

The input vectors and output vector were arranged to be in the same array and adjacent to each other. Even when doing the worst case (where you can ensure a cache miss every time), the performance doesn't change significantly (+.001 or so)

Adjusting the optimization settings some (ie fucking with them to get the absolute best results) I managed to get these numbers:
crossp_safe: 0.0045557
crossp_asm: 0.00446192
crossp_sse: 0.00440686

As we can see, there is even LESS reason to use the hand coded versions. But again, in the case of a math library, you could very well see a reason for it.