Well, there are many good, sensible points here. And I've heard all of them before. And they are not true...
__foceinline ...can blow out your code locality, instruction-cache, and the very small cache of decoded instructions too
That's true. But so does loop unrolling. That can't be the basis for dismissing __forceinline, it's just another factor to take into consideration. I've never encountered any I$ problems, and I put __forceinline on a 40-lines-of-code-functions more than once.
the compiler is in a much better position than you to determine whether to inline a particular call-site or not
Not always true, not even with the new, much improved VS12 and Intel Compiler 13. Compilers are not, and probably will never be perfect, especially when they need to support a wide variety of CPUs. Sometime you do know better, so why not help the compiler?
Forceinline is just stupid
Hmmm... That stupid thing gave me very nice performance improvments.
static const int swizzle1 = _MM_SHUFFLE( 3, 0, 2, 1 );
static const int swizzle2 = _MM_SHUFFLE( 3, 1, 0, 2 );
__m128 v1 = _mm_shuffle_ps( lhs, lhs, swizzle1 );
__m128 v2 = _mm_shuffle_ps( rhs, rhs, swizzle2 );
__m128 v3 = _mm_shuffle_ps( lhs, lhs, swizzle2 );
__m128 v4 = _mm_shuffle_ps( rhs, rhs, swizzle1 );
__m128 p1 = _mm_mul_ps( v1, v2 );
__m128 p2 = _mm_mul_ps( v3, v4 );
__m128 result = _mm_sub_ps( p1, p2 );
That's even worse that the original post. You are assuming that this code will be used sparsely, and that the output of this sequence will not be needed immediatly. Relying too much on the compiler and CPU OOO is one of the main reasons for not-optimal-SSE-performance. If you are sure your assumptions are true, than that code is fine. If you are implementing a library to be used by others - not so good.
Stop constantly assigning things to V & T - It just introduces dependency chains for no good reason
As V&T are local variables, the compiler will alias them using other registers. The dependency chain comes from the fact that output of an instruction is input to another instruction, has nothing to do with the amount of local variables.
I define a new __m128 at each step, the compiler optimizes better due to that, usually
Again, variable aliasing is from Compiler101, it's really one of those places that compiler does very good on its own.
It does make the code much more readable, though.
VC 2012 and Clang though... it is REALLY impressive to see the compilers doing such things
So true...
Other things:
1. If you are using a CPU with AVX, use the '/arch:AVX' compiler flag. It will magically convert your SSE code to use AVX128 instructions. AVX instructions have non-destructive-destination-register, which translates to better register utilization and better performance.
2. 64-bit application have access to more registers(16 instead of 8). The drawback is larger code size, but the extra registers more than make up for that. Compile to 64-bit if you can.
Now, before everybody starts thumbing me down - my job for the past 9 years was to the squeeze the hell out of Intel's CPUs, especially using vectorization, so I have some experience with the subject.
As a final note, there are a lot of fine details when doing performance optimizations. There's a lot of trial-and-error involved, a lot of profiling, manually inspecting the assembly, scratching your head not understading why things don't work as expected...
But most importantly, it's not black-and-white, there are no absolute truths. Getting to 70% of the optimal performance is easy. The performance cookbook will help you get to 90%. It's the last 10% that makes perf-opt so challenging (and so rewarding...).