How to profile SIMD

Started by
12 comments, last by satanir 10 years, 5 months ago

Actually the topic name should be "How to profile code".

Well I am writing SIMD math library. I got 2 implementations SSE and scalar.

I'm not shure how measure the code speed. Currently Im not using optimization, and no debug symbols are generated for profiling.

I'm creating a loop that repeats the operation...

The compiler is cl

I'm expecting SSE dot product to be slower than scalar version?

But the cross product is also slower!?!@


SGE_FORCE_INLINE SGVector vec3_cross(const SGVector& a, const SGVector& b)
{
#if defined(SGE_MATH_USE_SSE)
__m128 T = _mm_shuffle_ps(a.m_M128, a.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)
__m128 V = _mm_shuffle_ps(b.m_M128, b.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)


//i(ay*bz - by*az)  + j(bx*az - ax*bz)  + k(ax*by - bx*ay)
T = _mm_mul_ps(T, b.m_M128);//bx * ay, by * az, bz * ax
V = _mm_mul_ps(V, a.m_M128);//ax * by, ay * bz, az * bx
V = _mm_sub_ps(V, T);


V = _mm_shuffle_ps(V, V, SGE_SIMD_SHUFFLE(1, 2, 0, 3));
return SGVector(V);
#else
const float x = (a.y*b.z) - (b.y*a.z);
const float y = (b.x*a.z) - (a.x*b.z);
const float z = (a.x*b.y) - (b.x,a.y);


return SGVector(x, y, z, 0.f);
#endif
}

where SGVector is struct with union{ struct {float x,y,z;}; float arr[4]; __m128 m_M128}. (maybe that is the problem?!)

EDIT : maybe __forceinline is involed too!? I will remove it.

Advertisement


Currently Im not using optimization

Ah? If you want to measure performance you need to compile using optimizations. at least /O2, otherwise every instruction will go through memory.

The best tool for profiling is Intel's VTune, they have 30 days free trial. Visual studio comes with a profiler you can use as well. If you want to profile yourself, you can use the timestamp counter (rdtcs instruction, visual studio has an intinsic).

__forceinline is very good for performance, I use it almost everywhere in performance critical code.

The problem is probably the shuffles. Not sure what CPU you are using, but shuffle performance is limited. You have 3 shuffles and 3 math instructions, which is not a good ratio. The SSE math instructions have data dependency on the shuffles, which will stall the CPU pipeline.

Also, your inputs and outputs are structs. Even though you pass them by reference, both shuffles and return values will got through memory (probably, need to look at the generated assembly).

To maximize SSE performance you need to design your whole code around it. Have long SSE math sequences, and reduce the amount of memory acceses and shuffles.

BTW, you can tell CL to use scalar SSE for floating-point instructions by using /arch:SSE. Check this.


__forceinline is very good for performance, I use it almost everywhere in performance critical code.

That's probably not great blanket advice. __forceinline is a blunt instrument and causes the function to be inlined, even when its not called from a performance-critical section of code, which bloats your executable, and can blow out your code locality, instruction-cache, and the very small cache of decoded instructions too, I think. In general, the compiler is in a much better position than you to determine whether to inline a particular call-site or not. That's precisely why regular 'inline' started as an imperative command (which __forceinline is today) but became just a hint to the compiler -- programmer-specified inline was hurting performance the majority of the time.

Just use regular inline if you want -- of course, with sufficient optimizations enabled, the compiler might inline a function at its own discretion, regardless of whether you marked it inline or not --- but don't use __forceinline unless you're really sure it'll help, you can back that up with hard data, and you verify results afterwards. To verify, micro-benchmarks (say, a single tight loop that multiplies a large number of matrices) are insufficient in most cases because it won't suffer the ill effects __forceinline.

throw table_exception("(? ???)? ? ???");

Don't profile code, profile apps. Pass vectors by value, not reference (although that only works for the first 3 or so args, the rest will need to be by reference). Forceinline is just stupid. Stop using it. Stop constantly assigning things to V & T - It just introduces dependency chains for no good reason. Don't fear the use of multiple variables - they'll just end up as registers instead...

[source]
const float x = (a.y*b.z) - (b.y*a.z);
const float y = (a.z*b.x) - (b.z*a.x);
const float z = (a.x*b.y) - (b.x*a.y);
const float w = (0 * 0) - (0 * 0);
[/source]
4 shuffles, 2 mults, 1 sub = obvious, and better! ;)
I'd have to say there are a couple good and bad suggestions here. Rob has the best good suggestion, don't profile a function, profile it within a real application. The other suggestion about using force inline, don't, you are overriding a compiler's decision making, which given that recent compilers are generally more knowledgeable than you are about when it makes sense to inline is a bad idea. (Inline in general is a deprecated and ignored C++ feature due to this.) Calling a single function to get the timing repeatedly only tells you the absolute minimum cycles the set of instructions can execute in. I can do that without a computer simply using Intel Intrinsics Guide and about 5 minutes looking up the instructions, this is not very useful. With SIMD you have instruction latencies (other operations have latencies also, just not usually as severe) and the compilers are very good at interleaving other operations with the SIMD code to hide those latencies. My cross product is actually:


static const int	swizzle1	=	_MM_SHUFFLE( 3, 0, 2, 1 );
static const int	swizzle2	=	_MM_SHUFFLE( 3, 1, 0, 2 );

__m128	v1	=	_mm_shuffle_ps( lhs, lhs, swizzle1 );
__m128	v2	=	_mm_shuffle_ps( rhs, rhs, swizzle2 );
__m128	v3	=	_mm_shuffle_ps( lhs, lhs, swizzle2 );
__m128	v4	=	_mm_shuffle_ps( rhs, rhs, swizzle1 );

__m128	p1	=	_mm_mul_ps( v1, v2 );
__m128	p2	=	_mm_mul_ps( v3, v4 );

__m128	result	=	_mm_sub_ps( p1, p2 );
Yup, one more shuffle in that code, but both in latency counting and in real code, this is the better solution for the compiler to optimize with given that the primary stall is between the last shuffle and the first multiply. The compiler can easily find ALU/FPU operations from other code around this inlined function to hide the single big stall and only needs a couple more instructions to hide the smaller stalls between the muls and the sub and of course actually "using" the result is delayed a bit.

There are some give and takes involved of course. It is best if the compiler can find instructions to insert between each of the shuffles since on P3 machines you could only issue a shuffle every other cycle and then it took 6 cycles before the resulting register was usable. Of course on any P4 and later it is generally a 1 throughput so issuing them one after another is not a bad thing.

Overall though, the key to remember is that when you hand optimize with intrinsics (NOT with inline assembly etc) what you are doing is acting as an extension of the compilers optimizer. Those things which are too tricky/complex (dot/cross/etc) for the compilers optimizer to auto vectorize you do. But, you want to write your optimization in such a manner that the compiler can then use it in combination with the optimizations it can make which includes inserting other instructions to hide latencies, reordering the register usages (note in the above, I define a new __m128 at each step, the compiler optimizes better due to that, usually) and a whole slew of other things most modern compilers are doing for you all the time.

The last note, ignore everything above if you are using VC 2008 or prior, my cat can produce better code that that POS. VC 2010 and Clang are the minimums, GCC is 50/50 if it does things well though supposedly it is getting better. VC 2012 and Clang though, both do things like unrolling tight loops of the above cross product and hiding pretty much all the latencies, it is REALLY impressive to see the compilers doing such things.

Well, there are many good, sensible points here. And I've heard all of them before. And they are not true...


__foceinline ...can blow out your code locality, instruction-cache, and the very small cache of decoded instructions too

That's true. But so does loop unrolling. That can't be the basis for dismissing __forceinline, it's just another factor to take into consideration. I've never encountered any I$ problems, and I put __forceinline on a 40-lines-of-code-functions more than once.


the compiler is in a much better position than you to determine whether to inline a particular call-site or not

Not always true, not even with the new, much improved VS12 and Intel Compiler 13. Compilers are not, and probably will never be perfect, especially when they need to support a wide variety of CPUs. Sometime you do know better, so why not help the compiler?


Forceinline is just stupid

Hmmm... That stupid thing gave me very nice performance improvments.



static const int swizzle1 = _MM_SHUFFLE( 3, 0, 2, 1 ); 
static const int swizzle2 = _MM_SHUFFLE( 3, 1, 0, 2 ); 
__m128 v1 = _mm_shuffle_ps( lhs, lhs, swizzle1 ); 
__m128 v2 = _mm_shuffle_ps( rhs, rhs, swizzle2 ); 
__m128 v3 = _mm_shuffle_ps( lhs, lhs, swizzle2 ); 
__m128 v4 = _mm_shuffle_ps( rhs, rhs, swizzle1 ); 
__m128 p1 = _mm_mul_ps( v1, v2 ); 
__m128 p2 = _mm_mul_ps( v3, v4 ); 
__m128 result = _mm_sub_ps( p1, p2 );

That's even worse that the original post. You are assuming that this code will be used sparsely, and that the output of this sequence will not be needed immediatly. Relying too much on the compiler and CPU OOO is one of the main reasons for not-optimal-SSE-performance. If you are sure your assumptions are true, than that code is fine. If you are implementing a library to be used by others - not so good.


Stop constantly assigning things to V & T - It just introduces dependency chains for no good reason

As V&T are local variables, the compiler will alias them using other registers. The dependency chain comes from the fact that output of an instruction is input to another instruction, has nothing to do with the amount of local variables.


I define a new __m128 at each step, the compiler optimizes better due to that, usually

Again, variable aliasing is from Compiler101, it's really one of those places that compiler does very good on its own.

It does make the code much more readable, though.


VC 2012 and Clang though... it is REALLY impressive to see the compilers doing such things

So true...

Other things:

1. If you are using a CPU with AVX, use the '/arch:AVX' compiler flag. It will magically convert your SSE code to use AVX128 instructions. AVX instructions have non-destructive-destination-register, which translates to better register utilization and better performance.

2. 64-bit application have access to more registers(16 instead of 8). The drawback is larger code size, but the extra registers more than make up for that. Compile to 64-bit if you can.

Now, before everybody starts thumbing me down - my job for the past 9 years was to the squeeze the hell out of Intel's CPUs, especially using vectorization, so I have some experience with the subject.

As a final note, there are a lot of fine details when doing performance optimizations. There's a lot of trial-and-error involved, a lot of profiling, manually inspecting the assembly, scratching your head not understading why things don't work as expected...

But most importantly, it's not black-and-white, there are no absolute truths. Getting to 70% of the optimal performance is easy. The performance cookbook will help you get to 90%. It's the last 10% that makes perf-opt so challenging (and so rewarding...).


As a final note, there are a lot of fine details when doing performance optimizations. There's a lot of trial-and-error involved, a lot of profiling, manually inspecting the assembly, scratching your head not understading why things don't work as expected...
But most importantly, it's not black-and-white, there are no absolute truths.
^^That bit. Things like __forceinline aren't wonderful and they also aren't useless -- they're both of those things! They're both terrible and good.

Taken to the extreme, it's very obviously bad -- imagine force-inlining every function in the code-base, so all the code ends up in main... Or the other extreme, where you tell the compiler to never inline anything, no matter how much of a good idea it is...

Force-inlining is specifically an optimization where you're overriding the compiler, and saying you'll do a better job than it does. If you're ever in that situation, you're basically ASM programming, which is a lot of effort. So to make sure that effort is worthwhile, you better get scientific and do a damn lot of in-depth profiling with real loads wink.png

If you're doing multi-platform stuff, you also have to repeat this work over each platform. e.g. Maybe you don't blow out your x86 I$, but you do over-fill your PPC/ARM/SPU I$...

Pass vectors by value, not reference

Hi Rob,

Why do you recommend passing vectors by value rather than reference?

I've made it a habit to use const ref parameters wherever possible, and wherever ref makes sense
compared to the size of the type (I wouldn't pass a char by reference)

Pass vectors by value, not reference


Hi Rob,

Why do you recommend passing vectors by value rather than reference?
I've made it a habit to use const ref parameters wherever possible, and wherever ref makes sense
compared to the size of the type (I wouldn't pass a char by reference)

Native datatypes should be passed by value (i.e. float, double, char, int, short, long, size_t). Consider __m128 as a native data type. In Win32, up to 3 __m128 parameters will be passed through xmm registers. This post is also a good read.


Well, there are many good, sensible points here. And I've heard all of them before. And they are not true...

...

That's true. But so does loop unrolling. That can't be the basis for dismissing __forceinline, it's just another factor to take into consideration. I've never encountered any I$ problems, and I put __forceinline on a 40-lines-of-code-functions more than once.

...


Not always true, not even with the new, much improved VS12 and Intel Compiler 13. Compilers are not, and probably will never be perfect, especially when they need to support a wide variety of CPUs. Sometime you do know better, so why not help the compiler?

That was my kind of my point, and why I said it wasn't good *blanket* advice--You can't say just use force-inline or don't. Use it when it works, absolutely, but the default should really be to not. The general audience here aren't professionals with multiple years of optimization experience. Not half of them even know the tools they'd need to investigate performance, and fewer still know proper procedures for doing so. Most would, at best, bang up a tight loop that multiplies a bunch of vectors or matrices together, time the competing implementations, pick the winner and never look back afterwards. A good portion of the audience here, being the youngest and least experienced, are simply content to follow hearsay and urban legend as their optimization strategy. In short, you really have to be careful around here to not make statements that could appear as absolute to less-experienced folks because that will do more harm than good. Its better to prescribe the 90% rule, and then identify when a programmer might want to deviate from it.

Anyways, we're always glad to have someone with your experience around here (it certainly far exceeds my own) and I'm sure there'll be many more interesting conversations about optimization to come. If you'd ever care to go into depth about optimization work, consider writing an article for the site wink.png

throw table_exception("(? ???)? ? ???");

This topic is closed to new replies.

Advertisement