• Create Account

# How to profile SIMD

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

13 replies to this topic

### #1imoogiBG  Members   -  Reputation: 776

Like
0Likes
Like

Posted 23 October 2013 - 03:30 PM

Actually the topic name should be "How to profile code".

Well I am writing SIMD math library. I got 2 implementations SSE and scalar.

I'm not shure how measure the code speed. Currently Im not using optimization, and no debug symbols are generated for profiling.

I'm creating a loop that repeats the operation...

The compiler is cl

I'm expecting SSE dot product to be slower than scalar version?

But the cross product is also slower!?!@

SGE_FORCE_INLINE SGVector vec3_cross(const SGVector& a, const SGVector& b)
{
#if defined(SGE_MATH_USE_SSE)
__m128 T = _mm_shuffle_ps(a.m_M128, a.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)
__m128 V = _mm_shuffle_ps(b.m_M128, b.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)

//i(ay*bz - by*az)  + j(bx*az - ax*bz)  + k(ax*by - bx*ay)
T = _mm_mul_ps(T, b.m_M128);//bx * ay, by * az, bz * ax
V = _mm_mul_ps(V, a.m_M128);//ax * by, ay * bz, az * bx
V = _mm_sub_ps(V, T);

V = _mm_shuffle_ps(V, V, SGE_SIMD_SHUFFLE(1, 2, 0, 3));
return SGVector(V);
#else
const float x = (a.y*b.z) - (b.y*a.z);
const float y = (b.x*a.z) - (a.x*b.z);
const float z = (a.x*b.y) - (b.x,a.y);

return SGVector(x, y, z, 0.f);
#endif
}

where SGVector is struct with union{ struct {float x,y,z;}; float arr[4]; __m128 m_M128}. (maybe that is the problem?!)

EDIT : maybe __forceinline is involed too!? I will remove it.

Edited by imoogiBG, 23 October 2013 - 03:33 PM.

### #2N.I.B.  Members   -  Reputation: 1022

Like
0Likes
Like

Posted 23 October 2013 - 04:27 PM

Currently Im not using optimization

Ah? If you want to measure performance you need to compile using optimizations. at least /O2, otherwise every instruction will go through memory.

The best tool for profiling is Intel's VTune, they have 30 days free trial. Visual studio comes with a profiler you can use as well. If you want to profile yourself, you can use the timestamp counter (rdtcs instruction, visual studio has an intinsic).

__forceinline is very good for performance, I use it almost everywhere in performance critical code.

The problem is probably the shuffles. Not sure what CPU you are using, but shuffle performance is limited. You have 3 shuffles and 3 math instructions, which is not a good ratio. The SSE math instructions have data dependency on the shuffles, which will stall the CPU pipeline.

Also, your inputs and outputs are structs. Even though you pass them by reference, both shuffles and return values will got through memory (probably, need to look at the generated assembly).

To maximize SSE performance you need to design your whole code around it. Have long SSE math sequences, and reduce the amount of memory acceses and shuffles.

BTW, you can tell CL to use scalar SSE for floating-point instructions by using /arch:SSE. Check this.

Edited by satanir, 23 October 2013 - 04:34 PM.

### #3Ravyne  Crossbones+   -  Reputation: 5757

Like
5Likes
Like

Posted 23 October 2013 - 05:54 PM

__forceinline is very good for performance, I use it almost everywhere in performance critical code.

That's probably not great blanket advice. __forceinline is a blunt instrument and causes the function to be inlined, even when its not called from a performance-critical section of code, which bloats your executable, and can blow out your code locality, instruction-cache, and the very small cache of decoded instructions too, I think. In general, the compiler is in a much better position than you to determine whether to inline a particular call-site or not. That's precisely why regular 'inline' started as an imperative command (which __forceinline is today) but became just a hint to the compiler -- programmer-specified inline was hurting performance the majority of the time.

Just use regular inline if you want -- of course, with sufficient optimizations enabled, the compiler might inline a function at its own discretion, regardless of whether you marked it inline or not --- but don't use __forceinline unless you're really sure it'll help, you can back that up with hard data, and you verify results afterwards. To verify, micro-benchmarks (say, a single tight loop that multiplies a large number of matrices) are insufficient in most cases because it won't suffer the ill effects __forceinline.

### #4RobTheBloke  Crossbones+   -  Reputation: 2231

Like
5Likes
Like

Posted 23 October 2013 - 06:32 PM

Don't profile code, profile apps. Pass vectors by value, not reference (although that only works for the first 3 or so args, the rest will need to be by reference). Forceinline is just stupid. Stop using it. Stop constantly assigning things to V & T - It just introduces dependency chains for no good reason. Don't fear the use of multiple variables - they'll just end up as registers instead...

const float x = (a.y*b.z) - (b.y*a.z);
const float y = (a.z*b.x) - (b.z*a.x);
const float z = (a.x*b.y) - (b.x*a.y);
const float w = (0 * 0) - (0 * 0);


4 shuffles, 2 mults, 1 sub = obvious, and better! ;)

Edited by RobTheBloke, 23 October 2013 - 06:37 PM.

### #5AllEightUp  Moderators   -  Reputation: 3896

Like
3Likes
Like

Posted 23 October 2013 - 11:06 PM

I'd have to say there are a couple good and bad suggestions here. Rob has the best good suggestion, don't profile a function, profile it within a real application. The other suggestion about using force inline, don't, you are overriding a compiler's decision making, which given that recent compilers are generally more knowledgeable than you are about when it makes sense to inline is a bad idea. (Inline in general is a deprecated and ignored C++ feature due to this.) Calling a single function to get the timing repeatedly only tells you the absolute minimum cycles the set of instructions can execute in. I can do that without a computer simply using Intel Intrinsics Guide and about 5 minutes looking up the instructions, this is not very useful. With SIMD you have instruction latencies (other operations have latencies also, just not usually as severe) and the compilers are very good at interleaving other operations with the SIMD code to hide those latencies. My cross product is actually:

static const int	swizzle1	=	_MM_SHUFFLE( 3, 0, 2, 1 );
static const int	swizzle2	=	_MM_SHUFFLE( 3, 1, 0, 2 );

__m128	v1	=	_mm_shuffle_ps( lhs, lhs, swizzle1 );
__m128	v2	=	_mm_shuffle_ps( rhs, rhs, swizzle2 );
__m128	v3	=	_mm_shuffle_ps( lhs, lhs, swizzle2 );
__m128	v4	=	_mm_shuffle_ps( rhs, rhs, swizzle1 );

__m128	p1	=	_mm_mul_ps( v1, v2 );
__m128	p2	=	_mm_mul_ps( v3, v4 );

__m128	result	=	_mm_sub_ps( p1, p2 );

Yup, one more shuffle in that code, but both in latency counting and in real code, this is the better solution for the compiler to optimize with given that the primary stall is between the last shuffle and the first multiply. The compiler can easily find ALU/FPU operations from other code around this inlined function to hide the single big stall and only needs a couple more instructions to hide the smaller stalls between the muls and the sub and of course actually "using" the result is delayed a bit.

There are some give and takes involved of course. It is best if the compiler can find instructions to insert between each of the shuffles since on P3 machines you could only issue a shuffle every other cycle and then it took 6 cycles before the resulting register was usable. Of course on any P4 and later it is generally a 1 throughput so issuing them one after another is not a bad thing.

Overall though, the key to remember is that when you hand optimize with intrinsics (NOT with inline assembly etc) what you are doing is acting as an extension of the compilers optimizer. Those things which are too tricky/complex (dot/cross/etc) for the compilers optimizer to auto vectorize you do. But, you want to write your optimization in such a manner that the compiler can then use it in combination with the optimizations it can make which includes inserting other instructions to hide latencies, reordering the register usages (note in the above, I define a new __m128 at each step, the compiler optimizes better due to that, usually) and a whole slew of other things most modern compilers are doing for you all the time.

The last note, ignore everything above if you are using VC 2008 or prior, my cat can produce better code that that POS. VC 2010 and Clang are the minimums, GCC is 50/50 if it does things well though supposedly it is getting better. VC 2012 and Clang though, both do things like unrolling tight loops of the above cross product and hiding pretty much all the latencies, it is REALLY impressive to see the compilers doing such things.

### #6N.I.B.  Members   -  Reputation: 1022

Like
5Likes
Like

Posted 24 October 2013 - 01:06 AM

Well, there are many good, sensible points here. And I've heard all of them before. And they are not true...

__foceinline ...can blow out your code locality, instruction-cache, and the very small cache of decoded instructions too

That's true. But so does loop unrolling. That can't be the basis for dismissing __forceinline, it's just another factor to take into consideration. I've never encountered any I$problems, and I put __forceinline on a 40-lines-of-code-functions more than once. the compiler is in a much better position than you to determine whether to inline a particular call-site or not Not always true, not even with the new, much improved VS12 and Intel Compiler 13. Compilers are not, and probably will never be perfect, especially when they need to support a wide variety of CPUs. Sometime you do know better, so why not help the compiler? Forceinline is just stupid Hmmm... That stupid thing gave me very nice performance improvments. static const int swizzle1 = _MM_SHUFFLE( 3, 0, 2, 1 ); static const int swizzle2 = _MM_SHUFFLE( 3, 1, 0, 2 ); __m128 v1 = _mm_shuffle_ps( lhs, lhs, swizzle1 ); __m128 v2 = _mm_shuffle_ps( rhs, rhs, swizzle2 ); __m128 v3 = _mm_shuffle_ps( lhs, lhs, swizzle2 ); __m128 v4 = _mm_shuffle_ps( rhs, rhs, swizzle1 ); __m128 p1 = _mm_mul_ps( v1, v2 ); __m128 p2 = _mm_mul_ps( v3, v4 ); __m128 result = _mm_sub_ps( p1, p2 ); That's even worse that the original post. You are assuming that this code will be used sparsely, and that the output of this sequence will not be needed immediatly. Relying too much on the compiler and CPU OOO is one of the main reasons for not-optimal-SSE-performance. If you are sure your assumptions are true, than that code is fine. If you are implementing a library to be used by others - not so good. Stop constantly assigning things to V & T - It just introduces dependency chains for no good reason As V&T are local variables, the compiler will alias them using other registers. The dependency chain comes from the fact that output of an instruction is input to another instruction, has nothing to do with the amount of local variables. I define a new __m128 at each step, the compiler optimizes better due to that, usually Again, variable aliasing is from Compiler101, it's really one of those places that compiler does very good on its own. It does make the code much more readable, though. VC 2012 and Clang though... it is REALLY impressive to see the compilers doing such things So true... Other things: 1. If you are using a CPU with AVX, use the '/arch:AVX' compiler flag. It will magically convert your SSE code to use AVX128 instructions. AVX instructions have non-destructive-destination-register, which translates to better register utilization and better performance. 2. 64-bit application have access to more registers(16 instead of 8). The drawback is larger code size, but the extra registers more than make up for that. Compile to 64-bit if you can. Now, before everybody starts thumbing me down - my job for the past 9 years was to the squeeze the hell out of Intel's CPUs, especially using vectorization, so I have some experience with the subject. As a final note, there are a lot of fine details when doing performance optimizations. There's a lot of trial-and-error involved, a lot of profiling, manually inspecting the assembly, scratching your head not understading why things don't work as expected... But most importantly, it's not black-and-white, there are no absolute truths. Getting to 70% of the optimal performance is easy. The performance cookbook will help you get to 90%. It's the last 10% that makes perf-opt so challenging (and so rewarding...). Edited by satanir, 24 October 2013 - 01:11 AM. ### #7Hodgman Moderators - Reputation: 24321 Like 6Likes Like Posted 24 October 2013 - 01:44 AM As a final note, there are a lot of fine details when doing performance optimizations. There's a lot of trial-and-error involved, a lot of profiling, manually inspecting the assembly, scratching your head not understading why things don't work as expected... But most importantly, it's not black-and-white, there are no absolute truths. ^^That bit. Things like __forceinline aren't wonderful and they also aren't useless -- they're both of those things! They're both terrible and good. Taken to the extreme, it's very obviously bad -- imagine force-inlining every function in the code-base, so all the code ends up in main... Or the other extreme, where you tell the compiler to never inline anything, no matter how much of a good idea it is... Force-inlining is specifically an optimization where you're overriding the compiler, and saying you'll do a better job than it does. If you're ever in that situation, you're basically ASM programming, which is a lot of effort. So to make sure that effort is worthwhile, you better get scientific and do a damn lot of in-depth profiling with real loads If you're doing multi-platform stuff, you also have to repeat this work over each platform. e.g. Maybe you don't blow out your x86 I$, but you do over-fill your PPC/ARM/SPU I$... ### #8SuperVGA Members - Reputation: 1118 Like 0Likes Like Posted 24 October 2013 - 05:11 AM Pass vectors by value, not reference Hi Rob, Why do you recommend passing vectors by value rather than reference? I've made it a habit to use const ref parameters wherever possible, and wherever ref makes sense compared to the size of the type (I wouldn't pass a char by reference) ### #9Matias Goldberg Crossbones+ - Reputation: 2765 Like 3Likes Like Posted 24 October 2013 - 08:45 AM Pass vectors by value, not reference Hi Rob, Why do you recommend passing vectors by value rather than reference? I've made it a habit to use const ref parameters wherever possible, and wherever ref makes sense compared to the size of the type (I wouldn't pass a char by reference) Native datatypes should be passed by value (i.e. float, double, char, int, short, long, size_t). Consider __m128 as a native data type. In Win32, up to 3 __m128 parameters will be passed through xmm registers. This post is also a good read. Twitter: @matiasgoldberg ### #10Ravyne Crossbones+ - Reputation: 5757 Like 3Likes Like Posted 24 October 2013 - 01:04 PM Well, there are many good, sensible points here. And I've heard all of them before. And they are not true... ... That's true. But so does loop unrolling. That can't be the basis for dismissing __forceinline, it's just another factor to take into consideration. I've never encountered any I$ problems, and I put __forceinline on a 40-lines-of-code-functions more than once.

...

Not always true, not even with the new, much improved VS12 and Intel Compiler 13. Compilers are not, and probably will never be perfect, especially when they need to support a wide variety of CPUs. Sometime you do know better, so why not help the compiler?

That was my kind of my point, and why I said it wasn't good *blanket* advice--You can't say just use force-inline or don't. Use it when it works, absolutely, but the default should really be to not. The general audience here aren't professionals with multiple years of optimization experience. Not half of them even know the tools they'd need to investigate performance, and fewer still know proper procedures for doing so. Most would, at best, bang up a tight loop that multiplies a bunch of vectors or matrices together, time the competing implementations, pick the winner and never look back afterwards. A good portion of the audience here, being the youngest and least experienced, are simply content to follow hearsay and urban legend as their optimization strategy. In short, you really have to be careful around here to not make statements that could appear as absolute to less-experienced folks because that will do more harm than good. Its better to prescribe the 90% rule, and then identify when a programmer might want to deviate from it.

Anyways, we're always glad to have someone with your experience around here (it certainly far exceeds my own) and I'm sure there'll be many more interesting conversations about optimization to come. If you'd ever care to go into depth about optimization work, consider writing an article for the site

Edited by Ravyne, 24 October 2013 - 01:06 PM.

### #11N.I.B.  Members   -  Reputation: 1022

Like
0Likes
Like

Posted 24 October 2013 - 02:02 PM

If you'd ever care to go into depth about optimization work, consider writing an article for the site wink.png

Maybe I will

### #12imoogiBG  Members   -  Reputation: 776

Like
0Likes
Like

Posted 24 October 2013 - 04:11 PM

Guys i was looking at the ASM and this i what i've got

//pure _128
__m128 a, b , c;
c = _mm_set_ps1(f);
movss       xmm1,dword ptr [esp+0Ch]
b = _mm_set_ps1(ff);
movss       xmm0,dword ptr [esp+10h]
shufps      xmm0,xmm0,0
shufps      xmm1,xmm1,0
a = _mm_add_ps(c, b);

/////////////////////////////////////////////////////
//__m128 as a member of a struct
SGVector v, v2, d;
d.m_M128 = _mm_set_ps1(f);
movss       xmm1,dword ptr [esp+0Ch]
v2.m_M128 = _mm_set_ps1(ff);
movss       xmm0,dword ptr [esp+10h]
shufps      xmm0,xmm0,0
shufps      xmm1,xmm1,0
v.m_M128 = _mm_add_ps(d.m_M128, v2.m_M128);

​/////////////////////////////////////////////////////
//_m128 as a member of a struct. Calling a custom function
SGVector v, v2, d;
d.m_M128 = _mm_set_ps1(f);
movss       xmm1,dword ptr [esp+0Ch]
v2.m_M128 = _mm_set_ps1(ff);
movss       xmm0,dword ptr [esp+10h]
shufps      xmm0,xmm0,0
shufps      xmm1,xmm1,0

////////////////////////////////////////////
//retval is SGVector

SGVector v, v2, d;
d.m_M128 = _mm_set_ps1(f);
movss       xmm1,dword ptr [esp+0Ch]
v2.m_M128 = _mm_set_ps1(ff);
movss       xmm0,dword ptr [esp+10h]
shufps      xmm0,xmm0,0
shufps      xmm1,xmm1,0
v = vec3_add(d.m_M128, v2.m_M128);
SGE_FORCE_INLINE void vec3_add2(const SGVector& a, const SGVector& b, SGVector& c)
{
#if defined(SGE_MATH_USE_SSE)
c.m_M128 = _mm_add_ps(a.m_M128, b.m_M128);
#endif
}
Refs, retvals do not change anything. For add ofc. Tomorrow i will try the cross product.

CL x86 O2

PS:
shufps      xmm0,xmm0,0

shufps      xmm1,xmm1,0

Why shufps is needed?

PS 2:

SuperVGA, on 24 Oct 2013 - 2:11 PM, said:

Pass vectors by value, not reference

Hi Rob,

Why do you recommend passing vectors by value rather than reference?

I've made it a habit to use const ref parameters wherever possible, and wherever ref makes sense
compared to the size of the type (I wouldn't pass a char by reference)

if the function is inlined then the refs do not change anything.

Edited by imoogiBG, 24 October 2013 - 04:21 PM.

### #13King Mir  Members   -  Reputation: 1818

Like
2Likes
Like

Posted 24 October 2013 - 11:24 PM

Pass vectors by value, not reference

Hi Rob,

Why do you recommend passing vectors by value rather than reference?
I've made it a habit to use const ref parameters wherever possible, and wherever ref makes sense
compared to the size of the type (I wouldn't pass a char by reference)

As of C++11, the new rule of thumb is to pass vectors and other objects like it by value when the operation performed on them requires a copy, otherwise pass by const ref. The rule for passing by reference is still the same. The reason is that this can enable move semantics when the arguments passed are rvalues. With virtual functions, the rule becomes when the operation logically needs a copy (even though every overload may not).

EDIT: that's std::vector. SIMD vectors that your architecture natively supports are basically primitive types since they can fit into a single register by definition.

Edited by King Mir, 25 October 2013 - 12:38 AM.

### #14N.I.B.  Members   -  Reputation: 1022

Like
0Likes
Like

Posted 25 October 2013 - 12:27 AM

shufps xmm0,xmm0,0

shufps xmm1,xmm1,0

Why shufps is needed?

That's what _mm_set_ps1() - broadcast a single float value into all of the __m128 components. Has to be movss and shufps.

Refs, retvals do not change anything

This because of the inlining. You usually don't need to use reference for __m128. I use references for __m128 only when passing more than 3 vectors to a function - no calling convention support that, and due to stack alignemnt you can't pass a vector on stack.

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

PARTNERS