# How to profile SIMD

This topic is 1548 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Actually the topic name should be "How to profile code".

Well I am writing SIMD math library. I got 2 implementations SSE and scalar.

I'm not shure how measure the code speed. Currently Im not using optimization, and no debug symbols are generated for profiling.

I'm creating a loop that repeats the operation...

The compiler is cl

I'm expecting SSE dot product to be slower than scalar version?

But the cross product is also slower!?!@

SGE_FORCE_INLINE SGVector vec3_cross(const SGVector& a, const SGVector& b)
{
#if defined(SGE_MATH_USE_SSE)
__m128 T = _mm_shuffle_ps(a.m_M128, a.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)
__m128 V = _mm_shuffle_ps(b.m_M128, b.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0)

//i(ay*bz - by*az)  + j(bx*az - ax*bz)  + k(ax*by - bx*ay)
T = _mm_mul_ps(T, b.m_M128);//bx * ay, by * az, bz * ax
V = _mm_mul_ps(V, a.m_M128);//ax * by, ay * bz, az * bx
V = _mm_sub_ps(V, T);

V = _mm_shuffle_ps(V, V, SGE_SIMD_SHUFFLE(1, 2, 0, 3));
return SGVector(V);
#else
const float x = (a.y*b.z) - (b.y*a.z);
const float y = (b.x*a.z) - (a.x*b.z);
const float z = (a.x*b.y) - (b.x,a.y);

return SGVector(x, y, z, 0.f);
#endif
}

where SGVector is struct with union{ struct {float x,y,z;}; float arr[4]; __m128 m_M128}. (maybe that is the problem?!)

EDIT : maybe __forceinline is involed too!? I will remove it.

Edited by imoogiBG

##### Share on other sites

Currently Im not using optimization

Ah? If you want to measure performance you need to compile using optimizations. at least /O2, otherwise every instruction will go through memory.

The best tool for profiling is Intel's VTune, they have 30 days free trial. Visual studio comes with a profiler you can use as well. If you want to profile yourself, you can use the timestamp counter (rdtcs instruction, visual studio has an intinsic).

__forceinline is very good for performance, I use it almost everywhere in performance critical code.

The problem is probably the shuffles. Not sure what CPU you are using, but shuffle performance is limited. You have 3 shuffles and 3 math instructions, which is not a good ratio. The SSE math instructions have data dependency on the shuffles, which will stall the CPU pipeline.

Also, your inputs and outputs are structs. Even though you pass them by reference, both shuffles and return values will got through memory (probably, need to look at the generated assembly).

To maximize SSE performance you need to design your whole code around it. Have long SSE math sequences, and reduce the amount of memory acceses and shuffles.

BTW, you can tell CL to use scalar SSE for floating-point instructions by using /arch:SSE. Check this.

Edited by satanir

##### Share on other sites
I'd have to say there are a couple good and bad suggestions here. Rob has the best good suggestion, don't profile a function, profile it within a real application. The other suggestion about using force inline, don't, you are overriding a compiler's decision making, which given that recent compilers are generally more knowledgeable than you are about when it makes sense to inline is a bad idea. (Inline in general is a deprecated and ignored C++ feature due to this.) Calling a single function to get the timing repeatedly only tells you the absolute minimum cycles the set of instructions can execute in. I can do that without a computer simply using Intel Intrinsics Guide and about 5 minutes looking up the instructions, this is not very useful. With SIMD you have instruction latencies (other operations have latencies also, just not usually as severe) and the compilers are very good at interleaving other operations with the SIMD code to hide those latencies. My cross product is actually:

static const int	swizzle1	=	_MM_SHUFFLE( 3, 0, 2, 1 );
static const int	swizzle2	=	_MM_SHUFFLE( 3, 1, 0, 2 );

__m128	v1	=	_mm_shuffle_ps( lhs, lhs, swizzle1 );
__m128	v2	=	_mm_shuffle_ps( rhs, rhs, swizzle2 );
__m128	v3	=	_mm_shuffle_ps( lhs, lhs, swizzle2 );
__m128	v4	=	_mm_shuffle_ps( rhs, rhs, swizzle1 );

__m128	p1	=	_mm_mul_ps( v1, v2 );
__m128	p2	=	_mm_mul_ps( v3, v4 );

__m128	result	=	_mm_sub_ps( p1, p2 );

Yup, one more shuffle in that code, but both in latency counting and in real code, this is the better solution for the compiler to optimize with given that the primary stall is between the last shuffle and the first multiply. The compiler can easily find ALU/FPU operations from other code around this inlined function to hide the single big stall and only needs a couple more instructions to hide the smaller stalls between the muls and the sub and of course actually "using" the result is delayed a bit.

There are some give and takes involved of course. It is best if the compiler can find instructions to insert between each of the shuffles since on P3 machines you could only issue a shuffle every other cycle and then it took 6 cycles before the resulting register was usable. Of course on any P4 and later it is generally a 1 throughput so issuing them one after another is not a bad thing.

Overall though, the key to remember is that when you hand optimize with intrinsics (NOT with inline assembly etc) what you are doing is acting as an extension of the compilers optimizer. Those things which are too tricky/complex (dot/cross/etc) for the compilers optimizer to auto vectorize you do. But, you want to write your optimization in such a manner that the compiler can then use it in combination with the optimizations it can make which includes inserting other instructions to hide latencies, reordering the register usages (note in the above, I define a new __m128 at each step, the compiler optimizes better due to that, usually) and a whole slew of other things most modern compilers are doing for you all the time.

The last note, ignore everything above if you are using VC 2008 or prior, my cat can produce better code that that POS. VC 2010 and Clang are the minimums, GCC is 50/50 if it does things well though supposedly it is getting better. VC 2012 and Clang though, both do things like unrolling tight loops of the above cross product and hiding pretty much all the latencies, it is REALLY impressive to see the compilers doing such things.

##### Share on other sites

Pass vectors by value, not reference

Hi Rob,

Why do you recommend passing vectors by value rather than reference?

I've made it a habit to use const ref parameters wherever possible, and wherever ref makes sense
compared to the size of the type (I wouldn't pass a char by reference)

##### Share on other sites

Pass vectors by value, not reference

Hi Rob,

Why do you recommend passing vectors by value rather than reference?
I've made it a habit to use const ref parameters wherever possible, and wherever ref makes sense
compared to the size of the type (I wouldn't pass a char by reference)

Native datatypes should be passed by value (i.e. float, double, char, int, short, long, size_t). Consider __m128 as a native data type. In Win32, up to 3 __m128 parameters will be passed through xmm registers. This post is also a good read.

##### Share on other sites

Well, there are many good, sensible points here. And I've heard all of them before. And they are not true...

...

That's true. But so does loop unrolling. That can't be the basis for dismissing __forceinline, it's just another factor to take into consideration. I've never encountered any I\$ problems, and I put __forceinline on a 40-lines-of-code-functions more than once.

...

Not always true, not even with the new, much improved VS12 and Intel Compiler 13. Compilers are not, and probably will never be perfect, especially when they need to support a wide variety of CPUs. Sometime you do know better, so why not help the compiler?

That was my kind of my point, and why I said it wasn't good *blanket* advice--You can't say just use force-inline or don't. Use it when it works, absolutely, but the default should really be to not. The general audience here aren't professionals with multiple years of optimization experience. Not half of them even know the tools they'd need to investigate performance, and fewer still know proper procedures for doing so. Most would, at best, bang up a tight loop that multiplies a bunch of vectors or matrices together, time the competing implementations, pick the winner and never look back afterwards. A good portion of the audience here, being the youngest and least experienced, are simply content to follow hearsay and urban legend as their optimization strategy. In short, you really have to be careful around here to not make statements that could appear as absolute to less-experienced folks because that will do more harm than good. Its better to prescribe the 90% rule, and then identify when a programmer might want to deviate from it.

Anyways, we're always glad to have someone with your experience around here (it certainly far exceeds my own) and I'm sure there'll be many more interesting conversations about optimization to come. If you'd ever care to go into depth about optimization work, consider writing an article for the site

Edited by Ravyne

##### Share on other sites

If you'd ever care to go into depth about optimization work, consider writing an article for the site wink.png

Maybe I will

##### Share on other sites

Guys i was looking at the ASM and this i what i've got

//pure _128
__m128 a, b , c;
c = _mm_set_ps1(f);
movss       xmm1,dword ptr [esp+0Ch]
b = _mm_set_ps1(ff);
movss       xmm0,dword ptr [esp+10h]
shufps      xmm0,xmm0,0
shufps      xmm1,xmm1,0

/////////////////////////////////////////////////////
//__m128 as a member of a struct
SGVector v, v2, d;
d.m_M128 = _mm_set_ps1(f);
movss       xmm1,dword ptr [esp+0Ch]
v2.m_M128 = _mm_set_ps1(ff);
movss       xmm0,dword ptr [esp+10h]
shufps      xmm0,xmm0,0
shufps      xmm1,xmm1,0

?/////////////////////////////////////////////////////
//_m128 as a member of a struct. Calling a custom function
SGVector v, v2, d;
d.m_M128 = _mm_set_ps1(f);
movss       xmm1,dword ptr [esp+0Ch]
v2.m_M128 = _mm_set_ps1(ff);
movss       xmm0,dword ptr [esp+10h]
shufps      xmm0,xmm0,0
shufps      xmm1,xmm1,0

////////////////////////////////////////////
//retval is SGVector

SGVector v, v2, d;
d.m_M128 = _mm_set_ps1(f);
movss       xmm1,dword ptr [esp+0Ch]
v2.m_M128 = _mm_set_ps1(ff);
movss       xmm0,dword ptr [esp+10h]
shufps      xmm0,xmm0,0
shufps      xmm1,xmm1,0
SGE_FORCE_INLINE void vec3_add2(const SGVector& a, const SGVector& b, SGVector& c)
{
#if defined(SGE_MATH_USE_SSE)
#endif
}
Refs, retvals do not change anything. For add ofc. Tomorrow i will try the cross product.

CL x86 O2

PS:
shufps      xmm0,xmm0,0

shufps      xmm1,xmm1,0

Why shufps is needed?

PS 2:

SuperVGA, on 24 Oct 2013 - 2:11 PM, said:

Pass vectors by value, not reference

Hi Rob,

Why do you recommend passing vectors by value rather than reference?

I've made it a habit to use const ref parameters wherever possible, and wherever ref makes sense
compared to the size of the type (I wouldn't pass a char by reference)

if the function is inlined then the refs do not change anything.

Edited by imoogiBG

##### Share on other sites

Pass vectors by value, not reference

Hi Rob,

Why do you recommend passing vectors by value rather than reference?
I've made it a habit to use const ref parameters wherever possible, and wherever ref makes sense
compared to the size of the type (I wouldn't pass a char by reference)

As of C++11, the new rule of thumb is to pass vectors and other objects like it by value when the operation performed on them requires a copy, otherwise pass by const ref. The rule for passing by reference is still the same. The reason is that this can enable move semantics when the arguments passed are rvalues. With virtual functions, the rule becomes when the operation logically needs a copy (even though every overload may not).

EDIT: that's std::vector. SIMD vectors that your architecture natively supports are basically primitive types since they can fit into a single register by definition. Edited by King Mir

##### Share on other sites

shufps xmm0,xmm0,0

shufps xmm1,xmm1,0

Why shufps is needed?

That's what _mm_set_ps1() - broadcast a single float value into all of the __m128 components. Has to be movss and shufps.

Refs, retvals do not change anything

This because of the inlining. You usually don't need to use reference for __m128. I use references for __m128 only when passing more than 3 vectors to a function - no calling convention support that, and due to stack alignemnt you can't pass a vector on stack.