Actually the topic name should be "How to profile code".

Well I am writing SIMD math library. I got 2 implementations SSE and scalar.

I'm not shure how measure the code speed. Currently Im not using optimization, and no debug symbols are generated for profiling.

I'm creating a loop that repeats the operation...

The compiler is cl

I'm expecting SSE dot product to be slower than scalar version?

But the cross product is also slower!?!@

SGE_FORCE_INLINE SGVector vec3_cross(const SGVector& a, const SGVector& b) { #if defined(SGE_MATH_USE_SSE) __m128 T = _mm_shuffle_ps(a.m_M128, a.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0) __m128 V = _mm_shuffle_ps(b.m_M128, b.m_M128, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); //(Y Z X 0) //i(ay*bz - by*az) + j(bx*az - ax*bz) + k(ax*by - bx*ay) T = _mm_mul_ps(T, b.m_M128);//bx * ay, by * az, bz * ax V = _mm_mul_ps(V, a.m_M128);//ax * by, ay * bz, az * bx V = _mm_sub_ps(V, T); V = _mm_shuffle_ps(V, V, SGE_SIMD_SHUFFLE(1, 2, 0, 3)); return SGVector(V); #else const float x = (a.y*b.z) - (b.y*a.z); const float y = (b.x*a.z) - (a.x*b.z); const float z = (a.x*b.y) - (b.x,a.y); return SGVector(x, y, z, 0.f); #endif }

where SGVector is struct with union{ struct {float x,y,z;}; float arr[4]; __m128 m_M128}. (maybe that is the problem?!)

EDIT : maybe __forceinline is involed too!? I will remove it.