While there is no inherent instruction transition cost there are some serious gotcha's when working with SSE inside a C++ project. Take the following structure:
struct Vector3f
{
Vector3f Cross(const Vector3f& rhs) const;
__m128 mVector;
};
The signature for the Cross member has the primary performance hit visible but it is subtle. Anytime I wish to call 'Cross' I am passing in a reference to a Vector3f wrapper and not passing the __m128 register. What this means to the compiler is that instead of just calling an appropriate _mm instruction, it will first call a store to memory, pass the pointer to the memory and then a load back from memory in order to call the cross product. What this ends up doing is flushing the entire SSE pipeline for the given register which often involves waiting out the latency of several SSE instructions before the write can take place. The return portion of the function is also going to cause the same issue where the compiler doesn't understand that it could simply leave the value in a register for further SSE instructions to use directly, so it flushes at the point of the return also.
What all this means is that the cost of computing the cross may be greatly reduced but the call 'to' the cross member is constantly flushing the SSE pipeline, which can be very costly. If you wished to fix this constant flushing you would need to rewrite the structure as something like the following:
struct Vector3f
{
__m128 Cross(__m128 rhs) const;
__m128 mVector;
};
This breaks encapsulation but corrects the constant pipeline flushing. If you look at the following article:
http://www.gamedev.net/page/resources/_/technical/game-programming/practical-cross-platform-simd-math-part-2-r3101 you can see the timing differences caused by correcting this issue in a single function within a raytracer testbed. That article shows a 4% gain simply by correcting this one item in a single function call. I eventually went and hand optimized the entire codebase to follow pass by register semantics to see what would happen and showed nearly a 50% gain with no other changes than the calling convention fixes.