Is there a cost between switching SSE and non-SSE instructions?

Started by
5 comments, last by CodeBoyCjy 8 years, 9 months ago

Hi all:

I'm curious that is there some kind of cost due to switching between SSE and non-SSE instruction?

Like switching between compute and rendering pipeline will incur some driver overhead.

I tried an SSE optimized cross product, it turned out to be effective. Almost three times faster than non-sse version.

However the whole ray tracer doesn't improve at all due to some unknown reason.

(One possible reason is that cross product is not the bottleneck. However the cross product function was not written well, after a little optimization. The perf of the ray tracer and the function went up more than 5%, So it affects perf for sure.)

Any tips?

Thanks

Advertisement

Yes. Whenever you move values between SSE and non-SSE code, the CPU has to assign that values to different sets of registers, causing (at least) an expensive Load-Hit-Store. I'm sure there is someone that can give you a more technical explanation, but if possible, you should always try to do as many calculations as possible in SSE (like a cross-product followed by normalization and multiplication) if you want to get any performance gain.

While there is no inherent instruction transition cost there are some serious gotcha's when working with SSE inside a C++ project. Take the following structure:

struct Vector3f

{

Vector3f Cross(const Vector3f& rhs) const;

__m128 mVector;

};

The signature for the Cross member has the primary performance hit visible but it is subtle. Anytime I wish to call 'Cross' I am passing in a reference to a Vector3f wrapper and not passing the __m128 register. What this means to the compiler is that instead of just calling an appropriate _mm instruction, it will first call a store to memory, pass the pointer to the memory and then a load back from memory in order to call the cross product. What this ends up doing is flushing the entire SSE pipeline for the given register which often involves waiting out the latency of several SSE instructions before the write can take place. The return portion of the function is also going to cause the same issue where the compiler doesn't understand that it could simply leave the value in a register for further SSE instructions to use directly, so it flushes at the point of the return also.

What all this means is that the cost of computing the cross may be greatly reduced but the call 'to' the cross member is constantly flushing the SSE pipeline, which can be very costly. If you wished to fix this constant flushing you would need to rewrite the structure as something like the following:

struct Vector3f

{

__m128 Cross(__m128 rhs) const;

__m128 mVector;

};

This breaks encapsulation but corrects the constant pipeline flushing. If you look at the following article: http://www.gamedev.net/page/resources/_/technical/game-programming/practical-cross-platform-simd-math-part-2-r3101 you can see the timing differences caused by correcting this issue in a single function within a raytracer testbed. That article shows a 4% gain simply by correcting this one item in a single function call. I eventually went and hand optimized the entire codebase to follow pass by register semantics to see what would happen and showed nearly a 50% gain with no other changes than the calling convention fixes.
IIRC, on x86-64, all float math is now done using the "sse" registers, so there's not much of a penalty from intermingling scalar and vector data (perhaps some shuffling still).
...but as Juliean said above, on x86, regular float's and SEE __m128's are stored in different kinds of registers, meaning that there's a cost involved in moving data from being a float to being a member of a vector (and vice versa).
There's a little more to it than that, but that can be an issue too. You want to try and have all your calculations of the same kind happen together, with no unnecessary reads, all in a big list. For a raytracer this should be achievable. It is a problem that is highly tractable to both SSE and multithreading. You can divide the screen into quadrants or octants and assign each one to its own thread as well. This should speed things up a lot. But I think your main costs will depend on what collision structures you use and how you arrange your data for cache-friendly computation.

This is my thread. There are many threads like it, but this one is mine.

thanks for all answers.

I think I've got a general idea on how to improve my ray tracer.

While there is no inherent instruction transition cost there are some serious gotcha's when working with SSE inside a C++ project. Take the following structure:

struct Vector3f

{

Vector3f Cross(const Vector3f& rhs) const;

__m128 mVector;

};

The signature for the Cross member has the primary performance hit visible but it is subtle. Anytime I wish to call 'Cross' I am passing in a reference to a Vector3f wrapper and not passing the __m128 register. What this means to the compiler is that instead of just calling an appropriate _mm instruction, it will first call a store to memory, pass the pointer to the memory and then a load back from memory in order to call the cross product. What this ends up doing is flushing the entire SSE pipeline for the given register which often involves waiting out the latency of several SSE instructions before the write can take place. The return portion of the function is also going to cause the same issue where the compiler doesn't understand that it could simply leave the value in a register for further SSE instructions to use directly, so it flushes at the point of the return also.

What all this means is that the cost of computing the cross may be greatly reduced but the call 'to' the cross member is constantly flushing the SSE pipeline, which can be very costly. If you wished to fix this constant flushing you would need to rewrite the structure as something like the following:

struct Vector3f

{

__m128 Cross(__m128 rhs) const;

__m128 mVector;

};

This breaks encapsulation but corrects the constant pipeline flushing. If you look at the following article: http://www.gamedev.net/page/resources/_/technical/game-programming/practical-cross-platform-simd-math-part-2-r3101 you can see the timing differences caused by correcting this issue in a single function within a raytracer testbed. That article shows a 4% gain simply by correcting this one item in a single function call. I eventually went and hand optimized the entire codebase to follow pass by register semantics to see what would happen and showed nearly a 50% gain with no other changes than the calling convention fixes.

I'm trying your method, it does avoid the sse pipeline flushing.

However, I may have to pass the __m128 data instead of the vector itself. The higher level code will look something like this:

v0.data = Cross( v1.data , v2.data );

I'm not a C++ ninjia. I only have the following solutions:

  • Use macro to hide the .data field. Although it is still there, it hides the field anyway.
    D(v0) = Cross( D(v1) , D(v2) );
  • Use operator. May add a couple of copy operations, hopefully the compile can optimise it.

The reason that I don't like the type of higher level code is that I will also use the code this way in Non-SSE math code if SSE is not supported. (I knew that SSE is almost 100% coverage, let's just assume that it is not supported on certain machines.)

Any tip is welcome.

This topic is closed to new replies.

Advertisement