Practical Cross Platform SIMD Math: Part 2

Published May 04, 2013 by All8Up
Do you see issues with this article? Let us know.
Advertisement

Having created a starting point to write a math library using basic SIMD abstractions, it is time to put in some real testing. While unit testing gets you halfway to a solid library, a primary part of the reason for using SIMD is to get extra performance. Unfortunately, with so many variations of SIMD instructions, it is easy to slow things down on accident or break things in ways which leave them testing positive in the unit tests but not really correct in practice.

We need something which will use the current Vector3fv class in a manner where the math is similar to games but uses enough variation where performance and odd cases are easily tested. Additionally we need something simple that doesn't take a lot of work to implement which brings us to a simple answer: a ray tracer.

What exactly does a ray tracer have to do with a game? All the required math is generally shared between game and ray tracer, additionally a ray tracer abuses memory in ways similar to a game and finally a ray tracer condenses several minutes worth of game play related math into something which only runs for a couple seconds.

Presented here is a ray tracing system implemented in a manner eventually usable by the unit tests through fixtures. This article will start by implementing the item as an application in order to get some basics taken care of. Additionally the application may continue to be useful while discussing features of C++11 and less common ways to use them. Given this is a testbed, it is worth noting that the ray tracer is not something intended to be a real rendering starting point, it is dirty, hacked together and probably full of errors and bugs.

As with prior work, the CMake articles (here) cover the build environment and the prior overview of how the library is setup is at here. Finally, the source repository can be found here.

Preparing For The Future

Prior to moving forward, it is important to note that a couple changes have been made to the SIMD instruction abstraction layer preparing for future extensions. The first item is renaming the types used and providing some extra utilities which will be discussed later.

Instead of the name Vec4f_t for instance, the types have been renamed to follow the intrinsic types supplied by Clang and GCC a little better. The primary reason for this is the greatly expanded set of types which will eventually need to be covered and while the naming convention would work, it was better for GCC and Clang to follow a slightly modified form which follows the compiler conventions closer. The names are changed to be: U8x16_t, I8x16_t, U16x8_t, etc up to F32x4_t, F64x2_t and beyond.

Again, while not necessary it made the code a bit more consistent and easier to map to/from Neon types and other intrinsics. There are some additions to the build in order to support MMX also. But, it is important to realize that prior to SSE2, MMX shared registers with the FPU which is undesirable, so MMX is only enabled in the presence of SSE2. Of course the SSE2 implementation of MMX is actually a full replacement which uses the XMM registers meaning that you use the __m128i structured data instead of the __m64 structure, so this is not MMX, simply SSE supporting integral types.

It should also be noted that the SSE3 specific versions of Dot3 and other intrinsics are missing, in the first article I made a mistake and used two timing sheets from different aged CPU's and didn't catch the inconsistency. The SSE 3 implementation turned out to be slower than the SSE 1 implementation. The _mm_hadd_ps instruction is unfortunately too slow to beat out the SSE 1 combination of instructions when the timings are all consistently measured on a single CPU target. These are just a couple of the changes since the first article and some will be covered in more detail later in this article. For other changes, you will likely just want to browse the source and take a look.

For instance, a starting point for a matrix exists but it is just that, a starting point and not particularly well-implemented as of yet. It is suggested not to use the additional classes as of yet, as they are experimental and being validated which may cause notable changes as I move forward with the process.

Library Validation

As important, or arguably more important, than the performance gains of the vectorization is making sure the library is usable without undue hassle and that common bug patterns are avoided. The only way to really validate code is to use it. This is one of the reasons a ray tracer is a decent testbed, with a small project there is less resistance to change if problems with the code are noticed.

Additionally, like the unit tests, it is useful to switch from different versions of the vectorized code quickly to make sure the results are consistent. While the unit tests should catch outright errors, trivial differences in execution are easier to catch with the ray tracer since it outputs an image you can inspect for correctness. Eventually it will be useful to automate the ray tracer tests and note output differences within the unit test runs, but for the time being we'll just validate times via log and images via version 1.0 eyeball. In using the supplied code, the intention is to find and correct any obvious errors which will exist in any amount of code.

While I don't intend to walk through each error and the corrections, at least 10 very obvious errors were fixed in the first couple hours of implementing the ray tracer. It doesn't take much to make a mistake and often unit tests just don't catch the details. An example of a bug which showed up almost immediately was the division by scalar implementation - it was initializing the hidden element of the SIMD vector to 0.0f and generating a NaN during division.

While we don't care about the hidden element, some of the functions expect it to always be 0.0f and as such a NaN corrupted later calculations. The fix was to simply initialize the hidden element to 1.0f in that particular function and everything started working as expected. This is just one example of a trivial-to-make mistake which takes less time to fix than it does to diagnose, but which can exist in even relatively small amounts of code.

NaN And Infinity

Testing for NaN and infinity values is usually performed by calling to standard library functions such as isnan or isfinite. Unfortunately there are two problems: one is that Microsoft decided functions such as those will be renamed to _isnan or _isfinite which of course means a little preprocessor work to get around the name inconsistency.

The second problem is with the SIMD fundamental types you have to extract each element individually in order to pass them to such functions. Thankfully though, by leveraging the rules of IEEE 754 floating point, we can avoid both issues and perform validations relatively quickly. What is a NaN or infinity value in terms of floats? The simple definition is that the floating point binary representation will be set in such a way as to flag an error. For our purposes, we don't care about what causes the error results, we just want to notice them quickly.

While some SIMD instruction sets do not fully conform to IEEE 754, usually the handling of NaN and infinity values do apply the common rules. To summarize the rules: any operation involving a NaN returns a NaN and any operation involving an Infinity (when not involving a NaN) will result in Infinity, though positive and negative variations are possible.

Additionally, a comparison rule is defined such that NaN can never be equal to anything, even another NaN. This gives us a very effective method of testing for NaN and infinity using SIMD instructions. Multiply the SIMD register by 0 and if the result does not compare equal to 0, then either the value is a NaN or it is positive or negative infinity. Again, we don't care which, we just care that the values were valid or not. Using Intel SSE, this is represented by the following code:

static bool IsValid( F32x4_t v ) 
{ 
	F32x4_t test = _mm_mul_ps( v, _mm_setzero_ps() ); 
	test = _mm_cmpeq_ps( test, _mm_setzero_ps() ); 
	return( 0x0f == _mm_movemask_ps( test ) ); 
} 

With this abstraction in hand, we are able to insert additional testing of the Vector3fv classes. Even though this is a fast test, the extra validation placed in too many locations can quickly drag debug build performance to its knees. We want to only enable the advanced testing as required and as such it will be a compile time flag we add to CMake.

Additionally, sometimes debugging a problem may only be possible in release builds, so we want the flag to be independent of debug build flags. Presenting the new option is quite simple with the CMake setup, we will be exposing SIMD_ADVANCED_DEBUG to the configuration file. This leaves one additional problem to be covered. The standard assert macro compiles out of release builds so our flag will still have no effect in release builds. We need a customized function to break on the tests even in release builds.

Since this is a handy item to have for other reasons, it will be placed in the Core library header as a macro helper. CORE_BREAK will cause a break point in the program, even in release builds. For Windows, we simply call __debugbreak() and *Nix style Os's we use raise( SIGTRAP ). When everything is set up, the excessive level of debugging is handy in catching problems but comes at a notable cost. In heavy math, it is nearly a 50% slowdown to the library.

We may need to cut back on the level of debugging or possibly use two flags for quick and/or detailed variations. For the time being though, the benefits outweigh the costs. Of course, the base functionality for IsValid always exists and as such it is possible to put in spot checks as required to catch common problems and only rely on the heavy-handed solution when absolutely needed.

Performance Measurement

Why is making sure that the math library is decently optimized so important? Donald Knuth stated something which is often paraphrased as: "premature optimization is the root of all evil." The entire saying though is more appropriate to a math library: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."

A math library is so fundamental to all your game code that there is a very strong argument that it is in the 3% of code which is not considered evil to be optimized as soon as possible. Of course this has to be taken with a grain of salt - since you can't go changing all the interfaces and usage patterns during the optimization pass, this is most definitely a function-to-function optimization which is generally frowned upon.

But, unlike 90+% of other functions, optimizing something like a dot product is easily going to effect potentially hundreds, or more likely, thousands of calls per frame. The reason to optimize it early is that a dot product is a small function often inlined and will not show up as a hot spot in a profiler, the only reliable manner to know the optimization helps is to enable it and disable it, measuring both cases.

Classes And SIMD

Part of the performance gains from SIMD math ends up being lost since, as mentioned briefly in the first article, classes which wrap SIMD types often do not perform as best they could. While compilers continue to get better at optimizing in this area, it is still a fairly weak point for most. The difficulty is not so much caused within single functions since the compilers will inline and optimize away (almost) all of the wrapper and simply use the intrinsics on the fundamental type. Instead the difficulty is in calling between functions which are not inlined.

In such cases the compiler is forced to flush the SIMD register back to memory and then use normal class style parameter passing when calling the function. Of course, once in the called function, the data is loaded back into a SIMD register. This is unneeded overhead which can be avoided if you pass by the fundamental type in most cases. Before you rush out and try to fix things, there are several items to consider. The first is: do you really care?

If a function is called rarely, this overhead is trivial and can be ignored while you continue to get the primary gains from the vectorized types. On the other hand, if a function is called continually, such as the case of intersection calls in the ray tracer, the overhead is not trivial and is a notable loss of performance. Even in the case of highly-called functions, the vectorized code still out-performs the native C++ by a notable amount, just not as much as it could.

Is the work of fixing this problem worth it? That is going to have to be left to the reader for their specific cases to determine, we'll simply be covering the solutions provided by the wrapper and abstraction architecture.

Isolation Of SIMD Performance

Clang and GCC both disable usage of intrinsics for higher-level instruction sets, unless you enable the compiler to also use the intrinsics. This is unfortunate in a number of ways but mostly when attempting to get a base line cost/benefit number for isolated work with SIMD instructions. Enabling compiler support of higher instruction sets makes the entire program faster of course but what we really want is to get an idea of just the bits which are vectorized from our work.

Using MSVC it is possible to tell the compiler not to use SIMD instruction sets in generated code, yet still use the vectorized math and hand coded intrinsics we have written. In this way it is possible to get the isolated changes properly timed. Of course, this is not a very accurate method of measuring performance since interactions between the compiler optimizations and the library are just as important as the raw vectorized changes, but it does give a baseline set of numbers showing how the wrappers are performing in general.

For these reasons, unless otherwise noted, all performance information will be measured with MSVC on an Intel I7 with the compiler set to generate non-SSE code.

The Ray Tracer

While the ray tracer is intended to be simple, it is also written as a complete application with object oriented design and driving principles. Also, the ray tracer introduces a small open source utility class which has been dropped in to parse JSON files. (See http://mjpa.in/json) I don't intend to detail parsing of JSON but I will quickly review why it was chosen.

JSON as a format is considerably more simplified than XML and supports a differentiation of concepts which is quite useful to programming. The differentiation is that everything is one of four things: string, number, array or object. With the four concepts, parsing is simplified in many cases and error checks can start off by simply checking the type before trying to break down any content. For instance, a 3D vector can be written as the following in JSON: MyVector : [0, 0, 0]. With the library in use, the first error check is simply: isArray() && AsArray().size()==3.

XML on the other hand does not have any concepts built in, everything is a string and you have to parse the content yourself. This is not to say XML doesn't have a place, it is just a more difficult format to parse and for this testbed JSON was a better fit. The only real downside to the JSON format is lack of commenting abilities, it would be nice to have comments in JSON but people argue adamantly to not support such an extension. The scene files used by the test application are quite simple, there are only six key sections listing out specific items.

If you open the Sphere.json file in the Assets directory, you can see a very simple example. Hopefully the format is easy enough to understand, as I won't be going into details. As a little safety item, it does include a version section which is currently 1.0. But, as mentioned, the ray tracer is a quickly put together testbed and not properly coded in a number of areas. The lighting model is incorrect, the super-sampling is just a quick approximation and some of the settings are probably used incorrectly in general. While the ray tracer may be made better eventually, it currently serves the goal of something which tests the math vectorization in non-trivial ways.

First Performance Information

There are 4 different timings that we are interested in when testing the ray tracer. The first timing is the pure C++ implementation with all vectorization disabled. The second time is the reference SIMD implementation which, due to working on the hidden element, will be a bit slower than the pure C++ version.

Finally, we test the base SSE modification and then the SSE 4 upgrades. As mentioned, all SSE optimizations will be turned off in the compiler and the relative performance will be completely related to the vectorization of the math in use. We will be using a more complicated scene for testing, found in the BasicScene.json file. The scene contains a couple planes and a number of spheres with varying materials. With a couple minor additions to the SSE code since the first article, the first run results in the following numbers (command line: RayTracer BasicScene.json):

Total time 
Relative Performance 
Description 
41.6 seconds 
1.00 
C++ 
47.4 seconds 
0.88 
Reference 
36.3 seconds 
1.15 
SSE 1 
33.0 seconds 
1.26 
SSE 4 

For a fairly simple starting point, 15% and 26% performance gains are quite respectable. The reference implementation is 12% slower than the normal C++, we need to fix that eventually but won't be doing so in this series of articles, it will just be fixed in the repository at some point in the future. These numbers are not the huge gains you may be expecting, but remember the only change in this code is the single Vector3fv class, all other code is left with only basic optimizations.

Passing SIMD Fundamental Types By Register

Passing the wrapper class to functions instead of passing by the fundamental type breaks part of the compiler optimization abilities. It is fairly easy to gain back a portion of the optimizations by doing a little additional work. Since the class wrapper contains a conversion operator and constructor for the fundamental type, there is no reason function calls can't be optimized by passing the fundamental type, but how to do that?

In the SIMD instruction abstraction layer a utility structure has been added, for Intel it is in the Mmx.hpp file. The purpose of the structure is to provide the most appropriate parameter passing convention for any of the abstractions. In the reference implementation, the preferred passing style is a const type reference while in the MMX+ version we pass by unqualified basic type. The purpose of the type indirection is to maintain compatibility between the various implementations and has little effect on overall code usage or implementation details.

At anytime you can directly reference the type via: Simd::Param< F32x4_t >::Type_t, though as is done in the Vector3f classes, we pull the type definition into the wrapper class for ease of use and simply name it ParamType_t. There are a number of downsides to passing by the fundamental type, all related to ease of use. When you want to use the fundamental type you either need to use the SIMD instruction abstraction layer directly or wrap it back up in the Vector3fv wrapper class.

In practice, it is a bit more typing and looks odd, but re-wrapping the passed-in register has no fundamental effect on code generation given the compilers. The wrapper is stripped right back out during optimization and you still have the benefits of the class wrapper without additional costs.

Intersection Optimization

The primary ray tracer loop, Scene::Render, cycles through all pixels in the output image and builds a ray for each pixel. In the case of super sampling, it also builds intermediate rays which are averaged together. Each of the rays is sent into the scene via the Scene::Trace function which is a recursive function. From within Scene::Trace, the rays are checked against the contents of the scene via the ModelSet::Intersect and the ModelSet::Shadowed functions.

In all of these functions, the ray to be computed is packaged in a simple wrapper and passed by const reference. Of course given two vectors in the Ray class, the optimization loss is fairly significant. Since the per model intersection routines are the most called within the testbed, we will proceed to optimize them in a simple manner.

At this point, it is important to mention a rule of optimization. Fix the algorithms before proceeding to optimizing single functions. In the case of the ray tracer, its brute force nature is a problem. We do not want to fix this though, the purpose is not to write a fast ray tracer but to have a good consistent method of abusing the math types.

So, for the purposes here we are breaking the rule for good cause. Always make sure you remember not to focus on little details as is being done here, or stated another way: "Do as I say, not as I do." In this particular case. In order to get a good idea of the performance difference, we'll optimize starting with the sphere and plane interception routines.

The function declaration is currently: virtual bool Intersect( const Math::Ray& r, float& t ) const=0;.

We will change this to pass the two components of the ray as registers as shown in: virtual bool Intersect( Vector3::ParamType_t origin, Vector3::ParamType_t dir, float& t ) const=0.

We are not expecting a huge win with this change since we still have broken optimizations up the chain of function calls. But there should be a noticeable difference:

Total time 
Relative Performance 
Description 
41.6 seconds 
1.00 
C++ 
47.4 seconds 
0.88 
Reference 
35.8 seconds 
1.16 
SSE 1 
31.9 seconds 
1.30 
SSE 4 

For SSE 1, this is not a great win but 4% for SSE 4 is notable. Now, why would SSE 4 get such a benefit while SSE 1 shows only about 1 percent? If you look at the intersection code, the first thing which happens in both is a dot product. SSE4's dot product is so much faster than the SSE 1 implementation that the latencies involved in SSE 1 were hiding most of the performance losses.

SSE 4 though, has to load the registers just the same, but there are fewer instructions in which the latencies get hidden. Passing by register under SSE 4 gets a notable speed benefit in this case specifically because we removed wait states in the SIMD unit. Of course, changing all the code up to the main loop to pass by register will supply further and even potentially more notable performance gains.

Conclusion

The SIMD wrapper classes can provide a significant performance gain without requiring any changes to a code base. With some extra work, using the pass by register modifications can return a significant portion of the lost performance caused by wrapping fundamental types. Once pass by register is pushed throughout the code base, even hand coding SIMD intrinsics will not usually provide a notable amount of extra performance at this point.

Of course, as with all things optimization, you need to keep in mind where you apply optimizations. Low-level optimizations of the math library and types are unlikely to provide your greatest gains with even a fair SIMD implementation. Other math types will be added and used in further articles but this concludes the overview of the practical usage of SIMD.

It is highly suggested that you learn more about SIMD in general because eventually using the intrinsics directly can be used to optimize many portions of your game code well beyond the specific math involved.

Cancel Save
0 Likes 0 Comments

Comments

Nobody has left a comment. You can be the first!
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!

In order to finish the discussion of practical SIMD usage, it is important to put the code into use and also discuss some performance issues. Finally, in order to make the code really useful, debugging tools will be added for catching errors as quickly as possible.

Advertisement
Advertisement