Sign in to follow this  
Misery

[c++] slow/fast math functions

Recommended Posts

Hi,
Lately I found quite interesting thing. Everybody writes that standard mathematical functions in math library in c++ are very slow, and therefore are inadequate for game programming. I have made few tests using so called "fast" functions and standard ones in VC++ 2010 express ed. Everything written in many articles was true (that fast functions are faster than standard ones) until I changed into release mode instead debug mode. Than it turned out that standard math functions were 4 to 6 times faster than those "fast" ones. Tested functions were sin(x) and sqrt(), and 1.0f/sqrt(x).

Am I doing something wrong?

Performance has been tested by measuring time of for example 10^6 calls of function.

Regards,
Misery

PS: of course I have used 'inline' and so on.

Share this post


Link to post
Share on other sites
Which articles? How old are they? It may be that Visual Studio is using processor extensions or instructions that weren't commonplace at the time the articles were written.

Share this post


Link to post
Share on other sites
Quote:
Original post by Misery
Lately I found quite interesting thing. Everybody writes that standard mathematical functions in math library in c++ are very slow, and therefore are inadequate for game programming. I have made few tests using so called "fast" functions and standard ones in VC++ 2010 express ed. Everything written in many articles was true (that fast functions are faster than standard ones) until I changed into release mode instead debug mode. Than it turned out that standard math functions were 4 to 6 times faster than those "fast" ones. Tested functions were sin(x) and sqrt(), and 1.0f/sqrt(x).


Who is "Everybody"? The compiler most likely is generating calls to the floating-point unit for these functions; that is, you have hardware support for the function evaluations. "Fast" functions tend to be software implementations that trade accuracy for speed. Whether such a function outperforms the FPU-supported one depends on the function and the implementation (run a profiler to measure). The FPU-supported functions are required to produced correctly rounded results (as if you computed in infinite precision and rounded to nearest floating-point number). The fast functions are usually not correctly rounded but good enough for the application.

Do you have a particular "fast" function that is slower than the math-library counterpart? Maybe you can post it so that folks can comment on it.

Share this post


Link to post
Share on other sites
Quote:

Performance has been tested by measuring time of for example 10^6 calls of function.

First off, check your test. Make sure the compiler didn't optimize the code different from what you expected from the program.

Quote:

Everybody writes that standard mathematical functions in math library in c++ are very slow

Who's "everybody", and what is the target platform? not all platforms support the same feature sets. Consider "Carmack's Inverse Sqrt()". Back then it was much faster than the alternitaves, now the _mm_rsqrt_ss instruction is probably the faster/better solution.

Share this post


Link to post
Share on other sites
Quote:
Original post by Rycross
Which articles? How old are they? It may be that Visual Studio is using processor extensions or instructions that weren't commonplace at the time the articles were written.


This.

Remember, at one point using a lookup table would have been much faster for certain operations (back when I started sin/cos lookup tables were very common), but as CPU speed and memory latency has increased often its much faster to use instructions than to take a round trip into ram and back.*

*exceptions still apply; profile before deciding whats best. YMMV, this advice does not come with a money back guarantee nor with a cool hat

Share this post


Link to post
Share on other sites
I am compeletly new in game programming so I read everything I can possibly find, however I have been fluid mechanics computations programmer for a few years, so I quite understand programming mechanics and things like that, but not game programming yet.

So... 'Everybody' is 'Game programming gems' and all You could find in google typing f.ex. "Fast sin function" or "Fast sqrt", "Fast round" etc.
Sorry for not giving the source at first time.

I also have tried "Carmack's Inverse Sqrt()", and it still performs slower than standard C++ 1.0f/sqrt(x) function.

First off, check your test. Make sure the compiler didn't optimize the
code different from what you expected from the program.

Whether such a function outperforms the FPU-supported one depends on the
function and the implementation (run a profiler to measure)

How can I do that?

Tested functions:

inline float Sqrt(const float x)
{
static union
{
int i;
float x;
} u;
u.x = x;
u.i = (1<<29) + (u.i >> 1) - (1<<22);
u.x = 0.5f * (u.x + x/u.x);
return u.x;
}


inline float InvSqrt (float x)
{
static float xhalf = 0.5f*x;
static int i = *(int*)&x;
i = 0x5f3759df - (i >> 1);
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}

//Lagrange interpolation polynomial computed with horner's scheme for range -2Pi to 2Pi
inline float Sin2Pi(float x_rad)
{
if (x_rad>=0.0f)
{
return x_rad*((((-0.00535050099910763f*x_rad+0.08404547315910650f)*x_rad-0.37606003000535748f)*x_rad+0.22629999558973990f)*x_rad+0.91587156796416036f);
}
else
{
x_rad=-x_rad;
return x_rad*((((0.00535050099910763f*x_rad-0.08404547315910650f)*x_rad+0.37606003000535748f)*x_rad-0.22629999558973990f)*x_rad-0.91587156796416036f);
}
}

Project options:
Release mode,
Optimizations: Speed
inline functions: tested on default and AnySuitable
Enable intrinsic functions: no
Favor fast code
rest of options is default.


Thank You all for your help. :)

Share this post


Link to post
Share on other sites
Both the sqrt and invsqrt functions involve moving data to/from int/float registers, this may be slow on your modern target hardware, some consoles in particular.

Additionally instruction pipelines may well be deeper than they were when these functions were written making incorrect branch predictions more expensive than they used to be.

To profile functions make 100,000's of calls to them in a loop. (You might want to vary inputs) read a timer before the start of the loop (google QueryPerformanceCounter) and again at the end and print out how long it took.

My advice if your new to games programming, don't worry too much about the speed of the 'little things' until they show up in your profile. There is a whole world of interesting algorithms used in game development which is where the real speed comes from, that and memory access patterns. When a little maths function appears in your profile you can usually optimise it out, if not, at least you have a context and test case for trying to optimise the maths function itself. That is of course unless it's optimising the little things which really interest you.

In my current engine we actually do have approximate versions of a number of standard maths functions which are actually quicker on some platforms, we have a unit tests for each which fail if the approximate versions are not faster. (or do no produce expected results)

Share this post


Link to post
Share on other sites
Quote:
Original post by Martin
To profile functions make 100,000's of calls to them in a loop. (You might want to vary inputs) read a timer before the start of the loop (google QueryPerformanceCounter) and again at the end and print out how long it took.

No, that's how you get an artificial benchmark that's complete irrelevant to the performance of actual production code. To profile properly requires you to use the function in an environment comparable to that of its final target.

Share this post


Link to post
Share on other sites
Quote:

Origional post by Martin
To profile functions make 100,000's of calls to them in a loop. (You might want to vary inputs) read a timer before the start of the loop (google QueryPerformanceCounter) and again at the end and print out how long it took.

Get your values/write your answers to an array that you dump to a file after the loop. Anything to keep the compiler from optimizing the call out entierly. The compilers are often smart enough to remove things with no actual side effect other than wasting cpu time. They are also smart enough to constant fold math functions they know about (ie sum(1,100) could be replaced with just the constant 5050). Check the disassembly in the debugger, and make sure the compiler made the test case you thought it did. The better idea though is to make an actual benchmark out of working code that does something real, like a particle system or animation system.

Quote:

Origional post by Misery
How can I do that?

In the more general case, get a profiler like AMD's Code Analyst, and run it on an application that is actually doing something normal (ie not a contrived benchmark). This will show you where the problems are in the code. sqrt() might be slow, but you have to call it alot before it becomes a problem. When the profiler shows you that something big is taking a lot of time, optimize that first. If you end up having to optimize something with a tight inner loop (like an animation or particle subsystem), then worry about how many times you sqrt() and how much of an impact that is on the game.

Share this post


Link to post
Share on other sites
Here's my experience with trigonometric functions under GCC, which of course is a totally different compiler, but it probably is not that much different with VC++.

Trig functions weight in with around 360-380 cycles each, which really isn't so terrible unless you need to calculate millions of them.

An inlined super optimized minimax approximation with (only) 4 terms that works within +/- Pi runs in 25-26 cycles on my machine. With correct range reduction for arbitrary input, we're at 40-45 cycles. This is for a dataset that fits entirely into the level 2 cache.

Simply turning on all optimizations and throwing the right code gen flags at the compiler gives me 110-120 cycles per call. No extra code, no testing, no debugging, no possible bugs. Full precision, and takes only a bit more than twice as long as the approximation.

Now, if you calculate 50 sines per frame, it wouldn't even matter if one call took 100,000 cycles.

On the other hand, if you calculate 5,000,000 sines per frame (hopefully not!), then you'll need to take great care to have a cache friendly access pattern, or you will be bound by the 200-300 cycles memory latency on cache misses. Which, in fact, will make it entirely unimportant whether the sine takes 50 or 500 cycles.

Share this post


Link to post
Share on other sites
Sin() benchmarked at 62 million computed per second when loading the values from an array and accumulating the result in a variable.
Simply calling sin() in a tight loop nets 132 million/sec, but is a less realistic figure since there's no stores or cache misses going on.

VS2008 uses SSE code for it's trig functions with the proper optimisations switched on. I can't imagine doing much better without sacrificing accuracy and using hand written assembly.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Why would you even want to calculate sin or sqrt?


sqrt for the quadratic formula. atan2 is probably a more useful function than sin though.

Share this post


Link to post
Share on other sites
Quote:
Original post by taz0010

sqrt for the quadratic formula.
As well as many other cases.

But why would you need to calculate roots of quadratic equation?


I'm seriously asking, because answer to this may point out why looking at sqrt and sin functions is the wrong place to look when trying to optimize.

Share this post


Link to post
Share on other sites
Quote:

But why would you need to calculate roots of quadratic equation?

One scenario would be when ray-tracing a sphere/cone/cylinder. The speed of a single sqrt operation would be important enough to worry about. But there's a sqrtps instruction in SSE2 that can compute 4 of them at a time, so you're not going to find anything faster.

Quote:
I'm seriously asking, because answer to this may point out why looking at sqrt and sin functions is the wrong place to look when trying to optimize.

I think the OP was just looking for the fastest tools for the job, without any particular job in mind. If there's a known weakness in a vendor's implementation of a certain feature, it isn't a bad idea to pick something better.
The other day I wanted to see if unordered_set was better than std::set, so I tested both boost and tr1's implementation. Knowing that boost's version is faster in all areas, I'll pick it over VS2008's tr1::unordered_set if the situation comes up.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this