[c++] slow/fast math functions

Started by
12 comments, last by taz0010 13 years, 6 months ago
Hi,
Lately I found quite interesting thing. Everybody writes that standard mathematical functions in math library in c++ are very slow, and therefore are inadequate for game programming. I have made few tests using so called "fast" functions and standard ones in VC++ 2010 express ed. Everything written in many articles was true (that fast functions are faster than standard ones) until I changed into release mode instead debug mode. Than it turned out that standard math functions were 4 to 6 times faster than those "fast" ones. Tested functions were sin(x) and sqrt(), and 1.0f/sqrt(x).

Am I doing something wrong?

Performance has been tested by measuring time of for example 10^6 calls of function.

Regards,
Misery

PS: of course I have used 'inline' and so on.
Advertisement
Which articles? How old are they? It may be that Visual Studio is using processor extensions or instructions that weren't commonplace at the time the articles were written.
Quote:Original post by Misery
Lately I found quite interesting thing. Everybody writes that standard mathematical functions in math library in c++ are very slow, and therefore are inadequate for game programming. I have made few tests using so called "fast" functions and standard ones in VC++ 2010 express ed. Everything written in many articles was true (that fast functions are faster than standard ones) until I changed into release mode instead debug mode. Than it turned out that standard math functions were 4 to 6 times faster than those "fast" ones. Tested functions were sin(x) and sqrt(), and 1.0f/sqrt(x).


Who is "Everybody"? The compiler most likely is generating calls to the floating-point unit for these functions; that is, you have hardware support for the function evaluations. "Fast" functions tend to be software implementations that trade accuracy for speed. Whether such a function outperforms the FPU-supported one depends on the function and the implementation (run a profiler to measure). The FPU-supported functions are required to produced correctly rounded results (as if you computed in infinite precision and rounded to nearest floating-point number). The fast functions are usually not correctly rounded but good enough for the application.

Do you have a particular "fast" function that is slower than the math-library counterpart? Maybe you can post it so that folks can comment on it.
Quote:
Performance has been tested by measuring time of for example 10^6 calls of function.

First off, check your test. Make sure the compiler didn't optimize the code different from what you expected from the program.

Quote:
Everybody writes that standard mathematical functions in math library in c++ are very slow

Who's "everybody", and what is the target platform? not all platforms support the same feature sets. Consider "Carmack's Inverse Sqrt()". Back then it was much faster than the alternitaves, now the _mm_rsqrt_ss instruction is probably the faster/better solution.
Quote:Original post by Rycross
Which articles? How old are they? It may be that Visual Studio is using processor extensions or instructions that weren't commonplace at the time the articles were written.


This.

Remember, at one point using a lookup table would have been much faster for certain operations (back when I started sin/cos lookup tables were very common), but as CPU speed and memory latency has increased often its much faster to use instructions than to take a round trip into ram and back.*

*exceptions still apply; profile before deciding whats best. YMMV, this advice does not come with a money back guarantee nor with a cool hat
I am compeletly new in game programming so I read everything I can possibly find, however I have been fluid mechanics computations programmer for a few years, so I quite understand programming mechanics and things like that, but not game programming yet.

So... 'Everybody' is 'Game programming gems' and all You could find in google typing f.ex. "Fast sin function" or "Fast sqrt", "Fast round" etc.
Sorry for not giving the source at first time.

I also have tried "Carmack's Inverse Sqrt()", and it still performs slower than standard C++ 1.0f/sqrt(x) function.

First off, check your test. Make sure the compiler didn't optimize the
code different from what you expected from the program.

Whether such a function outperforms the FPU-supported one depends on the
function and the implementation (run a profiler to measure)

How can I do that?

Tested functions:

inline float Sqrt(const float x)
{
static union
{
int i;
float x;
} u;
u.x = x;
u.i = (1<<29) + (u.i >> 1) - (1<<22);
u.x = 0.5f * (u.x + x/u.x);
return u.x;
}


inline float InvSqrt (float x)
{
static float xhalf = 0.5f*x;
static int i = *(int*)&x
i = 0x5f3759df - (i >> 1);
x = *(float*)&i
x = x*(1.5f - xhalf*x*x);
return x;
}

//Lagrange interpolation polynomial computed with horner's scheme for range -2Pi to 2Pi
inline float Sin2Pi(float x_rad)
{
if (x_rad>=0.0f)
{
return x_rad*((((-0.00535050099910763f*x_rad+0.08404547315910650f)*x_rad-0.37606003000535748f)*x_rad+0.22629999558973990f)*x_rad+0.91587156796416036f);
}
else
{
x_rad=-x_rad;
return x_rad*((((0.00535050099910763f*x_rad-0.08404547315910650f)*x_rad+0.37606003000535748f)*x_rad-0.22629999558973990f)*x_rad-0.91587156796416036f);
}
}

Project options:
Release mode,
Optimizations: Speed
inline functions: tested on default and AnySuitable
Enable intrinsic functions: no
Favor fast code
rest of options is default.


Thank You all for your help. :)
Both the sqrt and invsqrt functions involve moving data to/from int/float registers, this may be slow on your modern target hardware, some consoles in particular.

Additionally instruction pipelines may well be deeper than they were when these functions were written making incorrect branch predictions more expensive than they used to be.

To profile functions make 100,000's of calls to them in a loop. (You might want to vary inputs) read a timer before the start of the loop (google QueryPerformanceCounter) and again at the end and print out how long it took.

My advice if your new to games programming, don't worry too much about the speed of the 'little things' until they show up in your profile. There is a whole world of interesting algorithms used in game development which is where the real speed comes from, that and memory access patterns. When a little maths function appears in your profile you can usually optimise it out, if not, at least you have a context and test case for trying to optimise the maths function itself. That is of course unless it's optimising the little things which really interest you.

In my current engine we actually do have approximate versions of a number of standard maths functions which are actually quicker on some platforms, we have a unit tests for each which fail if the approximate versions are not faster. (or do no produce expected results)

Cheers,MartinIf I've helped you, a rating++ would be appreciated
Quote:Original post by Martin
To profile functions make 100,000's of calls to them in a loop. (You might want to vary inputs) read a timer before the start of the loop (google QueryPerformanceCounter) and again at the end and print out how long it took.

No, that's how you get an artificial benchmark that's complete irrelevant to the performance of actual production code. To profile properly requires you to use the function in an environment comparable to that of its final target.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

Quote:
Origional post by Martin
To profile functions make 100,000's of calls to them in a loop. (You might want to vary inputs) read a timer before the start of the loop (google QueryPerformanceCounter) and again at the end and print out how long it took.

Get your values/write your answers to an array that you dump to a file after the loop. Anything to keep the compiler from optimizing the call out entierly. The compilers are often smart enough to remove things with no actual side effect other than wasting cpu time. They are also smart enough to constant fold math functions they know about (ie sum(1,100) could be replaced with just the constant 5050). Check the disassembly in the debugger, and make sure the compiler made the test case you thought it did. The better idea though is to make an actual benchmark out of working code that does something real, like a particle system or animation system.

Quote:
Origional post by Misery
How can I do that?

In the more general case, get a profiler like AMD's Code Analyst, and run it on an application that is actually doing something normal (ie not a contrived benchmark). This will show you where the problems are in the code. sqrt() might be slow, but you have to call it alot before it becomes a problem. When the profiler shows you that something big is taking a lot of time, optimize that first. If you end up having to optimize something with a tight inner loop (like an animation or particle subsystem), then worry about how many times you sqrt() and how much of an impact that is on the game.
Here's my experience with trigonometric functions under GCC, which of course is a totally different compiler, but it probably is not that much different with VC++.

Trig functions weight in with around 360-380 cycles each, which really isn't so terrible unless you need to calculate millions of them.

An inlined super optimized minimax approximation with (only) 4 terms that works within +/- Pi runs in 25-26 cycles on my machine. With correct range reduction for arbitrary input, we're at 40-45 cycles. This is for a dataset that fits entirely into the level 2 cache.

Simply turning on all optimizations and throwing the right code gen flags at the compiler gives me 110-120 cycles per call. No extra code, no testing, no debugging, no possible bugs. Full precision, and takes only a bit more than twice as long as the approximation.

Now, if you calculate 50 sines per frame, it wouldn't even matter if one call took 100,000 cycles.

On the other hand, if you calculate 5,000,000 sines per frame (hopefully not!), then you'll need to take great care to have a cache friendly access pattern, or you will be bound by the 200-300 cycles memory latency on cache misses. Which, in fact, will make it entirely unimportant whether the sine takes 50 or 500 cycles.

This topic is closed to new replies.

Advertisement