Sign in to follow this  
johnstanp

Optimized invSqrt( double )

Recommended Posts

Well, I'm looking for the double version of the famous function invSqrt() of the Quake 3 Engine. Could someone provide it? Thanks. [Edited by - johnstanp on July 5, 2008 6:41:02 PM]

Share this post


Link to post
Share on other sites
Well, using google, I've found a version for the function taking a double parameter.



inline double invSqrt( const double& x )
{
double y = x;
double xhalf = ( double )0.5 * y;
long long i = *( long long* )( &y );
i = 0x5fe6ec85e7de30daLL - ( i >> 1 );//LL suffix for (long long) type for GCC
y = *( double* )( &i );
y = y * ( ( double )1.5 - xhalf * y * y );

return y;
}







[Edited by - johnstanp on July 5, 2008 6:23:20 PM]

Share this post


Link to post
Share on other sites
I read an article about that function some time ago, in that article there was a warning about the speed of it. It was a lot faster to use "back then", but as compilers have gotten better, that code might actually make your program run slower.

Share this post


Link to post
Share on other sites
This function leads to undefined behaviour. Particularly:

long long i = *( long long* )( &y ); // pointer cast
y = *( double* )( &i ); // pointer cast
( i >> 1 ) // (not strictly UB, but right shifts are implementation depenedent for signed integers


Apart from that, it relies on implemenation specific factors like double implementation and size, long long size, etc.

Lastly, try to profile what's really faster. My bet is that 1/std::sqrt(x) should be faster on modern CPUs.

Share this post


Link to post
Share on other sites
Quote:
Original post by rozz666
This function leads to undefined behaviour. Particularly:

long long i = *( long long* )( &y ); // pointer cast
y = *( double* )( &i ); // pointer cast
( i >> 1 ) // (not strictly UB, but right shifts are implementation depenedent for signed integers


Apart from that, it relies on implemenation specific factors like double implementation and size, long long size, etc.

Lastly, try to profile what's really faster. My bet is that 1/std::sqrt(x) should be faster on modern CPUs.


I use a laptop with a Pentium 4( 2.66Ghz ). That implementation is faster than 1/std::sqrt(x), actually seven times faster. My sqrt() is an inline member function in a template class.

If that sqrt() implementation yields faster results, to overcome the problem you cite, writing different functions for each targeted platform solves the problem.

PS: I use dev-cpp for cross-platform development. I have yet to test and profile it on a Unix machine.

Share this post


Link to post
Share on other sites
On my CPU (Pentium 4 1.4 GHz) invSqrt runs about 20% slower than 1 / std::sqrt.
As for the pointer casts - it's undefined behaviour. It's undefined what will happen. Sure, many compilers allows this, but then you have to check your code with every compiler you use and still can't be sure when it's going to fail.

Share this post


Link to post
Share on other sites
Quote:
Original post by h3ro
I read an article about that function some time ago, in that article there was a warning about the speed of it. It was a lot faster to use "back then", but as compilers have gotten better, that code might actually make your program run slower.

While current CPUs use radix 16 division, SQRT remained the same. That said inline 64 bit ASM might be actually faster. That function is hardcoded into CPU when asembly programmers need a fast aproximation of square root. Also, you will get rid of an ambiguity of a C++ code.
Quote:
Original post by dmatter
Optimising things like square roots is an old trick - in modern programming, we would prefer to optimise our use of square roots instead: try to change your high level algorithms so that they make more economical use of such maths functions.

Thousand times nothing killed donkey. While programmers might do effort to invent, or choose, algorithms with low comutional cost, and high numerical accuracy. At the end they still might like to speed up theirs fast algorithm.

Share this post


Link to post
Share on other sites
Quote:
Original post by dmatter
Optimising things like square roots is an old trick - in modern programming, we would prefer to optimise our use of square roots instead: try to change your high level algorithms so that they make more economical use of such maths functions.
I tend to think that thing are a bit more complicated than that. True, new algorithms are typically the only way to improvements beyond an order of magnitude, but on the other hand implementing more complex specialized algorithms will affect far more code and is usually more brittle. That is to say that if this (admittedly bizarre) bithack allows you not to re-engineer dozens of systems around the game to keep values normalized then you've saved a great deal of work, at the cost of figuring out whether the hack is reliable.
Not to say that there are any hard rules as to when to perform micro and macro optimize something but the "modern" view that low-level optimizations are merely a last resort when you absolutely can't convolute your algorithms or datastructures anymore is dangerous.

Quote:
Original post by rozz666
On my CPU (Pentium 4 1.4 GHz) invSqrt runs about 20% slower than 1 / std::sqrt.
As for the pointer casts - it's undefined behaviour. It's undefined what will happen. Sure, many compilers allows this, but then you have to check your code with every compiler you use and still can't be sure when it's going to fail.
Odd.. What's your x87 precision set to?
Of course systems with SSE/3DNow have fast inverse square-root approximations in hardware already so you should really be using those on recent hardware.
As for the casts how about a memcpy() then? That ought to be portable and, assuming that the compiler has the proper intrinsics, about as fast.

Share this post


Link to post
Share on other sites
Quote:
Original post by rozz666
On my CPU (Pentium 4 1.4 GHz) invSqrt runs about 20% slower than 1 / std::sqrt.


You probably messed up your profiling somehow.

A division op is just about the slowest thing you can do on any CPU, old or new. I'm not talking 2 times slower than the simpler ops, or even 5 times slower.. try up to 60 times slower in some cases and no better than 20 times slower on the freshest cpu's available. Division times rival those of L2 and L3 cache misses, and cannot be paired with other divisions. Its a rare day when avoiding a division isnt the right thing to do if performance is a consideration.

Now maybe you have found a compiler that converts 1 / sqrt() into an SSE reciprocal square root, but I am skeptical unless you blatently told the compiler to use as little precision as possible in floating point calculations (which completely defeats the purpose of using doubles instead of floats everywhere else, not just at the site of the rsqrt)

Share this post


Link to post
Share on other sites
Core 2 Duo

div a/b 20 cycles
sqrt(a) 60 cycles

implicit
It's not a bit hack, it's a reasonable implementation of some algorithm. The standard organisation made FP64 in a way to permit algorithms like this.

Though quite a lot of people doesn't have the right education to know these algorithms (from experience).

Share this post


Link to post
Share on other sites
Quote:
Original post by Raghar
It's not a bit hack, it's a reasonable implementation of some algorithm. The standard organisation made FP64 in a way to permit algorithms like this.

Though quite a lot of people doesn't have the right education to know these algorithms (from experience).
Dude, if the way the initial guess is made by shifting the exponent and mantissa together and adjusting the result with a magic constant *isn't* a bithack then I don't know what is. I seriously doubt that the IEEE committee had this kind of abuse in mind when choosing their format, or anything much beyond the ability to easily compare floats as integers really.

Share this post


Link to post
Share on other sites
On my machine (AMD Sempron 1.91GHz) also, invSqrt() runs 20% slower than 1/sqrt(). I'm using Visual Studio 2005, invSqrt() takes 33.98 ns in average while 1/sqrt() takes 28.53 ns.

And I disassembled the code generated by compiler and confirmed that 1/sqrt() is implemented by fsqrt and fdiv instructions.

What the hell is going on?

And yes, floating point model is set to /fp:precise

Share this post


Link to post
Share on other sites
Well, I don't know if that will make a difference but here is the code that I did find:


float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;

x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the fuck?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed

#ifndef Q3_VM
#ifdef __linux__
assert( !isnan(y) ); // bk010122 - FPE?
#endif
#endif
return y;
}




double InvSqrt(double number)
{
__int64 i;
double x2, y;
const double threehalfs = 1.5F;

x2 = number * 0.5F;
y = number;
i = * ( __int64 * ) &y;
i = 0x5fe6ec85e7de30da - ( i >> 1 );
y = * ( double * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}




I used ( long long ) instead of __int64 for GCC.

Share this post


Link to post
Share on other sites
Quote:
Original post by rozz666
As for the pointer casts - it's undefined behaviour. It's undefined what will happen. Sure, many compilers allows this, but then you have to check your code with every compiler you use and still can't be sure when it's going to fail.


Can't you get the same result without the undefined behavior by using a union instead of a pointer cast?

Share this post


Link to post
Share on other sites
Quote:
Original post by rozz666
Sure, many compilers allows this, but then you have to check your code with every compiler you use and still can't be sure when it's going to fail.
Enter: Unit tests ;)
Set up your build system to run the unit tests as a post-compile step, then if the code is broken it will show up as a compile-time error.

Share this post


Link to post
Share on other sites
Quote:
Original post by lol
Quote:
Original post by rozz666
As for the pointer casts - it's undefined behaviour. It's undefined what will happen. Sure, many compilers allows this, but then you have to check your code with every compiler you use and still can't be sure when it's going to fail.


Can't you get the same result without the undefined behavior by using a union instead of a pointer cast?

No. Accessing a union through alternate members is undefined behavior. Once you set a union via a certain member you're only supposed to access it through that member.

Using unions to convert between types is an all too common abuse which few people know about.

Back on topic, any processor that supports SSE already has an inverse square root instruction built in, RSQRTSS. You can access it via a compiler intrinsic (_mm_rsqrt_ss) in MSVC 2008. However, even using the old-style FPU math will outperform Quake's inverse square root on any modern processors.

This kind of thing is a micro-optimization that shouldn't be performed until you've determined, with a profiler, that you actually need it. Which means the profiler is reporting that you're spending 80% of your time on sqrt calls. Writing out a square root function using integer math and bit hacks is a really bad idea, as it'll only turn out much slower than the built-in instructions on modern processors.

Share this post


Link to post
Share on other sites
Quote:
Original post by Ra
Quote:
Original post by lol
Quote:
Original post by rozz666
As for the pointer casts - it's undefined behaviour. It's undefined what will happen. Sure, many compilers allows this, but then you have to check your code with every compiler you use and still can't be sure when it's going to fail.


Can't you get the same result without the undefined behavior by using a union instead of a pointer cast?

No. Accessing a union through alternate members is undefined behavior. Once you set a union via a certain member you're only supposed to access it through that member.

Using unions to convert between types is an all too common abuse which few people know about.

Back on topic, any processor that supports SSE already has an inverse square root instruction built in, RSQRTSS. You can access it via a compiler intrinsic (_mm_rsqrt_ss) in MSVC 2008. However, even using the old-style FPU math will outperform Quake's inverse square root on any modern processors.

This kind of thing is a micro-optimization that shouldn't be performed until you've determined, with a profiler, that you actually need it. Which means the profiler is reporting that you're spending 80% of your time on sqrt calls. Writing out a square root function using integer math and bit hacks is a really bad idea, as it'll only turn out much slower than the built-in instructions on modern processors.


As far as I know this rsqrt_ss is using a lookup table which is too unprecise for most people's need.

Did there ever someone represent the square root as a cubic or quadratic spline and implement the evaluation with sse?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this