Optimized invSqrt( double )

Started by
17 comments, last by Basiror 15 years, 9 months ago
Well, I'm looking for the double version of the famous function invSqrt() of the Quake 3 Engine. Could someone provide it? Thanks. [Edited by - johnstanp on July 5, 2008 6:41:02 PM]
Advertisement
Well, using google, I've found a version for the function taking a double parameter.

inline double invSqrt( const double& x ){    double y = x;    double xhalf = ( double )0.5 * y;    long long i = *( long long* )( &y );    i = 0x5fe6ec85e7de30daLL - ( i >> 1 );//LL suffix for (long long) type for GCC    y = *( double* )( &i );    y = y * ( ( double )1.5 - xhalf * y * y );        return y;}


[Edited by - johnstanp on July 5, 2008 6:23:20 PM]
I read an article about that function some time ago, in that article there was a warning about the speed of it. It was a lot faster to use "back then", but as compilers have gotten better, that code might actually make your program run slower.
This function leads to undefined behaviour. Particularly:
long long i = *( long long* )( &y ); // pointer cast    y = *( double* )( &i ); // pointer cast( i >> 1 ) // (not strictly UB, but right shifts are implementation depenedent for signed integers


Apart from that, it relies on implemenation specific factors like double implementation and size, long long size, etc.

Lastly, try to profile what's really faster. My bet is that 1/std::sqrt(x) should be faster on modern CPUs.
Quote:Original post by rozz666
This function leads to undefined behaviour. Particularly:
long long i = *( long long* )( &y ); // pointer cast    y = *( double* )( &i ); // pointer cast( i >> 1 ) // (not strictly UB, but right shifts are implementation depenedent for signed integers


Apart from that, it relies on implemenation specific factors like double implementation and size, long long size, etc.

Lastly, try to profile what's really faster. My bet is that 1/std::sqrt(x) should be faster on modern CPUs.


I use a laptop with a Pentium 4( 2.66Ghz ). That implementation is faster than 1/std::sqrt(x), actually seven times faster. My sqrt() is an inline member function in a template class.

If that sqrt() implementation yields faster results, to overcome the problem you cite, writing different functions for each targeted platform solves the problem.

PS: I use dev-cpp for cross-platform development. I have yet to test and profile it on a Unix machine.
Optimising things like square roots is an old trick - in modern programming, we would prefer to optimise our use of square roots instead: try to change your high level algorithms so that they make more economical use of such maths functions.
On my CPU (Pentium 4 1.4 GHz) invSqrt runs about 20% slower than 1 / std::sqrt.
As for the pointer casts - it's undefined behaviour. It's undefined what will happen. Sure, many compilers allows this, but then you have to check your code with every compiler you use and still can't be sure when it's going to fail.
Quote:Original post by h3ro
I read an article about that function some time ago, in that article there was a warning about the speed of it. It was a lot faster to use "back then", but as compilers have gotten better, that code might actually make your program run slower.

While current CPUs use radix 16 division, SQRT remained the same. That said inline 64 bit ASM might be actually faster. That function is hardcoded into CPU when asembly programmers need a fast aproximation of square root. Also, you will get rid of an ambiguity of a C++ code.
Quote:Original post by dmatter
Optimising things like square roots is an old trick - in modern programming, we would prefer to optimise our use of square roots instead: try to change your high level algorithms so that they make more economical use of such maths functions.

Thousand times nothing killed donkey. While programmers might do effort to invent, or choose, algorithms with low comutional cost, and high numerical accuracy. At the end they still might like to speed up theirs fast algorithm.
Quote:Original post by dmatter
Optimising things like square roots is an old trick - in modern programming, we would prefer to optimise our use of square roots instead: try to change your high level algorithms so that they make more economical use of such maths functions.
I tend to think that thing are a bit more complicated than that. True, new algorithms are typically the only way to improvements beyond an order of magnitude, but on the other hand implementing more complex specialized algorithms will affect far more code and is usually more brittle. That is to say that if this (admittedly bizarre) bithack allows you not to re-engineer dozens of systems around the game to keep values normalized then you've saved a great deal of work, at the cost of figuring out whether the hack is reliable.
Not to say that there are any hard rules as to when to perform micro and macro optimize something but the "modern" view that low-level optimizations are merely a last resort when you absolutely can't convolute your algorithms or datastructures anymore is dangerous.

Quote:Original post by rozz666
On my CPU (Pentium 4 1.4 GHz) invSqrt runs about 20% slower than 1 / std::sqrt.
As for the pointer casts - it's undefined behaviour. It's undefined what will happen. Sure, many compilers allows this, but then you have to check your code with every compiler you use and still can't be sure when it's going to fail.
Odd.. What's your x87 precision set to?
Of course systems with SSE/3DNow have fast inverse square-root approximations in hardware already so you should really be using those on recent hardware.
As for the casts how about a memcpy() then? That ought to be portable and, assuming that the compiler has the proper intrinsics, about as fast.
Quote:Original post by rozz666
On my CPU (Pentium 4 1.4 GHz) invSqrt runs about 20% slower than 1 / std::sqrt.


You probably messed up your profiling somehow.

A division op is just about the slowest thing you can do on any CPU, old or new. I'm not talking 2 times slower than the simpler ops, or even 5 times slower.. try up to 60 times slower in some cases and no better than 20 times slower on the freshest cpu's available. Division times rival those of L2 and L3 cache misses, and cannot be paired with other divisions. Its a rare day when avoiding a division isnt the right thing to do if performance is a consideration.

Now maybe you have found a compiler that converts 1 / sqrt() into an SSE reciprocal square root, but I am skeptical unless you blatently told the compiler to use as little precision as possible in floating point calculations (which completely defeats the purpose of using doubles instead of floats everywhere else, not just at the site of the rsqrt)

This topic is closed to new replies.

Advertisement