Fast sqrt for 64bit

Started by
17 comments, last by clb 8 years, 8 months ago


Now, if you want a super fast sqrt on x86, you can use Greg Walsh's (in)famous approximation
float Sqrt(float x){
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
float r = *(float*)&i;
r = r*(1.5f - 0.5f*x*r*r);
return r * x;
}
Just don't ask me how it works...

We have no evidence it's really Greg Walsh the author if i'm not wrong.

Advertisement
There's benchmarks of different sqrt versions in MathGeoLib benchmark suite. Experience gained from toying around there:
- the x87 fpu sqrt is slow, since modern math is all sse, and there's a penalty in mixing.
- the "Greg Walsh's"/"Carmack's (inv)Sqrt" method should never be used, it is slower and more imprecise than what SSE gives.
- don't assume std::sqrt or sqrtf would be fast, practice shows they might, or might not, and it depends on compiler+OS. Good compilers make these calls down to an intrinsic that gates on -msse or similar, bad ones perform a function call.
- for SSE1-capable hardware there are exactly two methods to consider: _mm_sqrt_ss(x), or _mm_mul_ss(x, _mm_rsqrt_ss(x)).
- _mm_rcp_ss(_mm_rsqrt_ss(x)) is inferior to mul+rsqrt in both performance and precision, on Intel at least.
In general, there is no free lunch, and when measuring what is fast, you are blind if you ignore the other side of the equation: precision. So whenever benchmarking for performance, remember to also benchmark for precision, since there's always a tradeoff. Custom Newton-Rhapson steps don't yield enough improvement to make any of the hacks better than SSE.
The results of the above sqrt variants on a rather old Core 2 Quad and 64-bit Win 8.1 and Visual Studio 2013 targeting 64-bit with SSE2:
Carmack's: time 10 clocks, max relative error (within x=[0, 1e20] range) to std::sqrt(): 1.7e-3
sqrtss: time 12 clocks, relative error 8.4e-8
mulss(rsqrtss): time 4.5 clocks, relative error 3.3e-4

Therefore the rule of thumb is that (on Intels at least) for precise sqrt, use _mm_sqrt_ss(x), and for the fastest possible sqrt, use _mm_mul_ss(x, _mm_rsqrt_ss(x)).

I can pretty much guarantee that sqrt is not the source of your performance problems...

Micro-optimizations like "fast SQRT" rarely get you anything these days. Modern processors are so fast that you're going to get much more bang for your buck by looking at memory usage patterns and lowering cache misses.

I was writing some SIMD code the other day and replacing a single instruction with an approximation gave me a 25% performance boost. "Fast SQRT" could be really damn useful if you are calculating 4 billion of them every second.


float Sqrt(float x){
    int i = *(int*)&x;
    i = 0x5f3759df - (i>>1);
    float r = *(float*)&i;
    r = r*(1.5f - 0.5f*x*r*r);
    return r * x;
}
Just don't ask me how it works...


That code computes a fast approximation to 1/sqrt(x) and multiplies it by x.

In modern CPUs casting from float to int and then back will cause moving data among different registers and potentially memory, which can cause stalls.

I wouldn't be surprised if it performs poorly today.

SSE2 and AXV2 give you integer SIMD instructions that operate on the same register file as the floating point instructions. This means you get float -> int punning for free. See clb's post below, apparently on Intel anyway there is a penalty for switching between float and integer SIMD instructions. It seems this has to do with the way their pipeline is structured and how forwarding of results from one operation to the next is handled.

Something else that should be mentioned (although most of you certainly know this, it's important for beginners): A large fraction of the uses of sqrt by beginners are avoidable. Here's the prototypical example:

  if (std::sqrt((x1-x2)*(x1-x2) + (y1-y2)*(y1-y2)) < radius) ... // BAD!
  if ((x1-x2)*(x1-x2) + (y1-y2)*(y1-y2) < radius * radius) ... // GOOD!

If you really need a square root fast and you aren't to concerned about the precision you can use this. May actually not be that much faster depending on your architecture.


//use 0x1FBD1DF5 for (0.0, 1.0) RME: 1.42%
//use 0x1FBD22DF for (0.0, 1000.0) RME: 1.62%
inline float sqrt(float f)

{

    int i = 0x1FBD1DF5 + (*(int*)&f >> 1);

    return *(float*)&i;

}

Source

Edit: If you want to do the same for doubles you might have luck with this magic number 0x1FF776E75B46DF3D.

I can pretty much guarantee that sqrt is not the source of your performance problems...

Micro-optimizations like "fast SQRT" rarely get you anything these days. Modern processors are so fast that you're going to get much more bang for your buck by looking at memory usage patterns and lowering cache misses.


I was writing some SIMD code the other day and replacing a single instruction with an approximation gave me a 25% performance boost. "Fast SQRT" could be really damn useful if you are calculating 4 billion of them every second.


One assumes you profiled your project and saw that optimizing that single instruction was worth your time. I intentionally did not say "fast sqrt" is useless, because it isn't, if it is the correct optimization for your program.

Most people looking for "fast X" don't profile and have other major flaws in their system that would give them huge performance gains over using a "fast X".

So what I'm really saying is: Profile your code - chances are a "fast X" won't save you because you've got bigger algorithmic or memory-related problems.

SSE2 and AXV2 give you integer SIMD instructions that operate on the same register file as the floating point instructions. This means you get float -> int punning for free.

There is a crossover penalty, or a "bypass delay" at least on Intel hardware for crossing computation of data between float and int modes, see https://software.intel.com/en-us/forums/topic/279382 , so float -> int punning comes at the cost of one cycle.

If you really need a square root fast and you aren't to concerned about the precision you can use this. May actually not be that much faster depending on your architecture.


//use 0x1FBD1DF5 for (0.0, 1.0) RME: 1.42%
//use 0x1FBD22DF for (0.0, 1000.0) RME: 1.62%
inline float sqrt(float f)

{

    int i = 0x1FBD1DF5 + (*(int*)&f >> 1);

    return *(float*)&i;

}

Source

Edit: If you want to do the same for doubles you might have luck with this magic number 0x1FF776E75B46DF3D.

Testing this on a Haswell Core i7 5960X, at https://github.com/juj/MathGeoLib/commit/e6f667e89a6b51f7c1e9bf2c35c074f3ad3509ac , with VS 2013 Community update 4 targeting AVX, yields (reposting the above numbers since different computer and compiler):

Carmack's: 6.16 clocks, precision: 1.75e-3

sqrt_ss(x): 5.92 clocks, precision: 8.4e-8

mulss(rsqrtss(x)): 2.08 clocks, precision: 3e-4

0x1FBD1DF5: 2.32 clocks, precision: 4.3e-2

Looks like using the SSE1 instructions is better at least on the Haswell.


Carmack's: 6.16 clocks, precision: 1.75e-3
sqrt_ss(x): 5.92 clocks, precision: 8.4e-8

mulss(rsqrtss(x)): 2.08 clocks, precision: 3e-4
0x1FBD1DF5: 2.32 clocks, precision: 4.3e-2

Though I think we all agree that the sqrt itself is rather irrelevant when the number of actual CPU cycles is this close to 1, it is certainly interesting, and I did some tests as well.

For one I think you're measuring loads and stores.. as I managed to reproduce almost exactly those numbers (also Haswell) and the ~2 clocks for rsqrt were identical to mem0 = mem1 (unless your test compensates for the load/store time?)

I glanced at the code and it looked as if it was included in the timing.

I guess in order to make anything like this meaningful at all we have to assume that we're at a point in a program where we have our float values in an XMM register and want to measure only the time it takes to get the sqrt of that value into another XMM register, which means we measure latency.

However if we have the sqrt in a loop and the compiler is smart, it might order operations so that the throughput matters. The loop in your test does independent sqrts, so the second sqrt doesn't wait for the first, and therefore measures throughput, but hides latency.

I tried it in ASM and for thoughput got something like sqrtss ~= 7, mulss rsqrtss ~= 2.5, psrld paddd ~= 1.5.

When measuring latency however, sqrtss ~= 10, mulss rsqrtss ~= 10, psrld paddd ~= 2. Looking at Intels specifications this matches pretty closely, as mulss for example has a throughput of 1 cycle, but a latency of 4 cycles. sqrtss is actually listed as 13 cycles latency (if I managed to look at the right column), which probably means my test wasn't well enough constructed to hide such a long latency.

psrld have 1 latency and 1 throughput, whereas paddd has 1 latency but 0.5 throughput, so it can actually do 2 adds per cycle in throughput.

If you have a long sequence of floating point calculations and at one point want to at minimal latency turn once of your floats into the sqrt of itself and don't care for very much precision, psrld paddd is probably the way to go.

Also note that the doc linked by Chris_F actually talked about GPUs and doing sqrts in a pixel shader for example, and we're discussing using the same methods on completely different architectures. I would guess these things matter a lot more on GPUs nowadays, as we're much more often at a time where a few instructions can actually make a difference in performance, and latency is already hidden as good as possible by the GPU so once it starts to matter parallelism has already been maximized and is no longer improvable.

On CPUs I would guess this case is more rare and much harder to predict in general code... and depending on the code around the sqrt different methods can give best results.

I think a script that brute-force tested every sqrt in a program individually with all these methods would find that performance would benefit from one of them in some cases and from another in another case.

Very good points. My benchmark does include loads and stores (identical in each tested variant) to hot cache, and it stresses more throughput, rather than latency. It is difficult to say which one is more important, since it depends if the computation of the overall algorithm has something else to do to hide the latency. Sometimes it's possible to hide latency that it doesn't matter, and other times it is not. Also indeed, none of the timing applies to GPUs or ARM or anything like that.

This topic is closed to new replies.

Advertisement