#### Archived

This topic is now archived and is closed to further replies.

# Carmacks sqrt()

This topic is 4972 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Last week i saw a topic in which someone posted a copy of the sqrt function used in quake 3 but i was at work and couldn''t grab it. If anyone has it can i have a copy plz. Also how much of a loss in accuracy does it take for speed?

##### Share on other sites

      float InvSqrt (float x){    float xhalf = 0.5f*x;    int i = *(int*)&x;    i = 0x5f3759df - (i >> 1);    x = *(float*)&i;    x = x*(1.5f - xhalf*x*x);    return x;}  I believe it was inverse square root.  To test it I used:      int main(){    cout << "Using Carmacks: " << 1 / InvSqrt(3) << endl;    cout << "math.h: " << sqrt(3) << endl;}

And it was pretty accurate.

I did a quick (and I believe innacurate, because I used SDL's getticks instead of other means due to inexperience) test using full compiler optimization in Dev-c++ of the time required to do 10,000,000 loops on both math.h's and carmacks.

Carmack: 28ms
math.h: 232ms

Hmm. That's an awfully large difference. Please correct me if I'm wrong.

[edited by - Ronin Magus on February 15, 2003 11:35:18 AM]

##### Share on other sites
Doh. This method (Newton-Rhapson) has been used for years, and the theory behind it is hundreds of years old. It's a damn shame if people begin to call it Carmack's sqrt now, just because Carmack has a guru's status and a cool sounding name.

- Mikko Kauppila (whose name doesn't sound cool

[edited by - uutee on February 15, 2003 11:59:23 AM]

##### Share on other sites
Well I just called it that because I''m no math person, I saw this post on the front page and remembered copying that code.

##### Share on other sites
sorry, bt what does it do actually? (in code)

.lick

##### Share on other sites
I wouldn''t say it is awefully large. Rather ten to one is what I generally use as a rule of thumb as to whether something is worth while. Within performance you can get 100 to 1 and 1000 to 1 improvements, i.e. bubble sort versus quicksort. With 10 to 1 you might as well go ahead and do it without compelling reasons otherwise. With 10% it better be a big part of your processing. 90% of 10% is just as big as 10% of 90%, but a whole lot less work as a general rule.

##### Share on other sites
i''m not seeing how this works.
i tried to do it by hand and i''m getting weird numbers...

float x == 4

float xhalf == 0.5f * 4 == 2

int i == 2

i == 5f3759df (which is 1597463007 in decimal) - (2 >> 1)
== 1597463006

x == 1597463006

x == 1597463006*(1.5f - 2*x*x)

x == some ridiculously long number:
(8153093528352233345939813923)

and the inverse is (1/x): 1.22 e-28.
in other words not 2.

can someone tell me where i went wrong with the calculation?

##### Share on other sites
quote:
Original post by Alpha_ProgDes
can someone tell me where i went wrong with the calculation?

You are using integers in your calculation.
The function relies on the different representations of integers and floats in memory. This is why there are pointer dereferences instead of simple assignments.

##### Share on other sites
I know that in the SSE instruction set there is a similar instruction (rsqrt*s) that uses an on chip LUT.

Then just toss in a (rcp*s) to take a fast reciprocal of that, and you have the square root.

SSE is supported by Pentium 3/4 and Athlon XP, but you might want to check out 3DNOW! stuff for AMD processors because you might happen to find a quicker way of doing it.

##### Share on other sites
There''s a thing I don''t understand. Why the: int i = *(int*)&x; instead of: int i = (int)x;
Darookie mentioned something about it, but what are the details behind that method?
Could anyone explain it briefly?

Click NOW!

##### Share on other sites
Ok. I''ll try to explain it

On x86 platforms, floats are represented in IEEE format:

Single Precision (32 bits)
The IEEE single precision floating point standard representation requires a 32 bit word, which may be represented as numbered from 0 to 31, left to right. The first bit is the sign bit, S, the next eight bits are the exponent bits, ''E'', and the final 23 bits are the fraction ''F'':
  S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF  0 1      8 9                    31The value V represented by the word may be determined as follows:If E=255 and F is nonzero, then V=NaN ("Not a number") If E=255 and F is zero and S is 1, then V=-Infinity If E=255 and F is zero and S is 0, then V=Infinity If 0If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F) These are "unnormalized" values. If E=0 and F is zero and S is 1, then V=-0 If E=0 and F is zero and S is 0, then V=0 In particular,  0 00000000 00000000000000000000000 = 0  1 00000000 00000000000000000000000 = -0  0 11111111 00000000000000000000000 = Infinity  1 11111111 00000000000000000000000 = -Infinity  0 11111111 00000100000000000000000 = NaN  1 11111111 00100010001001010101010 = NaN  0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2  0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5  1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5  0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)  0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127)   0 00000000 00000000000000000000001 = +1 * 2**(-126) *                                        0.00000000000000000000001 =                                        2**(-149)  (Smallest positive value)

If you write int i = x, the compiler will convert x to the nearest integer that is lower than x, which is not what you want.
Since you need to do some bit shifting and masking tricks that are only available for integers, you need to fool the compiler to have an integer variable that contains the exact bits of x.

This is why pointer dereferencing i used instead of simple assignments.

##### Share on other sites
Thanks a lot. For one or another reason I didn''t know that bitwise operations are not permitted on floats. (probably I never needed to perform them on floats )
Now it''s clear.

Click NOW!

##### Share on other sites
Plug that "magic constant" into Google. The first result is an explanation of this code. Interesting stuff.

##### Share on other sites
If anyone''s interested I did some testing on this formula:

taking the standard c sqrt and the ''magic'' sqrt of all #s from 1-5000, assuming the standard sqrt is the true value:

the average % error is: 0.0877298% (and no, I didn''t forget to multipy by 100 - it''s really that low)

the average difference between the ''magic'' and c sqrts is: 0.0395671

if 0.0395671 is subracted from every ''magic'' sqrt the average % error drops to: 0.0746919%
However, the %error for smaller #s skyrockets.

Going from 0 to 4809 the percent error continually drops.
At 4810 to 4817, my level of precision could not tell if there was any difference.
from 4818 to 5000 the % error slowly increased.
For 4793 to 4809 and 4818 to 4837, the percent error output was put in scientific notation (using E)

I didn''t time the functions though - maybe I''ll do that later.

##### Share on other sites
That was fun - I decided to see if I could optimise the function any further. Here''s what I came up with:
float InvSqrt2 (float x){	union{		int intPart;		float floatPart;	} convertor;	convertor.floatPart = x;	convertor.intPart = 0x5f3759df - (convertor.intPart >> 1);	return 0.5f*convertor.floatPart*(3.0f - x*convertor.floatPart*convertor.floatPart);}

I''m not sure how much accuracy it''ll lose by pulling the 0.5f out in the return line. I''m guessing it''ll be less than the error already introduced by the method. Just in case though I also tried:
float InvSqrt3 (float x){	union{		int intPart;		float floatPart;	} convertor;	convertor.floatPart = x;	convertor.intPart = 0x5f3759df - (convertor.intPart >> 1);	return convertor.floatPart*(1.5f - 0.5f*x*convertor.floatPart*convertor.floatPart);}

I timed them and averaged over several runs and got the following times for performing 500 000 000 invsqrts:

InvSqrt (original): 29.233s = 58.466ns per invsqrt
InvSqrt2: 20.957 = 41.914ns per invsqrt
InvSqrt3: 21.550 = 43.100ns per invsqrt

So my best attempt was 28.3% faster than the original function. If you take off the loop overhead (4.48694s) the speed increase is 33.4%!

Can anyone beat that?

Enigma

##### Share on other sites
quote:
Original post by Enigma
I''m not sure how much accuracy it''ll lose by pulling the 0.5f out in the return line. I''m guessing it''ll be less than the error already introduced by the method.

Try ''none''. If there was any difference it wasn''t obvious - the avg. difference and the avg. % error are the same for InvSqrt and InvSqrt2, compared to sqrt().

I think the only way to make it even faster is either to write in assembly(maybe use sse or mmx?) or double check to make sure the optimizer for your compiler is on full throttle. Oh, and maybe make the function inline.

Anyway, can this formula work with doubles and long doubles?

##### Share on other sites
This code in it''s self would not work directly with doubles (or long doubles), but I think the principle could be adapted to work with doubles. However, I''m not sure how you''d work it since you need to do 64bit integer arithmetic & I don''t know of any standard implementation of C that allow 64 bit integer processing (MSVC has __int64 - but I''m unsure of how it codes instructions). If you don''t mind it being un-portable, then you could use x86 asm, which has a weird & wonderful instruction (that I''ve forgotten!) that would make the right shift operation to be done fairly easily & efficiently with 32 bit registers (of course, there''s MMX - but for the sake of 2 instructions it''d kill the speed gain. Something to think about!

##### Share on other sites
with vc++ 6.0 optimizations on:

CONST SEGMENT
__real@3f000000 DD 03f000000r ; 0.5
CONST ENDS
CONST SEGMENT
__real@3fc00000 DD 03fc00000r ; 1.5
; Function compile flags: /Ogty
CONST ENDS
_TEXT SEGMENT
_x$= 8 _i$ = 8
?InvSqrt@@YAMM@Z PROC NEAR
mov eax, DWORD PTR _x$[esp-4] mov ecx, 1597463007 ; 5f3759dfH fld DWORD PTR _x$[esp-4]
fmul DWORD PTR __real@3f000000
sar eax, 1
sub ecx, eax

mov DWORD PTR _i$[esp-4], ecx fld DWORD PTR _i$[esp-4]
fmul DWORD PTR _i$[esp-4] fmulp ST(1), ST(0) fsubr DWORD PTR __real@3fc00000 fmul DWORD PTR _i$[esp-4]
ret 0
?InvSqrt@@YAMM@Z ENDP
_TEXT ENDS

the code seem good optimized, and of course use bit rotation, but I think that the fsqrt FPU instrucction its enough fast and accurate, or the equivalents in 3dnow/SSE/SSE2, not mmx because it works with integers

##### Share on other sites
Apparently the C# square root function is a lot more optimized than this one. I did the test with the functions under C# and got the following results (Tested on all integers in the set [1,10e5]:

Sqrt: 4.01 ms
Test: 11.0 ms
Diff: 7.01 ms
%err: 0.095%

Sorry about the previous numbers, I put an extra timer call in that messed some stuff up. So the C# sqrt is still faster than the Newton method when used under C# however the speed difference is very small.
-- Exitus Acta Probat --

[edited by - jperalta on February 18, 2003 12:55:14 PM]

[edited by - jperalta on February 18, 2003 12:56:02 PM]

##### Share on other sites
I figured out how this thing works, and have improved the accuracy. The new code is

/// Chris Lomont version of fast inverse sqrt,
/// based on Newton method, 1 iteration, more accurate
float InvSqrt_Lomont(float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f375a84 - (i>>1); // hidden initial guess, fast
x = *(float*)&i;
x = x*(1.5f-xhalf*x*x);
return x;
} // InvSqrt_Lomont

I changed the const from 0x5f3759df to 0x5f375a84 which reduces the maximum relative error over the entire range of floating point numbers. If someone else will like to test this, please post anything you find.

I hope to write up an explanation of how this works, and post it to my site (www.math.purdue.edu/~clomont) eventually under papers. The explanation by David Eberly at www.magic-software.com/Documentation/FastInverseSqrt.pdf is not entirely correct, since he ignores a lot of bit errors and shifts.

Let me know if this works for everyone. I am pretty certain about my analysis.

Chris Lomont

##### Share on other sites
Again, testing in C# as it''s my current language of choice and using your new algorithm results, on the set [1,10e5], 100 iterations are as follows:

Average percent error: 0.096%
Average time difference: 4.79 ms in favor of .NET''s sqrt()

-- Exitus Acta Probat --

##### Share on other sites
From my tests I determined that using 0x5f375a84 was slightly more inaccurate than the original over a range of [1,1000], with incriments of 0.5. However, it was only around 3/1000 of a difference, so it doesn''t make much of a difference which is used, and over some ranges it was more accurate. What I am interested in is the math behind the calculations you did to arrive at that number.

##### Share on other sites
quote:
Original post by Anonymous Poster
I figured out how this thing works, and have improved the accuracy. The new code is

/// Chris Lomont version of fast inverse sqrt,
/// based on Newton method, 1 iteration, more accurate
float InvSqrt_Lomont(float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f375a84 - (i>>1); // hidden initial guess, fast
x = *(float*)&i;
x = x*(1.5f-xhalf*x*x);
return x;
} // InvSqrt_Lomont

I changed the const from 0x5f3759df to 0x5f375a84 which reduces the maximum relative error over the entire range of floating point numbers. If someone else will like to test this, please post anything you find.

I hope to write up an explanation of how this works, and post it to my site (www.math.purdue.edu/~clomont) eventually under papers. The explanation by David Eberly at www.magic-software.com/Documentation/FastInverseSqrt.pdf is not entirely correct, since he ignores a lot of bit errors and shifts.

Let me know if this works for everyone. I am pretty certain about my analysis.

Chris Lomont

Ok, I plugged it into my test program and here''s what I got, with the C sqrt() being the ''true value''. The 1st style uses 0x5f3759df and the 2nd method uses 0x5f375a84.

first 500 integers:
Avg. Pct Error for 1st style: 0.0946063
Avg. Difference for 1st style: 0.0142623
Avg. Pct Error for 2nd style: 0.0946675
Avg. Difference for 2nd style: 0.0142718
first 5000 integers:
Avg. Pct Error for 1st style: 0.0971904
Avg. Difference for 1st style: 0.0409933
Avg. Pct Error for 2nd style: 0.0972366
Avg. Difference for 2nd style: 0.0410079
first 20000 integers:
Avg. Pct Error for 1st style: 0.112022
Avg. Difference for 1st style: 0.0893813
Avg. Pct Error for 2nd style: 0.112074
Avg. Difference for 2nd style: 0.0894122
first 100000 integers:
Avg. Pct Error for 1st style: 0.10822
Avg. Difference for 1st style: 0.192221
Avg. Pct Error for 2nd style: 0.108278
Avg. Difference for 2nd style: 0.192317
first 1e+006 integers:
Avg. Pct Error for 1st style: 0.105428
Avg. Difference for 1st style: 0.647431
Avg. Pct Error for 2nd style: 0.105498
Avg. Difference for 2nd style: 0.647852

Looks like there''s a difference between the two methods, though these results show the 1st method is only very slightly more accurate. I agree with YoshiN on this one. However, both methods are just as suitable because they''re approximations anyway.

##### Share on other sites
quote:
Original post by jperalta
Again, testing in C# as it''s my current language of choice and using your new algorithm results, on the set [1,10e5], 100 iterations are as follows:

Average percent error: 0.096%
Average time difference: 4.79 ms in favor of .NET''s sqrt()

-- Exitus Acta Probat --

It occurs to me that .NET might be exploiting processor features w/o telling you it is, hence the speed of that particular function.

##UnknownPlayer##

##### Share on other sites
quote:
Original post by UnknownPlayer

It occurs to me that .NET might be exploiting processor features w/o telling you it is, hence the speed of that particular function.

##UnknownPlayer##

Just goes to show you how ''portable'' their languages are.