# Floating Point Resolution/Precision Errors

This topic is 3425 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm having problems with the resolution of floating-point calculations and am unsure how I should handle them. Here's my problem and my (failed) solution: (NOTE: Either my browser or the forum software is not displaying the PLUS sign, even though I'm writing it. In my code below, I will write out "-PLUS-" instead of using the addition sign, to make it clear) The equations of motion of a game object are governed by the following equation (movement as a function of the parameter r, which increases a very small amount every loop):
[source="cpp"]
float b = 26.087315;
float b2 = 680.54797;
float Rs = 10.;
float position = 0;
float EPSILON = .00001;

// Movement loop
while (r is still valid)
{
// r is the dependent variable, b and Rs are constants
float val = 1. - b2 / (r * r) * (1. - Rs / r);
float movement = b / (r * r * sqrt(val);
position = position -PLUS- movement;

r = r -PLUS- EPSILON;
}


The equation (1. - (b2 / r^2) * (1. - Rs / r)) blows up to infinity, depending on the values of b2 and Rs. In this case, the critical value is 15.840763. I have code that solves the above equation to find the critical point, so this is not a problem - I simply start r at a value above this critical value (15.840763) and do not allow r to fall low enough to be equal to or below this value. Above the critical value, the function (1. - (b2 / r^2) * (1. - Rs / r)) is smooth and continuous, increasing in value from 0 to 1 as r approaches infinity. However, certain values above the critical value will still result in a divide-by-zero error, as val will equal 0. I reasoned that this was because of the imprecision of floating-point numbers - too close to the critical value and val is evaluated to such a small number that it rounds to 0. So I first ran the following code to determine the lowest possible value of r (while still remaining above my critical value) such that val is still evaluted to be greater than 0:
float b2 = 680.54797;
float Rs = 10.;
float critical_r = 15.840763;  // Calculated earlier in the code
float low_limit = critical_r;  // The value for which we know any r
// equal to or below will return a 0
// upon evaluation of "val"
float high_limit = 20.;  // A value chosen because I -KNOW- this to be valid
// from earlier code.  Any r value equal to or greater
// than high_limit should be guaranteed to return a
// number greater than 0 upon evaluation of "val"
float min_r = high_limit;  // min_r will be the smallest possible r above
// critical_r that we know will not provide a 0
// for the evaluation of "val"

while(1)
{
// Test a value midway between the low limit and high limit
float test_r = low_limit -PLUS- 0.5 * (high_limit - low_limit);

if (test_r - low_limit <= 0 || high_limit - test_r <= 0)
{
// We've reached the floating point resolution limit, where we
// can't fine-tune the value of min_r anymore
break;
}

float val =  1. - b2 / (test_r * test_r) * (1. - Rs / test_r);

if (val <= 0)
{
// Mathematically, we know val should be greater than 0.
// However, due to floating point precision limits, we have
// calculated it to be equal to or less than 0, so we know
// min_r has to be greater than our current value of test_r
low_limit = test_r;
}
else
{
// We've shown that evaluating "val" with r = test_r, we get
// a greater-than-zero value.  Thus, our value of test_r is
// currently the lowest possible value of r that we know will
// give a valid value for "val," despite floating point
// precision errors
high_limit = test_r;
min_r = test_r;
}
}

// Since our tolerance limits our resolution to this difference at the most
// sensitive point anyway, let's use it as a minimum cap always
EPSILON = min_r - critical_r;


So my code attempts to narrow down on the smallest possible value of r that is above critical_r (15.840763) AND also yields a greater-than-zero value for "val." I then forbid my movement code from ever allowing r to go below this value (min_r). Unfortunately, I've found that while min_r yields a greater-than-zero value for "val," there are other values of r that are GREATER than min_r which will return 0 for "val." For example, running with the above code and constants (b2 = 680.54797, Rs = 10., critical_r calculated to be 15.840763), I find: min_r = 15.840767 When I evaluate "val" at min_r: val = 1. - b2 / (min_r * min_r) * (1. - Rs / min_r) = 5.9604645e-008; Interestingly, if I evaluate val at r = 15.840769 (which is greater than min_r): val = 1. - b2 / (15.840769 * 15.840769) * (1. - Rs / 15.840769) = 5.9604645e-008; So val has the same value, even though r is now slightly higher. Fine, I can chalk this up to imprecision in the floating-point calculation. However, if I evaluate val at r = 15.840771 (which is greater than both min_r and the last test value of r, 15.840769): val = 1. - b2 / (15.840771 * 15.840771) * (1. - Rs / 15.840771) = 0; I expect val to be equal to previous values, if not larger (as my r value is larger). However, not only is it smaller, it's actually 0 - which will lead to a divide-by-zero error! Mathematically, this makes no sense - I conclude it must be due to floating-point resolution and precision errors! Clearly, my algorithm for determining the smallest possible r value to use (while still remaining valid given the limitations of the floating-point calculating abilities of my machine) is not completely effective. How should I approach and solve this problem? How can I effectively determine the values of r I can use in my equations above, (given a value for b2 and Rs)?

##### Share on other sites
If you want an in depth answer, you need to understand how floating point math works. There are a lot of details. Google should find an old but good paper "What every computer scientist should know about floating point".

As for computing the precision, you don't have to.

There are constants in the system that tell you the safe amount of precision you are given.

Since you are speaking for base 10 (decimal) digits, you need to look at decimal digits of precision available to you, that is, the number of digits that can be safely stored without change when expressed as X.XXXeYY

In C, use float.h for the constants FLT_DIG and DBL_DIG.
In C++, use the numeric_limits<typename>::digits10() constant.

Those are the number of decimal digits of precision. The language standard specifies a minimum value and most systems use that value. For float, it is 6 (or more) decimal digits, for double it is 10 (or more) decimal digits.

##### Share on other sites
If it's for a game, pick a different formula that approximates the behaviour. If that's simply impossible, check val for 0 and substitute a close-enough value in those cases. Or just add 0.00001 to val in all cases, to keep it away from 0 or, worse, negative values. Who's going to be able to tell?

It's a sad fact that formulas that work in textbooks don't always work in computers; computers have finite precision and your formula is obviously stressing the limits. Maybe using double would save you here, but I'd work hard to lose that formula and find something else that gets you close enough but doesn't require such twitchy divides.

##### Share on other sites
Quote:
 Original post by frobIn C, use float.h for the constants FLT_DIG and DBL_DIG. In C++, use the numeric_limits::digits10() constant.

I checked both float.h and the numeric_limits<typename>::digits10() constant. They both indicate that, with doubles, I would expect 15 digits of precision in base 10.

However, when debugging, I find the following problem:

[source="cpp"]double dphi = -0.00013619533274322748;double dlambda = 0.00011334598093526438;double theta = 2.6116697788238525;double X = dphi * dlambda;// Debugger now says X = -1.5437192857348236e-008// Debugger also says theta = 2.6116697788238525theta = theta PLUS X;// Debugger still says theta = 2.6116697788238525// i.e., it was UNCHANGED by the addition operation

When I add a double on the order of 10^-8 to a double on the order of 1, I still find that floating-point resolution treats the small double as 0. However, I should have 15 digits of precision. Even with the minimum standard of 10 for doubles, as you suggested, I should STILL expect to be able to perform this addition, as the small double is less than 10 orders of magnitude smaller than the larger double.

What is going on? Do I not understand how the limits of precision work in these operations?

##### Share on other sites
think of it this way.

between 0 and 1, you have the maximum precision possible

between 1-2 you have half as much precision (as you lose a bit to the integral part of the float).
between 2-4 you loose half as much again (another bit to the wind).

and so on, for each power of two that your floating point number uses.

the further away from 0 you get, the more bits you use to represent the integral, the more gaps that you cant represent with a float.

basically, you should test not for an exact value especially with a very long and specific mantissa, but for a range of values within a particular delta (or elipsom). again, games dont need to be exact just close enough .

if it looks right, it is right :)

##### Share on other sites
Quote:
 Original post by Matt_Dbasically, you should test not for an exact value especially with a very long and specific mantissa, but for a range of values within a particular delta (or elipsom). again, games dont need to be exact just close enough .

I believe I understand this concept, but if you look to my post just above, you'll see the issue is not about exact comparisons - it is about behavior with the floating point arithmetic that does not seem to fall in line with what I'm being told (by yourself and others!). After checking around other sources, I believe I am still correct - the arithmetic does not make sense!

I will use the same example I just used in my latest post above, but walk through my reasoning more explicitly. If there is a flaw, please point it out!

Note that I am assuming 15 digit decimal precision. This is supported by the value DBL_DIG in float.h as well as the value of std::numeric_limits<double>::digits10 in my code. Furthermore, a double has 52 bits to represent the significand and log(2^52) is just a bit more than 15. So you are correct in stating I cannot have exact precision for every number - but I CAN expect precision down to 15 decimal places!

I will be using the floating-point addition arithmetic example on wikipedia as a model:

The arithmetic is the addition of X to Y, where X = -.000000015437192857348236 and Y = 2.6116697788238525.

Quote:
 double X = -.000000015437192857348236 = -1.5437192857348236 * 10^-8double Y = 2.6116697788238525 = 2.6116697788238525 * 10^0Hence:-.000000015437192857348236 + .6116697788238525 = (-1.5437192857348236 * 10^-8) + (2.6116697788238525 * 10^0) = -.000000015437192857348236 * 10^0) + (2.6116697788238525 * 10^0) = (-.000000015437192857348236 + 2.6116697788238525) * 10^0 = 2.611669763386659642651763 * 10^0In detail: e=-8 s=-1.5437192857348236 (-.000000015437192857348236)+e=0 s=2.6116697788238525 (2.6116697788238525) e=0 s=-0.000000015437192857348236 (after shifting)+e=0 s=2.6116697788238525---------------------------------------------- e=0 s=2.611669763386659642651763 (true sum: 2.611669763386659642651763)This is the true result, the exact sum of the operands. It will be rounded to fifteen digits and then normalized if necessary. The final result ise=0 s=2.61166976338665Note that the low 9 digits of the first operand (857348236) are essentially lost. This is round-off error.

By my understanding, this is what should happen. But as I stated above, when I perform this operation, I get a result of 2.6116697788238525 - NOT 2.61166976338665. It is as if all the digits of the first operand are lost.

I would expect this behavior with floats, who only have 6 digit decimal precision. But with doubles, with 15 digit decimal precision, I still lose it all.

What is going on?!

##### Share on other sites
I don't know what's specifically the issue in your case, but some processors can be told to lessen their floating point precision. There is software that does this to gain the slight speed increase it yields. Are you calling anything that might futz with the precision? Such futzing violates the assumptions of the language guarantees, the definition of DBL_DIG and so on are based on what the compiler thinks the processor is going to do.

Other sources of error: some compilers do calculations at compile time if they can (if you don't want that, declare the variables as volatile), and this can be different than runtime precision. It can cause interesting errors when the compiler does some math at compile time and the rest happens at run time. A related problem could be a problem compiler or optimizer, though I wouldn't point a finger at the compiler unless I'd at least looked at the assembler it generated. Probably not the issue in this case, but something to check.

You aren't running on an old Pentium, are you? [smile]

Maybe if you post a tiny (as few operations as possible to make the point), but complete program that very clearly demonstrates excessive roundoff, someone will take pity and try it in various compilers. I agree that if you are using double and trying to add 1e-8 to value around 2, on any hardware I've used recently, the add shouldn't be a noop.

##### Share on other sites
So it looks like you have this equation:
  +2.611669778xxx+ -0.000000001xxx

So for large values of the first one and small values of the second one, the result can still round to the initial number. Also be aware that error accumulates over each math operation.

As for your own system not getting full precision, are you doing something else that could mess with the floating point system? It cannot mix between double and float.

Many libraries (such as those in DirectX) will force the system back into 32-bit float mode. Any library can manually alter the floating point settings, and many will do it without storing your old state. You can change it with _fpreset() and similar functions, but be aware that it may disturb the functions of those other libraries.

##### Share on other sites
When you initialize D3D9 it drops the CPU floating point precision to effectively treat everything as a float. You can pass the D3DCREATE_FPU_PRESERVE flag to stop it doing that.

Also be aware that depending on the compiler optimization settings, floating point precision can vary significantly, so you should always aim to write code that's as tolerant of inaccuracies as possible as that lets you use the best optimization settings, and is also least likely to have precision related bugs.

##### Share on other sites
Quote:
 Original post by Adam_42When you initialize D3D9 it drops the CPU floating point precision to effectively treat everything as a float. You can pass the D3DCREATE_FPU_PRESERVE flag to stop it doing that.

Wow. What I find unbelievable is that DX9 didn't make PRESERVE the default. It's one thing to offer to speed things up for you, if you ask. It's another to silently rape someone's calculations by default.

I understand DX10 stopped doing this evil little trick.

1. 1
2. 2
3. 3
Rutin
15
4. 4
5. 5

• 9
• 9
• 11
• 11
• 23
• ### Forum Statistics

• Total Topics
633678
• Total Posts
3013290
×