• Advertisement
Sign in to follow this  

Floating point Operations

This topic is 4229 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi. What I am trying to do(as part of something im working on) is determine the accuracy of floating-point calculations. That is, I want to be able to know if a+b gives an exact answer, a*b, etc. I THINK I have managed to do it for Addition and Multiplication, but I have no idea about Division. Does anyone know how? Or maybe its not possible? EDIT: Maybe its possible to say that (a/b)*b will be equal to a only if a/b gives an exact answer?

Share this post


Link to post
Share on other sites
Advertisement
Guest Anonymous Poster
"...I want to be able to know if a+b gives an exact answer,..."
Short answer, it doesn't. There is nothing exact about floats or doubles when you consider it's decimal places in binary.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Maybe I missed the point, but how do you mean?
"...determine the accuracy of floating-point calculations...if a+b gives an exact answer..."

Share this post


Link to post
Share on other sites
The IEEE standard for floating point arithmetic might help you. Floating point numbers aren't mathematical reals, and so you can't assume they will obey mathematical laws.

(x + y) + z != x + (y + z)
(x * y) * z != x * (y * z)
x * (y + z) != (x * y) + (x * z)

Some numbers cannot be represented. NaNs (Not a Number) also have this interesting property:

x != x

So don't rely on their accuracy. If you need accurate numbers, use a library that gives you arbitrary precision real numbers.

Share this post


Link to post
Share on other sites
Ok, I'm curious, how did you do it for addition/multiplication?

The only way I can think of to test this would be to work through each digit, and perform the operations manually, storing the result into a fixed-point format that can handle the precision. Then check if the result matches manually, comparing each digit.

Needless to say, if you go that route, it'd be a lot easier to just avoid floats entirely.

Share this post


Link to post
Share on other sites
Addition is exact if no 1-bit is rightshifted out of the mantissa.

For example that holds for all integers < 23 bit. 1.0f+2424.0f==2425.0f

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
"...I want to be able to know if a+b gives an exact answer,..."
Short answer, it doesn't. There is nothing exact about floats or doubles when you consider it's decimal places in binary.


Floats and doubles are perfectly exact, its the operations on them that are inexact.

I'm aware that lots of numbers are un-representable. But I still don't know how to do Division. I think I did Addition much as suggested. Try not to laugh, but I THINK multiplication is exact if the resulting number of bits is less than 52(double). and the number of resulting number of bits = bits in first * bits in second?

Share this post


Link to post
Share on other sites
Can I ask why do you want to do this? And what exactly you are trying to do?

Share this post


Link to post
Share on other sites
Good question. I am attempting to write a very basic computer algebra system(and after a long time am finally getting somewhere!). Basically I don't want to lose any information. That means that sqrt(2) stays sqrt(2), but sqrt(4) = 2, and 1/3 stays 1/3, but 4/2 = 0.5. I know that doing this with roots/powers will be impossible, but division is the last of the basic operations i have yet to work out.

Share this post


Link to post
Share on other sites
Then I really recommend you use a real number library and stop using hardware precision floats. Prefer correct arithmetic to fast arithmetic.

An excellent example of something like what you are trying to do is Frink. Try out the web applet. Precision is kept throughout calculations.

Share this post


Link to post
Share on other sites
Hmm....You have a point. There was a reason why I decided to use doubles rather than integers as my basic data type.........can't remember what it was though...

Share this post


Link to post
Share on other sites
Possibly that an int is limited to 32 bit, and will cause you just as many headaches as a double.

Serious math libraries use their own datatypes that aren't restricted to 32 or 64 bit, and don't use floating-point representation that leaks precision everywhere... [grin]

Share this post


Link to post
Share on other sites
The trouble of floating point (FP) numbers is that they are just an aproximation of real numbers represented by binary mantissa and exponent of finite lenght. This means the following: successive FP numbers are just isolated points on the real number axis. These points are separated by gaps. If you try to store a real number from this gap as FP number, you are going to approximate it's value with the FP model. The approximation is written like this:

x = approx(x) + delta(x)

where x is the real number, approx(x) is it's floating point representation closest to x on the real axis and delta(x) is the difference between x and approx(x). So delta(x) becomes the error of aprroximation of x.

Before we get into calculations with these representations let's ask a simple question: what can be the maximum absolute value of delta(x)? FP numbers are distributed on the real axis evenly as long as the exponent is the same. With increasing exponent the gap between successive FP numbers increases also. So lets reformulate the question: What can be the maximum absolute value of delta(x) at speciefied exponent? Answer: it's half of the distance between two succesive FPs (half of the gap). This value is called machine epsilon (you can get the machine epsilon value for your FP implementation using an algorithm from here: http://en.wikipedia.org/wiki/Machine_epsilon ). Let's call the machine epsilon for real number x: eps(x).

The stuff above means that our real number x will surely lie in the interval <approx(x) - eps(x); approx(x) + eps(x)>. So we can write the representation like this:

x = approx(x) +- eps(x)

where eps(x) is the worst possible error of approximation.

How does this error distribute acros calculations? You can easily derive this if you do basic arithmetic operations with numbers written in the form mentioned above:

Let a,b be real numbers and A +- eps(A), B +- eps(B) be their approximations.

1) addition
[A +- eps(A)] + [B +- eps(B)] = A + B +- [eps(A) + eps(B)]

so error of addition is: eps(A) + eps(B)

2) subtraction
[A +- eps(A)] - [B +- eps(B)] = A - B +- [eps(A) + eps(B)]

so error of subtraction is: eps(A) + eps(B)

3) multiplication
[A +- eps(A)] * [B +- eps(B)] = A * B +- [|A|*eps(B) + |B|*eps(A) + eps(A)*eps(B)]

although eps(A)*eps(B) is unlikely going to be representable by the FP implementation we can just write:

[A +- eps(A)] * [B +- eps(B)] = A * B +- [|A|*eps(B) + |B|*eps(A)]

so error of multiplication is: |A|*eps(B) + |B|*eps(A)
where |A|, |B| is the absolute value of A, B

4) division


A +- eps(A) A +- eps(A) B +- eps(B) A * B +- [|A|*eps(B) + |B|*eps(A) + eps(A)*eps(B)]
----------- = ----------- * ----------- = --------------------------------------------------
B +- eps(B) B +- eps(B) B +- eps(B) B^2 +- [2*|B|*eps(B) + eps(B)^2]


we can (i'm not going to explain why) neglect [2*|B|*eps(B) + eps(B)^2] in the denominator and eps(A)*eps(B) in the numerator so we get:


A * B +- [|A|*eps(B) + |B|*eps(A)]
= ----------------------------------
B^2


thus the error of division is: [|A|*eps(B) + |B|*eps(A)] / B^2

So dividing by the number greater in absolute value gives smaller error.

Hope this helps a little :)

[Edited by - Dinsdale on July 28, 2006 9:42:01 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by DaBookshah
Hi. What I am trying to do(as part of something im working on) is determine the accuracy of floating-point calculations. That is, I want to be able to know if a+b gives an exact answer, a*b, etc. I THINK I have managed to do it for Addition and Multiplication, but I have no idea about Division. Does anyone know how? Or maybe its not possible?

EDIT: Maybe its possible to say that (a/b)*b will be equal to a only if a/b gives an exact answer?

"What Every Computer Scientists Should Know about Floating Point Arithmetic" has an in depth discussion of the accuracy of floating-point calculations. And by "in depth", I mean "will give you a headache and make you realize you really don't care." Google finds it pretty easily if you're really interested.

CM

Share this post


Link to post
Share on other sites
Quote:
Original post by DaBookshah
Good question. I am attempting to write a very basic computer algebra system(and after a long time am finally getting somewhere!). Basically I don't want to lose any information. That means that sqrt(2) stays sqrt(2), but sqrt(4) = 2, and 1/3 stays 1/3, but 4/2 = 0.5. I know that doing this with roots/powers will be impossible, but division is the last of the basic operations i have yet to work out.
The best method for such a problem is to use some sort of lazy evaluation. This means building up an expression object that contains a sqrt object, that contains another object which represents the number 2.

All mathematical operations, or inputs are then derived from some kind of base class. e.g. You'd have a division class, that contains two objects, the divisor, and the quotient. The divisor could it be an objects which could be a sqrt object for example. The sqrt object could point to an object which simply stores the literal value of 3. Then when it comes to multiplying this whole expression object by sqrt(3) you can eventually work out that this cancels with the sqrt(3) on the bottom and you are left with the top part etc. There are obviously a lot of complicated rules to follow to simplify an expression in this manner however, but it will guarantee as accurate as possible results. You can easily evaluate the value of the expression at any time.
btw this is very similiar to building a parse tree, and performing algorithmic optimisations just as compiler would do.

Share this post


Link to post
Share on other sites
Thanks to Dinsdale for the mathematical explanation. But I guess this is somewhat irrelevant now, because I decided that it was too much trouble determining error values for hardwarde floats - Instead I'm going to use real numbers of the form a/b as my basic number.

iMalc:
BTW, in this case it's not the symbolic calculations that I am fiddling with, just the underlying numeric representations. At the moment, the program can determine that:
(x+2) 1
------ == ---
(x+1)(x+2) x+1

however cannot do 2^2 = 4

Thanks anyway.

Share this post


Link to post
Share on other sites
Quote:
Original post by DaBookshah
Instead I'm going to use real numbers of the form a/b as my basic number.

Pi appreciates your hard work.

Share this post


Link to post
Share on other sites
Quote:
Original post by Conner McCloud
Quote:
Original post by DaBookshah
Instead I'm going to use real numbers of the form a/b as my basic number.

Pi appreciates your hard work.


lol.


I'll make it a symbolic constant then.

Share this post


Link to post
Share on other sites
Seriously, just use an existing real-number library. There's some great libraries available out there that support arbitrary-precision, arbitrary-magnitude numbers (even complex numbers, in many cases) and are almost guaranteed to work better than anything you could hack up by hand in a reasonable amount of time.

I didn't catch a mention of what language you're working with; that'd probably help a few people around here who are wise in the ways of libraries to make some solid recommendations for you.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement