High precision floats

Started by
8 comments, last by Samith 20 years, 3 months ago
How would I go about making my own floating point type variable, that has a super high accuracy? If it''s really hard though, do any of you know of any free libraries or something that I can use for some higher precision floats? I need it for a fractal drawer, because you can''t zoom in very well with a normal float.
Advertisement
Just as an interim solution have you tried doubles? I''m guessing prolly so.

-=[ Megahertz ]=-
-=[Megahertz]=-

float-------32 bit -> 1E-37 to 1E+37 with six digits of precision
double------64 bit -> 1E-37 to 1E+37 with ten digits of precision
long double-80 bit -> 1E-37 to 1E+37 with ten digits of precision
quote:Original post by Anonymous Poster

float-------32 bit -> 1E-37 to 1E+37 with six digits of precision
double------64 bit -> 1E-37 to 1E+37 with ten digits of precision
long double-80 bit -> 1E-37 to 1E+37 with ten digits of precision


Then what''s the difference between long double and normal doubles?
long double and double are not always different. In VC++ 6.0 long double has 64 bits while double has 64 bits both with 10 digits of accuracy, but on GCC long double has 80 bits (IIRC 14 digits of accuracy) while double has 64 bits.

Instead of float, try double or long double.

Colin Jeanne | Invader''s Realm
long double is a java thing, you will not find it in c++ (its not there or same as double)...


edit: ohh gcc has it? wow...


T2k

[edited by - T2k on January 11, 2004 2:27:32 PM]
Actually, long double isn''t a java-thing at all. It exists in C/C++, but in most standard compilers it''s the same as a double.
GCC implements it with 80 bits, and i think ICC does aswell?
Anyway, creating a floatingpoint-class with very high precision seems kind of overkill, and it''s not the easiest thing. Ever thought of using fixed point instead? For example 64:64. That would give you HUGE precision, but would ofcourse be kind of slow.


--
MFC is sorta like the swedish police... It''''s full of crap, and nothing can communicate with anything else.
quote:Original post by tok_junior
Actually, long double isn''t a java-thing at all. It exists in C/C++, but in most standard compilers it''s the same as a double.
GCC implements it with 80 bits, and i think ICC does aswell?
Anyway, creating a floatingpoint-class with very high precision seems kind of overkill, and it''s not the easiest thing. Ever thought of using fixed point instead? For example 64:64. That would give you HUGE precision, but would ofcourse be kind of slow.


--
MFC is sorta like the swedish police... It''''s full of crap, and nothing can communicate with anything else.


I don''t know what you mean by fixed point, what is it and how do I use it?
Fixed-point is the same as integral.

Colin Jeanne | Invader''s Realm
quote:Original post by Anonymous Poster

float-------32 bit -> 1E-37 to 1E+37 with six digits of precision
double------64 bit -> 1E-37 to 1E+37 with ten digits of precision
long double-80 bit -> 1E-37 to 1E+37 with ten digits of precision


These ranges are off. The larger floating point types store bigger ranges as well as providing more precision.

float = 1 bit sign, 23 bits mantissa, 8 bit exponent
double = 1 bit sign, 52 bits mantissa, 11 bit exponent
long double = 1 bit sign, 63 bits mantissa, 16 bit exponent - I think. Not sure on the last one.

The availability of "long double" is kind of a hardware thing, really. The standards for how the numbers behave (the allocation of bits to mantissa/exponent, etc) is specified by the relevant IEEE standard - #754.

However:
- Java doesn''t provide access to a "long double" type.
- Some C/C++ compilers will interpret "long double" as "double", even though the hardware is capable (and almost all desktop PC hardware is, apparently)
- In the old days of K&R C, operations between two floats would always use double internally, and I think operations between two doubles would similarly use long double, but I could be wrong on that one. Now the type coercion rules are simplified; the shorter FP value is promoted to the type of the longer one, but two floats still mean the work is done in float values. The result is that errors can accumulate in the last bit. (this is from what I remember about the long PDF referenced at the end of this post.)

Numerical stability is not a simple bit of study, BTW; some of the rules of thumb like "you only need a couple more bits as ''guard'' on your calculation" fail catastrophically for some formulas. It''s not difficult to construct things where using double internally, when the initial values are floats, really is needed to get the right result.

Interesting references on the subject:
http://cch.loria.fr/documentation/IEEE754/
www.cs.nyu.edu/cs/faculty/overton/ book/docs/KahanTalk.pdf
www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf (80 pages, but I read it all and so should you.)

Apparently this Kahan guy is authoritative on the subject. :s

This topic is closed to new replies.

Advertisement