# 64-bit fixed point precision

This topic is 2931 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi, I'm trying to create a class that will allow me to have greater precision in numerical calculations. I needed a way to convert this to 64 bit fixed point integer and keep most of the precision. I'm basically just applying a scaling factor to each x,y,z number of a point, so lets say im using an __int64 (in VS2008), and making it 32:32 precision. That means the scaling factor would be: 1 << 32 (which gives 4294967296) From here I scale up each number for storing (I pass 3 doubles to the contructor), example:
//BaseDataType is __int64
//x is 12534706.185 as a double
//y is 43465665.472 as a double
//z is 26742463.219 as a double
//m_factor is 4294967296

this->m_x = static_cast<BaseDataType>(x * m_factor);
this->m_y = static_cast<BaseDataType>(y * m_factor);
this->m_z = static_cast<BaseDataType>(z * m_factor);

This should give me: 53836153129543925(.76 removed as its put into m_x) However it gives me 53836152334974976 (which is the result if the .185 wasn't included in the multiplication... I need to include that, and I'm not sure why its being excluded?). I scale back down to float / double in the same way using division instead of multiplication. I really don't want to revert to just using doubles, as it's not really a good solution, but I would really like some idea's as to where I'm going wrong, and how I could improve what i'm doing. Any help would be great, Thanks.

##### Share on other sites
Well, you do your calculations in the domain of double, and your result has 17 digits, which, if I'm not too far off here, is extremely close to double's maximum precision.

If you want to have the benefit of working accurately with values exceeding the precision range of double, you would have to implement your own custom operators that operate in your number format's domain. Otherwise you'd just store them differently, while the error you seek to avoid has already occured in an earlier step along the way.

Ideally, the lines should more look like:
this->m_x = static_cast<BaseDataType>(x) * static_cast<BaseDataType>(m_factor);

The whole process only makes sense, if you never ever use unscaled values in doubles, especially not within calculations that require possibly large intermediate results.

And on a side-note: If you aim at more precision than double, you're not doing well using another representaion that uses the exact same amount of bits. The overall precision will stay the same, it'll only be distributed differently.
You might want to consider a hirachical coordinate system, which, as I've read some time ago, was used to enable the vastness but yet good precision in the game Freelancer for example.

##### Share on other sites
It sounds like the double is being converted to an __int64 before the multiplication. If that is the case, static casting m_factor to a double should stop that from happening. You still won't get exactly 53836153129543925 (it uses more than 53 bits), but it should be close.

You could improve the conversion by unpacking the double into its significand and exponent. From there you can obtain the fixed-point value by shifting the significand bits based on the exponent value. Check out the IEEE 754 format if you need an idea on how to do this.

As Medium9 said, using a 32.32 fixed-point format will not give you more precision than a double (unless your numbers are consistently using most of the 32 integer bits, which is unlikely). If you really need more precision, an arbitrary precision math library might be a better choice. Fixed-point arithmetic is better reserved for situations where you need more performance, such as on systems that do not have hardware floating-point units.