# how's my fixed point code

This topic is 2543 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm planning on making a game that uses fixed point math for everything. Currently, I'm working on making the basic math features.
One thing I've noticed is that for some reason, existing fp libraries stick to a single precision for all values. The problem is that this leads to lots of overflow and rounding errors when doing multiplication. My idea was to use templates to track the precision of each variable so that math could be done more accurately without the runtime overhead of actual floating point emulation.

However, I'm new to template programming and fixed point math as well, so I thought I would post what I have so far to get comments on my coding style and whether I am doing things properly.

Another worry I had was that I use 64 bit variables as an intermediate in order to get more accurate results. If 64 bit variables aren't available on the target platform (GBA), will the compiler just emulate it? Is this something I need to be concerned about?

Note: the floating point conversions are just for convenience when testing. When finished, the actual game will be pure fixed point.
#include <cstdint> #include <cmath> #include "macros.h" //for static assertions typedef int32_t s32; typedef int64_t s64; //typedef uint32_t u32; //typedef uint64_t u64; template <int a, int b> struct Max{enum {val = (b<a)?a:b};}; template <int a, int b> struct Min{enum {val = (b>a)?a:b};}; template<int E> struct Fixed{ s32 val; Fixed(): val(0) {} explicit Fixed(s32 x): val(x) {} explicit Fixed(double x); static int Prec() {return E;} Fixed<E> operator-() const {return Fixed<E>(-val);} }; template<int E> //Since float->int rounds to 0 instead of -inf, the extra 0.5 must have the same sign as the argument to ensure desired rounding Fixed<E>::Fixed(double x): val(x * pow(2, E) + copysign(0.5, x)) {} template<int E> double Float(Fixed<E> x) {return (double)x.val * pow(2, -E);} // Operations /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// template<int ret, int E1, int E2> Fixed<ret> Add(Fixed<E1> a, Fixed<E2> b) { //Choose an intermediate width const int Eint = Min<E1, E2>::val + 31; const int shiftA = Eint - E1; STATICASSERT(shiftA >= 0 && shiftA < 64); const int shiftB = Eint - E2; STATICASSERT(shiftB >= 0 && shiftB < 64); s64 sum = ((s64)a.val << shiftA) + ((s64)b.val << shiftB); const int shift = Eint - ret; STATICASSERT(shift >= 0 && shift < 64); s64 sum_adjusted = sum + (1LL << shift)/2; //Add half an ulp to round to nearest instead of negative inf s64 newv = sum_adjusted >> shift; if ((sum != (newv << shift))) { std::cout << "Possible error in Add<" << ret << ',' << E1 << ',' << E2 << "> with a = " << a.val << " and b = " << b.val << "\n"; } if ((s32)newv != newv) { std::cout << "\n!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"; std::cout << "Overflow in Add<" << ret << ',' << E1 << ',' << E2 << "> with a = " << a.val << " and b = " << b.val << "\n"; } return Fixed<ret>((s32) newv); } //Addition and subtration operators don't do any scaling, so they may overflow. template<int E> Fixed<E>& operator+= (Fixed<E>& a, const Fixed<E>& b) {a.val += b.val; return a;} template<int E> Fixed<E> operator+ (Fixed<E> a, const Fixed<E>& b) {return a += b;} template<int E> Fixed<E>& operator-= (Fixed<E>& a, const Fixed<E>& b) {a.val -= b.val; return a;} template<int E> Fixed<E> operator- (Fixed<E> a, const Fixed<E>& b) {return a -= b;} template<int ret, int E1, int E2> Fixed<ret> Mult(Fixed<E1> a, Fixed<E2> b) { const int shift = E1 + E2 - ret; STATICASSERT(shift >= 0 && shift < 64); s64 p = (s64)a.val * (s64)b.val; s64 p_adjusted = p + (1LL << shift)/2; //Add half an ulp to round to nearest instead of negative inf s64 newv = p_adjusted >> shift; if ((p != (newv << shift))) { std::cout << "Possible error in Mult<" << ret << ',' << E1 << ',' << E2 << "> with a = " << a.val << " and b = " << b.val << "\n"; } if ((s32)newv != newv) { std::cout << "\n!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"; std::cout << "Overflow in Mult<" << ret << ',' << E1 << ',' << E2 << "> with a = " << a.val << " and b = " << b.val << "\n"; } return Fixed<ret>((s32) newv); } template<int E1, int E2> Fixed<E1 + E2 - 31> operator* (Fixed<E1> a, Fixed<E2> b) { return Mult<E1 + E2 - 31>(a,b); //Technically, in order to garuentee no overflow in the case of both arguments being 0x80000000, we need to shift by 32 not 31 } template<int ret, int E1, int E2> Fixed<ret> Div(Fixed<E1> a, Fixed<E2> b) { const int shift = 32 + E1 - E2 - ret; STATICASSERT(shift >= 0 && shift < 32); s64 dividend = (s64)a.val << 32; s64 divisor = (s64)b.val; s64 quotient = dividend / divisor + (1LL << shift)/2; //Add half an ulp to round to nearest instead of negative inf s64 newv = quotient >> shift; if ((dividend != divisor * (newv << shift))) { std::cout << "Rounding error in Div<" << ret << ',' << E1 << ',' << E2 << "> with a = " << a.val << " and b = " << b.val << "\n"; s64 error = dividend - divisor * (newv << shift); std::cout << "Error " << error << '\t' << (error/divisor) << "\t" << ((double) error)/(divisor << shift) << '\n'; } if ((s32)newv != newv) { std::cout << "\n!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"; std::cout << "Overflow in Div<" << ret << ',' << E1 << ',' << E2 << "> with a = " << a.val << " and b = " << b.val << "\n"; } return Fixed<ret>((s32) newv); } 

#pragma once #include "fixed.h" template<int E> struct Vec{ Fixed<E> x,y; static int Prec() {return E;} Vec(){} //Vec(s32 x, s32 y): x(x), y(y) {} Vec(double x, double y): x(x), y(y) {} Vec(Fixed<E> x, Fixed<E> y): x(x), y(y) {} }; template<int E> Vec<E> MakeVec(const Fixed<E>& a, const Fixed<E>& b) {return Vec<E>(a, b);} //Addition and subtration operators don't do any scaling, so they may overflow. template<int E> Vec<E>& operator+= (Vec<E>& a, const Vec<E>& b) {a.x += b.x; a.y += b.y; return a;} template<int E> Vec<E> operator+ (Vec<E> a, const Vec<E>& b) {return a += b;} template<int E> Vec<E>& operator-= (Vec<E>& a, const Vec<E>& b) {a.x -= b.x; a.y -= b.y; return a;} template<int E> Vec<E> operator- (Vec<E> a, const Vec<E>& b) {return a -= b;} template<int E> Vec<E> Rot90(const Vec<E>& a) {return Vec<E>(a.y, -a.x);} template<int ret, int E1, int E2> Fixed<ret> Dot(const Vec<E1>& a, const Vec<E2>& b) { const int shift = E1 + E2 - ret; STATICASSERT(shift >= 0 && shift < 64); s64 p = (s64)a.x.val * (s64)b.x.val; p += (s64)a.y.val * (s64)b.y.val; s64 newv = p >> shift; if ((p != (newv << shift))) { std::cout << "Rounding error in Dot<" << ret << ',' << E1 << ',' << E2 << ">\n"; } if ((s32)newv != newv) { std::cout << "\n!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"; std::cout << "Overflow in Dot<" << ret << ',' << E1 << ',' << E2 << ">\n"; } return Fixed<ret>((s32) newv); } 

##### Share on other sites
Your definition for operator* seems pretty weird. Multiply two numbers with 16 fractional places and you'll get one with 1 fractional place. Multiply two numbers with 8 fractional places and you'll get one with... negative 15 fractional places? I think you'll want operator* to have one template argument, just like operator+ and operator-.

##### Share on other sites
I made operator * choose the largest precision that can't overflow (ignoring the case of signed minimum).

If you have two numbers with 16 fractional bits each, then they can each be up to 2^15, meaning that the product can be up to 2^30. Therefore, there's only one bit left over for the fractional part.

##### Share on other sites
I don't know what exactly you are trying to achieve with this. Perhaps you could keep the number of fractional bits you are using together with the value. If only we could get hardware support for it so things would execute fast... Oh, wait! That's what floating-point types do!

Now, really. I understand the desire to express something like a position with a type that doesn't have different resolution at different parts of the level, and a fixed-point type makes sense for that. But I can't imagine a scenario where this type of arithmetic you are proposing is what you want.

What problems of using floating-point types for everything are you trying to fix?

##### Share on other sites
Well for one thing, the GBA doesn't even have floating point support.

##### Share on other sites

I made operator * choose the largest precision that can't overflow (ignoring the case of signed minimum).

Losing precision can be just as bad a problem. For example if you multiply 3/8 by 1 with 16 fractional places, you'll get 1/2. Overflow is generally more avoidable than precision loss, and it's also more debuggable because your errors are more noticeable. It's more convenient to optimize the default case for when you're multiplying by numbers close to 1, or when you expect the answer to be close to 1. (Where "close" means anywhere between 1/50 and 50.) You'll want to be able to do trivial multiplications, like multiplying by 2, for instance, without having to write out Mult<...>.

##### Share on other sites
Im also doing a fixed point class...

Pretty much addition and subtraction will be just normal integer addition and subtraction.
Multiplication is basically converting A to 64 bit, then multiplying by B, and then shifting right by the larger accuracy of the 2 and then making back to 32 bit.
Division is making A 64 bit, shifting left by the larger accuracy, and then dividing and making it back to 32 bit.

Then i have some comparision operators and functions to get rounded value and raw shifted value and raw value shifted to desired accuracy.

its like
template <class Type,class TypeX2,const int Fractions>
class fixed
{
const int Fracs;
fixed():Fracs(Fractions) {}
blabla
};

Type is the value container type (how many bits)
TypeX2 is a value container twice the size of Type (used in division and multiplication... i dont want to use a 64 bit integer to multiply 2 chars -.-)
Frac is a const where i put Fractions (So i can access it more easily, it doesnt add to the size of the thing... at least in the tests i made >_>)

Oh, anyone know any way to get TypeX2 from Type as i dont like having to write them both in... Of course, you could optimize your code by using 32 bits for typex2 with 32bit value, if youre not going to multiply big values...

i also made a ton of definitions like

#define fixed16_16 fixed<templatestuffblah>

so you have a few predefined setups... I think i should change that to typedef... for some reason...

The precision of the values wont change when doing operations, unless you do something like
fixed32_0value=fixed16_16value
in wich case itll convert it... Im trying to make it as fast as possible, im not going to make it more secure than normal integers, though i could add 1 more value to the template so you can use overflow checking.
I do check for not dividing by 0, and im propably going to make it set the value to the highest possible value...

##### Share on other sites

Losing precision can be just as bad a problem. For example if you multiply 3/8 by 1 with 16 fractional places, you'll get 1/2. Overflow is generally more avoidable than precision loss, and it's also more debuggable because your errors are more noticeable. It's more convenient to optimize the default case for when you're multiplying by numbers close to 1, or when you expect the answer to be close to 1. (Where "close" means anywhere between 1/50 and 50.) You'll want to be able to do trivial multiplications, like multiplying by 2, for instance, without having to write out Mult<...>.

Well currently, it prints out messages warning for both overflow and rounding errors. Also, I consider values close to the maximum to be the default case. However I'm not actually using operator * anyway. Maybe I should just take it out.

##### Share on other sites
The big problem with your fixed-point code is that it's not really fixed-point anymore - it is, for all intents and purposes, floating-point, with the radix point encoded in the data type rather than as a bit pattern in the data (with the added drawback that, since templates are resolved at compile-time, it's even less flexible with regards to operands with mismatched radix points). At that point, you might as well go whole-hog and implement a real floating-point type, because you're already incurring the performance of one, and losing run-time flexibility as well. You're sacrificing all the advantages of fixed-point over floating-point, and you're not even getting all the advantages of floating-point in return...

The point of fixed-point, after all, is that it allows you to have fractional precision while still being able to use fast integer math rather than slow floating-point math (i.e. it's not particularly useful on platforms that have fast floating-point) - something which your system doesn't allow because the radix point isn't truly fixed.

For example, with real fixed point, comparing numbers is exactly the same as comparing integers - your system instead has to shift the numbers to align their radix points before comparing, which is exactly what floating-point code does (the same is true for adding and subtracting), and since your design pretty much mandates 64-bit intermediates, it's going to be about an order of magnitude slower than conventional fixed-point. Fixed point multiplication is integer multiplication followed by a right-shift - your code first decides on a new radix point for the result, then multiplies the numbers and shifts the result, while floating-point multiplication is actually a bit easier: multiply the mantissas then add the exponents together (it's slightly more complex for IEEE 754 floating-point math, because of the bias on the exponent and the implied "1" in front of the stored mantissa).

Also, why worry about trying to guarantee that multiplication won't overflow, when no arithmetic operation on built-in types can make that same guarantee except for division? (Which is, of course, a side-effect of the fact that C and C++ are NOT as close to the hardware as some people like to think. Addition and subtraction can never overflow by more than one bit, which is stored in the CPU's carry or overflow flag, and is thus inaccessible in C, and multiplication can never overflow by more than the operand size, which is why nearly every CPU with a multiply instruction stores the result in a PAIR of registers, the most-significant-word of which is inaccessible without "tricking" the compiler.)

As an aside, I sure hope your compiler is smart enough to realize what this line:
s64 p = (s64)a.val * (s64)b.val;

actually does, and just emits a single UMULL instruction, rather than actually sign-extending the variables and performing a slow multiple-precision multiplication...

##### Share on other sites

The big problem with your fixed-point code is that it's not really fixed-point anymore - it is, for all intents and purposes, floating-point, with the radix point encoded in the data type rather than as a bit pattern in the data (with the added drawback that, since templates are resolved at compile-time, it's even less flexible with regards to operands with mismatched radix points). At that point, you might as well go whole-hog and implement a real floating-point type, because you're already incurring the performance of one, and losing run-time flexibility as well. You're sacrificing all the advantages of fixed-point over floating-point, and you're not even getting all the advantages of floating-point in return...

Since the radix is fixed at compile time, it's still much faster than software floating points and it has no space overhead either, apart from a possible increase in code size. In fact, it seems to me that the generated code will be identical to traditional fixed point except that it uses different shifts.

Also, I don't see how it's possible to do things any differently without sacrificing an unacceptable amount of accuracy.

For example, with real fixed point, comparing numbers is exactly the same as comparing integers - your system instead has to shift the numbers to align their radix points before comparing, which is exactly what floating-point code does (the same is true for adding and subtracting), and since your design pretty much mandates 64-bit intermediates, it's going to be about an order of magnitude slower than conventional fixed-point. Fixed point multiplication is integer multiplication followed by a right-shift - your code first decides on a new radix point for the result, then multiplies the numbers and shifts the result, while floating-point multiplication is actually a bit easier: multiply the mantissas then add the exponents together (it's slightly more complex for IEEE 754 floating-point math, because of the bias on the exponent and the implied "1" in front of the stored mantissa).

If you don't use 64bit intermediates, you're severely limited in the amount of operations you can do without errors or overflow. Heck, in a 16.16 format, even 1*1 overflows.

• 36
• 15
• 9
• 23
• 10