# how's my fixed point code

## Recommended Posts

Storyyeller    215
I'm planning on making a game that uses fixed point math for everything. Currently, I'm working on making the basic math features.
One thing I've noticed is that for some reason, existing fp libraries stick to a single precision for all values. The problem is that this leads to lots of overflow and rounding errors when doing multiplication. My idea was to use templates to track the precision of each variable so that math could be done more accurately without the runtime overhead of actual floating point emulation.

However, I'm new to template programming and fixed point math as well, so I thought I would post what I have so far to get comments on my coding style and whether I am doing things properly.

Another worry I had was that I use 64 bit variables as an intermediate in order to get more accurate results. If 64 bit variables aren't available on the target platform (GBA), will the compiler just emulate it? Is this something I need to be concerned about?

Note: the floating point conversions are just for convenience when testing. When finished, the actual game will be pure fixed point.
[code]#include <cstdint>
#include <cmath>
#include "macros.h" //for static assertions

typedef int32_t s32;
typedef int64_t s64;
//typedef uint32_t u32;
//typedef uint64_t u64;

template <int a, int b> struct Max{enum {val = (b<a)?a:b};};
template <int a, int b> struct Min{enum {val = (b>a)?a:b};};

template<int E>
struct Fixed{
s32 val;

Fixed(): val(0) {}
explicit Fixed(s32 x): val(x) {}
explicit Fixed(double x);
static int Prec() {return E;}

Fixed<E> operator-() const {return Fixed<E>(-val);}
};

template<int E> //Since float->int rounds to 0 instead of -inf, the extra 0.5 must have the same sign as the argument to ensure desired rounding
Fixed<E>::Fixed(double x): val(x * pow(2, E) + copysign(0.5, x)) {}

template<int E>
double Float(Fixed<E> x) {return (double)x.val * pow(2, -E);}

// Operations ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

template<int ret, int E1, int E2>
Fixed<ret> Add(Fixed<E1> a, Fixed<E2> b)
{
//Choose an intermediate width
const int Eint = Min<E1, E2>::val + 31;

const int shiftA = Eint - E1; STATICASSERT(shiftA >= 0 && shiftA < 64);
const int shiftB = Eint - E2; STATICASSERT(shiftB >= 0 && shiftB < 64);

s64 sum = ((s64)a.val << shiftA) + ((s64)b.val << shiftB);
const int shift = Eint - ret; STATICASSERT(shift >= 0 && shift < 64);

s64 sum_adjusted = sum + (1LL << shift)/2; //Add half an ulp to round to nearest instead of negative inf
s64 newv = sum_adjusted >> shift;

if ((sum != (newv << shift)))
{
std::cout << "Possible error in Add<" << ret << ',' << E1 << ',' << E2 << "> with a = " << a.val << " and b = " << b.val << "\n";
}
if ((s32)newv != newv)
{
std::cout << "\n!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n";
std::cout << "Overflow in Add<" << ret << ',' << E1 << ',' << E2 << "> with a = " << a.val << " and b = " << b.val << "\n";
}

return Fixed<ret>((s32) newv);
}

//Addition and subtration operators don't do any scaling, so they may overflow.
template<int E> Fixed<E>& operator+= (Fixed<E>& a, const Fixed<E>& b) {a.val += b.val; return a;}
template<int E> Fixed<E> operator+ (Fixed<E> a, const Fixed<E>& b) {return a += b;}
template<int E> Fixed<E>& operator-= (Fixed<E>& a, const Fixed<E>& b) {a.val -= b.val; return a;}
template<int E> Fixed<E> operator- (Fixed<E> a, const Fixed<E>& b) {return a -= b;}

template<int ret, int E1, int E2>
Fixed<ret> Mult(Fixed<E1> a, Fixed<E2> b)
{
const int shift = E1 + E2 - ret; STATICASSERT(shift >= 0 && shift < 64);

s64 p = (s64)a.val * (s64)b.val;
s64 p_adjusted = p + (1LL << shift)/2; //Add half an ulp to round to nearest instead of negative inf
s64 newv = p_adjusted >> shift;

if ((p != (newv << shift)))
{
std::cout << "Possible error in Mult<" << ret << ',' << E1 << ',' << E2 << "> with a = " << a.val << " and b = " << b.val << "\n";
}
if ((s32)newv != newv)
{
std::cout << "\n!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n";
std::cout << "Overflow in Mult<" << ret << ',' << E1 << ',' << E2 << "> with a = " << a.val << " and b = " << b.val << "\n";
}

return Fixed<ret>((s32) newv);
}

template<int E1, int E2>
Fixed<E1 + E2 - 31> operator* (Fixed<E1> a, Fixed<E2> b)
{
return Mult<E1 + E2 - 31>(a,b); //Technically, in order to garuentee no overflow in the case of both arguments being 0x80000000, we need to shift by 32 not 31
}

template<int ret, int E1, int E2>
Fixed<ret> Div(Fixed<E1> a, Fixed<E2> b)
{
const int shift = 32 + E1 - E2 - ret; STATICASSERT(shift >= 0 && shift < 32);

s64 dividend = (s64)a.val << 32;
s64 divisor = (s64)b.val;
s64 quotient = dividend / divisor + (1LL << shift)/2; //Add half an ulp to round to nearest instead of negative inf
s64 newv = quotient >> shift;

if ((dividend != divisor * (newv << shift)))
{
std::cout << "Rounding error in Div<" << ret << ',' << E1 << ',' << E2 << "> with a = " << a.val << " and b = " << b.val << "\n";
s64 error = dividend - divisor * (newv << shift);
std::cout << "Error " << error << '\t' << (error/divisor) << "\t" << ((double) error)/(divisor << shift) << '\n';
}
if ((s32)newv != newv)
{
std::cout << "\n!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n";
std::cout << "Overflow in Div<" << ret << ',' << E1 << ',' << E2 << "> with a = " << a.val << " and b = " << b.val << "\n";
}

return Fixed<ret>((s32) newv);
}

[/code]

[code]#pragma once
#include "fixed.h"

template<int E>
struct Vec{
Fixed<E> x,y;

static int Prec() {return E;}

Vec(){}
//Vec(s32 x, s32 y): x(x), y(y) {}
Vec(double x, double y): x(x), y(y) {}
Vec(Fixed<E> x, Fixed<E> y): x(x), y(y) {}
};

template<int E>
Vec<E> MakeVec(const Fixed<E>& a, const Fixed<E>& b) {return Vec<E>(a, b);}

//Addition and subtration operators don't do any scaling, so they may overflow.
template<int E> Vec<E>& operator+= (Vec<E>& a, const Vec<E>& b) {a.x += b.x; a.y += b.y; return a;}
template<int E> Vec<E> operator+ (Vec<E> a, const Vec<E>& b) {return a += b;}
template<int E> Vec<E>& operator-= (Vec<E>& a, const Vec<E>& b) {a.x -= b.x; a.y -= b.y; return a;}
template<int E> Vec<E> operator- (Vec<E> a, const Vec<E>& b) {return a -= b;}

template<int E>
Vec<E> Rot90(const Vec<E>& a) {return Vec<E>(a.y, -a.x);}

template<int ret, int E1, int E2>
Fixed<ret> Dot(const Vec<E1>& a, const Vec<E2>& b)
{
const int shift = E1 + E2 - ret; STATICASSERT(shift >= 0 && shift < 64);

s64 p = (s64)a.x.val * (s64)b.x.val;
p += (s64)a.y.val * (s64)b.y.val;
s64 newv = p >> shift;

if ((p != (newv << shift)))
{
std::cout << "Rounding error in Dot<" << ret << ',' << E1 << ',' << E2 << ">\n";
}
if ((s32)newv != newv)
{
std::cout << "\n!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n";
std::cout << "Overflow in Dot<" << ret << ',' << E1 << ',' << E2 << ">\n";
}

return Fixed<ret>((s32) newv);
}
[/code]

#### Share this post

##### Share on other sites
sam_hughes    132
Your definition for operator* seems pretty weird. Multiply two numbers with 16 fractional places and you'll get one with 1 fractional place. Multiply two numbers with 8 fractional places and you'll get one with... negative 15 fractional places? I think you'll want operator* to have one template argument, just like operator+ and operator-.

#### Share this post

##### Share on other sites
Storyyeller    215
I made operator * choose the largest precision that can't overflow (ignoring the case of signed minimum).

If you have two numbers with 16 fractional bits each, then they can each be up to 2^15, meaning that the product can be up to 2^30. Therefore, there's only one bit left over for the fractional part.

#### Share this post

##### Share on other sites
alvaro    21246
I don't know what exactly you are trying to achieve with this. Perhaps you could keep the number of fractional bits you are using together with the value. If only we could get hardware support for it so things would execute fast... Oh, wait! That's what floating-point types do!

Now, really. I understand the desire to express something like a position with a type that doesn't have different resolution at different parts of the level, and a fixed-point type makes sense for that. But I can't imagine a scenario where this type of arithmetic you are proposing is what you want.

What problems of using floating-point types for everything are you trying to fix?

#### Share this post

##### Share on other sites
Storyyeller    215
Well for one thing, the GBA doesn't even have floating point support.

#### Share this post

##### Share on other sites
sam_hughes    132
[quote name='Storyyeller' timestamp='1306729267' post='4817364']
I made operator * choose the largest precision that can't overflow (ignoring the case of signed minimum).
[/quote]

Losing precision can be just as bad a problem. For example if you multiply 3/8 by 1 with 16 fractional places, you'll get 1/2. Overflow is generally more avoidable than precision loss, and it's also more debuggable because your errors are more noticeable. It's more convenient to optimize the default case for when you're multiplying by numbers close to 1, or when you expect the answer to be close to 1. (Where "close" means anywhere between 1/50 and 50.) You'll want to be able to do trivial multiplications, like multiplying by 2, for instance, without having to write out Mult<...>.

#### Share this post

##### Share on other sites
Waterlimon    4398
Im also doing a fixed point class...

Pretty much addition and subtraction will be just normal integer addition and subtraction.
Multiplication is basically converting A to 64 bit, then multiplying by B, and then shifting right by the larger accuracy of the 2 and then making back to 32 bit.
Division is making A 64 bit, shifting left by the larger accuracy, and then dividing and making it back to 32 bit.

Then i have some comparision operators and functions to get rounded value and raw shifted value and raw value shifted to desired accuracy.

its like
template <class Type,class TypeX2,const int Fractions>
class fixed
{
const int Fracs;
fixed():Fracs(Fractions) {}
blabla
};

Type is the value container type (how many bits)
TypeX2 is a value container twice the size of Type (used in division and multiplication... i dont want to use a 64 bit integer to multiply 2 chars -.-)
Frac is a const where i put Fractions (So i can access it more easily, it doesnt add to the size of the thing... at least in the tests i made >_>)

Oh, anyone know any way to get TypeX2 from Type as i dont like having to write them both in... Of course, you could optimize your code by using 32 bits for typex2 with 32bit value, if youre not going to multiply big values...

i also made a ton of definitions like

#define fixed16_16 fixed<templatestuffblah>

so you have a few predefined setups... I think i should change that to typedef... for some reason...

The precision of the values wont change when doing operations, unless you do something like
fixed32_0value=fixed16_16value
in wich case itll convert it... Im trying to make it as fast as possible, im not going to make it more secure than normal integers, though i could add 1 more value to the template so you can use overflow checking.
I do check for not dividing by 0, and im propably going to make it set the value to the highest possible value...

#### Share this post

##### Share on other sites
Storyyeller    215
[quote name='sam_hughes' timestamp='1306735839' post='4817401']
Losing precision can be just as bad a problem. For example if you multiply 3/8 by 1 with 16 fractional places, you'll get 1/2. Overflow is generally more avoidable than precision loss, and it's also more debuggable because your errors are more noticeable. It's more convenient to optimize the default case for when you're multiplying by numbers close to 1, or when you expect the answer to be close to 1. (Where "close" means anywhere between 1/50 and 50.) You'll want to be able to do trivial multiplications, like multiplying by 2, for instance, without having to write out Mult<...>.
[/quote]

Well currently, it prints out messages warning for both overflow and rounding errors. Also, I consider values close to the maximum to be the default case. However I'm not actually using operator * anyway. Maybe I should just take it out.

#### Share this post

##### Share on other sites
Anthony Serrano    3285
The big problem with your fixed-point code is that it's not really fixed-point anymore - it is, for all intents and purposes, floating-point, with the radix point encoded in the data type rather than as a bit pattern in the data (with the added drawback that, since templates are resolved at compile-time, it's even less flexible with regards to operands with mismatched radix points). At that point, you might as well go whole-hog and implement a real floating-point type, because you're already incurring the performance of one, and losing run-time flexibility as well. You're sacrificing all the advantages of fixed-point over floating-point, and you're not even getting all the advantages of floating-point in return...

The point of fixed-point, after all, is that it allows you to have fractional precision while still being able to use fast integer math rather than slow floating-point math (i.e. it's not particularly useful on platforms that have fast floating-point) - something which your system doesn't allow because the radix point isn't truly fixed.

For example, with real fixed point, comparing numbers is exactly the same as comparing integers - your system instead has to shift the numbers to align their radix points before comparing, which is exactly what floating-point code does (the same is true for adding and subtracting), and since your design pretty much mandates 64-bit intermediates, it's going to be about an order of magnitude slower than conventional fixed-point. Fixed point multiplication is integer multiplication followed by a right-shift - your code first decides on a new radix point for the result, then multiplies the numbers and shifts the result, while floating-point multiplication is actually a bit easier: multiply the mantissas then add the exponents together (it's slightly more complex for IEEE 754 floating-point math, because of the bias on the exponent and the implied "1" in front of the stored mantissa).

Also, why worry about trying to guarantee that multiplication won't overflow, when no arithmetic operation on built-in types can make that same guarantee except for division? (Which is, of course, a side-effect of the fact that C and C++ are NOT as close to the hardware as some people like to think. Addition and subtraction can never overflow by more than one bit, which is stored in the CPU's carry or overflow flag, and is thus inaccessible in C, and multiplication can never overflow by more than the operand size, which is why nearly every CPU with a multiply instruction stores the result in a PAIR of registers, the most-significant-word of which is inaccessible without "tricking" the compiler.)

As an aside, I sure hope your compiler is smart enough to realize what this line:
s64 p = (s64)a.val * (s64)b.val;

actually does, and just emits a single UMULL instruction, rather than actually sign-extending the variables and performing a slow multiple-precision multiplication...

#### Share this post

##### Share on other sites
Storyyeller    215
[quote name='Anthony Serrano' timestamp='1306806137' post='4817720']
The big problem with your fixed-point code is that it's not really fixed-point anymore - it is, for all intents and purposes, floating-point, with the radix point encoded in the data type rather than as a bit pattern in the data (with the added drawback that, since templates are resolved at compile-time, it's even less flexible with regards to operands with mismatched radix points). At that point, you might as well go whole-hog and implement a real floating-point type, because you're already incurring the performance of one, and losing run-time flexibility as well. You're sacrificing all the advantages of fixed-point over floating-point, and you're not even getting all the advantages of floating-point in return...
[/quote]

Since the radix is fixed at compile time, it's still much faster than software floating points and it has no space overhead either, apart from a possible increase in code size. In fact, it seems to me that the generated code will be identical to traditional fixed point except that it uses different shifts.

Also, I don't see how it's possible to do things any differently without sacrificing an unacceptable amount of accuracy.

[quote name='Anthony Serrano' timestamp='1306806137' post='4817720']
For example, with real fixed point, comparing numbers is exactly the same as comparing integers - your system instead has to shift the numbers to align their radix points before comparing, which is exactly what floating-point code does (the same is true for adding and subtracting), and since your design pretty much mandates 64-bit intermediates, it's going to be about an order of magnitude slower than conventional fixed-point. Fixed point multiplication is integer multiplication followed by a right-shift - your code first decides on a new radix point for the result, then multiplies the numbers and shifts the result, while floating-point multiplication is actually a bit easier: multiply the mantissas then add the exponents together (it's slightly more complex for IEEE 754 floating-point math, because of the bias on the exponent and the implied "1" in front of the stored mantissa).[/quote]

If you don't use 64bit intermediates, you're severely limited in the amount of operations you can do without errors or overflow. Heck, in a 16.16 format, even 1*1 overflows.

#### Share this post

##### Share on other sites
alvaro    21246
[quote name='Anthony Serrano' timestamp='1306806137' post='4817720']
[color="#1C2837"][size="2"]Also, why worry about trying to guarantee that multiplication won't overflow, when no arithmetic operation on built-in types can make that same guarantee except for division?[/size][/color]
[/quote]

Just to nitpick on that, signed integer division *can* result in an overflow. This was a strange question I got when I interviewed for my current job.

#### Share this post

##### Share on other sites
fastcall22    10838
[quote name='alvaro' timestamp='1306813205' post='4817749']
Just to nitpick on that, signed integer division *can* result in an overflow. This was a strange question I got when I interviewed for my current job.
[/quote]

[spoiler]
std::numeric_limits<int>::min() / (-1)
[/spoiler]

Right?

#### Share this post

##### Share on other sites
Storyyeller    215
Yeah, signed min is always a pain. I think I'm just going to esnure that none of my variables ever end up with that value.

#### Share this post

##### Share on other sites
Anthony Serrano    3285
[quote name='Storyyeller' timestamp='1306806746' post='4817723']
[quote name='Anthony Serrano' timestamp='1306806137' post='4817720']
For example, with real fixed point, comparing numbers is exactly the same as comparing integers - your system instead has to shift the numbers to align their radix points before comparing, which is exactly what floating-point code does (the same is true for adding and subtracting), and since your design pretty much mandates 64-bit intermediates, it's going to be about an order of magnitude slower than conventional fixed-point. Fixed point multiplication is integer multiplication followed by a right-shift - your code first decides on a new radix point for the result, then multiplies the numbers and shifts the result, while floating-point multiplication is actually a bit easier: multiply the mantissas then add the exponents together (it's slightly more complex for IEEE 754 floating-point math, because of the bias on the exponent and the implied "1" in front of the stored mantissa).[/quote]

If you don't use 64bit intermediates, you're severely limited in the amount of operations you can do without errors or overflow. Heck, in a 16.16 format, even 1*1 overflows.
[/quote]

Well I was referring specifically to the intermediates in the addition/subtraction code. With a single, predefined radix point, fixed-point addition, for example, is just a single integer add instruction (with silent overflow at the C language level - of course, built-in integer types already overflow silently), whereas your code mandates at least two double-word shifts (2 instructions each) and a double-word add (another 2 instructions), for a total of 6 instructions not including error-reporting.

As far as multiplication, the CPU can automatically generate the 64-bit intermediate (the SMULL instruction performs a signed 32-bit*32-bit=64-bit multiply), but you have to "trick" the compiler to actually access all of it by casting the operands to 64-bit before multiplying - which, if the compiler is smart, will just be reduced to a single SMULL instruction (followed by the requisite shifting, which takes two instructions), but if the compiler is dumb, it could compile to two sign-extensions (3 instructions each), a double-word by double-word multiply (four multiplies and four additions total), and then the final shift (two instructions), for a total of 16 instructions compared to the optimal solution's 3. (Though this is somewhat moot if you're not performing many fixed-point multiplications.)

What I would recommend is, rather than having each operand type and the return type all be template parameters for the arithmetic functions, just have a single template parameter the dictates the type of both operands and the return value, with additional templates to explicitly convert from one radix point to another - this will have the double benefit of increasing the performance of fixed-point math as well as avoiding what amounts to expensive implicit type conversions.

#### Share this post

##### Share on other sites
Storyyeller    215
I've already got the addition and subtraction operators for traditional fixed point addition behavior. The templated functions are intended for more complex calculations where accuracy matters. And obviously, the error reporting code is just for testing purposes. I'm going to take it out when I'm done.

#### Share this post

##### Share on other sites
Storyyeller    215
By the way, does anyone have recommendations for testing? The problem is that it seems like I'll have to program the rest of the game before I can test any of the physics code.

What would be nice is some sort of light weight and easy to use library that can visualize shapes so that I can test physics without implementing game logic.

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account

## Sign in

Already have an account? Sign in here.

Sign In Now