float vs double

Started by
48 comments, last by GameDev.net 18 years, 7 months ago
Most of the tutorials I've read uses floats. Why don't they use doubles instead? Is the increase in memory by using doubles significant? Do most modern commercial games use floats or doubles? I was thinking to use doubles to do the calculations, and save the final result as a float. Is this a good idea? Or should i not bother and just use either floats or doubles? Thanks.
Advertisement
Most of the time people don't need the extra precision afforded by doubles. With float you can already store quite a wide range of numbers. Also, I don't have any evidence to back it up but I believe I've heard floats are a bit faster.
Floats are likely faster since they are 32-bits and thus the same size as the memory bus on 32-bit systems, plus less calculations need to be done by the FPU. Also using doubles and then storing the results in a float is sort of silly as you'll lose most of precision during conversion.
mmm ic. Ok, thanks. Ill just stick to floats then.
I'm not an expert on these matters, but I don't believe you need to commit to one or the other. One solution is a typedef:

typedef float Scalar;

If you want to compare performance or switch to some other type, just replace float with the desired type. A more flexible method is to make the type a template parameter:

template <class Scalar = float>
class Vector3
{
public: Scalar x, y, z;
};

typedef Vector3<> Vector3f;
typedef Vector3<double> Vector3d;

Float and double are the most obvious choices for type, but the above method leaves the door open for other options, such as a custom rational number class.
As far as I know, you should care about double/float in rendering (as they consume less precious memory and main memory <-> video memory bandwidth). You should DEFINITELY use double in mathematical calculations, as float will give you nothing but unexpected results where double will be OK.
Using doubles for the calculation and returning a float can make sense. Just because you have eight digits doesn't mean you have eight digits correct. Basically the differance between precision and significance. Most often though the FPU is already doing that for you and there is no reason for the programmer to do it. Generally, just choose one format and use it. My knowledge of CPU's is limited but I think you might take a performance hit switching formats.
Keys to success: Ability, ambition and opportunity.
the performance hit with doubles is a 32 bit platform problem although the 64 bit platforms use only 48 bits for addresses since the addressable space is large enough

but i don t know about the memory bus

floats are defined by the IEEE 754
1 bit sign
8 bit exponent
23 bit mantissa


usually floats are represented as normalized floats, that means

1,mantissa * 2^Exponent

the 8 bits for exponent are used as follows

Exp8Bit -127 = Exponent so you can use shift the bits infront or behind the comma to the left or right 2^-# or 2^#

when adding 2 floats you add
1,mantissabits *optionally create a B-2 complement on subtractions*


now an example

4960 = 2^12 =1_0000_0000_0000
so the float representation is
1,0000_0000_0000_XXXX_XXXX_XXX * 2^12

as you see you shift 12 bits to the right which means you loose 12 bits of the mantissa for the value infront of the comma
so you have got 11 bits left which is 0.00048828125f minimum precision
calculated a follows:
1/2^11 == 2^-11

you get the mantissa by deviding the values of the mantissa / 2^23

as you see you loose quite a bit of precision with larger values
4096*4096*4096 means shifting 36 bits to the right the mantissa has only 23 bits so in theory you might even loose some precision infront of the comma


but i think as already stated above the FPU probably uses higher precision floats internally which in fact isn t that expensive at all since you only need to add 24 bits of a 32 bit float the exponen stays the same and in the end you should left or right to normalize the float to 1,mantissa


Hope that helps

P.S.: a good way to increase precision is to reorder the operations in a way that you don t get too large values so instead of

4096.0345345^3/2048 you could do (4096/2048)*4096^2
although this is usually not possible at runtime
http://www.8ung.at/basiror/theironcross.html
I just looked something up, SSE2 is supposed to support 128bit registers to perform 2 double precision operations in one step

so you can stick with simple floats as long as you compile with the latest processor packs for VC++

for the gcc read the manpages

for vc++

/G7 optimized code for intel and AMD cpus
/arch:SSE2 or /arch:SSE makes use of SSE and SSE2 SSE3 might work similarily but the compilers i am using atm doesn t support SSE3 so i can t say
http://www.8ung.at/basiror/theironcross.html
Should you not try to implement your own floating point class?

As floats / doubles cant contain every single number possible - i.e. the number they store in memory is actually a calculation for the final number. Which leads to certain cases where:

int main(){  float a = 2.501f;   a *= 1.5134f;   if (a == 3.7850134) cout << "Expected value" << endl;  else cout << "Unexpected value" << endl; }


would print "Unexpected value".

Mark Ingramhttp://www.mark-ingram.com

This topic is closed to new replies.

Advertisement