# float vs double

This topic is 4517 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Most of the tutorials I've read uses floats. Why don't they use doubles instead? Is the increase in memory by using doubles significant? Do most modern commercial games use floats or doubles? I was thinking to use doubles to do the calculations, and save the final result as a float. Is this a good idea? Or should i not bother and just use either floats or doubles? Thanks.

##### Share on other sites
Most of the time people don't need the extra precision afforded by doubles. With float you can already store quite a wide range of numbers. Also, I don't have any evidence to back it up but I believe I've heard floats are a bit faster.

##### Share on other sites
Floats are likely faster since they are 32-bits and thus the same size as the memory bus on 32-bit systems, plus less calculations need to be done by the FPU. Also using doubles and then storing the results in a float is sort of silly as you'll lose most of precision during conversion.

##### Share on other sites
mmm ic. Ok, thanks. Ill just stick to floats then.

##### Share on other sites
I'm not an expert on these matters, but I don't believe you need to commit to one or the other. One solution is a typedef:

typedef float Scalar;

If you want to compare performance or switch to some other type, just replace float with the desired type. A more flexible method is to make the type a template parameter:

template <class Scalar = float>
class Vector3
{
public: Scalar x, y, z;
};

typedef Vector3<> Vector3f;
typedef Vector3<double> Vector3d;

Float and double are the most obvious choices for type, but the above method leaves the door open for other options, such as a custom rational number class.

##### Share on other sites
As far as I know, you should care about double/float in rendering (as they consume less precious memory and main memory <-> video memory bandwidth). You should DEFINITELY use double in mathematical calculations, as float will give you nothing but unexpected results where double will be OK.

##### Share on other sites
Using doubles for the calculation and returning a float can make sense. Just because you have eight digits doesn't mean you have eight digits correct. Basically the differance between precision and significance. Most often though the FPU is already doing that for you and there is no reason for the programmer to do it. Generally, just choose one format and use it. My knowledge of CPU's is limited but I think you might take a performance hit switching formats.

##### Share on other sites
the performance hit with doubles is a 32 bit platform problem although the 64 bit platforms use only 48 bits for addresses since the addressable space is large enough

but i don t know about the memory bus

floats are defined by the IEEE 754
1 bit sign
8 bit exponent
23 bit mantissa

usually floats are represented as normalized floats, that means

1,mantissa * 2^Exponent

the 8 bits for exponent are used as follows

Exp8Bit -127 = Exponent so you can use shift the bits infront or behind the comma to the left or right 2^-# or 2^#

1,mantissabits *optionally create a B-2 complement on subtractions*

now an example

4960 = 2^12 =1_0000_0000_0000
so the float representation is
1,0000_0000_0000_XXXX_XXXX_XXX * 2^12

as you see you shift 12 bits to the right which means you loose 12 bits of the mantissa for the value infront of the comma
so you have got 11 bits left which is 0.00048828125f minimum precision
calculated a follows:
1/2^11 == 2^-11

you get the mantissa by deviding the values of the mantissa / 2^23

as you see you loose quite a bit of precision with larger values
4096*4096*4096 means shifting 36 bits to the right the mantissa has only 23 bits so in theory you might even loose some precision infront of the comma

but i think as already stated above the FPU probably uses higher precision floats internally which in fact isn t that expensive at all since you only need to add 24 bits of a 32 bit float the exponen stays the same and in the end you should left or right to normalize the float to 1,mantissa

Hope that helps

P.S.: a good way to increase precision is to reorder the operations in a way that you don t get too large values so instead of

4096.0345345^3/2048 you could do (4096/2048)*4096^2
although this is usually not possible at runtime

##### Share on other sites
I just looked something up, SSE2 is supposed to support 128bit registers to perform 2 double precision operations in one step

so you can stick with simple floats as long as you compile with the latest processor packs for VC++

for the gcc read the manpages

for vc++

/G7 optimized code for intel and AMD cpus
/arch:SSE2 or /arch:SSE makes use of SSE and SSE2 SSE3 might work similarily but the compilers i am using atm doesn t support SSE3 so i can t say

##### Share on other sites
Should you not try to implement your own floating point class?

As floats / doubles cant contain every single number possible - i.e. the number they store in memory is actually a calculation for the final number. Which leads to certain cases where:

int main(){  float a = 2.501f;   a *= 1.5134f;   if (a == 3.7850134) cout << "Expected value" << endl;  else cout << "Unexpected value" << endl; }

would print "Unexpected value".

##### Share on other sites
On an IA32 platform under normal conditions, the only difference between float and double is the amount of memory they use. By this I mean that internally both float and double are done at even higher precision and the result truncated/rounded to fit the desired memory size. Using doubles requires double the memory so if you're operating on lots of values then the increase in required bandwidth could have undesirable effects, i.e. slow it down.

This does lead to interesting results as a result of compiler optimisations. In debug, the intermediate values in a sequence of computations will most likely be written to memory, thus losing some precision. In release, the intermediate values are stored on the FPU stack so the precision isn't lost and thus you get different results.

It is also possible to reduce the level of precision the FPU works at, although if memory serves me right this only affects the transendental functions.

Skizz

##### Share on other sites
I think if you're just starting programming, and not into 3D graphics and worrying about memory constraints, use double. It will give you more precise answers.

Mike

##### Share on other sites
usually its a good idea to used typedef instead of build in types this way you can simple switch the typedef if you need more precision

##### Share on other sites
The reason we use floats ("we" as in we of GameDev) is because the GPUs of today are highly optimized for working with 32 bit floating point values. That's what works well. To a lesser extent, SSE also works best with 32 bit floats.

Those are basically the only real reasons for it. x86 does all FPU ops internally at 80 bits, but expansion from 32 bit to 80 bit generally carries no performance hit at all (it's done during the flop; I think this applies to P4 as well but I'm not sure). You do spend more memory, but that's usually not important, and if it is, you will be conscious of it (hopefully).

##### Share on other sites
In my opinion on modern processors (like P4 and its 128 bit SIMD floating point arithmetic) using floats give no speed benefits.
Viceversa mixing float and double can slow the processing due to castings.
Use double by default; use floats only if you really need them.

##### Share on other sites
Doubles use twice as much memory as floats. Using 50% less memory can be a big difference in situations where a lot of memory is being accessed.

##### Share on other sites
Floats are not always faster than doubles. For example on the platform I'm working on now, doubles are actually faster, as all floating point operations are native to doubles so floats get converted to doubles and back anyway. The extra memory is also unlikely to be an issue.

Know your target platform and code to it. (Where speed is critical, of course. In probably 90%+ situations, it just doesn't matter which you use, unless you particularly need greater precision, which in most games is unlikely).

##### Share on other sites
Quote:
 Original post by SkuteShould you not try to implement your own floating point class?As floats / doubles cant contain every single number possible - i.e. the number they store in memory is actually a calculation for the final number. Which leads to certain cases where:*** Source Snippet Removed ***would print "Unexpected value".

NEVER EVER use == on a float

[you don't know how things are rounded, and with different compiliers or different platforms the problem is made worse with more calculations... also, not all numbers that make sense in base 10 work in binary... like 1/5 = 0.2 in decimal, and is about 0.001100110011001100110011001100110011001100110011001100110011001101... in binary]

no, because how would you represent 1/3? or pi? also, it would be much much slower and give you no reasonable benefit

Quote:
 Original post by BittermanAndyFloats are not always faster than doubles. For example on the platform I'm working on now, doubles are actually faster, as all floating point operations are native to doubles so floats get converted to doubles and back anyway. The extra memory is also unlikely to be an issue.

the playstation 2 doesn't have double precision support

##### Share on other sites
hmmm.. seems like theres a mix of opinions about which one to use. Ill use a typedef and give both a try then, and see if my program can handle the increase in memory. Thanks guys!

##### Share on other sites
Quote:
 Original post by blizzard999In my opinion on modern processors (like P4 and its 128 bit SIMD floating point arithmetic) using floats give no speed benefits.

There are 128 bit instructions where those 128 bits hold 4 floats.
There are 128 bit instructions where those 128 bits hold 2 doubles.

Using floats with 128bit instructions can give a 100% speed up over doubles

##### Share on other sites
MS recently said at one of their XBox 360 conferences that double is native and therefor faster to the platform. I second the recommendation of using typedefs for your floating point types, to facilitate simple switching between the resulting types used. This could cause compatability problems in file IO and other such things, so a bit of care must be taken.

##### Share on other sites
Quote:
Original post by Nitage
Quote:
 Original post by blizzard999In my opinion on modern processors (like P4 and its 128 bit SIMD floating point arithmetic) using floats give no speed benefits.

There are 128 bit instructions where those 128 bits hold 4 floats.
There are 128 bit instructions where those 128 bits hold 2 doubles.

Using floats with 128bit instructions can give a 100% speed up over doubles

Example ?

##### Share on other sites
Quote:
Original post by blizzard999
Quote:
Original post by Nitage
Quote:
 Original post by blizzard999In my opinion on modern processors (like P4 and its 128 bit SIMD floating point arithmetic) using floats give no speed benefits.

There are 128 bit instructions where those 128 bits hold 4 floats.
There are 128 bit instructions where those 128 bits hold 2 doubles.

Using floats with 128bit instructions can give a 100% speed up over doubles

Example ?

because you can do twice as many calculations with floats than doubles. 4 is 100% higher than 2.

##### Share on other sites
What example do you want? xmm0 can hold either 4 floats or 2 doubles. 2x the data calculated, plus you don't need to use SSE2, which all AthlonXPs (still very popular) don't support.

##### Share on other sites
Quote:
Original post by blizzard999
Quote:
Original post by Nitage
Quote:
 Original post by blizzard999In my opinion on modern processors (like P4 and its 128 bit SIMD floating point arithmetic) using floats give no speed benefits.

There are 128 bit instructions where those 128 bits hold 4 floats.
There are 128 bit instructions where those 128 bits hold 2 doubles.

Using floats with 128bit instructions can give a 100% speed up over doubles

Example ?

SSE3 instructions:

Input: { A0, A1 }, { B0, B1 }
Output: { A0 - B0, A1 + B1 }

Input: { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
Output: { A0 - B0, A1 + B1, A2 - B2, A3 + B3 }

Twice as much gets done in the same time using floats. Therfore floats can be 100% faster.