• Create Account

# float vs double

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

49 replies to this topic

### #1HalcyonX  Members   -  Reputation: 130

Like
0Likes
Like

Posted 04 September 2005 - 03:19 PM

Most of the tutorials I've read uses floats. Why don't they use doubles instead? Is the increase in memory by using doubles significant? Do most modern commercial games use floats or doubles? I was thinking to use doubles to do the calculations, and save the final result as a float. Is this a good idea? Or should i not bother and just use either floats or doubles? Thanks.

### #2load_bitmap_file  Members   -  Reputation: 826

Like
0Likes
Like

Posted 04 September 2005 - 04:46 PM

Most of the time people don't need the extra precision afforded by doubles. With float you can already store quite a wide range of numbers. Also, I don't have any evidence to back it up but I believe I've heard floats are a bit faster.

### #3Scet  Members   -  Reputation: 960

Like
0Likes
Like

Posted 04 September 2005 - 04:52 PM

Floats are likely faster since they are 32-bits and thus the same size as the memory bus on 32-bit systems, plus less calculations need to be done by the FPU. Also using doubles and then storing the results in a float is sort of silly as you'll lose most of precision during conversion.

### #4HalcyonX  Members   -  Reputation: 130

Like
0Likes
Like

Posted 04 September 2005 - 04:58 PM

mmm ic. Ok, thanks. Ill just stick to floats then.

### #5scgames  Members   -  Reputation: 2073

Like
0Likes
Like

Posted 04 September 2005 - 05:18 PM

I'm not an expert on these matters, but I don't believe you need to commit to one or the other. One solution is a typedef:

typedef float Scalar;

If you want to compare performance or switch to some other type, just replace float with the desired type. A more flexible method is to make the type a template parameter:

template <class Scalar = float>
class Vector3
{
public: Scalar x, y, z;
};

typedef Vector3<> Vector3f;
typedef Vector3<double> Vector3d;

Float and double are the most obvious choices for type, but the above method leaves the door open for other options, such as a custom rational number class.

### #6Generic Guest  Members   -  Reputation: 110

Like
0Likes
Like

Posted 04 September 2005 - 11:13 PM

As far as I know, you should care about double/float in rendering (as they consume less precious memory and main memory <-> video memory bandwidth). You should DEFINITELY use double in mathematical calculations, as float will give you nothing but unexpected results where double will be OK.

### #7LilBudyWizer  Members   -  Reputation: 491

Like
0Likes
Like

Posted 05 September 2005 - 12:03 AM

Using doubles for the calculation and returning a float can make sense. Just because you have eight digits doesn't mean you have eight digits correct. Basically the differance between precision and significance. Most often though the FPU is already doing that for you and there is no reason for the programmer to do it. Generally, just choose one format and use it. My knowledge of CPU's is limited but I think you might take a performance hit switching formats.

### #8Basiror  Members   -  Reputation: 241

Like
0Likes
Like

Posted 05 September 2005 - 12:43 AM

the performance hit with doubles is a 32 bit platform problem although the 64 bit platforms use only 48 bits for addresses since the addressable space is large enough

but i don t know about the memory bus

floats are defined by the IEEE 754
1 bit sign
8 bit exponent
23 bit mantissa

usually floats are represented as normalized floats, that means

1,mantissa * 2^Exponent

the 8 bits for exponent are used as follows

Exp8Bit -127 = Exponent so you can use shift the bits infront or behind the comma to the left or right 2^-# or 2^#

1,mantissabits *optionally create a B-2 complement on subtractions*

now an example

4960 = 2^12 =1_0000_0000_0000
so the float representation is
1,0000_0000_0000_XXXX_XXXX_XXX * 2^12

as you see you shift 12 bits to the right which means you loose 12 bits of the mantissa for the value infront of the comma
so you have got 11 bits left which is 0.00048828125f minimum precision
calculated a follows:
1/2^11 == 2^-11

you get the mantissa by deviding the values of the mantissa / 2^23

as you see you loose quite a bit of precision with larger values
4096*4096*4096 means shifting 36 bits to the right the mantissa has only 23 bits so in theory you might even loose some precision infront of the comma

but i think as already stated above the FPU probably uses higher precision floats internally which in fact isn t that expensive at all since you only need to add 24 bits of a 32 bit float the exponen stays the same and in the end you should left or right to normalize the float to 1,mantissa

Hope that helps

P.S.: a good way to increase precision is to reorder the operations in a way that you don t get too large values so instead of

4096.0345345^3/2048 you could do (4096/2048)*4096^2
although this is usually not possible at runtime

### #9Basiror  Members   -  Reputation: 241

Like
0Likes
Like

Posted 05 September 2005 - 01:10 AM

I just looked something up, SSE2 is supposed to support 128bit registers to perform 2 double precision operations in one step

so you can stick with simple floats as long as you compile with the latest processor packs for VC++

for the gcc read the manpages

for vc++

/G7 optimized code for intel and AMD cpus
/arch:SSE2 or /arch:SSE makes use of SSE and SSE2 SSE3 might work similarily but the compilers i am using atm doesn t support SSE3 so i can t say

### #10Skute  Members   -  Reputation: 134

Like
0Likes
Like

Posted 05 September 2005 - 01:42 AM

Should you not try to implement your own floating point class?

As floats / doubles cant contain every single number possible - i.e. the number they store in memory is actually a calculation for the final number. Which leads to certain cases where:

int main(){  float a = 2.501f;   a *= 1.5134f;   if (a == 3.7850134) cout << "Expected value" << endl;  else cout << "Unexpected value" << endl; }

would print "Unexpected value".

### #11 Skizz   Banned   -  Reputation: 794

Like
0Likes
Like

Posted 05 September 2005 - 02:02 AM

On an IA32 platform under normal conditions, the only difference between float and double is the amount of memory they use. By this I mean that internally both float and double are done at even higher precision and the result truncated/rounded to fit the desired memory size. Using doubles requires double the memory so if you're operating on lots of values then the increase in required bandwidth could have undesirable effects, i.e. slow it down.

This does lead to interesting results as a result of compiler optimisations. In debug, the intermediate values in a sequence of computations will most likely be written to memory, thus losing some precision. In release, the intermediate values are stored on the FPU stack so the precision isn't lost and thus you get different results.

It is also possible to reduce the level of precision the FPU works at, although if memory serves me right this only affects the transendental functions.

Skizz

### #12 Anonymous Poster_Anonymous Poster_*   Guests   -  Reputation:

0Likes

Posted 05 September 2005 - 02:11 AM

I think if you're just starting programming, and not into 3D graphics and worrying about memory constraints, use double. It will give you more precise answers.

Mike

### #13Basiror  Members   -  Reputation: 241

Like
0Likes
Like

Posted 05 September 2005 - 03:56 AM

usually its a good idea to used typedef instead of build in types this way you can simple switch the typedef if you need more precision

### #14Promit  Moderators   -  Reputation: 11533

Like
0Likes
Like

Posted 05 September 2005 - 04:03 AM

The reason we use floats ("we" as in we of GameDev) is because the GPUs of today are highly optimized for working with 32 bit floating point values. That's what works well. To a lesser extent, SSE also works best with 32 bit floats.

Those are basically the only real reasons for it. x86 does all FPU ops internally at 80 bits, but expansion from 32 bit to 80 bit generally carries no performance hit at all (it's done during the flop; I think this applies to P4 as well but I'm not sure). You do spend more memory, but that's usually not important, and if it is, you will be conscious of it (hopefully).

### #15blizzard999  Members   -  Reputation: 268

Like
0Likes
Like

Posted 05 September 2005 - 04:33 AM

In my opinion on modern processors (like P4 and its 128 bit SIMD floating point arithmetic) using floats give no speed benefits.
Viceversa mixing float and double can slow the processing due to castings.
Use double by default; use floats only if you really need them.

### #16 Anonymous Poster_Anonymous Poster_*   Guests   -  Reputation:

0Likes

Posted 05 September 2005 - 05:38 AM

Doubles use twice as much memory as floats. Using 50% less memory can be a big difference in situations where a lot of memory is being accessed.

### #17BittermanAndy  Members   -  Reputation: 108

Like
0Likes
Like

Posted 05 September 2005 - 05:42 AM

Floats are not always faster than doubles. For example on the platform I'm working on now, doubles are actually faster, as all floating point operations are native to doubles so floats get converted to doubles and back anyway. The extra memory is also unlikely to be an issue.

Know your target platform and code to it. (Where speed is critical, of course. In probably 90%+ situations, it just doesn't matter which you use, unless you particularly need greater precision, which in most games is unlikely).

### #18sit  Members   -  Reputation: 174

Like
0Likes
Like

Posted 05 September 2005 - 06:07 AM

Quote:
 Original post by SkuteShould you not try to implement your own floating point class?As floats / doubles cant contain every single number possible - i.e. the number they store in memory is actually a calculation for the final number. Which leads to certain cases where:*** Source Snippet Removed ***would print "Unexpected value".

NEVER EVER use == on a float

[you don't know how things are rounded, and with different compiliers or different platforms the problem is made worse with more calculations... also, not all numbers that make sense in base 10 work in binary... like 1/5 = 0.2 in decimal, and is about 0.001100110011001100110011001100110011001100110011001100110011001101... in binary]

no, because how would you represent 1/3? or pi? also, it would be much much slower and give you no reasonable benefit

Quote:
 Original post by BittermanAndyFloats are not always faster than doubles. For example on the platform I'm working on now, doubles are actually faster, as all floating point operations are native to doubles so floats get converted to doubles and back anyway. The extra memory is also unlikely to be an issue.

the playstation 2 doesn't have double precision support

### #19HalcyonX  Members   -  Reputation: 130

Like
0Likes
Like

Posted 05 September 2005 - 08:37 PM

hmmm.. seems like theres a mix of opinions about which one to use. Ill use a typedef and give both a try then, and see if my program can handle the increase in memory. Thanks guys!

### #20Nitage  Members   -  Reputation: 1045

Like
0Likes
Like

Posted 06 September 2005 - 01:40 AM

Quote:
 Original post by blizzard999In my opinion on modern processors (like P4 and its 128 bit SIMD floating point arithmetic) using floats give no speed benefits.

There are 128 bit instructions where those 128 bits hold 4 floats.
There are 128 bit instructions where those 128 bits hold 2 doubles.

Using floats with 128bit instructions can give a 100% speed up over doubles

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

PARTNERS