**0**

# IEEE Float Suck

IEEE standard floating point numbers, at any bit depth, should not be used to define color spaces. Specifically, one should not use floats or doubles (henceforth collectively "floats") as color components when rendering High Dynamic Range Imagery (HDRI). Due to the nature of the representation of floats, they cannot be used to define uniform color spaces.

The IEEE standard for floating point numbers defines a 32 bit float as a 1 bit sign value, an 8 bit biased exponent value, and a 23 bit fraction value. The exponent is "biased" in that it represents an integer value from which the value 127 is subtracted, to give a real range of exponent values from -127 to +127. The fraction is the fraction portion of a number expressed in "floating point binary notation".

There are 6 reserved values

0 00000000 00000000000000000000000 = 0

1 00000000 00000000000000000000000 = -0

0 11111111 00000000000000000000000 = Infinity

1 11111111 00000000000000000000000 = -Infinity

0 11111111 00000100000000000000000 = NaN

1 11111111 00100010001001010101010 = NaN

Any value that uses a full bit field for the exponent value is a non-number with the IEEE representation. With this exponent, and a 0 fractional value, it is interpreted as signed infinity. With a non zero fractional value, it is interpreted as Not A Number. Because of this interpretation, there are a full 2**24 unusable values when using floats as color components.

For example, the number -32.5625 would be converted as

S = 1, as the number is negative

temp = 100000.1001, "floating point binary representation"

= 1.000001001 x 2**5

E = 5 + 127 = 132

= 10000100

F = 000001001, drop the "1." from representation

= 00000100100000000000000, padded with zeros

Full number:

1 10000100 00000100100000000000000

The problem with color spaces lies in calculating the difference between two "consecutive" floats, i.e. floats that differ only in their least significant bit. Color is equally dependant on the relationship between two very similar colors as it is on the absolute value of a single color. Without thinking about the problem, one might think that this difference is very small, as floats are capable of representing very small numbers. However, once we start calculating the exact decimal value of the float representation boundary cases, we see that the difference between "consecutive" floats is dependant on the actual value of the numbers.

As an example, I've constructed a "4bit IEEE-like floating point number." It has no sign bit; a positive, 2 bit, unbiased exponent (so translating the exponent is not necessary, as with 32 or 64 bit floats); and a 2 bit fraction. Otherwise, calculating the value follows the same rules as regular IEEE floats. The purpose of using such a limited example is to allow us to see the full range of values that can occur for a specific representation.

bin | sci not |fp bin| dec | delta

========================================

0000 | 1.00 x 1 | 1.00 | 1 | -

0001 | 1.01 x 1 | 1.01 | 1.25 | 0.25

0010 | 1.10 x 1 | 1.10 | 1.5 | 0.25

0011 | 1.11 x 1 | 1.11 | 1.75 | 0.25

0100 | 1.00 x 2 | 10.0 | 2 | 0.25

0101 | 1.01 x 2 | 10.1 | 2.5 | 0.5

0110 | 1.10 x 2 | 11.0 | 3 | 0.5

0111 | 1.11 x 2 | 11.1 | 3.5 | 0.5

1000 | 1.00 x 4 | 100 | 4 | 0.5

1001 | 1.01 x 4 | 101 | 5 | 1

1010 | 1.10 x 4 | 110 | 6 | 1

1011 | 1.11 x 4 | 111 | 7 | 1

1100 | 1.00 x 8 | 1000 | 8 | 1

1101 | 1.01 x 8 | 1010 | 10 | 2

1110 | 1.10 x 8 | 1100 | 12 | 2

1111 | 1.11 x 8 | 1110 | 14 | 2

What this shows us is that the difference in values in the range of floating point numbers is not fixed.

Next, we will calculate the exact decimal value that two "consecutive" floats represent. These numbers are the 2nd-largest and largest possible, positive floats.

2nd largest 32 bit float

Binary: 01111111011111111111111111111110

Sign: 0

Exponent: binary 11111110 - 01111111 = 254 - 127 = 127

Fraction: 11111111111111111111110 = 2**24 - 2

Binary scientific notation: 1.11111111111111111111110 x 2**127

Decimal: 3.4028232635611925616003375953727e+38

Equivalent: 2**128 - 2**105

largest 32 bit float

Binary: 01111111011111111111111111111111

Sign: 0

Exponent: binary 11111110 - 01111111 = 254 - 127 = 127

Fraction: 11111111111111111111110 = 2**24 - 1

Binary scientific notation: 1.11111111111111111111111 x 2**127

Decimal: 3.4028234663852885981170418348452e+38

Equivalent: 2**128 - 2**104

Difference: (2**128 - 2**104) - (2**128 - 2**105) = 2**105 - 2**104 = 2**104 = 20282409603651670423947251286016

or roughly

2.0282 x 10**31

We can do the same calculations for the difference between the two smallest numbers and come up with a difference of 2**-150, or an extremely small number.

Using floats to define a color space will result in a non-uniform color space. There will be a lot of subtlety in the middle of the range, and giant leaps between values at the ends of the range. In the end, there are really only 2**32 - 2**24 unique and usable 32 bit floats (as was previously demonstrated) unique and usable 32 bit floats, because of the 5 "special values" that the IEEE float specification defines for values such as plus or minus infinity. Just these special values alone limit the expressive power of floats in defining a color space.

Unsigned integers work much better for defining color spaces, as there are a full 2**32 unique and usable values, and the difference between the largest two integers is the same as the difference between the smallest two integers: only 1.

Note: GameDev.net moderates comments.