**Edited by RoundPotato, 23 August 2014 - 05:39 PM.**

**1**

# potatoe

###
#2
Crossbones+ - Reputation: **10518**

Posted 02 August 2014 - 12:22 PM

POPULAR

1. Any usage of floats is normally much expensive than Int operations right?

http://stackoverflow.com/questions/2550281/floating-point-vs-integer-calculations-on-modern-hardware

2. If 32 bit floats can hold data from -3.14 * 10 ^ -38 to 3.14 * 10 ^ 38

without precision loss

Float32s cannot perfectly represent all values within their range. They cannot even represent every single possible Int32, which may surprise some people.

For an experiment, try this:

- Loop over all integers.

- Cast the integer to a float.

- Cast back to an integer.

- See if the 'before' and 'after' integers are the same.

// Positive numbers (negative numbers work similarly) int maxDifference = 0; for (int i=0; i<=0x7FFFFFFF; ++i) { float f = (float)i; int i2 = (int)f; int diff = abs(i2-i); if (diff > maxDifference) { printf("%i -> %f -> %i (off by %i)\n", i, f, i2, diff); maxDifference = diff; } }The output looks like this:

16777217 -> 16777216.000000 -> 16777216 (off by 1) 33554434 -> 33554432.000000 -> 33554432 (off by 2) 67108867 -> 67108864.000000 -> 67108864 (off by 3) 67108868 -> 67108864.000000 -> 67108864 (off by 4) 134217733 -> 134217728.000000 -> 134217728 (off by 5) 134217734 -> 134217728.000000 -> 134217728 (off by 6) 134217735 -> 134217728.000000 -> 134217728 (off by 7) 134217736 -> 134217728.000000 -> 134217728 (off by 8) 268435465 -> 268435456.000000 -> 268435456 (off by 9) 268435466 -> 268435456.000000 -> 268435456 (off by 10) 268435467 -> 268435456.000000 -> 268435456 (off by 11) 268435468 -> 268435456.000000 -> 268435456 (off by 12) 268435469 -> 268435456.000000 -> 268435456 (off by 13) 268435470 -> 268435456.000000 -> 268435456 (off by 14) 268435471 -> 268435456.000000 -> 268435456 (off by 15) 268435472 -> 268435456.000000 -> 268435456 (off by 16) 536870929 -> 536870912.000000 -> 536870912 (off by 17) 536870930 -> 536870912.000000 -> 536870912 (off by 18) 536870931 -> 536870912.000000 -> 536870912 (off by 19) 536870932 -> 536870912.000000 -> 536870912 (off by 20) 536870933 -> 536870912.000000 -> 536870912 (off by 21) 536870934 -> 536870912.000000 -> 536870912 (off by 22) 536870935 -> 536870912.000000 -> 536870912 (off by 23) 536870936 -> 536870912.000000 -> 536870912 (off by 24) 536870937 -> 536870912.000000 -> 536870912 (off by 25) 536870938 -> 536870912.000000 -> 536870912 (off by 26) 536870939 -> 536870912.000000 -> 536870912 (off by 27) 536870940 -> 536870912.000000 -> 536870912 (off by 28) 536870941 -> 536870912.000000 -> 536870912 (off by 29) 536870942 -> 536870912.000000 -> 536870912 (off by 30) 536870943 -> 536870912.000000 -> 536870912 (off by 31) 536870944 -> 536870912.000000 -> 536870912 (off by 32) 1073741857 -> 1073741824.000000 -> 1073741824 (off by 33) 1073741858 -> 1073741824.000000 -> 1073741824 (off by 34) 1073741859 -> 1073741824.000000 -> 1073741824 (off by 35) 1073741860 -> 1073741824.000000 -> 1073741824 (off by 36) 1073741861 -> 1073741824.000000 -> 1073741824 (off by 37) 1073741862 -> 1073741824.000000 -> 1073741824 (off by 38) 1073741863 -> 1073741824.000000 -> 1073741824 (off by 39) 1073741864 -> 1073741824.000000 -> 1073741824 (off by 40) 1073741865 -> 1073741824.000000 -> 1073741824 (off by 41) 1073741866 -> 1073741824.000000 -> 1073741824 (off by 42) 1073741867 -> 1073741824.000000 -> 1073741824 (off by 43) 1073741868 -> 1073741824.000000 -> 1073741824 (off by 44) 1073741869 -> 1073741824.000000 -> 1073741824 (off by 45) 1073741870 -> 1073741824.000000 -> 1073741824 (off by 46) 1073741871 -> 1073741824.000000 -> 1073741824 (off by 47) 1073741872 -> 1073741824.000000 -> 1073741824 (off by 48) 1073741873 -> 1073741824.000000 -> 1073741824 (off by 49) 1073741874 -> 1073741824.000000 -> 1073741824 (off by 50) 1073741875 -> 1073741824.000000 -> 1073741824 (off by 51) 1073741876 -> 1073741824.000000 -> 1073741824 (off by 52) 1073741877 -> 1073741824.000000 -> 1073741824 (off by 53) 1073741878 -> 1073741824.000000 -> 1073741824 (off by 54) 1073741879 -> 1073741824.000000 -> 1073741824 (off by 55) 1073741880 -> 1073741824.000000 -> 1073741824 (off by 56) 1073741881 -> 1073741824.000000 -> 1073741824 (off by 57) 1073741882 -> 1073741824.000000 -> 1073741824 (off by 58) 1073741883 -> 1073741824.000000 -> 1073741824 (off by 59) 1073741884 -> 1073741824.000000 -> 1073741824 (off by 60) 1073741885 -> 1073741824.000000 -> 1073741824 (off by 61) 1073741886 -> 1073741824.000000 -> 1073741824 (off by 62) 1073741887 -> 1073741824.000000 -> 1073741824 (off by 63) 1073741888 -> 1073741824.000000 -> 1073741824 (off by 64)As you can see, as the values get further from zero, the floats start to lose their ability to represent each integer.

Observe the pattern: For a while, integers are perfectly represented. But at 16777217, the floating point representation loses the ability to track the lowest bit of the integer. At 33554434, it loses the ability to track the lowest two bits. Each time the integer reaches the next power of two, the float loses another bit. What's happening?

http://en.wikipedia.org/wiki/Single-precision_floating-point_format

Notice the floating point bits are divided into three major sections: sign, exponent, mantissa (aka fraction).

The mantissa portion only has 23 actual bits, but 24 effective bits since an implicit leading 1 is used. Now, if we find the maximum integer that we can represent with 24 bits (0xFFFFFF), it's 16777215. That looks close to the point in the above experiment where the float can no longer store the value. the next two numbers are 0x1000000 (16777216) and 0x1000001 (16777217). Floating point can still represent 0x1000000 correctly because all of the low bits are zero. But 0x1000001 can't, because it can't hold the lowest bit anymore, which is a 1. When converting a float back to an integer, any bits which the float can't hold are treated as zeroes, which leads to the pattern you see above.

(EDIT) Changed "error" to "difference" since I was using the term improperly.

**Edited by Nypyren, 02 August 2014 - 07:22 PM.**

###
#3
Members - Reputation: **380**

Posted 02 August 2014 - 12:22 PM

1. Yes float usage is more expensive. May not be extremely obvious in small programs, but using the correct variable type in large ones is critical.

2. You will generally use ints when youre absolutely sure the variable should only be a whole number. floats does provide larger and more precise storage but at a cost in performance. So always use integers when possible (plus some operations are easier to do with int's rather than floats)

**Edited by Penanito, 02 August 2014 - 12:23 PM.**

###
#4
Members - Reputation: **3018**

Posted 02 August 2014 - 12:34 PM

POPULAR

1. Yes float usage is more expensive. May not be extremely obvious in small programs, but using the correct variable type in large ones is critical.

You can't say that as a fact as it is entirely dependent on the hardware in question. GPUs for example are often better at floating point.

###
#5
Crossbones+ - Reputation: **19670**

Posted 02 August 2014 - 08:34 PM

POPULAR

Float32s cannot perfectly represent all values within their range. They cannot even represent every single possible Int32, which may surprise some people.

How can that possibly surprise anybody? There are 2^32 different Int32 values, and floats are represented using 32 bits. So for every non-integer number that can be represented as a Float32s, there is an integer that cannot be represented as a Float32s.

###
#6
Moderators - Reputation: **48918**

Posted 02 August 2014 - 11:09 PM

POPULAR

Then, mid 90's, every desktop CPU started to add actual hardware support for float operations, which made them cost about the same as into operations.

One of the most common performance statistics is FLOPS - floating-point operations per second - because float ops are about one of the simplest things a CPU can do these days!

As mentioned above, GPUs have taken an opposite path, where initially, they only worked with floats, and integer operations had to be emulated! Recently, GPUs have added real hardware support for int operations, but it may still be slower.

###
#7
Senior Moderators - Reputation: **7666**

Posted 03 August 2014 - 04:17 AM

In the 90's, float calculations were performed by software routines, so they were much, muh slower than ints.

Then, mid 90's, every desktop CPU started to add actual hardware support for float operations, which made them cost about the same as into operations.

Intel had a floating point co-processor available since 1980 (the 8087 was an FPU co-processor for the 8086). The Intel 80486DX (1989) had a full floating point implementation on board, while the SX variety did as well, but it was disabled due to fab issues, the 80487 was actually a full 80486DX with a bit of circuitry on board to require the 80486SX to operate. It would disable the main processor and take over ALL OPERATIONS when it was installed. The circuitry that detected the presence of the master CPU was known to be somewhat... flaky, and so many people were able to build systems with just 80487 chips in them without the additional cost of an 80486 processor.

Most floating point software actually would detect if an FPU was present on the hardware and defer operations to it when available. Since a lot of times this was provided via source based libraries this made it no more costly than most other operations reasonably complex mathematical operations (when an FPU was present). However, FPU instructions were still quite slow with relation to integer based ones, even with an FPU. It took the rapid differentiation between memory fetch times and modern CPU cycle speeds, along with pipelining and clock subdivision for executing subinstruction operations before the cost has been reduced significantly enough to make them essentially identical operations.

Sometimes when building my systems I miss seeing those dual sockets both populated by the most powerful silicon available to the general public of the time...

**Edited by Washu, 03 August 2014 - 04:22 AM.**

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

ScapeCode - Blog | SlimDX

###
#8
Crossbones+ - Reputation: **8706**

Posted 03 August 2014 - 06:56 AM

You can still put a bunch of Xeons with 30 threads each on a single board Washu

Sometimes when building my systems I miss seeing those dual sockets both populated by the most powerful silicon available to the general public of the time...

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and *Making a Terrain Generator*

###
#9
Members - Reputation: **2048**

Posted 03 August 2014 - 09:21 AM

2. If 32 bit floats can hold data from -3.14 * 10 ^ -38 to 3.14 * 10 ^ 38 without precision loss then why would anyone use Ints if they can only store from -2 * 10 ^ 9 to 2 * 10 ^ 9 ?

Actually, in Lua all numbers are FP32. And it's a pain in the ass, because all of a sudden you loose the ability to store a 32-bit hash or a 32-bit Unicode codepoint as a regular number. Indices for an array can not only be negative, but can also be fractions or NANs.

###
#10
Members - Reputation: **2459**

Posted 03 August 2014 - 01:49 PM

Actually, in Lua all numbers are FP32. And it's a pain in the ass, because all of a sudden you loose the ability to store a 32-bit hash or a 32-bit Unicode codepoint as a regular number. Indices for an array can not only be negative, but can also be fractions or NANs.

I don't know the history or what version of Lua you're talking about, but in the version of Lua that I downloaded source for around a year ago, the base lua_Number is a double, and it is configurable.

###
#13
Members - Reputation: **2048**

Posted 03 August 2014 - 02:39 PM

I don't know the history or what version of Lua you're talking about, but in the version of Lua that I downloaded source for around a year ago, the base lua_Number is a double, and it is configurable.

You are probably right, I was writing from memory, and somehow thought it was single precision. I did not know, that it is configurable though, thanks for pointing that out.

If so, is there an easy way to tell how float operation speed differs from int operation speed?

Rule(s) of thumb: If you are on the cpu, float is slightly slower then int. If you are on the GPU (especially NVidia) float is faster then int. If you are having a lot of branches, which nuke your pipeline, it doesn't matter. If you are memory bandwidth bound it doesn't matter. If you are having a lot of cache misses it doesn't matter. If you are chasing pointers it doesn't matter. If you have a low ILP it probably also doesn't matter.

###
#14
Members - Reputation: **2742**

Posted 03 August 2014 - 05:44 PM

3. So practically the actual maximum safe range without losing precision is only

2^24right? Similarly it says that the minimum rangewithout losing precisionis range 1.175494351e-38 , which I believe is also false right? If so then what is the minimum safe range?

The two things are not actually similar, beyond the fact that they both involve precision (the number of digits that can be accurately represented).

The maximum safe range is the limit of integral precision - i.e. the point beyond which a float is incapable of representing all integral bits of the number.

**It's not really about loss of precision on floats, but is instead the point at which integer precision exceeds that of a float, which is to say the point beyond which precision will be lost when converting from int to float.**This is because the precision of a float is constant (with one exception, but we'll get into that) and based on the number of bits allocated to the mantissa, while the precision of an int varies depending upon the magnitude of the number. (For example, an int can represent a number between 8388608 and 16777215 with 24 bits of precision, but a number between 64 and 127 with only 7 bits of precision.)

Or to put it another way, (assuming 32-bit floats) any number with a magnitude of 2^24 or greater will lose precision when converted from int to float, and conversely any number with a magnitude less than 2^23 will lose precision when convereted from float to int.

The minimum range without loss of precision (which IS 1.175494351e-38 for a standard 32-bit float) is due to the existence of denormalized numbers, and represents an actual loss of precision within the float format itself. As has been mentioned, the mantissa of a float has an implied most significant bit of 1. However, for a denormalized number, the implied most significant bit of the mantissa is instead 0. Denormalized numbers are used only for extremely small magnitudes - they allow numbers closer to zero to be represented with increasing accuracy but reduced precision. Since the implied msb is 0, the precision is determined by the the highest set bit in the mantissa (much as with ints).

Note that if there were no such thing as denormalized numbers, there would be no such thing as "minimum range without loss of precision" - floats would have a constant precision.

###
#16
Members - Reputation: **2742**

Posted 04 August 2014 - 03:31 PM

How? A random number 45 is below 2^23...

45.0f -> 45

where '->' is conversion to int. Where is the precision loss?

45.0f has 24 bits of precision, while 45 (as an integer) has only 6. Precision is lost when converting to int because the int has fewer significant figures.

As a float, 45.0 is distinct from 45.000004. As an int, it is not.

To put it another way, 45.00000000 is more precise than 45.0, even though all the extra digits are 0s.

###
#18
Senior Moderators - Reputation: **7666**

Posted 04 August 2014 - 05:50 PM

If so does this also apply to what I asked earlier

RoundPotato, on 04 Aug 2014 - 09:34 AM, said:

I think you meant 1.175494351e-38 is stated without having precision loss is because such a number can be defined with the same 'precision' of a float, that is 24 bits(where precision is defined as number of bits in the mantissa) is that what you were getting at?

then?

1.175494351e-38 is the minimum normalized value a 32 bit floating point number can represent (it is not the minimum value an IEEE 32 bit float can hold accurately).

0 00000001 000000000000000000000000 ==> 1.17549435E-38 0 00000000 000000000000000000000001 ==> 1.4E-45 (note, this is a denormalized float as the exponent is 0)

**Edited by Washu, 04 August 2014 - 05:53 PM.**

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

ScapeCode - Blog | SlimDX

###
#20
Senior Moderators - Reputation: **7666**

Posted 04 August 2014 - 06:25 PM

1.175494351e-38 is the minimum normalized value a 32 bit floating point number can represent (it is not the minimum value an IEEE 32 bit float can hold accurately).

So 1.17549435E-38 is the minimum representable numberwithout precision lossbecause it uses 24 bits(ala 24bit precision) and 1.4E-45 is the very minimum number that can be represented but at the loss of precision(1 bit that is the MSB because it is denormalized now), that it?

With IEEE floats you have an invisible leading 1 whenever the exponent is not zero or NaN. In other words its something like (-1) ^ sign * 2 ^ exponent * 1.mantissa. This is normalized, form, as the most significant bit is represented by the value of the exponent, giving you 1 + 23 bits of precision.

When you use denormalized floats the exponent is 0, and thus there is no leading 1 bit. So you do lose a bit of precision.

**Edited by Washu, 05 August 2014 - 03:58 PM.**

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

ScapeCode - Blog | SlimDX