Back to General and Gameplay Programming

Performance : Ints vs. Floats?

General and Gameplay Programming Programming Unity

Started by Ralph Trickey July 06, 2007 09:02 PM

12 comments, last by Ralph Trickey 16 years, 9 months ago

Ralph Trickey

230

Author

July 06, 2007 09:02 PM

What is the performance difference between Ints and Floats on modern processors? Many years ago (486 Assembler is the last reference book I have on it<g>), there was a huge difference. Is the difference now the same? Twice as fast? Something else? What about converts to/from ints? Just how slow are they? Fell free to point me to another resource if one's out there. Here's a previous thread on the topic. http://www.gamedev.net/community/forums/topic.asp?topic_id=183624 Thanks, Ralph

fpsgamer

856

July 06, 2007 09:45 PM

Floating point and fixed point are the same speed on modern processors.

You should pay more attention to the natural word size of the machine instead. For example, how many clocks will it take to mov or multiply a 32-bit value on that system. For 32-bit systems, 32-bit values are optimal (ie. using floats instead of double).

[edit]

I noticed you posted in the graphics programming board, so I think it is worth mentioning that in graphics programming we usually use floats because they provide sufficient precision while being 32-bits means they are optimal on machines where that is the natural word size.

ebray99

186

July 06, 2007 10:37 PM

Quote:Floating point and fixed point are the same speed on modern processors.

That is most definitely not true. A lot of people say this, but I've done the timings to back up what I say. Currently, fixed point math for things such as addition and subtraction are barely faster than floating point, if even at all (depends on the CPU). However, fixed point multiplication is at worst about 4x faster than floating point, and division is at worst about 11x faster than floating point. Of course, this varies depending on whether or not you're using SSE. Also, because of SSE's vertical design nature, it's rarely possible to use it in the most efficient way. Also, due to the pipelining of modern processors, as well as out of order execution capabilities, it's generally better to interleave SSE and integer code when possible (this assumes that the integer math and floating point math don't have any dependencies with each other... otherwise going from floating point to integer and vice versa will kill your perf [literally]).

Quote:You should pay more attention to the natural word size of the machine instead. For example, how many clocks will it take to mov or multiply a 32-bit value on that system. For 32-bit systems, 32-bit values are optimal (ie. using floats instead of double).

I'm not sure that this is the most important thing in the world... compilers will pad for you to avoid this problem. However, it's definitely good advice. The biggest thing you can do is align data to 16-byte and 32-byte boundaries. The reason for this is SSE and caches: Cache line sizes vary from 32 to 128 bytes. The 16 byte boundary is for SSE... SSE floating point works with 16-byte data, so it should be aligned at 16-bytes when possible. This will maximize a compiler's effectiveness at optimizing your code. It will also give you more flexibility in the future for unintrusive optimizations. Also, you should store relevant data close together in memory. The smallest amount of memory that any CPU will actually read (the data that goes across the FSB on your motherboard) from physical RAM is the size of it's cache line... if multiple bits of data happen to be in a single cache line, you basically get at data for cheap, thus maximizing the effectiveness of the cache =).

Okay, I'm sorry about the whole cache and memory ramblings... I got carried away and it's late at night (my excuse!). The general point here is that there is still a somewhat limited need for fixed point math. Usually, if the end result of the math needs to be in floating point, all of the clocks you save by doing things in fixed point would simply get dashed by going from integer to float (unless there are a whole lot of divisions and multiplies :)). With performance, there is no simple answer... there are good "rules of thumb", but the best thing you can do is read up on modern processor architecture as well do performance tests yourself. Also, a good performance profiler will help you a lot in finding actual bottlenecks (never optimize prematurely!). Hope this helps!

Kevin B

T1Oracle

100

July 06, 2007 10:39 PM

Floats do have numerical robustness issues that fixed point types do not. With fixed point types the precision is consistent, with floats it's very inconsistent due to quantization.

Programming since 1995.

Promit

13,404

July 07, 2007 03:43 AM

Moving this over to General.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

ToohrVyk

1,596

July 07, 2007 04:10 AM

As a side note, on the last-generation GPUs, all floating-point operations except for division are as fast as all integer addition and substraction operations. Integer multiplication and floating-point inverse and inverse square root are four times slower. Floating-point division is around eight times slower, in the same range as square root, log2, ex2, sin and cos. Integer division and remainder are massively slower. Note that SIMD optimization on the GPU works differently from that of the CPU.

Blog — Facebook

Ralph Trickey

230

Author

July 07, 2007 07:20 PM

Awesome, thanks guys. I'm doing mostly additions, so a factor of 4 for * and 11 for / worst case I can live with. I'm probably losing more than that on average by having to do * 100 / 100 around the calculations. Right now, everything is integral, so I doubt I'll hit the worst end of the multiplication at least. I'm not as sure about the divisions, but they're pretty simple. I'll definitely evaluate them again, though to see if I can store the data to make them multiplications instead. I think I can in a lot of cases because they number is often a fraction between 1 and 100.

I'll be programming in C++ or C# on a PC, so alignment shouldn't be an issue. I don't expect to be doing any GPU programming in the near future, so I'll take a look at them when I need to.

Thanks,
Ralph

cignox1

736

July 08, 2007 09:46 AM

Quote:
in the same range as square root, log2, ex2, sin and cos

Some years ago I read that nVidia (or ati) implemented a single clock cicle cos and sin. Do I recall it correctly or are things different?

Quarklight Blog<br/>My Site

ToohrVyk

1,596

July 08, 2007 10:15 AM

Quote:Original post by cignox1
Some years ago I read that nVidia (or ati) implemented a single clock cicle cos and sin. Do I recall it correctly or are things different?

On the GeForce 8800, at the very least, sin and cos take 32 clock cycles, which is eight times as much as the fastest operation (4 cycles).

Blog — Facebook

Nathan Baum

1,027

July 08, 2007 10:49 AM

Quote:Original post by cignox1
Quote:
in the same range as square root, log2, ex2, sin and cos

Some years ago I read that nVidia (or ati) implemented a single clock cicle cos and sin. Do I recall it correctly or are things different?

Apparently, ATI has that. I don't know about nVidia.

But don't think that being able to do cos and sin in a single clock cycle is a good measure of general floating-point speed.

For one thing, they probably use a look-up table. A look-up table reduces any operation to constant time: even a 486 could do sin/cos in a single clock cycle if you had a big enough look-up table. If the operation works in 16-bit precision, it'd only need a 128KB look-up table. Using a look-up table essentially means that the sin/cos operation doesn't actually involve any math, so the speed of that operation would be entirely divorced from the speed of non-look-up table-based floating-point operations.

Apart from that, clock cycles have not usually been anything more than a rule-of-thumb measure of performance for some time, due to pipelining. An operation that takes one clock cycle might be faster in isolation than an operation which takes five, but if the former stalls an eight stage long pipeline, it'll be much slower in most real code.

Even if we can ignore the effects of the above issues, on "embarrassingly parallel" architectures like GPUs, there may be technical limits on the types of instructions that can be executed in parallel. For example, the r600 can, in principle, perform 320 multiplication, addition or division operations in parallel. In practice, no useful shaders will fully utilize that potential, but even in principle it can only perform 64 "other" operations (including sin and cos) in parallel.

Performance : Ints vs. Floats?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Performance : Ints vs. Floats?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines