Jump to content
  • Advertisement
Sign in to follow this  
Ralph Trickey

Unity Performance : Ints vs. Floats?

This topic is 4058 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

What is the performance difference between Ints and Floats on modern processors? Many years ago (486 Assembler is the last reference book I have on it<g>), there was a huge difference. Is the difference now the same? Twice as fast? Something else? What about converts to/from ints? Just how slow are they? Fell free to point me to another resource if one's out there. Here's a previous thread on the topic. http://www.gamedev.net/community/forums/topic.asp?topic_id=183624 Thanks, Ralph

Share this post


Link to post
Share on other sites
Advertisement
Floating point and fixed point are the same speed on modern processors.

You should pay more attention to the natural word size of the machine instead. For example, how many clocks will it take to mov or multiply a 32-bit value on that system. For 32-bit systems, 32-bit values are optimal (ie. using floats instead of double).

[edit]

I noticed you posted in the graphics programming board, so I think it is worth mentioning that in graphics programming we usually use floats because they provide sufficient precision while being 32-bits means they are optimal on machines where that is the natural word size.

Share this post


Link to post
Share on other sites
Quote:
Floating point and fixed point are the same speed on modern processors.


That is most definitely not true. A lot of people say this, but I've done the timings to back up what I say. Currently, fixed point math for things such as addition and subtraction are barely faster than floating point, if even at all (depends on the CPU). However, fixed point multiplication is at worst about 4x faster than floating point, and division is at worst about 11x faster than floating point. Of course, this varies depending on whether or not you're using SSE. Also, because of SSE's vertical design nature, it's rarely possible to use it in the most efficient way. Also, due to the pipelining of modern processors, as well as out of order execution capabilities, it's generally better to interleave SSE and integer code when possible (this assumes that the integer math and floating point math don't have any dependencies with each other... otherwise going from floating point to integer and vice versa will kill your perf [literally]).

Quote:
You should pay more attention to the natural word size of the machine instead. For example, how many clocks will it take to mov or multiply a 32-bit value on that system. For 32-bit systems, 32-bit values are optimal (ie. using floats instead of double).


I'm not sure that this is the most important thing in the world... compilers will pad for you to avoid this problem. However, it's definitely good advice. The biggest thing you can do is align data to 16-byte and 32-byte boundaries. The reason for this is SSE and caches: Cache line sizes vary from 32 to 128 bytes. The 16 byte boundary is for SSE... SSE floating point works with 16-byte data, so it should be aligned at 16-bytes when possible. This will maximize a compiler's effectiveness at optimizing your code. It will also give you more flexibility in the future for unintrusive optimizations. Also, you should store relevant data close together in memory. The smallest amount of memory that any CPU will actually read (the data that goes across the FSB on your motherboard) from physical RAM is the size of it's cache line... if multiple bits of data happen to be in a single cache line, you basically get at data for cheap, thus maximizing the effectiveness of the cache =).

Okay, I'm sorry about the whole cache and memory ramblings... I got carried away and it's late at night (my excuse!). The general point here is that there is still a somewhat limited need for fixed point math. Usually, if the end result of the math needs to be in floating point, all of the clocks you save by doing things in fixed point would simply get dashed by going from integer to float (unless there are a whole lot of divisions and multiplies :)). With performance, there is no simple answer... there are good "rules of thumb", but the best thing you can do is read up on modern processor architecture as well do performance tests yourself. Also, a good performance profiler will help you a lot in finding actual bottlenecks (never optimize prematurely!). Hope this helps!

Kevin B

Share this post


Link to post
Share on other sites
Floats do have numerical robustness issues that fixed point types do not. With fixed point types the precision is consistent, with floats it's very inconsistent due to quantization.

Share this post


Link to post
Share on other sites
As a side note, on the last-generation GPUs, all floating-point operations except for division are as fast as all integer addition and substraction operations. Integer multiplication and floating-point inverse and inverse square root are four times slower. Floating-point division is around eight times slower, in the same range as square root, log2, ex2, sin and cos. Integer division and remainder are massively slower. Note that SIMD optimization on the GPU works differently from that of the CPU.

Share this post


Link to post
Share on other sites
Awesome, thanks guys. I'm doing mostly additions, so a factor of 4 for * and 11 for / worst case I can live with. I'm probably losing more than that on average by having to do * 100 / 100 around the calculations. Right now, everything is integral, so I doubt I'll hit the worst end of the multiplication at least. I'm not as sure about the divisions, but they're pretty simple. I'll definitely evaluate them again, though to see if I can store the data to make them multiplications instead. I think I can in a lot of cases because they number is often a fraction between 1 and 100.

I'll be programming in C++ or C# on a PC, so alignment shouldn't be an issue. I don't expect to be doing any GPU programming in the near future, so I'll take a look at them when I need to.

Thanks,
Ralph

Share this post


Link to post
Share on other sites
Quote:

in the same range as square root, log2, ex2, sin and cos


Some years ago I read that nVidia (or ati) implemented a single clock cicle cos and sin. Do I recall it correctly or are things different?

Share this post


Link to post
Share on other sites
Quote:
Original post by cignox1
Some years ago I read that nVidia (or ati) implemented a single clock cicle cos and sin. Do I recall it correctly or are things different?


On the GeForce 8800, at the very least, sin and cos take 32 clock cycles, which is eight times as much as the fastest operation (4 cycles).


Share this post


Link to post
Share on other sites
Quote:
Original post by cignox1
Quote:

in the same range as square root, log2, ex2, sin and cos

Some years ago I read that nVidia (or ati) implemented a single clock cicle cos and sin. Do I recall it correctly or are things different?

Apparently, ATI has that. I don't know about nVidia.

But don't think that being able to do cos and sin in a single clock cycle is a good measure of general floating-point speed.

For one thing, they probably use a look-up table. A look-up table reduces any operation to constant time: even a 486 could do sin/cos in a single clock cycle if you had a big enough look-up table. If the operation works in 16-bit precision, it'd only need a 128KB look-up table. Using a look-up table essentially means that the sin/cos operation doesn't actually involve any math, so the speed of that operation would be entirely divorced from the speed of non-look-up table-based floating-point operations.

Apart from that, clock cycles have not usually been anything more than a rule-of-thumb measure of performance for some time, due to pipelining. An operation that takes one clock cycle might be faster in isolation than an operation which takes five, but if the former stalls an eight stage long pipeline, it'll be much slower in most real code.

Even if we can ignore the effects of the above issues, on "embarrassingly parallel" architectures like GPUs, there may be technical limits on the types of instructions that can be executed in parallel. For example, the r600 can, in principle, perform 320 multiplication, addition or division operations in parallel. In practice, no useful shaders will fully utilize that potential, but even in principle it can only perform 64 "other" operations (including sin and cos) in parallel.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!