Relative Speed of Various Math Operations and Comparisons

Started by
4 comments, last by Tachikoma 14 years, 3 months ago
Hello, I'm interested in the relative speed of various math operations and comparison tests on a 32-bit CPU, for both integers and floats. By math operations, I mean things like: -addition -subtraction -multiplication -division -sqrt() -sin(), cos(), tan() -asin(), acos, atan() -exp() -pow() -modulus (%) -abs() -etc. And by comparsions, I mean comparing integer and floating point values (==, >, <, etc) I am interested in commonly used modern PC CPUs, from the past 5 years (Intel and AMD). Now I know that each CPU is unique and that plenty of other things affect an operation's speed, like how things are stored and moved around in memory, cacheing, and potential for parallel operations. Also, for each specific processor, there may be very detailed, very technical documentation (and I also have the option of trying to write test programs for any processor I have in my possession to generate some raw data). But I don't need something so specific - neither specific numbers, nor for a specific, certain processor. So what I'm looking for is general, relative factors. For instance, everyone says sqrt() is expensive, or that floating-point multiplication is faster than floating-point division. But how much faster are these operations compared to one another? An answer like "sqrt() is roughly ten times slower than division, which itself is roughly 3-4 times slower than multiplication" is certainly sufficient for my purposes - I just want a good feel for these kinds of things. Anything like that will help - an article, a table, etc. Does anyone know what resource might have this information, in relatively non-technical, non-specific terms? Thank you. EDIT: This probably shouldn't affect anything, as I'm interested in raw CPU capabilities, but the code I write and read about is C++.
Advertisement
There's really nothing out there as non-specific as you want. Every compiler, every OS, every CPU will all implement such operations in different ways. Availability of SIMD intrinsics affects things; automatic vectorization capabilities affect things; manually written assembly affects things; and who knows what compiler optimizations might affect.

You might as well ask how much faster a car is than a car; the question is so open-ended and vague that it really can't be answered [wink]

Context is everything. In general, it's not wise to make any assumptions about performance, especially on modern superscalar CPUs. Just write code to be clear and concise. If performance becomes an issue, then run a profiler and tune up any areas that need it. Always consider algorithm-level performance prior to worrying about single instruction performance. In fact, if you're doing your job right, you shouldn't need to care about single instruction performance at all, unless you're doing something extremely hardcore, in which case you probably already know the clock counts of all those instructions on given hardware [wink]

The moral of the story is that modern programming is an exercise in utilizing abstractions. We do the vast majority of our work far away from the raw CPU, and this is a Good Thing.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

While exact performance may vary, I still see value in a rough list. For example, it's good to know that pow(x, 2) is more expensive than x*x, and this is unlikely to ever change. There are also identities which allow trading off various operations.
Paul Hsieh's excellent (but unfortunately no longer updated) optimization page touches upon this issue.

Estimates of the relative cost of your list items follow:
1 addition, subtraction, comparison
2 fabs
3 abs
4 multiplication
10 division, modulus
10 fsqrtf (approximation)
50 exp
60 sin, cos, tan
80 asin, acos, atan
100 pow

Surprisingly enough, approximated [reciprocal] sqrts are fairly cheap on x86.

Quote:The moral of the story is that modern programming is an exercise in utilizing abstractions. We do the vast majority of our work far away from the raw CPU, and this is a Good Thing.

.. until you get programmers who have absolutely no clue as to what is really going on inside the machine, and implement every computation in terms of pow() because it's convenient. Even if only parts of the code are time-critical and you're armed with a profiler, it is more economical to
know up-front that writing x*x is preferable.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
I generally agree with Jan, but recently I discovered that g++ is sometimes amazingly smart: `y=pow(x,2)' generates the same code than `y=x*x', and `y=pow(x,257)' results in the same code than
x2=x*x;x4=x2*x2;x8=x4*x4;x16=x8*x8;x32=x16*x16;x64=x32*x32;x128=x64*x64;x256=x128*x128;y=x256*x;

In general, you can roughly assume something like this:
page fault >>>>>> cache miss >> branch mistaken >> dependency chain >> non-sse-sqrt, division, and modulus > sse-sqrt etc. > everything else

with ">>" being much greater than (as in 3-20 times), not right shift

Quote:Original post by ApochPiQ
it's not wise to make any assumptions about performance, especially on modern superscalar CPUs
Quoted for truth, and translated to English:
Even though for example division is usually 8-15 times slower than most other operations, do not count on that. If there is no dependency on the result, the CPU can execute other instructions before it ends, thus making a 40-60 cycle instruction run effectively in maybe 1-3 cycles.
On the other hand, several "fast" instructions of which each uses the result of the previous one may totally stall and run slower. Add to that the fact that the compiler may replace divisions by constants with multiplication or shifts, or use multiply-add style instructions, which makes things really unpredictable.

Ideally, when there are no dependencies, no cache issues, no mispredictions, you can expect each instruction to take a single cycle (or, statistically, half a cycle, because many CPUs can start two instructions in a cycle).
With some dependencies, most "simple" instructions (that is, not division, etc) typically take 2-5 cycles, and usually one extra cycle if a memory access is involved. The "more complex" instructions typically appear to take 5-40 cycles, depending on the dependency chain and on how good the compiler can hide the latency.
A cache or TLB miss can add to this anything from 20 to 2500 cycles. A page fault will add a million or so cycles.
Quote:Original post by samoth
In general, you can roughly assume something like this:
page fault >>>>>> cache miss >> branch mistaken >> dependency chain >> non-sse-sqrt, division, and modulus > sse-sqrt etc. > everything else


This is the best advice given.

In most performance sensitive environments it is not about finding fastest operation, but about squeezing as many instructions as possible into idle time caused by above memory bottlenecks.

In practice this may mean that doing four multiplications instead of one will take exactly the same time, but get more work done.

Quote:I am interested in commonly used modern PC CPUs, from the past 5 years (Intel and AMD)

Memory is the bottleneck in those.

Here is a recent presentation (pdf). The lessons are universal, even though the presentation is about PS3.

Without considering those factors, your CPU will perform the sqrt in 1 cycle, then wait 200 cycles for next value to come from memory. That is - if performance is really important, most of the time it really doesn't matter.

The reason why data organization is more important is simply due to design implications. Slow sqrt() can be replaced with fast_sqrt() later on, but it won't help with underlying data organization problems which are much costlier to fix, nor will it allow for SIMD (or similar) alternative.
I also agree with most of the comments where, but I would generally avoid putting the more expensive functions inside tight loops, particularly for things like image processing. Cache misses are still going to be drag though; this is where some of the prefetch instructions will help a great deal.
Latest project: Sideways Racing on the iPad

This topic is closed to new replies.

Advertisement