Optimizing floating point computations (simulating electric motors)
#1 Members - Reputation: 432
Posted 29 April 2012 - 06:31 AM
I have a C++ module consisting of a bunch of functions containing quite complicated floating point math computations that get run over and over again in a tight loop. Profiling has shown that the program spends a substantial amount of CPU time running these floating point computations. Floating point multiplications and std::sin() / cos() / tan(), etc. seem to be a lot more expensive than I thought. We're already compiling with -xO4 (which is the second highest optimization level on our compiler).
In the past few days I've been trying to optimize these computations as much as I can. I've actually had some success as some of our simulation tests scenarios have gone down from roughly 60 seconds to about 50 seconds. Not an astronomical improvement but nothing to be sneezed at either.
The way I've been going about this is to mainly break up complex computations that reuse the same expression over and over. So instead of evaluating the same expression many times over, I just evaluate it once and then feed the result into the computations that depended on the original expression. The downside to this is that readability of the calculations has suffered badly. Also, it's pretty tenuous work and I have to take great care not to accidentally mess up any of the rather complicated calculations.
So I was wondering if there are any fancy mechanisms or tools that can be fed a series of complicated math expressions and that will try to automatically identify and remove any redundant calculations to optimize the computations as much as possible. Does such a thing exist or am I stuck with handcrafted optimizations?
#2 Members - Reputation: 1397
Posted 29 April 2012 - 07:28 AM
You can also simplify equations with software assistance.
Other options:
- Make use of SIMD where possible. What's the target processor?
- Make sure that you aren't doing computations on NaNs, Infinites or denormals. They can be much much slower than standard floating point values.
- It can help that cos(x) == sqrt(1.0 - (sin(x) * sin(x))) when you're calculating the sin and cos of the same value.
- You can trade more accuracy for speed by for example using a look up table for sin(), cos() and tan(). Make the number of entries in the table a power of two.
- Make use of all your processor cores by splitting the work across multiple threads.
Edited by Adam_42, 29 April 2012 - 07:42 AM.
#3 Members - Reputation: 2040
Posted 29 April 2012 - 07:29 AM
- investigate your compiler's options for floating point. You can get a lot of speed up simply by adding a couple of flags. Take care though, as they usually place additional constraints/requirements on your code e.g. they assume you'll never have any NaNs, you won't be reading errno to check for errors in transcendentals, etc.
- SIMD. There are usefully-accurate polynomial approximations to sin/cos across pi-sized ranges which can be used for e.g. SSE implementations.
- Newer versions of GCC have auto-vectorization support (provided you specify the right flags). VS11 also has this feature, apparently.
- tan(x) = sin(x)/cos(x), so if you're calculating sin, cos and tan for the same calculation, you can save some work
- threads (careful if you have to retrofit them!)
- and then there's OpenCL and CUDA
Edited by edd², 29 April 2012 - 07:29 AM.
#4 Members - Reputation: 2369
Posted 29 April 2012 - 01:47 PM
from roughly 60 seconds to about 50 seconds. Not an astronomical improvement but nothing to be sneezed at either.
Is this the worst case? How many of such test cases will be run? How many hours/days/years of computational work are we talking here?
For comparison: 1 month of compute time of Amazon costs $70. For lowest end software engineer salaries, one can get 10-20 instances running full-time, giving you 2000% improvement, compared to 17% improvement you've made so far.
Another option is buying more machines. Weigh the cost of development vs. cost of an i7 (~$250).
What are the targets here?
Accuracy? SIMD and GPU may be fast, but have accuracy problems. GPUs can suffer from overheating and corrupt computations undetectably. It's no big deal for graphics, but number crunching can be a problem. Fast math is likely a no-go for this reason as well, if results need to be accurate, then it'll sooner be :precise or whatever the most robust version is.
Since 60s computation time is being mentioned, it doesn't seem it has to be interactive. In that case, range of solutions becomes very large, but as far as cost factors go, optimizing the algorithm is not one of them - hardware is simply too cheap.
it's pretty tenuous work and I have to take great care not to accidentally mess up any of the rather complicated calculations.
You do run full test suite after each change, right? You do have unoptimzed reference implementation against which the values are compared?
If not, then instead of optimization write that.
- One is a table of hand-calculated or text-book results
- Second is unoptimized reference implementation. Slow, compiled for maximum reliability, perhaps in debug mode
- Then come incremental improvements. If results differ, RMS or similar statistic is measured against first two to determine deviation
Having that, optimizaton can proceed incrementally with only a few seconds to test validity. For complex computations and simulations, any mistake will quickly cause large divergence.
Edited by Antheus, 29 April 2012 - 01:52 PM.
#5 Moderators - Reputation: 7535
Posted 29 April 2012 - 01:56 PM
#6 Members - Reputation: 2369
Posted 29 April 2012 - 02:32 PM
And if difference is several orders of magnitude, then there's something very wrong with current implementation, so instead identify that bottleneck.
Edited by Antheus, 29 April 2012 - 02:32 PM.
#7 Members - Reputation: 3789
Posted 29 April 2012 - 05:31 PM
Lots of the same computations inside a tight loop sound like ideal candidates for a GPU-based solution to me though. I wouldn't worry overly much about accuracy - any floating-point ops have inherent inaccuracy so what's more important is that you define your tolerances here and decide if the operation is within them. The biggest downside to using the GPU is latency in getting the results back - not due to bandwidth but due to pipeline stalls - so what it's more suited for is where this latency is a small fraction of the time taken to do the computation.
It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.
#8 Members - Reputation: 100
Posted 30 April 2012 - 07:49 AM
I haven't read all the posts but anyway maybe my idea could help.
First of all, it would be much better if you provide us with some detailed information like how many times approximately do you call these trigonometric functions in the loop?
How important is precision?
If precision is not so important or it is acceptable that absolute error is about 0.000006 (at most, not for all the cases) you could pre-calculate sin (and cos if you need) for about a million values (eg. PRECISION = 1000000, for each x = k * 2*Pi / PRECISION, k = 0, PRECISION) you can put in an array these values.
When you need to find value for sin(x), you will find k such that k = x*PRECISION / (2*Pi) (integer value) and your value will be sinus[k] or sinus[k + 1], depends which of the numbers k and (k+1) is closer to x*PRECISION / (2*Pi) (if sinus is the name of the array you used to pre-calculate values).
You can also increase the precision by increasing PRECISION value.
Time complexity will be O(PRECISION) to pre-calculate, and O(1) to calculate the value of sin(x).
Memory complexity will be O(PRECISION) too and it will be around 8MBs for PRECISION = 1000000 and the array is of type double.
There are probably more ways to solve that problem. If this solution I have written is not of any help I could come up with a new one, but I need some more feedback about.
#9 Members - Reputation: 2369
Posted 30 April 2012 - 09:49 AM
If accuracy really isn't an issue, then on just about any modern PC using simplified Fourier expansion of first n terms will be faster. These also translate nicely to SIMD.
A Google search finds this random article (pdf). For cos(x) = c1 +c2*x^2 + c3*x^4 yields error < 0.001. Sin approximation gives error < 0.0001.
it will be around 8MBs for
This is huge and unless one is guaranteed locality, it will trash L1, L2 and L3 cache.
---
Looking through instruction tables (pdf) for nehalem, things aren't quite that easy to figure out, especially for newer architectures there aren't directly useful numbers.
But fsin/fcos show latency of 40-100 cycles. It's possible that these pipeline well, so reordering for batch operations might help.
Compared to that, fmul has negligible execution time and 5 cycle latency, with multiply pipelining well. Given above formula, we get some 4 * 5 + 2 * 3 latency and 6 uops for approximation. Definitely a good candidate for SIMD (with latency dominating) or carefully reordered computation to minimize stalls. Compiler might be able to do this. But it's really hard to judge numbers without testing, these CPUs are so complex just about anything can happen.
i7 L1 cache miss is ~10 cycles (varies a lot), so unless table is carefully optimized, it will rarely, if ever, beat properly pipelined raw calculation.
Edited by Antheus, 30 April 2012 - 09:52 AM.
#10 Members - Reputation: 100
Posted 30 April 2012 - 11:28 AM
Quote
it will be around 8MBs for
This is huge and unless one is guaranteed locality, it will trash L1, L2 and L3 cache.
He doesn't have to put it all in cache. If that's the first thing to pre-calculate, cache will keep only the most used values from the sinus array.
Edit: sin (x) (cos too) can be calculated using Taylor series (Maclaurin series actually here) to calculate sinus with an absolute error of 0.000006 in about 12 steps (in the worst case) and that would be about 1s of time if you call 10 million times sin(x) function.
Edited by FreezingCool, 30 April 2012 - 11:35 AM.
#11 Moderator* - Reputation: 5335
Posted 30 April 2012 - 01:02 PM
That's the problem. Unless the memory is in the cache, accessing memory is about the slowest thing you can do, taking hundreds of cycles. Like Antheus said, unless there's some locality in the way he's accessing the look-up table, it will cause thrashing as portions of the look-up table are continuously loaded and unloaded from the cache, which will kill performance. Look-up tables only offer a speed increase if you access them in specific ways, and Antheus's warning is to make sure that the look-up table will be accessed in a sensible pattern before bothering making a look-up table.
Quote
it will be around 8MBs for
This is huge and unless one is guaranteed locality, it will trash L1, L2 and L3 cache.
He doesn't have to put it all in cache. If that's the first thing to pre-calculate, cache will keep only the most used values from the sinus array.
Edited by Cornstalks, 30 April 2012 - 01:03 PM.






