Optimizing floating point computations (simulating electric motors)

Started by
10 comments, last by FreezingCool 11 years, 11 months ago
Hi,

I have a C++ module consisting of a bunch of functions containing quite complicated floating point math computations that get run over and over again in a tight loop. Profiling has shown that the program spends a substantial amount of CPU time running these floating point computations. Floating point multiplications and std::sin() / cos() / tan(), etc. seem to be a lot more expensive than I thought. We're already compiling with -xO4 (which is the second highest optimization level on our compiler).

In the past few days I've been trying to optimize these computations as much as I can. I've actually had some success as some of our simulation tests scenarios have gone down from roughly 60 seconds to about 50 seconds. Not an astronomical improvement but nothing to be sneezed at either.

The way I've been going about this is to mainly break up complex computations that reuse the same expression over and over. So instead of evaluating the same expression many times over, I just evaluate it once and then feed the result into the computations that depended on the original expression. The downside to this is that readability of the calculations has suffered badly. Also, it's pretty tenuous work and I have to take great care not to accidentally mess up any of the rather complicated calculations.

So I was wondering if there are any fancy mechanisms or tools that can be fed a series of complicated math expressions and that will try to automatically identify and remove any redundant calculations to optimize the computations as much as possible. Does such a thing exist or am I stuck with handcrafted optimizations?
Advertisement
Depending on your compiler it may have a 'fast math' setting which will make it optimize floating point expressions more aggressively than normal. The downside of that is that you may lose some precision / correctness because it lets the compiler make assumptions like (a+b)+c == a+(b+c) which isn't always true in floating point maths. If my googling is right and you're using the Sun Studio compiler you want to use -fsimple=2 to do that. You may also want -fns meaning flush denormals to zero.

You can also simplify equations with software assistance.

Other options:
- Make use of SIMD where possible. What's the target processor?
- Make sure that you aren't doing computations on NaNs, Infinites or denormals. They can be much much slower than standard floating point values.
- It can help that cos(x) == sqrt(1.0 - (sin(x) * sin(x))) when you're calculating the sin and cos of the same value.
- You can trade more accuracy for speed by for example using a look up table for sin(), cos() and tan(). Make the number of entries in the table a power of two.
- Make use of all your processor cores by splitting the work across multiple threads.
Some things to try if you aren't already:

  • investigate your compiler's options for floating point. You can get a lot of speed up simply by adding a couple of flags. Take care though, as they usually place additional constraints/requirements on your code e.g. they assume you'll never have any NaNs, you won't be reading errno to check for errors in transcendentals, etc.
  • SIMD. There are usefully-accurate polynomial approximations to sin/cos across pi-sized ranges which can be used for e.g. SSE implementations.
  • Newer versions of GCC have auto-vectorization support (provided you specify the right flags). VS11 also has this feature, apparently.
  • tan(x) = sin(x)/cos(x), so if you're calculating sin, cos and tan for the same calculation, you can save some work
  • threads (careful if you have to retrofit them!)
  • and then there's OpenCL and CUDA

Ninja'd :(
from roughly 60 seconds to about 50 seconds. Not an astronomical improvement but nothing to be sneezed at either.[/quote]

Is this the worst case? How many of such test cases will be run? How many hours/days/years of computational work are we talking here?

For comparison: 1 month of compute time of Amazon costs $70. For lowest end software engineer salaries, one can get 10-20 instances running full-time, giving you 2000% improvement, compared to 17% improvement you've made so far.

Another option is buying more machines. Weigh the cost of development vs. cost of an i7 (~$250).

What are the targets here?

Accuracy? SIMD and GPU may be fast, but have accuracy problems. GPUs can suffer from overheating and corrupt computations undetectably. It's no big deal for graphics, but number crunching can be a problem. Fast math is likely a no-go for this reason as well, if results need to be accurate, then it'll sooner be :precise or whatever the most robust version is.


Since 60s computation time is being mentioned, it doesn't seem it has to be interactive. In that case, range of solutions becomes very large, but as far as cost factors go, optimizing the algorithm is not one of them - hardware is simply too cheap.

it's pretty tenuous work and I have to take great care not to accidentally mess up any of the rather complicated calculations.[/quote]

You do run full test suite after each change, right? You do have unoptimzed reference implementation against which the values are compared?

If not, then instead of optimization write that.

- One is a table of hand-calculated or text-book results
- Second is unoptimized reference implementation. Slow, compiled for maximum reliability, perhaps in debug mode
- Then come incremental improvements. If results differ, RMS or similar statistic is measured against first two to determine deviation

Having that, optimizaton can proceed incrementally with only a few seconds to test validity. For complex computations and simulations, any mistake will quickly cause large divergence.
Another option may be multi-core, multiprocessor, and multi-node parallelism. Threading was mentioned, but more powerful multiprocessing techniques exist if your problem is big enough. It takes a bit of time to restructure an algorithm to run in parallel across several machines, but if the work load is large enough the wall clock time savings can easily be worth the effort.
Another thing: try to count number of operations performed. Compare that to FLOPS rating of the CPU. It will give you a rough estimate on how far from theoretical limit you are. If difference is a factor 10 or less, then instead optimize the algorithm, microoptimizations won't improve anything. Even converting it to multiple threads won't gain much on 2 or 4 core machines.

And if difference is several orders of magnitude, then there's something very wrong with current implementation, so instead identify that bottleneck.
Lookup tables may help with sin/cos/tan operations - especially if precision is not of huge importance for these. Alternatively, and if you're regularly computing both the sin and cos of the same angle (a common enough scenario) you may be able to add some inline asm to do a sincos instruction instead (although your compiler may already be making this optimization so it might gain you nothing).

Lots of the same computations inside a tight loop sound like ideal candidates for a GPU-based solution to me though. I wouldn't worry overly much about accuracy - any floating-point ops have inherent inaccuracy so what's more important is that you define your tolerances here and decide if the operation is within them. The biggest downside to using the GPU is latency in getting the results back - not due to bandwidth but due to pipeline stalls - so what it's more suited for is where this latency is a small fraction of the time taken to do the computation.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Hello,

I haven't read all the posts but anyway maybe my idea could help.

First of all, it would be much better if you provide us with some detailed information like how many times approximately do you call these trigonometric functions in the loop?
How important is precision?

If precision is not so important or it is acceptable that absolute error is about 0.000006 (at most, not for all the cases) you could pre-calculate sin (and cos if you need) for about a million values (eg. PRECISION = 1000000, for each x = k * 2*Pi / PRECISION, k = 0, PRECISION) you can put in an array these values.
When you need to find value for sin(x), you will find k such that k = x*PRECISION / (2*Pi) (integer value) and your value will be sinus[k] or sinus[k + 1], depends which of the numbers k and (k+1) is closer to x*PRECISION / (2*Pi) (if sinus is the name of the array you used to pre-calculate values).

You can also increase the precision by increasing PRECISION value.
Time complexity will be O(PRECISION) to pre-calculate, and O(1) to calculate the value of sin(x).
Memory complexity will be O(PRECISION) too and it will be around 8MBs for PRECISION = 1000000 and the array is of type double.

There are probably more ways to solve that problem. If this solution I have written is not of any help I could come up with a new one, but I need some more feedback about.
Lookup tables, especially those larger than a single line of cache will simply kill performance due to cache trashing.

If accuracy really isn't an issue, then on just about any modern PC using simplified Fourier expansion of first n terms will be faster. These also translate nicely to SIMD.

A Google search finds this random article (pdf). For cos(x) = c1 +c2*x^2 + c3*x^4 yields error < 0.001. Sin approximation gives error < 0.0001.

it will be around 8MBs for[/quote]

This is huge and unless one is guaranteed locality, it will trash L1, L2 and L3 cache.

---
Looking through instruction tables (pdf) for nehalem, things aren't quite that easy to figure out, especially for newer architectures there aren't directly useful numbers.

But fsin/fcos show latency of 40-100 cycles. It's possible that these pipeline well, so reordering for batch operations might help.

Compared to that, fmul has negligible execution time and 5 cycle latency, with multiply pipelining well. Given above formula, we get some 4 * 5 + 2 * 3 latency and 6 uops for approximation. Definitely a good candidate for SIMD (with latency dominating) or carefully reordered computation to minimize stalls. Compiler might be able to do this. But it's really hard to judge numbers without testing, these CPUs are so complex just about anything can happen.


i7 L1 cache miss is ~10 cycles (varies a lot), so unless table is carefully optimized, it will rarely, if ever, beat properly pipelined raw calculation.

Quote
it will be around 8MBs for

This is huge and unless one is guaranteed locality, it will trash L1, L2 and L3 cache.


He doesn't have to put it all in cache. If that's the first thing to pre-calculate, cache will keep only the most used values from the sinus array.

Edit: sin (x) (cos too) can be calculated using Taylor series (Maclaurin series actually here) to calculate sinus with an absolute error of 0.000006 in about 12 steps (in the worst case) and that would be about 1s of time if you call 10 million times sin(x) function.

This topic is closed to new replies.

Advertisement