C++ Compiler Optimization

Started by
14 comments, last by Slavik81 12 years, 5 months ago
So I know code like this:
float denominator = 20 / 50 * 91 + 3;
float numeratorThing1 = 70 * 90;

float val1 = (56 * 90 + numeratorThing1) / denominator;
float val2 = (57 * 90 + numeratorThing1) / denominator;
float val3 = (58 * 90 + numeratorThing1) / denominator;


Is just as optimal as code like this:
float val1 = (56 * 90 + 70 * 90) / (20 / 50 * 91 + 3);
float val2 = (57 * 90 + 70 * 90) / (20 / 50 * 91 + 3);
float val3 = (58 * 90 + 70 * 90) / (20 / 50 * 91 + 3);


But does the compiler know to optimize out function calls that would return the same value each time?
for(vector<Widgets>::iterator iter = m_widgets.begin(); iter != m_widgets.end(); iter++) {
float val1 = (56 * 90 + iter->getWidth()) / getApplicationContext()->getScreenDimensions().getViewAspect();
float val2 = (57 * 90 + iter->getWidth()) / getApplicationContext()->getScreenDimensions().getViewAspect();
float val3 = (58 * 90 + iter->getWidth()) / getApplicationContext()->getScreenDimensions().getViewAspect();
}


Into something that would be more optimal like this?
vector<Widgets>::iterator widgetsEnd = m_widgets.end();

for(vector<Widgets>::iterator iter = m_widgets.begin(); iter != widgetsEnd ; iter++) {
float width = iter->getWidth();
float viewAspect = getApplicationContext()->getScreenDimensions().getViewAspect();

float val1 = (56 * 90 + width) / viewAspect;
float val2 = (57 * 90 + width) / viewAspect;
float val3 = (58 * 90 + width) / viewAspect;
}
Advertisement

But does the compiler know to optimize out function calls that would return the same value each time?

In general, no. If the functions are inline then it might.
I'd write the latter code anyway, because it is clearer to me. I'd probably even have put the view aspect outside the loop too, given that it is independent of the widgets.

Microsoft's Link Time Code Generation can inline functions across modules, and apparently GCC can too.

The real answer is to enable the appropriate optimisations in Release mode and examine the resulting disassembly. Don't forget to profile before and after, and ensure you're having a positive impact on performance. Of course, best would be to profile the whole program first, and concentrate your efforts on your actual bottlenecks, rather than chasing potential problems.
If you use gcc, you can use the attribute `pure' to tell the compiler that the function doesn't have side effects and only depends on its arguments and global variables. This would allow for such an optimization to happen.

However, I agree with rip-off that writing the code with the intermediate variable is preferable because it's more readable, particularly if you give the intermediate variable an informative name.
Most of such optimizations are not valid due to aliasing rules. While rare in practice, they do occur and compilers prefer conservative route to avoid causing obscure bugs.

Accessing members of a struct should not be inlined, even though it often is. Floating point calculations are also dependent on hardware, so compiler might choose to not optimize them at compile time. Some compilers allow extra switches to force such optimizations.

If performance matters, then guessing what compiler might or might do isn't productive or useful. Either profile and examine assembly, or write code that isn't susceptible to such issues.

// avoid reading m_widgets (requires this dereference)
// avoid using accessor, vector is linear
Widget * curr = &m_widgets.front();
Widget * end = curr + m_widgets.size();

// multiply is always preferred to division
float viewAspectInverse = 1.0 / getApplicationContext()->getScreenDimensions().getViewAspect();


for ( ; curr < end; curr++) { // save one register by using start and end pointers
float width = curr->width; // no getter, plain variable access
float val1 = (56 * 90 + width) * viewAspectInverse;
float val2 = (57 * 90 + width) * viewAspectInverse;
float val3 = (58 * 90 + width) * viewAspectInverse;
}


Of course, the code above is NOP, hopefully the real one does something more useful.

The loss of "safe" programming here is a trade-off. Writing generic <algorithm>-like code has its advantages, but carries a lot of hidden penalties. So if performance does matter, it makes sense to choose more suitable approach.

Once code is converted to use plain arrays, then one can go one step further and mark variables with __restrict (technically a C not C++ feature). But if using raw pointers and PODs, code might as well be compiled as plain C.void foo(WidgetPOD * p, size_t n);

void foo_cpp(vector<Widget> & w) {
foo(&w.first(), w.size());
}Or something similar. std::for_each would also work.

And by being explicit about what code does, compiler also has more optimization opportunities.

As far as basics go:
- avoid any kind of 'this' dereferencing, or evaluate it once. If accessing class members, there can be considerable hidden cost
- avoid dereferences in general, not because pointer chasing would necessarily be slow, but because compilers need to be conservative about what they access
- prefer PODs or at least expose raw members
- despite compilers being capable of doing it, perform invariant hoisting yourself - mostly to avoid hidden penalties mentioned above

Another way to give compiler explicit instructions is through template meta programming, but that code gets ugly. I did however manage to get compiler to reliable emit more efficient integer operations, including more compile-time evaluation, but with some really gnarly code, not suitable for human consumption.
Hmm yeah. I've been doing a lot of coding lately where I think, nah whatever, compiler will optimize that. Especially constantly dereferencing STL iterators in for loops.

I might want to go back and rewrite some of my code heh. It runs fast now but then again I'm running it on a fast computer. I might even get an FPS boost in my game. I'm definitely going to profile it a bit first to make sure this is worth my time.

One thing about this code though is it puts curr and end outside of the scope of the for loop which is a bit annoying.
And I can have element pointers for Widget in this loop since it's a vector, but that won't work on things like map iterators or list iterators. I use those sometimes if vector isn't good enough.

Also a lot of times I have inline functions for things like getWidth so I think those get optimized, right? There's no need to worry about non direct member access?

// avoid reading m_widgets (requires this dereference)
// avoid using accessor, vector is linear
Widget * curr = &m_widgets.front();
Widget * end = curr + m_widgets.size();

// multiply is always preferred to division
float viewAspectInverse = 1.0 / getApplicationContext()->getScreenDimensions().getViewAspect();


for ( ; curr < end; curr++) { // save one register by using start and end pointers
float width = curr->width; // no getter, plain variable access
float val1 = (56 * 90 + width) * viewAspectInverse;
float val2 = (57 * 90 + width) * viewAspectInverse;
float val3 = (58 * 90 + width) * viewAspectInverse;
}
I think you could just put curr, end in the for statement..


for (Widget *curr = &m_widgets.front(), *end = curr + m_widgets.size(); curr < end; ++curr)
{

}
The answer is: tune where your optimizer tells you you have a problem spot and focus on clarity everywhere else.

It's almost never worth worrying about microperformance, and when it is you'll definitely know it (it has dollar signs attached).

Stephen M. Webb
Professional Free Software Developer

Most of such optimizations are not valid due to aliasing rules. While rare in practice, they do occur...
I'd argue that aliasing is extremely common in almost every C++ program. For the unaware: any time you write a function that uses multiple pointers, you've potentially added aliasing. N.B. [font="Courier New"]this[/font] is a pointer like any other.void Widget::SumTotal( int& outputTotal ) {
outputTotal = 0;
for( int i=0; i<m_count; ++i )//must read m_count every iteration...
outputTotal += m_data;//...because...
}

Widget w; w.SumTotal( w.m_count );//...this is a valid use of that function

void Widget::SumTotal( int& outputTotal ) {//but did you actually mean to write this:
int localTotal = 0;
int* data = m_data;
for( int i=0, end=m_count; i<end; ++i )
localTotal += data;
outputTotal = localTotal;
}
[edit]N.B. the "local variable" rewrite works just as well as using the [font="Lucida Console"]restrict[/font] keyword (or [font="Courier New"]__restrict__[/font] / [font="Courier New"]__declspec(restrict)[/font]) here, by making the number of pointer reads/writes explicit. I'd recommend solving these problems with sensible use of local variables first (read data into locals early, write from locals to memory late), and only resorting to [font="Lucida Console"]restrict[/font] when necessary.

[quote name='Antheus' timestamp='1321552432' post='4885032']Most of such optimizations are not valid due to aliasing rules. While rare in practice, they do occur...
I'd argue that aliasing is extremely common in almost every C++ program. Any time you write a function that uses multiple pointers, you've potentially added aliasing. N.B. [font="Courier New"]this[/font] is a pointer like any other.[/quote]

And that's why we need the `restrict' keyword, like in C99.

This topic is closed to new replies.

Advertisement