Could you elaborate on the example you posted, why the first version is bad?
That by 'm_sum'/'m_vec' being acessed through pointer ('this'?), during each iteration the loop has to load the members from memory? Why can't it just load the initial value of 'm_sum' into a CPU register, add to it during the loop and then store the value?
That the end statement is recalculated every iteration?
Why can't the compiler just assume the member variables won't be altered outside the function? And if has to be, force the programmer to make explicit use of the volatile keyword?
I would have thought that something as trivial as size() would have gotten inlined out?
It will inline the function, but assuming size returns a member variable of type size_t (and given the example loop), a valid compiler MUST issue an instruction to re-load the value of that size member once per iteration rather than caching it.
These are the aliasing rules of the language that GCC assumes we all know :wink:
Let's pretend we have a compiler that generates this intermediate representation from my "bad" example, using intrinsics to access memory:
write<size_t>(&m_sum, 0);
size_t i=0;
repeat:
size_t temp = read<size_t>(&m_vec.m_size);
if( i == temp )
goto end;
write<size_t>(&m_sum, read<size_t>(&m_sum) + read<size_t>(m_vec.m_begin + i));
++i;
goto repeat;
end:
The memory accesses are now painfully obvious :)
The compiler will now try to optimize away as many as it can, but this code is fighting against the optimizer.
A dumb compiler can conclude that any read instruction could be affected by any write instruction, meaning it can only remove repeated sequences of reads if no writes occur in the middle. In this example, the repeated reads of the loop condition (m_vec.m_size) are always preceded by a write to m_sum, so they can't be optimize away (and replaced with a single read that caches the value in a local variable).
A smart compiler (with strict aliasing) can conclude that any read instruction of type T could be affected by any write instruction of type T, meaning it can only remove repeated sequences of reads if no writes
of the same type occur in the middle. In this example, I deliberately chose a vector of size_t's and assumed the vector's size is also stored in a size_t, which defeats this little bit of smartness in the compiler -- it's still unable to optimize.
In general, you will often mix a lot of different types in your loops, so GCC's strict aliasing support does help out the optimizer.
...However, instead of feeding the compiler garbage and hoping that it figures out your intentions, it's much better to just tell the compiler what you know insteaf of asking it to guess.
As shown in this simple example, sometimes even the smartest compiler is unable to rearrange your code to what you want, because the language spec forbids it.
In general:
* prefer caching values into local variables yourself, especially for loop conditions.
* avoid repeated writes via a pointer if it's easily replaced with repeated writes to a local (amd one final write via the pointer).
* assume that pointer writes will cause pointer reads to be uncachable (and do the caching yourself).
* don't make assumption about the implementation of even simple getters -- vec.size() likely is implemented as "return end-begin;", but as shown in this example, if it happened to be "return m_size;", then it could cause aliasing issues for the optimizer. Better to just assume that the optimizer won't be able to remove repeated calls to the getter, and do the caching explicitly like in my 2nd example.