MSVC generating much slower code compared to GCC

Started by
22 comments, last by ApochPiQ 8 years, 2 months ago

The first version calls std::vector<>::size() every iteration. The second does so only once and stores the value in a local variable.


I would have thought that something as trivial as size() would have gotten inlined out? Though at least the implementation I'm looking at computes the size by creating the beginning and end iterators and subtracting them, so maybe that isn't much of a savings anyway.

It was inlined out, but calling size() every iteration prevented compiler from vectorizing the loop (i.e. using SIMD commands). Precalculating size lets optimizer to figure out it can vectorize the loop. Using local variable to store temporary results also improves chances.


It would be nice if there was a compiler warning that could be turned on for critical blocks of code that would tell you this kind of thing up front, obviating the need to look at the generated assembly.
Advertisement


It would be nice if there was a compiler warning that could be turned on for critical blocks of code that would tell you this kind of thing up front, obviating the need to look at the generated assembly.
Some static analysis tools will point them out when they find them.

The problem is one of threshold. What deserves a warning, and what doesn't? For nontrivial programs if the compiler were to list all optimizations it didn't take, there would be millions, perhaps billions, of lines of output.

One fundamental principle of the C family, including C++, is that the language assumes the programmer is always right, the programmer always knows what they are doing and why they are doing it. It stems back to the era it was made, when human time was far cheaper than computer time. Programmers were experts (like surgeons or lawyers) who carefully reviewed every line of code and fully understood the effects and side effects.

The first version calls std::vector<>::size() every iteration. The second does so only once and stores the value in a local variable.

I've seen this come up in profiles before (and have corrected it). Visual C++, at least, has a lot of difficulty with this situation. It treats m_sum practically as a global variable for purposes of optimization in this case, so every single operation done on it in that loop incurs a load-hit-store. In the latter, the compiler knows very well that sum is local to just the function and thus generally just performs the operations on a register and stores it at the end.

This can be hugely faster in some cases, depending on what you're doing. While this is bad on x86, on the PPC consoles (like the 360 or the PS3) where LHS were death, this caused major slowdowns.

The first version calls std::vector<>::size() every iteration. The second does so only once and stores the value in a local variable.


I would have thought that something as trivial as size() would have gotten inlined out? Though at least the implementation I'm looking at computes the size by creating the beginning and end iterators and subtracting them, so maybe that isn't much of a savings anyway.

It's not trivial at all. Consider the following:


m_sum = 0;
for( size_t i=0; i != m_vec.size(); ++i )
{
  someFunc();
  m_sum += m_vec[i];
}

Unless the full body of someFunc is available at the compilation stage (and even if it does, someFunc must not do something that is not visible to the compiler), the compiler literally can't know if someFunc() will alter m_sum or if it will push or remove elements from m_vec; hence m_vect.size() must be fetched from memory every single loop, so does m_sum.

However if it were changed to:


size_t tmpSum = 0;
const size_t vecSize = m_vec.size();
for( size_t i=0; i != vecSize; ++i )
{
  someFunc();
  tmpSum += m_vec[i];
}

m_sum = tmpSum;

Now the compiler knows for sure neither tmpSum nor vecSize will change regardless of what happens inside someFunc (unless someFunc corrupts memory, of course) and can keep their values in a register instead of refetching on every single loop iteration.

It's far from trivial, in fact most of the time the word is "impossible" to optimize. This is where the arguments of "let the compiler do its job. Don't worry about optimization, they're really good" falls short. Yes, compilers have improved tremendously in the last 15 years, but they aren't fortune tellers. They can't optimize away stuff that might change the expressed behavior (I say expressed because what's intended may have more relaxed requirements than what's being expressed in code)

I actually just meant that the function invocation would be inlined, not the actual size computation/access. Probably I misinterpreted the use of "inlined" in the post I quoted.

The first version calls std::vector<>::size() every iteration. The second does so only once and stores the value in a local variable.


I would have thought that something as trivial as size() would have gotten inlined out? Though at least the implementation I'm looking at computes the size by creating the beginning and end iterators and subtracting them, so maybe that isn't much of a savings anyway.

It's not trivial at all. Consider the following:
m_sum = 0;
for( size_t i=0; i != m_vec.size(); ++i )
{
  someFunc();
  m_sum += m_vec[i];
}
Unless the full body of someFunc is available at the compilation stage (and even if it does, someFunc must not do something that is not visible to the compiler), the compiler literally can't know if someFunc() will alter m_sum or if it will push or remove elements from m_vec; hence m_vect.size() must be fetched from memory every single loop, so does m_sum.


m_vec implies that this is all happening in a class method. What if someFunc is a const method? Though, I suppose the compiler would have to assume that const_cast might be used within the function if it's defined in another translation unit.

Could you elaborate on the example you posted, why the first version is bad?

That by 'm_sum'/'m_vec' being acessed through pointer ('this'?), during each iteration the loop has to load the members from memory? Why can't it just load the initial value of 'm_sum' into a CPU register, add to it during the loop and then store the value?
That the end statement is recalculated every iteration?

Why can't the compiler just assume the member variables won't be altered outside the function? And if has to be, force the programmer to make explicit use of the volatile keyword?

I would have thought that something as trivial as size() would have gotten inlined out?

It will inline the function, but assuming size returns a member variable of type size_t (and given the example loop), a valid compiler MUST issue an instruction to re-load the value of that size member once per iteration rather than caching it.

These are the aliasing rules of the language that GCC assumes we all know :wink:

Let's pretend we have a compiler that generates this intermediate representation from my "bad" example, using intrinsics to access memory:
write<size_t>(&m_sum, 0);
size_t i=0; 
repeat:
  size_t temp = read<size_t>(&m_vec.m_size);
  if( i == temp )
    goto end;
  write<size_t>(&m_sum, read<size_t>(&m_sum) + read<size_t>(m_vec.m_begin + i));
  ++i;
  goto repeat;
end:
The memory accesses are now painfully obvious :)
The compiler will now try to optimize away as many as it can, but this code is fighting against the optimizer.

A dumb compiler can conclude that any read instruction could be affected by any write instruction, meaning it can only remove repeated sequences of reads if no writes occur in the middle. In this example, the repeated reads of the loop condition (m_vec.m_size) are always preceded by a write to m_sum, so they can't be optimize away (and replaced with a single read that caches the value in a local variable).

A smart compiler (with strict aliasing) can conclude that any read instruction of type T could be affected by any write instruction of type T, meaning it can only remove repeated sequences of reads if no writes of the same type occur in the middle. In this example, I deliberately chose a vector of size_t's and assumed the vector's size is also stored in a size_t, which defeats this little bit of smartness in the compiler -- it's still unable to optimize.

In general, you will often mix a lot of different types in your loops, so GCC's strict aliasing support does help out the optimizer.
...However, instead of feeding the compiler garbage and hoping that it figures out your intentions, it's much better to just tell the compiler what you know insteaf of asking it to guess.
As shown in this simple example, sometimes even the smartest compiler is unable to rearrange your code to what you want, because the language spec forbids it.

In general:
* prefer caching values into local variables yourself, especially for loop conditions.
* avoid repeated writes via a pointer if it's easily replaced with repeated writes to a local (amd one final write via the pointer).
* assume that pointer writes will cause pointer reads to be uncachable (and do the caching yourself).
* don't make assumption about the implementation of even simple getters -- vec.size() likely is implemented as "return end-begin;", but as shown in this example, if it happened to be "return m_size;", then it could cause aliasing issues for the optimizer. Better to just assume that the optimizer won't be able to remove repeated calls to the getter, and do the caching explicitly like in my 2nd example.

What if someFunc is a const method? Though, I suppose the compiler would have to assume that const_cast might be used within the function if it's defined in another translation unit.

Yep, const has no impact on the optimizer, it's not a promise that a value can't change. Even without const_cast, just because your pointer is const, it doesn't mean that someone else doesn't have a non-const pointer to the same data.
What you actually want is restrict, not const:
//I promise that no one else has a pointer to this data:
__restrict size_t* sum = &m_sum;
*sum = 0;
for( size_t i=0; i != m_vec.size(); ++i )
{
  someFunc();
  *sum += m_vec[i];
}
Now it will be optimized, but if someFunc does happen to modify m_sum, you've got undefined behaviour.
Better to just manually cache it in a local variable yourself, as that doesn't involve rare keywords and always has the same behaviour (if m_sum is modified by someFunc, those changes will be lost).

__restrict cannot be used on references though, for whatever reason. And even though I read an explanation on those forums that references are already thought of to not be able to be rebound, testing it with assembler showed significant worse generated code than with pointers and __restrict (or whatever equivalent).


vec.size() likely is implemented as "return end-begin;", but as shown in this example, if it happened to be "return m_size;", then it could cause aliasing issues for the optimizer.

Even if it was implemented as end-begin, whats the difference? It will need to access member variables of the class regardless (this->end, this->begin), so aliasling is bound to happen unless you tell the compiler otherwise.

Especially with more compilcated loops, explicitely caching values like the result of size() is well worth, since

- the compiler cannot really assume you are not using a pointer to the vector used in the loop => especially since MSVC is very lenient about pointer casting, pretty much any pointer you use can be filled with any memory address you want, so the compiler always has to asume that every pointer can manipulate the value of everything else (probably one of the reasons why MSVC performs worse than most other compilers).

- Even if it could, you are telling the compiler "I want to compare to the size of the vector at every iteration". Who's guaranteeing that the value of size() is staying constant? In fact, nobody. Specially if you are using the vector inside the loop, it is very likely that some call might just change the size of the vector.

Even if it was implemented as end-begin, whats the difference? It will need to access member variables of the class regardless (this->end, this->begin), so aliasling is bound to happen unless you tell the compiler otherwise.

Depends on how smart the compiler is.
Dumb compilers conform with the spec by implementing the rule "any write can modify the value of any upcoming read".
Smart compilers conform with the spec by implementing the rule "any write of type T can modify the value of any upcoming read of type T".
In the example loop, the writes are of type size_t, so if the getter reads a size_t member, then it's aliasing on every compiler. If the getter reads two Blah* members, they won't alias with the loop's size_t writes, at least on smart compilers.
But yes, in general you should assume your compiler is dumb and tell it everything that you know smile.png

references are already thought of to not be able to be rebound, testing it with assembler showed significant worse generated code than with pointers and __restrict

A reference is basically equivalent to a const-pointer (where the pointer itself is const, not the thing that it points to).
A const-pointer is treated just like any other pointer, except that if it exists only in a local scope, the compiler can maybe assume that it will only ever point to one memory address.
A restrict-pointer lets the compiler assume that the value at that memory address won't be changed by any writes, except the writes that occur through this particular pointer variable.

m_vec implies that this is all happening in a class method. What if someFunc is a const method? Though, I suppose the compiler would have to assume that const_cast might be used within the function if it's defined in another translation unit.

The following code modifies m_sum setting it to 7 on every loop when someFunc gets called, no const_cast involved:


size_t *globalVar = nullptr;

void Foo::doIt()
{
    globalVar = &m_sum;
    m_sum = 0;
    for( size_t i=0; i != m_vec.size(); ++i )
    {
      someFunc();
      m_sum += m_vec[i];
    }
}

void Foo::someFunc() const
{
    *globalVar = 7;
}

You could argue using a global variable is a bad practice, but:

  1. It doesn't matter. Something like this can happen, therefore the compiler can't assume it can't happen.
  2. I can give you more complicated examples where "proper" practices are used and m_sum gets modified in the middle. Would take many lines of code to show.

The correct way to do it is, as Hodgman said, via temporaries, or via __restrict. The latter is basically saying "I promise" and give permission to the compiler to screw you if you break the promise.

This topic is closed to new replies.

Advertisement