Sign in to follow this  

MSVC generating much slower code compared to GCC

This topic is 669 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello everyone!

 

We just noticed that the same C++-Program performs way better when ran using gcc and Linux, than it does in Windows, using VisualStudio 2015.

 

The test program is calculates some integrals over and over again and was used in one of my programming assignments, where I work under Linux using my Laptop, which doesn't have a very fast CPU built in. The program finishes in about 0.9 second on it, while it takes a whooping 8-9 seconds on my much stronger Windows-machine!

 

Another problem is, that our game-project runs much faster on my friends linux machine than it does on mine. He is still going at ~140fps while some heavy physics scenes, while I'm down to ~20.

 

I'm very confused about that and I can't seem to find any more compile-flags which would improve the situation.

The flags I used on Windows are (These come directly from a compiler-benchmark since I was desperate):

/arch:SSE2 /Ox /Ob2 /Oi /Ot /Oy /fp:fast /GF /FD /MT /GS- /openmp

While the gcc-build uses:

-O3 -ffast-math -fopenmp -funroll-loops -march=native

The same phenomenon could be observed on an other friends windows-machine which has an even better CPU then mine.

It's really weird that my tiny laptop can outperform these computers in no-time. I mean, I should get the same code roughly up to the same speed on both operating systems, right?

 

Are there any more magic compilerflags to set or other pitfalls I should look out for?

 

Thanks in advance!

Share this post


Link to post
Share on other sites
When I'm not sure what's going on with code generation, I look at the disassembly. Maybe the linux version discovered that it can optimize intermediate steps of a loop out because those steps aren't used?

Can we look at your code?

Share this post


Link to post
Share on other sites
You can help out MSVC by making use of __restrict, to give manual promises/hints about where aliasing situations cannot arise (this will also help out GCC as long as you use a macro, so that it changes to __restrict__ on GCC

That isn't even necessary, __restrict works mighty fine.

 

What doesn't work is using restrict (without underscores) as per C99. Which I deemed somewhat unlucky for a long time because GCC allows a lot of C99 stuff as GNU extension in C++ that isn't very useful, but this one which would be quite nice isn't supported. Then again, you can use the exact same spelling on either compiler with __restrict which is actually preferrable to having a no-underscore version (if you ever intend to make code portable between MS and GCC). Insofar, I don't deem this "unlucky" any more, it's actually a good decision.

Edited by samoth

Share this post


Link to post
Share on other sites

e.g. this is terrible code that should fail a code review:

//members: size_t m_sum; vector<size_t> m_vec;
m_sum = 0;
for( size_t i=0; i != m_vec.size(); ++i )
  m_sum += m_vec[i];

A good compiler is forced to generate terrible asm given ^that^ code. Even GCC with it's strict aliasing can't fix the mistakes in it.

This is the fixed version that is giving the correct hints to the compiler so that it can produce good code:

size_t sum = 0;
for( size_t i=0, end=m_vec.size(); i != end; ++i )
  sum += m_vec[i];
m_sum = sum;

 

Could you elaborate on the example you posted, why the first version is bad?

 

That by 'm_sum'/'m_vec' being acessed through pointer ('this'?), during each iteration the loop has to load the members from memory? Why can't it just load the initial value of 'm_sum' into a CPU register, add to it during the loop and then store the value?

That the end statement is recalculated every iteration? 

 

Why can't the compiler just assume the member variables won't be altered outside the function? And if has to be, force the programmer to make explicit use of the volatile keyword?

Share this post


Link to post
Share on other sites

The first version calls std::vector<>::size() every iteration. The second does so only once and stores the value in a local variable.


I would have thought that something as trivial as size() would have gotten inlined out? Though at least the implementation I'm looking at computes the size by creating the beginning and end iterators and subtracting them, so maybe that isn't much of a savings anyway.

Share this post


Link to post
Share on other sites

 

The first version calls std::vector<>::size() every iteration. The second does so only once and stores the value in a local variable.


I would have thought that something as trivial as size() would have gotten inlined out? Though at least the implementation I'm looking at computes the size by creating the beginning and end iterators and subtracting them, so maybe that isn't much of a savings anyway.

 

It was inlined out, but calling size() every iteration prevented compiler from vectorizing the loop (i.e. using SIMD commands). Precalculating size lets optimizer to figure out it can vectorize the loop. Using local variable to store temporary results also improves chances.

Share this post


Link to post
Share on other sites

The first version calls std::vector<>::size() every iteration. The second does so only once and stores the value in a local variable.


I would have thought that something as trivial as size() would have gotten inlined out? Though at least the implementation I'm looking at computes the size by creating the beginning and end iterators and subtracting them, so maybe that isn't much of a savings anyway.

It was inlined out, but calling size() every iteration prevented compiler from vectorizing the loop (i.e. using SIMD commands). Precalculating size lets optimizer to figure out it can vectorize the loop. Using local variable to store temporary results also improves chances.


It would be nice if there was a compiler warning that could be turned on for critical blocks of code that would tell you this kind of thing up front, obviating the need to look at the generated assembly.

Share this post


Link to post
Share on other sites


It would be nice if there was a compiler warning that could be turned on for critical blocks of code that would tell you this kind of thing up front, obviating the need to look at the generated assembly.
Some static analysis tools will point them out when they find them. 

 

The problem is one of threshold. What deserves a warning, and what doesn't?  For nontrivial programs if the compiler were to list all optimizations it didn't take, there would be millions, perhaps billions, of lines of output. 

 

One fundamental principle of the C family, including C++, is that the language assumes the programmer is always right, the programmer always knows what they are doing and why they are doing it.  It stems back to the era it was made, when human time was far cheaper than computer time. Programmers were experts (like surgeons or lawyers) who carefully reviewed every line of code and fully understood the effects and side effects.

Share this post


Link to post
Share on other sites

The first version calls std::vector<>::size() every iteration. The second does so only once and stores the value in a local variable.

 

I've seen this come up in profiles before (and have corrected it). Visual C++, at least, has a lot of difficulty with this situation. It treats m_sum practically as a global variable for purposes of optimization in this case, so every single operation done on it in that loop incurs a load-hit-store. In the latter, the compiler knows very well that sum is local to just the function and thus generally just performs the operations on a register and stores it at the end.

This can be hugely faster in some cases, depending on what you're doing. While this is bad on x86, on the PPC consoles (like the 360 or the PS3) where LHS were death, this caused major slowdowns.

Edited by Ameise

Share this post


Link to post
Share on other sites
I actually just meant that the function invocation would be inlined, not the actual size computation/access. Probably I misinterpreted the use of "inlined" in the post I quoted.
 

The first version calls std::vector<>::size() every iteration. The second does so only once and stores the value in a local variable.


I would have thought that something as trivial as size() would have gotten inlined out? Though at least the implementation I'm looking at computes the size by creating the beginning and end iterators and subtracting them, so maybe that isn't much of a savings anyway.

It's not trivial at all. Consider the following:
m_sum = 0;
for( size_t i=0; i != m_vec.size(); ++i )
{
  someFunc();
  m_sum += m_vec[i];
}
Unless the full body of someFunc is available at the compilation stage (and even if it does, someFunc must not do something that is not visible to the compiler), the compiler literally can't know if someFunc() will alter m_sum or if it will push or remove elements from m_vec; hence m_vect.size() must be fetched from memory every single loop, so does m_sum.


m_vec implies that this is all happening in a class method. What if someFunc is a const method? Though, I suppose the compiler would have to assume that const_cast might be used within the function if it's defined in another translation unit. Edited by Oberon_Command

Share this post


Link to post
Share on other sites

What if someFunc is a const method? Though, I suppose the compiler would have to assume that const_cast might be used within the function if it's defined in another translation unit.

Yep, const has no impact on the optimizer, it's not a promise that a value can't change. Even without const_cast, just because your pointer is const, it doesn't mean that someone else doesn't have a non-const pointer to the same data.
What you actually want is restrict, not const:
//I promise that no one else has a pointer to this data:
__restrict size_t* sum = &m_sum;
*sum = 0;
for( size_t i=0; i != m_vec.size(); ++i )
{
  someFunc();
  *sum += m_vec[i];
}
Now it will be optimized, but if someFunc does happen to modify m_sum, you've got undefined behaviour.
Better to just manually cache it in a local variable yourself, as that doesn't involve rare keywords and always has the same behaviour (if m_sum is modified by someFunc, those changes will be lost).

Share this post


Link to post
Share on other sites

 __restrict cannot be used on references though, for whatever reason. And even though I read an explanation on those forums that references are already thought of to not be able to be rebound, testing it with assembler showed significant worse generated code than with pointers and __restrict (or whatever equivalent).

 


vec.size() likely is implemented as "return end-begin;", but as shown in this example, if it happened to be "return m_size;", then it could cause aliasing issues for the optimizer.

 

Even if it was implemented as end-begin, whats the difference? It will need to access member variables of the class regardless (this->end, this->begin), so aliasling is bound to happen unless you tell the compiler otherwise.

 

Especially with more compilcated loops, explicitely caching values like the result of size() is well worth, since

- the compiler cannot really assume you are not using a pointer to the vector used in the loop => especially since MSVC is very lenient about pointer casting, pretty much any pointer you use can be filled with any memory address you want, so the compiler always has to asume that every pointer can manipulate the value of everything else (probably one of the reasons why MSVC performs worse than most other compilers).

- Even if it could, you are telling the compiler "I want to compare to the size of the vector at every iteration". Who's guaranteeing that the value of size() is staying constant? In fact, nobody. Specially if you are using the vector inside the loop, it is very likely that some call might just change the size of the vector.

Share this post


Link to post
Share on other sites

Even if it was implemented as end-begin, whats the difference? It will need to access member variables of the class regardless (this->end, this->begin), so aliasling is bound to happen unless you tell the compiler otherwise.

Depends on how smart the compiler is.
Dumb compilers conform with the spec by implementing the rule "any write can modify the value of any upcoming read".
Smart compilers conform with the spec by implementing the rule "any write of type T can modify the value of any upcoming read of type T".
In the example loop, the writes are of type size_t, so if the getter reads a size_t member, then it's aliasing on every compiler. If the getter reads two Blah* members, they won't alias with the loop's size_t writes, at least on smart compilers.
But yes, in general you should assume your compiler is dumb and tell it everything that you know smile.png

references are already thought of to not be able to be rebound, testing it with assembler showed significant worse generated code than with pointers and __restrict

A reference is basically equivalent to a const-pointer (where the pointer itself is const, not the thing that it points to).
A const-pointer is treated just like any other pointer, except that if it exists only in a local scope, the compiler can maybe assume that it will only ever point to one memory address.
A restrict-pointer lets the compiler assume that the value at that memory address won't be changed by any writes, except the writes that occur through this particular pointer variable.

Share this post


Link to post
Share on other sites

m_vec implies that this is all happening in a class method. What if someFunc is a const method? Though, I suppose the compiler would have to assume that const_cast might be used within the function if it's defined in another translation unit.

The following code modifies m_sum setting it to 7 on every loop when someFunc gets called, no const_cast involved:
 

size_t *globalVar = nullptr;

void Foo::doIt()
{
    globalVar = &m_sum;
    m_sum = 0;
    for( size_t i=0; i != m_vec.size(); ++i )
    {
      someFunc();
      m_sum += m_vec[i];
    }
}

void Foo::someFunc() const
{
    *globalVar = 7;
}

You could argue using a global variable is a bad practice, but:

  1. It doesn't matter. Something like this can happen, therefore the compiler can't assume it can't happen.
  2. I can give you more complicated examples where "proper" practices are used and m_sum gets modified in the middle. Would take many lines of code to show.

The correct way to do it is, as Hodgman said, via temporaries, or via __restrict. The latter is basically saying "I promise" and give permission to the compiler to screw you if you break the promise.

Share this post


Link to post
Share on other sites
In the above example, a smart compiler (like clang) would determine whether someFunc can be proven not to have side effects. In a nutshell it goes recursively through functions depth first, and marks as not having side effects any function that don't write to memory other than the stack and only call other functions that are marked as having no side effect.

Of course if the function is defined in a different translation unit it has to assume that it potentially have side effects (perhaps it would be nice if this information was exported in object files) but imo it's more of a case for link time code generation than for micro optimizing using additional local variables.

Share this post


Link to post
Share on other sites

 __restrict cannot be used on references though, for whatever reason. And even though I read an explanation on those forums that references are already thought of to not be able to be rebound, testing it with assembler showed significant worse generated code than with pointers and __restrict (or whatever equivalent).

 

__restrict can be used on references as of Visual C++ 2015. It was one of the things that I was glad that they added. You also can functionally specify member functions as restrict in 2015 (where this is restrict).

Share this post


Link to post
Share on other sites

There's one thing Visual Studio can do to trip up performance measurements if you're not aware of it, which has nothing to do with the compiler.
 
If you run a program by pressing F5 then you will get the Windows debug heap enabled, which is much much slower than the non-debug one.
 
The simple workaround is to launch it without the debugger attached by using Control+F5 if you're doing performance testing.


The simpler workaround is to use optimised release builds in a profiler when doing performance testing.

Share this post


Link to post
Share on other sites

The simpler workaround is to use optimised release builds in a profiler when doing performance testing.



As near as I can tell, debug/release and code optimizations have no effect on the behavior Adam_42 is talking about, though. If you run a release build by pressing F5 and profile it, it'll be slower in allocation logic than if you attach after launching the process.

Share this post


Link to post
Share on other sites

This topic is 669 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this