MSVC generating much slower code compared to GCC

Started by
22 comments, last by ApochPiQ 8 years, 2 months ago

Hello everyone!

We just noticed that the same C++-Program performs way better when ran using gcc and Linux, than it does in Windows, using VisualStudio 2015.

The test program is calculates some integrals over and over again and was used in one of my programming assignments, where I work under Linux using my Laptop, which doesn't have a very fast CPU built in. The program finishes in about 0.9 second on it, while it takes a whooping 8-9 seconds on my much stronger Windows-machine!

Another problem is, that our game-project runs much faster on my friends linux machine than it does on mine. He is still going at ~140fps while some heavy physics scenes, while I'm down to ~20.

I'm very confused about that and I can't seem to find any more compile-flags which would improve the situation.

The flags I used on Windows are (These come directly from a compiler-benchmark since I was desperate):


/arch:SSE2 /Ox /Ob2 /Oi /Ot /Oy /fp:fast /GF /FD /MT /GS- /openmp

While the gcc-build uses:


-O3 -ffast-math -fopenmp -funroll-loops -march=native

The same phenomenon could be observed on an other friends windows-machine which has an even better CPU then mine.

It's really weird that my tiny laptop can outperform these computers in no-time. I mean, I should get the same code roughly up to the same speed on both operating systems, right?

Are there any more magic compilerflags to set or other pitfalls I should look out for?

Thanks in advance!

Advertisement
When I'm not sure what's going on with code generation, I look at the disassembly. Maybe the linux version discovered that it can optimize intermediate steps of a loop out because those steps aren't used?

Can we look at your code?
Are you using the standard library at all? MSVC by default turns on a bunch of security and correctness checking that has substantial performance overhead.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Visual C++ tends to err on the side of code safety, and has a habit of assuming everything is volatile. It can be persuaded to optimise your code (and can do a very good job of it), but it can be very picky as to when it applies the optimisations. As ApochPiQ mentions, make sure you disable the security checks in the compiler settings, but without seeing any of the code, it's kinda hard to say much more than that.

Maybe the linux version discovered that it can optimize intermediate steps of a loop out because those steps aren't used?


It's unlikely. VC++ is pretty good at stripping unused code paths. VC++ tends to assume that variables are volatile, so accessing values through references (or this) can prevent vectorisation in tight loops (and a few other fun situations like that). clang and gcc are little bit aggressive in this respect (occasionally too aggressive!). Fire a profiler over the code, look at the disassembly, and tackle those hot spots one at a time. If you see lots of instructions ending in 'ss' or 'sd' (instead of 'ps' or 'pd') then the vectorisation has failed.

VC++ tends to assume that variables are volatile

Volatile isn't quite the right word. MSVC tends not to be too aggressive when making assumptions about aliasing. GCC defaults to a very strict interpretation of the aliasing rules (which also relies on programmers being aware of the aliasing rules and using a lot of care during dodgy type reinterpretation tricks... whereas MSVC lets you write incorrect code and maybe have it work ok sometimes unsure.png).

You can help out MSVC by making use of __restrict, to give manual promises/hints about where aliasing situations cannot arise (this will also help out GCC as long as you use a macro, so that it changes to __restrict__ on GCC).

Alternatively, you should simply just avoid accessing values indirectly (i.e. via pointers, which includes member variables and their implicit this pointer) within loops. That should always be the first step before reaching for restrict.
e.g. this is terrible code that should fail a code review:


//members: size_t m_sum; vector<size_t> m_vec;
m_sum = 0;
for( size_t i=0; i != m_vec.size(); ++i )
  m_sum += m_vec[i];

A good compiler is forced to generate terrible asm given ^that^ code. Even GCC with it's strict aliasing can't fix the mistakes in it.

This is the fixed version that is giving the correct hints to the compiler so that it can produce good code:


size_t sum = 0;
for( size_t i=0, end=m_vec.size(); i != end; ++i )
  sum += m_vec[i];
m_sum = sum;
You can help out MSVC by making use of __restrict, to give manual promises/hints about where aliasing situations cannot arise (this will also help out GCC as long as you use a macro, so that it changes to __restrict__ on GCC

That isn't even necessary, __restrict works mighty fine.

What doesn't work is using restrict (without underscores) as per C99. Which I deemed somewhat unlucky for a long time because GCC allows a lot of C99 stuff as GNU extension in C++ that isn't very useful, but this one which would be quite nice isn't supported. Then again, you can use the exact same spelling on either compiler with __restrict which is actually preferrable to having a no-underscore version (if you ever intend to make code portable between MS and GCC). Insofar, I don't deem this "unlucky" any more, it's actually a good decision.


e.g. this is terrible code that should fail a code review:


//members: size_t m_sum; vector<size_t> m_vec;
m_sum = 0;
for( size_t i=0; i != m_vec.size(); ++i )
  m_sum += m_vec[i];

A good compiler is forced to generate terrible asm given ^that^ code. Even GCC with it's strict aliasing can't fix the mistakes in it.

This is the fixed version that is giving the correct hints to the compiler so that it can produce good code:


size_t sum = 0;
for( size_t i=0, end=m_vec.size(); i != end; ++i )
  sum += m_vec[i];
m_sum = sum;

Could you elaborate on the example you posted, why the first version is bad?

That by 'm_sum'/'m_vec' being acessed through pointer ('this'?), during each iteration the loop has to load the members from memory? Why can't it just load the initial value of 'm_sum' into a CPU register, add to it during the loop and then store the value?

That the end statement is recalculated every iteration?

Why can't the compiler just assume the member variables won't be altered outside the function? And if has to be, force the programmer to make explicit use of the volatile keyword?

The first version calls std::vector<>::size() every iteration. The second does so only once and stores the value in a local variable.

The first version calls std::vector<>::size() every iteration. The second does so only once and stores the value in a local variable.


I would have thought that something as trivial as size() would have gotten inlined out? Though at least the implementation I'm looking at computes the size by creating the beginning and end iterators and subtracting them, so maybe that isn't much of a savings anyway.

The first version calls std::vector<>::size() every iteration. The second does so only once and stores the value in a local variable.


I would have thought that something as trivial as size() would have gotten inlined out? Though at least the implementation I'm looking at computes the size by creating the beginning and end iterators and subtracting them, so maybe that isn't much of a savings anyway.

It was inlined out, but calling size() every iteration prevented compiler from vectorizing the loop (i.e. using SIMD commands). Precalculating size lets optimizer to figure out it can vectorize the loop. Using local variable to store temporary results also improves chances.

This topic is closed to new replies.

Advertisement