Sign in to follow this  
Splinter of Chaos

GCC v VC++ and C++ v inline assembly.

Recommended Posts

I've been taking an assembly coarse for a couple of months now, but worthless as it is, using assembly isn't a priority, just learning how to. So, I finally get some code out and test my assembly against GCC's assembly generated from my c++ code. No matter how many times I made the assembly version run, I could never time it because it was too fast. The GCC version, on the other hand, took ten seconds to run an insane number of iterations. Even after looking at GCC's assembly, I couldn't figure out why it was at all slower. I tried this on M$' compiler. Even though they had the home field advantage it took them twenty-some seconds, and the inline assembly took 13-some. Here's the source:
#include <iostream>
#include <ctime>
#include <string>
using namespace std;

#define USING_C

#if defined(USING_C)
    string codeLang = "C++";
#else
    string codeLang = "ASM";
#endif

#if defined(_MSC_VER)
    string compiler = "M$";
#else
    string compiler = "GCC";
#endif

unsigned int fact( unsigned int x )
{
    #if defined( USING_C )
        int sum = 1;
        for(; x; --x )
        {
            sum *= x;
        }

        return sum;
    #else // if using ASM
        #if defined(__GNUC__)
            asm
            (
                 // Exit quick if input is zero.
                "   cmp  $2, %%ebx     \n"
                "   jle   2f           \n"

                "   movl $1,    %%eax  \n" // eax is the sum.
                "1: imul %%ebx, %%eax  \n"
                "   decl %%ebx         \n"
                "   jnz  1b            \n"
                "   movl %%eax, %%ebx  \n"
                "2:                    \n"
                : "=b" ( x )
                : "b"  ( x )
                : "%eax"
            );
        #elif defined(_MSC_VER)
            __asm
            {
                mov EBX, x

                // Exit quit if input is zero.
                xor EAX, EAX
                cmp EAX, EBX
                je  A_End

                // The factorial loop:
                mov  EAX, 1 // EAX is the sum.
        A_Loop: imul EBX
                dec  EBX
                jnz  A_Loop

                mov x, EAX

         A_End:
            }
        #endif // M$ compiler
    #endif // ASM code

    return x;
}

int main()
{
    #if defined(_MSC_VER)
        clock_t timer = clock();
    #endif

    const int MAX = 99999;

    unsigned int accum = MAX;
    while( --accum )
    {
        fact( accum );
    }

    cout << "Time on " << compiler << " using " << codeLang << ": ";

    #if defined(_MSC_VER)
        // My non-M$ compiler would normally do this for me.
        // M$V$ won't, or at least isn't.
        cout << float(clock() - timer) * 0.001f << " secs" << endl;
        system("pause");
    #endif

    /*  |_____|___M$___|__GCC__|
     *  |C++__|_22.157_|_8.406_|
     *  |ASM__|_13.900_|_XXXXX_| X = Too fast to time.
     */
}
Can anyone explain to me, or help me figure out, why my asm code is so fast in GCC or why GCC can't produce similar code? How about why M$ is so slow, even I give it similar asm? (Or maybe I could optimize better?) I think my example might be too simple, but that only alarms me more! Looking as M$' disassembly, it wasn't smart enough to know that neither x, nor sum, needn't be on the stack and could be register variables. It seems GCC was smart enough to see this, but I don't see why it's any slower than my code. It seems the MS prologue and epilogue are quite large compared to GCC's, but it pushes the source and destination index, which this function does not use or modify. It also pushed ebx. It then has to pop them from the stack at the end. Is this the culprit? Thanks in advance.

Share this post


Link to post
Share on other sites
When profiling/timing these speeds, are you compiling/running in Release? If not, please do so as profiling in debug is useless since the compiler does not optimize and in some cases may add bloat to your executable. Interesting stuff though dude. :)

Share this post


Link to post
Share on other sites
What on earth is an "M$"? Monopoly money?

EDIT: As for the actual question, why don't you look at the dissasembly of the C++ code and see what's being generated?

Share this post


Link to post
Share on other sites
Also did you look at the generated code for the gcc assembly version? It's possible the optimizer optimized the whole shebang out altogether.

I've found in tests like these, it helps immensely to have something like
cout << "Final: " << accum;
or something, so that the compiler can't be like "oh, well we calculate it but never use it so *poof*". Just a thought. Other thoughts include differing calling conventions etc., since you didn't inline "fact" microsoft is most likely calling it with stdcall and thus will always push the arguments onto the stack... and also what shwasasin said about making sure they are both in release builds as well.

Cheers
-Scott

Share this post


Link to post
Share on other sites
I just ran your code in VC++ 2008:

Time on M$ using C++: 0 secs
Press any key to continue . . .


Time on M$ using ASM: 12.656 secs
Press any key to continue . . .


And on GCC:

Time on GCC using C++: 0 secs


Time on GCC using ASM: 0 secs


Have you ever considered perhaps enabling optimizations?

The "problem" is that you perform a bunch of computations which you don't actually use for anything. So any intelligent compiler will say "oh, that's a waste of time, let's skip those computations".

That's what VC++ does with C/C++ code (which it understands well, and can analyze), but apparently isn't able to do with ASM (which it doesn't understand well, and which is much harder for the compiler to reason about)

Apart from that, we see that GCC is apparently also pretty decent at optimizing assembly, so it is able to perform the same optimization there.

If you want *real* results, you need to 1) use the result of the computation for something, so that the compiler doesn't skip it, and 2) enable optimizations in your compiler.

Share this post


Link to post
Share on other sites
Benchmarking and interpretation of results is a difficult topic. That is why the internet is littered with misinformation.

Main reason for the difference is flawed methodology, as well as almost certainly invalid approach to testing.

Using corrected benchmarking code to prevent NOP elimination:
   const int MAX = 99999;

int foo = 0; // <--
unsigned int accum = MAX;
while( --accum )
{
foo += fact( accum ); // <--
}

std::cout << foo; // <--
cout << "Time on " << compiler << " using " << codeLang << ": ";


MVC 2008,
Quote:
/Ox /Ob2 /Oi /GL /I "XXXX\src" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_SECURE_SCL=0" /D "_CRT_SECURE_NO_WARNINGS" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MD /GS- /Gy /arch:SSE2 /Fx /Fo"XXXX\\" /Fd"XXXX\vc90.pdb" /W4 /nologo /c /Zi /TP /wd4355 /errorReport:prompt


#define USING_C
Quote:
-125961703Time on M$ using C++: 7.359 secs
Press any key to continue . . .


	int foo = 0;
00401014 xor esi,esi
unsigned int accum = MAX;
while( --accum )
00401016 mov edx,1869Eh
0040101B jmp main+20h (401020h)
0040101D lea ecx,[ecx]
{
foo += fact( accum );
00401020 mov ecx,edx
00401022 mov eax,1
00401027 test edx,edx
00401029 je main+38h (401038h)
0040102B jmp main+30h (401030h)
0040102D lea ecx,[ecx]
00401030 imul eax,ecx
00401033 sub ecx,1
00401036 jne main+30h (401030h)
00401038 add esi,eax
0040103A sub edx,1
0040103D jne main+20h (401020h)
}




Without #define USING_C
Quote:
-125961703Time on M$ using ASM: 7.453 secs
Press any key to continue . . .



	int foo = 0;
00401014 xor ecx,ecx
unsigned int accum = MAX;
while( --accum )
00401016 mov esi,1869Eh
0040101B jmp main+20h (401020h)
0040101D lea ecx,[ecx]
{
foo += fact( accum );
00401020 mov dword ptr [ebp-4],esi
00401023 mov ebx,dword ptr [ebp-4]
00401026 xor eax,eax
00401028 cmp eax,ebx
0040102A je main+39h (401039h)
0040102C mov eax,1
00401031 imul ebx
00401033 dec ebx
00401034 jne main+31h (401031h)
00401036 mov dword ptr [ebp-4],eax
00401039 add ecx,dword ptr [ebp-4]
0040103C sub esi,1
0040103F jne main+20h (401020h)






Using these results I can also conclude that your gcc benchmark is invalid, but I'm not doing the gcc benchmarks as well.

Share this post


Link to post
Share on other sites
Quote:
Original post by Splinter of Chaos
I tried this on M$' compiler.
Stop doing that. This is a place for intelligent discussion, not bullshit trolling.

(Just to clarify, it's awfully tempting to delete any further posts that do that.)

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Using these results I can also conclude that your gcc benchmark is invalid, but I'm not doing the gcc benchmarks as well.
I'm not to sure. In all likelihood that's the problem but I wouldn't discount the possibility of GCC simply being clever.
There are more efficient algorithms for doing this for large numbers of iterations after all, e.g. like this:
unsigned exp(unsigned k, unsigned n) {
unsigned v;
for(v = 1; n; n >>= 1) {
if(n & 1) v *= k;
k *= k;
}
return v;
}
I'm not suggesting that any compiler is anywhere near clever enough to do this (unless they've included a pattern precisely for this case in order to speed up benchmarks), but any half-decent compiler will split common factors out of an expression like "k * k * k * k" in it's sleep (though to be honest a great many compilers are less than half-decent.) Now add 4x loop unrolling to the mix and that's basically what you've got in the inner loop.

Granted, I haven't actually been able to make GCC do this but my version is getting a bit old.. Or perhaps the whole thing is simply being evaluated at compile-time, though I would've expected the compiler to hit some sort of limit long before reaching the end of this loop.

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Quote:
Original post by Splinter of Chaos
I tried this on M$' compiler.
Stop doing that. This is a place for intelligent discussion, not bullshit trolling.


Sorry, I just got in the habit through interacting with other internet communities. Different communities have different conventions. Even if you find it annoying, some communities feel the opposite way. And, without having said anything bad about MS (I had to think about it), I don't see how trolling applies here.

Quote:
Original post by Spoonbender
Have you ever considered perhaps enabling optimizations?


I generally work with all optimizations on.

Quote:

The "problem" is that you perform a bunch of computations which you don't actually use for anything. So any intelligent compiler will say "oh, that's a waste of time, let's skip those computations".


That was it. I changed my code to accumulate a variable based on the outcome. This also helped check verifiability on each separate compile.

I found out that eight seconds is about how much time you can expect this function to take 99999 times. It's just what it does. I wonder if I can find a better test to pit compilers against each other and C++ vs assembly.

Thanks for all the replies and helping me out with this!

EDIT:
Quote:
Original post by Evil Stevewhy don't you look at the dissasembly of the C++ code and see what's being generated?


I did. It's not all that easy for me to read yet, as I've only just begun. It seemed to me what I saw was the same, but now I know about the code skipping, which I never saw in the disassembly. This is why I didn't know why the functions took different time to execute.

Share this post


Link to post
Share on other sites
Quote:
Original post by Splinter of Chaos
Quote:
Original post by Promit
Quote:
Original post by Splinter of Chaos
I tried this on M$' compiler.
Stop doing that. This is a place for intelligent discussion, not bullshit trolling.


Sorry, I just got in the habit through interacting with other internet communities. Different communities have different conventions. Even if you find it annoying, some communities feel the opposite way. And, without having said anything bad about MS (I had to think about it), I don't see how trolling applies here.


"M$" is fairly synonymous -- at least in terms of how it's read around these parts - with "lol I made a joke about an evil monopoly lol money lolololol". That's how low our tolerance for the term has gotten, thanks to the users of that term (as a generalization). If nothing remotely resembling that was running through your mind when you made your posting, then great!

But you'll have to forgive Promit for (probably) hating some of those other communities you came from and... ah... encouraging you to break less savory habits picked up from there. You can forgive me too if you want, for egging him on in IRC -- but I'm just an asshole, so that one's entirely up to you [lol].

Share this post


Link to post
Share on other sites
Quote:
Original post by Splinter of Chaos
I generally work with all optimizations on.

Hmm, apparently not. In my test above, I compiled a standard release build under Visual Studio, and the GCC version was compiled with -O3 and nothing else. In both cases, it gave me 0 seconds. If you didn't get that value, it would seem like you compiled without optimizations. (Or you used an ancient version of both compilers)

Share this post


Link to post
Share on other sites
Quote:
Original post by Hodgman
Quote:
Original post by Promit
(Just to clarify, it's awfully tempting to delete any further posts that do that.)
While you're at it, can we delete all of those annoying threads where people pronounce it Lin-ux instead of Li-nux?


I pronounce those two exactly the same. What's the difference?

Share this post


Link to post
Share on other sites
Oberon_Command, I believe he was pointing out that spelling something differently does not constitute post deletion by pointing out how silly the issue is.

Spoonbender, as we're learning today, if you get zero seconds, it's because your compiler skipped it all together. Try making the function add to a sum the whole way. I didn't find printing the sum made a hell of a difference, but I do it, anyway.

I wonder why the compilers say "Hey, this line does nothing! I should disrespect the programmer by skipping it." I'd rather have a warning. I put inefficiencies into code for a reason, and I expect optimizing compilers to optimize what is obviously not intentional. (And this is so obviously intentional.)

Share this post


Link to post
Share on other sites
Quote:
Original post by Splinter of Chaos
Spoonbender, as we're learning today, if you get zero seconds, it's because your compiler skipped it all together. Try making the function add to a sum the whole way. I didn't find printing the sum made a hell of a difference, but I do it, anyway.
I believe he was pointing out that your code can't possibly have been run with optimisations on.

Quote:
Original post by Splinter of Chaos
I wonder why the compilers say "Hey, this line does nothing! I should disrespect the programmer by skipping it." I'd rather have a warning. I put inefficiencies into code for a reason, and I expect optimizing compilers to optimize what is obviously not intentional. (And this is so obviously intentional.)
The compilers job is to catch things like that without you having to tell the compiler it's OK. When you turn optimisations on, you're saying to the compiler "I want this code to run as fast as possible". The fastest code is the code that never executes, which is why the compiler will obliterate code that doesn't do anything. It's not disrespecting you any more than re-ordering lines of code to get better register and cache usage (Which most compilers will also do).
Throwing up a warning would be one option I suppose, but the compiler is just doing its job; telling you about it would quickly become annoying.

Share this post


Link to post
Share on other sites
Quote:

I wonder why the compilers say "Hey, this line does nothing! I should disrespect the programmer by skipping it." I'd rather have a warning. I put inefficiencies into code for a reason, and I expect optimizing compilers to optimize what is obviously not intentional. (And this is so obviously intentional.)


What is so obviously intentional about it?
If you want to delay the program, you call Sleep() or similar. Performing millions of arithmetic operations and then discarding the result does not signal an intent to delay the program.
It might just as well indicate 1) that the programmer left in some extra code to aid readability, 2) that the programmer just copy/pasted the code from a larger sample which *does* use the result, 3) that the programmer hadn't thought the code properly through or 4), it could be the output from some code generator program which simply glued together different snippets of code, and relied on the compiler to make it go fast. In all these cases, the correct course of action for the compiler is to eliminate the code. We don't need it.

All the compiler sees here is that you perform a lot of work which isn't strictly necessary.

Anyway, it's explicitly allowed in the language standard.
It specifically mentions the "as if" rule. The compiler just has to generate something that behaves *as if* the language spec had been followed. Performing a bunch of computations that you never use yields exactly the same results as if you'd never performed these computations at all, so it's legal to remove them.

[i]Every[I] optimization the compiler can do involves changing the program somehow. Should it warn about everything? Reordering computations, inlining functions, constant propagation? A modern C++ compiler performs hundreds of program transformations to optimize the program. Every single one of them will speed the program up by deviating from the exact code you wrote. Every single one of them will skew the results of your clock() calls, because less code will be performed before it. After all, that's the entire point in an optimization. It makes things go faster than they would otherwise. By enabling optimizations, you say "I know this may speed up my program, and I'm ok with that".
And that is precisely what the compiler did. Why should the compiler warn you that it does exactly what you asked it to do?

Share this post


Link to post
Share on other sites
Quote:
Original post by Splinter of Chaos
I wonder why the compilers say "Hey, this line does nothing! I should disrespect the programmer by skipping it." I'd rather have a warning. I put inefficiencies into code for a reason, and I expect optimizing compilers to optimize what is obviously not intentional. (And this is so obviously intentional.)
If you put them in your code on purpose, then drop in flags to tell the compiler not to optimize. In a general case though, there is no such thing as a compiler optimization pass that does not change something. Also GCC has a -O0 for people who want what you [for some reason] want. So perhaps there is a bigger question here, which is why do you want it?

Also, define 'obviously intentional' in an absolute way that us compiler writers can use in our work. Compilers cannot read minds... And if you want a warning every time a compiler is going to change something, even primitive compilers would bury you in warnings on compilation of all but the simplest programs.

Also, compilers cannot disrespect you either...

Share this post


Link to post
Share on other sites
Quote:
Original post by Splinter of Chaos
I wonder why the compilers say "Hey, this line does nothing! I should disrespect the programmer by skipping it." I'd rather have a warning. I put inefficiencies into code for a reason, and I expect optimizing compilers to optimize what is obviously not intentional. (And this is so obviously intentional.)


It's not even remotely that easy.

The source code as you see it gets transformed into either straight assembly or some intermediate form (be it AST, bytecode, native assembly, or something else). That form loses all syntactic sugar and is just a matter of performing trivial operations on set of registers and memory locations.

An optimizing compiler then looks at that and searches for patterns. One of them Idempotence.

The unique thing about C++ compiler is the degree to which it may perform such optimizations. The easiest way to see this is to examine boost. Between all the templates and specialization, any trivial use of boost classes results in potentially millions of conceptual classes and functions. Yet, after basic optimization pass, compiler is capable of trivially collapsing that into a few constant expressions.

And this type of transformation doesn't even include operation re-ordering, compile-time evaluation, (N)RVO and similar.


But pray tell: why benchmark performance of assembly vs. a compiler, then complain optimizing compiler is capable of optimization?

This is why the claim of compilers getting better than humans has taken hold. Too many people look at code line by line and optimize it, when they should be looking at big picture.

Ironically, the task ill-suited for machines seems to often cause more problems to humans than a machine.

Side note: the 'M$' in this whole context is incredibly ironic.

Share this post


Link to post
Share on other sites
Quote:
Original post by Splinter of Chaos
I wonder why the compilers say "Hey, this line does nothing! I should disrespect the programmer by skipping it." I'd rather have a warning. I put inefficiencies into code for a reason, and I expect optimizing compilers to optimize what is obviously not intentional. (And this is so obviously intentional.)


Compilers have no concept of "obvious". They don't have intuition. The closest they can get to intuition is this fact: In normal programming, you never intentionally add inefficiencies (except as a necessary cost of other benifits, and even then, it's viewed as just that -- a cost, not a boon). In other words, your compiler thinks it's obviously unintentional -- and with good reason (especially given that you can turn optimizers off).

Even when profiling in normal programming, you're largely interested in where you can help optimize that the compiler is unable to -- so you want the compiler to still be optimizing as much as it can.

Besides, benchmarks are largely useless. Given how wildly the performance of code can vary, depending on the context in which it's run, the only way to get a real, accurate representation of the performance of some code is to profile it in the context of a program using it.

If you're going to do benchmarks regardless, well, you're learning one of those things you just have to keep in mind: Without an optimizer your results are worthless (unrealistically slow), and with an optimizer you must take special care to, again, avoid your results being worthless (unrealistically fast).

Profile instead ;-)

Share this post


Link to post
Share on other sites
*sheesh!* You make one statement on the internet and everyone jumps on you.

Quote:
Original post by Antheus
But pray tell: why benchmark performance of assembly vs. a compiler, then complain optimizing compiler is capable of optimization?


Heh, good point. My perspective was more about how technology is intuitive and works as expected. That is, optimizing everything that doesn't change the meaning of the code. But, working this low, it's difficult to know what the meaning was and we don't have such nice luxury.

Like those rare aliasing issues that happen, especially on -O3.

Share this post


Link to post
Share on other sites
Quote:
Original post by Splinter of Chaos
*sheesh!* You make one statement on the internet and everyone jumps on you.

Quote:
Original post by Antheus
But pray tell: why benchmark performance of assembly vs. a compiler, then complain optimizing compiler is capable of optimization?


Heh, good point. My perspective was more about how technology is intuitive and works as expected. That is, optimizing everything that doesn't change the meaning of the code. But, working this low, it's difficult to know what the meaning was and we don't have such nice luxury.

Like those rare aliasing issues that happen, especially on -O3.


I don't follow. Take this:

struct E
{
int i, j, k, l, m;
};

E f()
{
E e;
e.i = 4;
return e;
}

int main()
{
return f().i;
}



Guess what MSVC compiles this into with optimisations?


mov eax, 4
ret 0



Yes, that's the whole program*. It hasn't changed the "meaning of the code" - your program returns 4. Indeed, any optimiser that changes the "meaning of the code" has a bug :P


*: y'know without a lot of the preamble/postamble

Share this post


Link to post
Share on other sites
Meaning has context. The meaning of my program was to benchmark the efficiency of inline assembly vs how my compilers did it. Of course, my compilers had no way of knowing that was the meaning.

It did not, however, change the outcome.

But, I suppose what the actual meaning of my program was is a matter of opinion. I bet there are more than one who will argue there's no difference between outcome and meaning, or that I'm confusing the two. I won't argue if anyone decides to say that. But, I believe meaning has context and if you claim it doesn't, you're using the wrong word. (Computers, of coarse, do not think in context.)

Share this post


Link to post
Share on other sites
Quote:
Original post by Splinter of Chaos
But, I suppose what the actual meaning of my program was is a matter of opinion. I bet there are more than one who will argue there's no difference between outcome and meaning, or that I'm confusing the two. I won't argue if anyone decides to say that. But, I believe meaning has context and if you claim it doesn't, you're using the wrong word. (Computers, of coarse, do not think in context.)
Yeah, you're using the word "meaning" as part of the english language, but others are using it as a technical jargon-word. This is where the confusion comes from.

In this technical sense, the "meaning" of a program is well defined; it's basically the visible outputs that the C++ standard guarantees your code will produce.

So optimisations changed your personal idea of what the meaning was, but C++'s idea of the meaning was unchanged.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this