Jump to content
  • Advertisement
Sign in to follow this  
Splinter of Chaos

GCC v VC++ and C++ v inline assembly.

This topic is 3556 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've been taking an assembly coarse for a couple of months now, but worthless as it is, using assembly isn't a priority, just learning how to. So, I finally get some code out and test my assembly against GCC's assembly generated from my c++ code. No matter how many times I made the assembly version run, I could never time it because it was too fast. The GCC version, on the other hand, took ten seconds to run an insane number of iterations. Even after looking at GCC's assembly, I couldn't figure out why it was at all slower. I tried this on M$' compiler. Even though they had the home field advantage it took them twenty-some seconds, and the inline assembly took 13-some. Here's the source:
#include <iostream>
#include <ctime>
#include <string>
using namespace std;

#define USING_C

#if defined(USING_C)
    string codeLang = "C++";
#else
    string codeLang = "ASM";
#endif

#if defined(_MSC_VER)
    string compiler = "M$";
#else
    string compiler = "GCC";
#endif

unsigned int fact( unsigned int x )
{
    #if defined( USING_C )
        int sum = 1;
        for(; x; --x )
        {
            sum *= x;
        }

        return sum;
    #else // if using ASM
        #if defined(__GNUC__)
            asm
            (
                 // Exit quick if input is zero.
                "   cmp  $2, %%ebx     \n"
                "   jle   2f           \n"

                "   movl $1,    %%eax  \n" // eax is the sum.
                "1: imul %%ebx, %%eax  \n"
                "   decl %%ebx         \n"
                "   jnz  1b            \n"
                "   movl %%eax, %%ebx  \n"
                "2:                    \n"
                : "=b" ( x )
                : "b"  ( x )
                : "%eax"
            );
        #elif defined(_MSC_VER)
            __asm
            {
                mov EBX, x

                // Exit quit if input is zero.
                xor EAX, EAX
                cmp EAX, EBX
                je  A_End

                // The factorial loop:
                mov  EAX, 1 // EAX is the sum.
        A_Loop: imul EBX
                dec  EBX
                jnz  A_Loop

                mov x, EAX

         A_End:
            }
        #endif // M$ compiler
    #endif // ASM code

    return x;
}

int main()
{
    #if defined(_MSC_VER)
        clock_t timer = clock();
    #endif

    const int MAX = 99999;

    unsigned int accum = MAX;
    while( --accum )
    {
        fact( accum );
    }

    cout << "Time on " << compiler << " using " << codeLang << ": ";

    #if defined(_MSC_VER)
        // My non-M$ compiler would normally do this for me.
        // M$V$ won't, or at least isn't.
        cout << float(clock() - timer) * 0.001f << " secs" << endl;
        system("pause");
    #endif

    /*  |_____|___M$___|__GCC__|
     *  |C++__|_22.157_|_8.406_|
     *  |ASM__|_13.900_|_XXXXX_| X = Too fast to time.
     */
}
Can anyone explain to me, or help me figure out, why my asm code is so fast in GCC or why GCC can't produce similar code? How about why M$ is so slow, even I give it similar asm? (Or maybe I could optimize better?) I think my example might be too simple, but that only alarms me more! Looking as M$' disassembly, it wasn't smart enough to know that neither x, nor sum, needn't be on the stack and could be register variables. It seems GCC was smart enough to see this, but I don't see why it's any slower than my code. It seems the MS prologue and epilogue are quite large compared to GCC's, but it pushes the source and destination index, which this function does not use or modify. It also pushed ebx. It then has to pop them from the stack at the end. Is this the culprit? Thanks in advance.

Share this post


Link to post
Share on other sites
Advertisement
When profiling/timing these speeds, are you compiling/running in Release? If not, please do so as profiling in debug is useless since the compiler does not optimize and in some cases may add bloat to your executable. Interesting stuff though dude. :)

Share this post


Link to post
Share on other sites
What on earth is an "M$"? Monopoly money?

EDIT: As for the actual question, why don't you look at the dissasembly of the C++ code and see what's being generated?

Share this post


Link to post
Share on other sites
Also did you look at the generated code for the gcc assembly version? It's possible the optimizer optimized the whole shebang out altogether.

I've found in tests like these, it helps immensely to have something like
cout << "Final: " << accum;
or something, so that the compiler can't be like "oh, well we calculate it but never use it so *poof*". Just a thought. Other thoughts include differing calling conventions etc., since you didn't inline "fact" microsoft is most likely calling it with stdcall and thus will always push the arguments onto the stack... and also what shwasasin said about making sure they are both in release builds as well.

Cheers
-Scott

Share this post


Link to post
Share on other sites
I just ran your code in VC++ 2008:

Time on M$ using C++: 0 secs
Press any key to continue . . .


Time on M$ using ASM: 12.656 secs
Press any key to continue . . .


And on GCC:

Time on GCC using C++: 0 secs


Time on GCC using ASM: 0 secs


Have you ever considered perhaps enabling optimizations?

The "problem" is that you perform a bunch of computations which you don't actually use for anything. So any intelligent compiler will say "oh, that's a waste of time, let's skip those computations".

That's what VC++ does with C/C++ code (which it understands well, and can analyze), but apparently isn't able to do with ASM (which it doesn't understand well, and which is much harder for the compiler to reason about)

Apart from that, we see that GCC is apparently also pretty decent at optimizing assembly, so it is able to perform the same optimization there.

If you want *real* results, you need to 1) use the result of the computation for something, so that the compiler doesn't skip it, and 2) enable optimizations in your compiler.

Share this post


Link to post
Share on other sites
Benchmarking and interpretation of results is a difficult topic. That is why the internet is littered with misinformation.

Main reason for the difference is flawed methodology, as well as almost certainly invalid approach to testing.

Using corrected benchmarking code to prevent NOP elimination:
   const int MAX = 99999;

int foo = 0; // <--
unsigned int accum = MAX;
while( --accum )
{
foo += fact( accum ); // <--
}

std::cout << foo; // <--
cout << "Time on " << compiler << " using " << codeLang << ": ";


MVC 2008,
Quote:
/Ox /Ob2 /Oi /GL /I "XXXX\src" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_SECURE_SCL=0" /D "_CRT_SECURE_NO_WARNINGS" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MD /GS- /Gy /arch:SSE2 /Fx /Fo"XXXX\\" /Fd"XXXX\vc90.pdb" /W4 /nologo /c /Zi /TP /wd4355 /errorReport:prompt


#define USING_C
Quote:
-125961703Time on M$ using C++: 7.359 secs
Press any key to continue . . .


	int foo = 0;
00401014 xor esi,esi
unsigned int accum = MAX;
while( --accum )
00401016 mov edx,1869Eh
0040101B jmp main+20h (401020h)
0040101D lea ecx,[ecx]
{
foo += fact( accum );
00401020 mov ecx,edx
00401022 mov eax,1
00401027 test edx,edx
00401029 je main+38h (401038h)
0040102B jmp main+30h (401030h)
0040102D lea ecx,[ecx]
00401030 imul eax,ecx
00401033 sub ecx,1
00401036 jne main+30h (401030h)
00401038 add esi,eax
0040103A sub edx,1
0040103D jne main+20h (401020h)
}




Without #define USING_C
Quote:
-125961703Time on M$ using ASM: 7.453 secs
Press any key to continue . . .



	int foo = 0;
00401014 xor ecx,ecx
unsigned int accum = MAX;
while( --accum )
00401016 mov esi,1869Eh
0040101B jmp main+20h (401020h)
0040101D lea ecx,[ecx]
{
foo += fact( accum );
00401020 mov dword ptr [ebp-4],esi
00401023 mov ebx,dword ptr [ebp-4]
00401026 xor eax,eax
00401028 cmp eax,ebx
0040102A je main+39h (401039h)
0040102C mov eax,1
00401031 imul ebx
00401033 dec ebx
00401034 jne main+31h (401031h)
00401036 mov dword ptr [ebp-4],eax
00401039 add ecx,dword ptr [ebp-4]
0040103C sub esi,1
0040103F jne main+20h (401020h)






Using these results I can also conclude that your gcc benchmark is invalid, but I'm not doing the gcc benchmarks as well.

Share this post


Link to post
Share on other sites
Quote:
Original post by Splinter of Chaos
I tried this on M$' compiler.
Stop doing that. This is a place for intelligent discussion, not bullshit trolling.

(Just to clarify, it's awfully tempting to delete any further posts that do that.)

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Using these results I can also conclude that your gcc benchmark is invalid, but I'm not doing the gcc benchmarks as well.
I'm not to sure. In all likelihood that's the problem but I wouldn't discount the possibility of GCC simply being clever.
There are more efficient algorithms for doing this for large numbers of iterations after all, e.g. like this:
unsigned exp(unsigned k, unsigned n) {
unsigned v;
for(v = 1; n; n >>= 1) {
if(n & 1) v *= k;
k *= k;
}
return v;
}
I'm not suggesting that any compiler is anywhere near clever enough to do this (unless they've included a pattern precisely for this case in order to speed up benchmarks), but any half-decent compiler will split common factors out of an expression like "k * k * k * k" in it's sleep (though to be honest a great many compilers are less than half-decent.) Now add 4x loop unrolling to the mix and that's basically what you've got in the inner loop.

Granted, I haven't actually been able to make GCC do this but my version is getting a bit old.. Or perhaps the whole thing is simply being evaluated at compile-time, though I would've expected the compiler to hit some sort of limit long before reaching the end of this loop.

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Quote:
Original post by Splinter of Chaos
I tried this on M$' compiler.
Stop doing that. This is a place for intelligent discussion, not bullshit trolling.


Sorry, I just got in the habit through interacting with other internet communities. Different communities have different conventions. Even if you find it annoying, some communities feel the opposite way. And, without having said anything bad about MS (I had to think about it), I don't see how trolling applies here.

Quote:
Original post by Spoonbender
Have you ever considered perhaps enabling optimizations?


I generally work with all optimizations on.

Quote:

The "problem" is that you perform a bunch of computations which you don't actually use for anything. So any intelligent compiler will say "oh, that's a waste of time, let's skip those computations".


That was it. I changed my code to accumulate a variable based on the outcome. This also helped check verifiability on each separate compile.

I found out that eight seconds is about how much time you can expect this function to take 99999 times. It's just what it does. I wonder if I can find a better test to pit compilers against each other and C++ vs assembly.

Thanks for all the replies and helping me out with this!

EDIT:
Quote:
Original post by Evil Stevewhy don't you look at the dissasembly of the C++ code and see what's being generated?


I did. It's not all that easy for me to read yet, as I've only just begun. It seemed to me what I saw was the same, but now I know about the code skipping, which I never saw in the disassembly. This is why I didn't know why the functions took different time to execute.

Share this post


Link to post
Share on other sites
Quote:
Original post by Splinter of Chaos
Quote:
Original post by Promit
Quote:
Original post by Splinter of Chaos
I tried this on M$' compiler.
Stop doing that. This is a place for intelligent discussion, not bullshit trolling.


Sorry, I just got in the habit through interacting with other internet communities. Different communities have different conventions. Even if you find it annoying, some communities feel the opposite way. And, without having said anything bad about MS (I had to think about it), I don't see how trolling applies here.


"M$" is fairly synonymous -- at least in terms of how it's read around these parts - with "lol I made a joke about an evil monopoly lol money lolololol". That's how low our tolerance for the term has gotten, thanks to the users of that term (as a generalization). If nothing remotely resembling that was running through your mind when you made your posting, then great!

But you'll have to forgive Promit for (probably) hating some of those other communities you came from and... ah... encouraging you to break less savory habits picked up from there. You can forgive me too if you want, for egging him on in IRC -- but I'm just an asshole, so that one's entirely up to you [lol].

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!