The VC++6 compiler left quite a bit to be desired as far as optimizations go. I have to use VC6 at work for some project, and I usually work on them in VC2003 then just make sure it still works in VC6 after changes are made.
I've found that often, the VS2003 debug executables are about as fast as (and smaller than) the VC6 release (w/ max optimizations turned on) executables for some of the projects.
Slowdown!
1. Project->Settings
2. C/C++ Tab
3. Listing Files under category
4. Assembly-Only Listing/Assembly with Source Code for the Listing File Type
Then you can use the [source="asm"] code here [/source] tags to paste it in [wink]
2. C/C++ Tab
3. Listing Files under category
4. Assembly-Only Listing/Assembly with Source Code for the Listing File Type
Then you can use the [source="asm"] code here [/source] tags to paste it in [wink]
Umm, just a note. Debug builds will usually avoid inlining functions. My entire engine is plagued with this problem. I get about 100fps in debug mode, and about 1500 in release. That's with zero rendering. It's because I contain almost every little action into methods. With optimizations on, you should notice zero difference between those two versions of code.
dear god.. ok, I think this is the assembly for the GetLength function in the bad loop.
and the function that the bad loop is in
[source = "asm"]?MoofGetLength@CGame@@QAE?BMABUP3D@1@@Z PROC NEAR ; CGame::MoofGetLength, COMDAT; File C:\WINDOWS\Desktop\Myth\Game.h; Line 65 sub esp, 8; Line 69 mov eax, DWORD PTR _Vector$[esp+4] mov ecx, DWORD PTR [eax+8] mov edx, DWORD PTR [eax+4] mov eax, DWORD PTR [eax] mov DWORD PTR -8+[esp+8], edx mov DWORD PTR 8+[esp+4], eax mov DWORD PTR -4+[esp+8], ecx fld DWORD PTR 8+[esp+4] fmul DWORD PTR 8+[esp+4] fld DWORD PTR -8+[esp+8] fmul DWORD PTR -8+[esp+8] faddp ST(1), ST(0) fld DWORD PTR -4+[esp+8] fmul DWORD PTR -4+[esp+8] faddp ST(1), ST(0) fld QWORD PTR __real@8@3ffe8000000000000000 call __CIpow; Line 73 add esp, 8 ret 4?MoofGetLength@CGame@@QAE?BMABUP3D@1@@Z ENDP ; CGame::MoofGetLength
and the function that the bad loop is in
[source = "asm"] ?Update@CGame@@QAEXXZ PROC NEAR ; CGame::Update, COMDAT; File C:\WINDOWS\Desktop\Myth\Game.cpp; Line 156 sub esp, 12 ; 0000000cH push ebx push esi push edi mov esi, ecx; Line 158 call ?GetElapsedTime@CGame@@QAEMXZ ; CGame::GetElapsedTime fstp DWORD PTR [esi+8]; Line 172 mov DWORD PTR _LightPos$[esp+24], 1131020288 ; 436a0000H; Line 173 mov DWORD PTR _LightPos$[esp+28], 1131020288 ; 436a0000H; Line 174 mov DWORD PTR _LightPos$[esp+32], 1131020288 ; 436a0000H mov ebx, 13976 ; 00003698H$L59093:; Line 192 mov edi, 3$L59097:; Line 265 lea eax, DWORD PTR _LightPos$[esp+24] mov ecx, esi push eax call ?MoofGetLength@CGame@@QAE?BMABUP3D@1@@Z ; CGame::MoofGetLength dec edi fstp ST(0) jne SHORT $L59097; Line 182 dec ebx jne SHORT $L59093; Line 397 fld DWORD PTR [esi+181232] fadd DWORD PTR __real@4@3fff8000000000000000 fst DWORD PTR [esi+181232]; Line 398 fld DWORD PTR [esi+181228] fadd DWORD PTR [esi+8] fst DWORD PTR [esi+181228]; Line 399 fcomp DWORD PTR __real@4@3fff8000000000000000 fnstsw ax test ah, 1 jne SHORT $L60409; Line 405 sub esp, 8 mov ecx, OFFSET FLAT:?file@@3Vofstream@@A mov DWORD PTR ?file@@3Vofstream@@A+4, 1 fstp QWORD PTR [esp] call ??6ostream@@QAEAAV0@N@Z ; ostream::operator<<; Line 406 push OFFSET FLAT:??_C@_01BJG@?6?$AA@ ; `string' mov ecx, OFFSET FLAT:?file@@3Vofstream@@A call ??6ostream@@QAEAAV0@PBD@Z ; ostream::operator<<; Line 407 xor eax, eax pop edi mov DWORD PTR [esi+181232], eax; Line 408 mov DWORD PTR [esi+181228], eax pop esi pop ebx; Line 417 add esp, 12 ; 0000000cH ret 0$L60409: pop edi pop esi; Line 408 fstp ST(0) pop ebx; Line 417 add esp, 12 ; 0000000cH ret 0?Update@CGame@@QAEXXZ ENDP ; CGame::Update
and the function with the good loop
[source = "asm"]?Update@CGame@@QAEXXZ PROC NEAR ; CGame::Update, COMDAT; File C:\WINDOWS\Desktop\Myth\Game.cpp; Line 156 push ecx push esi mov esi, ecx; Line 158 call ?GetElapsedTime@CGame@@QAEMXZ ; CGame::GetElapsedTime fst DWORD PTR [esi+8]; Line 397 fld DWORD PTR [esi+181232] fadd DWORD PTR __real@4@3fff8000000000000000 fst DWORD PTR -4+[esp+8] fstp DWORD PTR [esi+181232]; Line 398 fadd DWORD PTR [esi+181228] fst DWORD PTR [esi+181228]; Line 399 fcomp DWORD PTR __real@4@3fff8000000000000000 fnstsw ax test ah, 1 jne SHORT $L59101; Line 405 fld DWORD PTR -4+[esp+8] sub esp, 8 mov ecx, OFFSET FLAT:?file@@3Vofstream@@A mov DWORD PTR ?file@@3Vofstream@@A+4, 1 fstp QWORD PTR [esp] call ??6ostream@@QAEAAV0@N@Z ; ostream::operator<<; Line 406 push OFFSET FLAT:??_C@_01BJG@?6?$AA@ ; `string' mov ecx, OFFSET FLAT:?file@@3Vofstream@@A call ??6ostream@@QAEAAV0@PBD@Z ; ostream::operator<<; Line 407 xor eax, eax mov DWORD PTR [esi+181232], eax; Line 408 mov DWORD PTR [esi+181228], eax$L59101: pop esi; Line 417 pop ecx ret 0?Update@CGame@@QAEXXZ ENDP ; CGame::Update
The reason is that the compiler optimized away all the code in the first case. It didn't optimize away nearly as much in the second case. I don't know why the second case wasn't optimized as well as the first, but that's irrelevant since the test is flawed.
This kind of thing comes up time and time again. Do not write just just to test the speed of something unless you are well experienced at doing so. There are a number of rules to go by like:
There must be a result (other than the timing results) output somewhere.
It must not be possible to remove a single line of code from anywhere and still get the same result. i.e. any value you assign to a variable must be subsequently used.
All tests must be with optimisations on.
There's more to it than that, but that's a bare minimum.
Also in this specific case sqrt is likely to be faster than pow(x, 0.5) I believe.
There must be a result (other than the timing results) output somewhere.
It must not be possible to remove a single line of code from anywhere and still get the same result. i.e. any value you assign to a variable must be subsequently used.
All tests must be with optimisations on.
There's more to it than that, but that's a bare minimum.
Also in this specific case sqrt is likely to be faster than pow(x, 0.5) I believe.
If the loop is actually calling the function instead of inlining it, then that is your primary problem. The function call/return mechanism can add a lot of overhead in small loops like that. On top of that, the compiler can optimize the inlined version to cache LightPos.x, LightPos.y, and LightPos.z in registers because it knows they're not being modified in that loop. It's not safe to make the same assumption for parameters passed to a function. Since it doesn't have to touch memory at all, the performance of that loop should be lightning fast. Make sure you're not compiling in debug (which disables inlining), and that optimizations and inlining are turned on.
Another thing to point out is that you should be using powf() instead of pow(). Well, actually in this case I think you should be using sqrtf(). In the Microsoft runtime library, most of the math functions have double and single precision float versions, and while in some cases the single-precision version may just be recasting the double-precision version, in many cases it uses a faster algorithm. I've validated that with tests like the one you're running. The sqrtf() should be faster than powf() because it is more specific, which makes it easier to use hacks to optimize it. (A hack that applies to sqrt() may not apply to all cases of pow()).
EDIT: I just realized that the compiler should actually be caching the result of (LightPos.x * LightPos.x) + (LightPos.y * LightPos.y) + (LightPos.z * LightPos.z) instead of the individual values. Reading assembler makes my head hurt, so I'm just making an educated guess. To be honest, it's not a very good test loop because it is too optimizable to be a good real-world scenario. Try the same thing with an array of POINT3D objects, using a different one each time through the loop. I think you'll find it doesn't run nearly as fast, but it should still be faster than the function call.
Another thing to point out is that you should be using powf() instead of pow(). Well, actually in this case I think you should be using sqrtf(). In the Microsoft runtime library, most of the math functions have double and single precision float versions, and while in some cases the single-precision version may just be recasting the double-precision version, in many cases it uses a faster algorithm. I've validated that with tests like the one you're running. The sqrtf() should be faster than powf() because it is more specific, which makes it easier to use hacks to optimize it. (A hack that applies to sqrt() may not apply to all cases of pow()).
EDIT: I just realized that the compiler should actually be caching the result of (LightPos.x * LightPos.x) + (LightPos.y * LightPos.y) + (LightPos.z * LightPos.z) instead of the individual values. Reading assembler makes my head hurt, so I'm just making an educated guess. To be honest, it's not a very good test loop because it is too optimizable to be a good real-world scenario. Try the same thing with an array of POINT3D objects, using a different one each time through the loop. I think you'll find it doesn't run nearly as fast, but it should still be faster than the function call.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement