Sign in to follow this  
Drythe

Slowdown!

Recommended Posts

I'm having some trouble with slowdown and tried to pinpoint what's going wrong by testing it with the following code. Here are 2 different versions of a loop statement that do the same thing. With the first one, I'm getting about 130,000 program cycles/sec and the second one, 18:P As you can see, the only difference is in one I'm writing the math out manually, and the second I'm using a function that does the same thing. Good: POINT3D LightPos = POINT3D(34, 546, 56); float u; for (int i = 0; i < 13976; i++) { for (int k = 0; k < 3; k++) { //this is pasted 16 times in the real code u = pow((LightPos.x * LightPos.x) + (LightPos.y * LightPos.y) +(LightPos.z * LightPos.z), 0.5); } } Bad: POINT3D LightPos = POINT3D(34, 546, 56); float u; for (int i = 0; i < 13976; i++) { for (int k = 0; k < 3; k++) { //this is pasted 16 times in the real code u = GetLength( LightPos); } } Here are the definitions used: //////////////////////////////////////////////////////////////////////////// inline const float GetLength(const POINT3D& Vector) { return pow((Vector.x * Vector.x) + (Vector.y * Vector.y) + (Vector.z * Vector.z), 0.5); } //////////////////////////////////////////////////////////////////////////// typedef struct POINT3D { public: POINT3D() {}; POINT3D(float xf, float yf, float zf) {x = xf; y = yf; z = zf;} float x, y, z; }POINT3D; *A couple notes: the inline seems to be working I guess.. because when I remove it, the 18 program cycles/sec goes down to 9. Also, when I move the 'float u;' into the header file, the 18 goes to 32. I don't know if this is a hint of any kind. Also, this data isn't being used anywhere in the program. in fact, this is about all the program is, since I commented all of the main stuff out. Everything's still being compiled though... P.S. sorry about the formatting... how do you do those little code boxes on this forum?

Share this post


Link to post
Share on other sites
well, I do use the double fors in the final code, but I commented just about everything out to test the slowdown problem.

Share this post


Link to post
Share on other sites
yes

////////////////////////////////////////////////////////////////////////////
inline const float GetLength(const POINT3D& Vector)
{

return pow((Vector.x * Vector.x) +
(Vector.y * Vector.y) +
(Vector.z * Vector.z), 0.5);

}

Share this post


Link to post
Share on other sites
I mean.. it does the same thing, but there must be a key difference on the compiler level or something. Possibly something to do with memory? i dunno:)

Share this post


Link to post
Share on other sites
Can you take a look at the assembly code the compiler generates in both cases?
You can post it here.

Share this post


Link to post
Share on other sites
The VC++6 compiler left quite a bit to be desired as far as optimizations go. I have to use VC6 at work for some project, and I usually work on them in VC2003 then just make sure it still works in VC6 after changes are made.
I've found that often, the VS2003 debug executables are about as fast as (and smaller than) the VC6 release (w/ max optimizations turned on) executables for some of the projects.

Share this post


Link to post
Share on other sites
1. Project->Settings
2. C/C++ Tab
3. Listing Files under category
4. Assembly-Only Listing/Assembly with Source Code for the Listing File Type

Then you can use the [source="asm"] code here [/source] tags to paste it in [wink]

Share this post


Link to post
Share on other sites
Umm, just a note. Debug builds will usually avoid inlining functions. My entire engine is plagued with this problem. I get about 100fps in debug mode, and about 1500 in release. That's with zero rendering. It's because I contain almost every little action into methods. With optimizations on, you should notice zero difference between those two versions of code.

Share this post


Link to post
Share on other sites
dear god.. ok, I think this is the assembly for the GetLength function in the bad loop.

[source = "asm"]?MoofGetLength@CGame@@QAE?BMABUP3D@1@@Z PROC NEAR	; CGame::MoofGetLength, COMDAT
; File C:\WINDOWS\Desktop\Myth\Game.h
; Line 65
sub esp, 8
; Line 69
mov eax, DWORD PTR _Vector$[esp+4]
mov ecx, DWORD PTR [eax+8]
mov edx, DWORD PTR [eax+4]
mov eax, DWORD PTR [eax]
mov DWORD PTR -8+[esp+8], edx
mov DWORD PTR 8+[esp+4], eax
mov DWORD PTR -4+[esp+8], ecx
fld DWORD PTR 8+[esp+4]
fmul DWORD PTR 8+[esp+4]
fld DWORD PTR -8+[esp+8]
fmul DWORD PTR -8+[esp+8]
faddp ST(1), ST(0)
fld DWORD PTR -4+[esp+8]
fmul DWORD PTR -4+[esp+8]
faddp ST(1), ST(0)
fld QWORD PTR __real@8@3ffe8000000000000000
call __CIpow
; Line 73
add esp, 8
ret 4
?MoofGetLength@CGame@@QAE?BMABUP3D@1@@Z ENDP ; CGame::MoofGetLength



and the function that the bad loop is in

[source = "asm"] ?Update@CGame@@QAEXXZ PROC NEAR				; CGame::Update, COMDAT
; File C:\WINDOWS\Desktop\Myth\Game.cpp
; Line 156
sub esp, 12 ; 0000000cH
push ebx
push esi
push edi
mov esi, ecx
; Line 158
call ?GetElapsedTime@CGame@@QAEMXZ ; CGame::GetElapsedTime
fstp DWORD PTR [esi+8]
; Line 172
mov DWORD PTR _LightPos$[esp+24], 1131020288 ; 436a0000H
; Line 173
mov DWORD PTR _LightPos$[esp+28], 1131020288 ; 436a0000H
; Line 174
mov DWORD PTR _LightPos$[esp+32], 1131020288 ; 436a0000H
mov ebx, 13976 ; 00003698H
$L59093:
; Line 192
mov edi, 3
$L59097:
; Line 265
lea eax, DWORD PTR _LightPos$[esp+24]
mov ecx, esi
push eax
call ?MoofGetLength@CGame@@QAE?BMABUP3D@1@@Z ; CGame::MoofGetLength
dec edi
fstp ST(0)
jne SHORT $L59097
; Line 182
dec ebx
jne SHORT $L59093
; Line 397
fld DWORD PTR [esi+181232]
fadd DWORD PTR __real@4@3fff8000000000000000
fst DWORD PTR [esi+181232]
; Line 398
fld DWORD PTR [esi+181228]
fadd DWORD PTR [esi+8]
fst DWORD PTR [esi+181228]
; Line 399
fcomp DWORD PTR __real@4@3fff8000000000000000
fnstsw ax
test ah, 1
jne SHORT $L60409
; Line 405
sub esp, 8
mov ecx, OFFSET FLAT:?file@@3Vofstream@@A
mov DWORD PTR ?file@@3Vofstream@@A+4, 1
fstp QWORD PTR [esp]
call ??6ostream@@QAEAAV0@N@Z ; ostream::operator<<
; Line 406
push OFFSET FLAT:??_C@_01BJG@?6?$AA@ ; `string'
mov ecx, OFFSET FLAT:?file@@3Vofstream@@A
call ??6ostream@@QAEAAV0@PBD@Z ; ostream::operator<<
; Line 407
xor eax, eax
pop edi
mov DWORD PTR [esi+181232], eax
; Line 408
mov DWORD PTR [esi+181228], eax
pop esi
pop ebx
; Line 417
add esp, 12 ; 0000000cH
ret 0
$L60409:
pop edi
pop esi
; Line 408
fstp ST(0)
pop ebx
; Line 417
add esp, 12 ; 0000000cH
ret 0
?Update@CGame@@QAEXXZ ENDP ; CGame::Update

Share this post


Link to post
Share on other sites
and the function with the good loop

[source = "asm"]?Update@CGame@@QAEXXZ PROC NEAR				; CGame::Update, COMDAT
; File C:\WINDOWS\Desktop\Myth\Game.cpp
; Line 156
push ecx
push esi
mov esi, ecx
; Line 158
call ?GetElapsedTime@CGame@@QAEMXZ ; CGame::GetElapsedTime
fst DWORD PTR [esi+8]
; Line 397
fld DWORD PTR [esi+181232]
fadd DWORD PTR __real@4@3fff8000000000000000
fst DWORD PTR -4+[esp+8]
fstp DWORD PTR [esi+181232]
; Line 398
fadd DWORD PTR [esi+181228]
fst DWORD PTR [esi+181228]
; Line 399
fcomp DWORD PTR __real@4@3fff8000000000000000
fnstsw ax
test ah, 1
jne SHORT $L59101
; Line 405
fld DWORD PTR -4+[esp+8]
sub esp, 8
mov ecx, OFFSET FLAT:?file@@3Vofstream@@A
mov DWORD PTR ?file@@3Vofstream@@A+4, 1
fstp QWORD PTR [esp]
call ??6ostream@@QAEAAV0@N@Z ; ostream::operator<<
; Line 406
push OFFSET FLAT:??_C@_01BJG@?6?$AA@ ; `string'
mov ecx, OFFSET FLAT:?file@@3Vofstream@@A
call ??6ostream@@QAEAAV0@PBD@Z ; ostream::operator<<
; Line 407
xor eax, eax
mov DWORD PTR [esi+181232], eax
; Line 408
mov DWORD PTR [esi+181228], eax
$L59101:
pop esi
; Line 417
pop ecx
ret 0
?Update@CGame@@QAEXXZ ENDP ; CGame::Update

Share this post


Link to post
Share on other sites
The reason is that the compiler optimized away all the code in the first case. It didn't optimize away nearly as much in the second case. I don't know why the second case wasn't optimized as well as the first, but that's irrelevant since the test is flawed.

Share this post


Link to post
Share on other sites
This kind of thing comes up time and time again. Do not write just just to test the speed of something unless you are well experienced at doing so. There are a number of rules to go by like:
There must be a result (other than the timing results) output somewhere.
It must not be possible to remove a single line of code from anywhere and still get the same result. i.e. any value you assign to a variable must be subsequently used.
All tests must be with optimisations on.
There's more to it than that, but that's a bare minimum.

Also in this specific case sqrt is likely to be faster than pow(x, 0.5) I believe.

Share this post


Link to post
Share on other sites
If the loop is actually calling the function instead of inlining it, then that is your primary problem. The function call/return mechanism can add a lot of overhead in small loops like that. On top of that, the compiler can optimize the inlined version to cache LightPos.x, LightPos.y, and LightPos.z in registers because it knows they're not being modified in that loop. It's not safe to make the same assumption for parameters passed to a function. Since it doesn't have to touch memory at all, the performance of that loop should be lightning fast. Make sure you're not compiling in debug (which disables inlining), and that optimizations and inlining are turned on.

Another thing to point out is that you should be using powf() instead of pow(). Well, actually in this case I think you should be using sqrtf(). In the Microsoft runtime library, most of the math functions have double and single precision float versions, and while in some cases the single-precision version may just be recasting the double-precision version, in many cases it uses a faster algorithm. I've validated that with tests like the one you're running. The sqrtf() should be faster than powf() because it is more specific, which makes it easier to use hacks to optimize it. (A hack that applies to sqrt() may not apply to all cases of pow()).

EDIT: I just realized that the compiler should actually be caching the result of (LightPos.x * LightPos.x) + (LightPos.y * LightPos.y) + (LightPos.z * LightPos.z) instead of the individual values. Reading assembler makes my head hurt, so I'm just making an educated guess. To be honest, it's not a very good test loop because it is too optimizable to be a good real-world scenario. Try the same thing with an array of POINT3D objects, using a different one each time through the loop. I think you'll find it doesn't run nearly as fast, but it should still be faster than the function call.

Share this post


Link to post
Share on other sites
yeah, i'm compiling in release. Just to be sure, which options specifically should I have set in the compiler to optimize the code? I don't think I've ever touched them.

the sqrtf did make it a bit faster, btw

Share this post


Link to post
Share on other sites
Open the Project Settings. First make sure the correct project and build configuration are selected. Then select the C/C++ tab and the Optimizations category. You should have "Maximize Speed" selected, and make sure the inline function expansion is not disabled. You may want to try changing it from "Only __inline" to "Any Suitable". Visual Studio 6.0 also has some pragmas that affect function inlining, but I'm not sure if they help at all. They are:


#pragma auto_inline( [{on | off}] )
#pragma inline_depth( [0... 255] )
#pragma inline_recursion( [{on | off}] )


Aside from that, try using an array of POINT3D objects like I suggested earlier. This will let you know if the compiler is making unfair optimizations based on the fact that the data isn't changing.

Oh, and click on the tiny "faq" link near the top-right corner of the page to find out how to create those source code boxes.

Share this post


Link to post
Share on other sites
Something is fishy with the code you posted. The code shows GetLength as a free top-level function. The assembly on the other hand shows it as CGame::MoofGetLength. Which is it?

Assuming that CGame::MoofGetLength is correct, I don't think VC6 will inline member functions unless the inline code is in the class definition directly. It definitely won't do it if the called function is in a different file caller.

i.e. if you have this:


class CGame
{
inline const float MoofGetLength(const POINT3D& Vector);
};

inline const float CGame::MoofGetLength(const POINT3D& Vector)
{
// yadda yadda
}


Change it to:


class CGame
{
inline const float CGame::MoofGetLength(const POINT3D& Vector)
{
// yadda yadda
}
};



If that doesn't do it you can also try __forceinline instead of just inline. It's MS specific though and should be used only if you are absolutely sure you need it.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this