# Slowdown!

## Recommended Posts

Drythe    122
I'm having some trouble with slowdown and tried to pinpoint what's going wrong by testing it with the following code. Here are 2 different versions of a loop statement that do the same thing. With the first one, I'm getting about 130,000 program cycles/sec and the second one, 18:P As you can see, the only difference is in one I'm writing the math out manually, and the second I'm using a function that does the same thing. Good: POINT3D LightPos = POINT3D(34, 546, 56); float u; for (int i = 0; i < 13976; i++) { for (int k = 0; k < 3; k++) { //this is pasted 16 times in the real code u = pow((LightPos.x * LightPos.x) + (LightPos.y * LightPos.y) +(LightPos.z * LightPos.z), 0.5); } } Bad: POINT3D LightPos = POINT3D(34, 546, 56); float u; for (int i = 0; i < 13976; i++) { for (int k = 0; k < 3; k++) { //this is pasted 16 times in the real code u = GetLength( LightPos); } } Here are the definitions used: //////////////////////////////////////////////////////////////////////////// inline const float GetLength(const POINT3D& Vector) { return pow((Vector.x * Vector.x) + (Vector.y * Vector.y) + (Vector.z * Vector.z), 0.5); } //////////////////////////////////////////////////////////////////////////// typedef struct POINT3D { public: POINT3D() {}; POINT3D(float xf, float yf, float zf) {x = xf; y = yf; z = zf;} float x, y, z; }POINT3D; *A couple notes: the inline seems to be working I guess.. because when I remove it, the 18 program cycles/sec goes down to 9. Also, when I move the 'float u;' into the header file, the 18 goes to 32. I don't know if this is a hint of any kind. Also, this data isn't being used anywhere in the program. in fact, this is about all the program is, since I commented all of the main stuff out. Everything's still being compiled though... P.S. sorry about the formatting... how do you do those little code boxes on this forum?

##### Share on other sites
Why do you have two fors() if you don't use the value for k or i?

##### Share on other sites
Drythe    122
well, I do use the double fors in the final code, but I commented just about everything out to test the slowdown problem.

##### Share on other sites
GetLength( LightPos);
Does it has the exact same code as that line where you do the calculations?

##### Share on other sites
Drythe    122
yes

////////////////////////////////////////////////////////////////////////////
inline const float GetLength(const POINT3D& Vector)
{

return pow((Vector.x * Vector.x) +
(Vector.y * Vector.y) +
(Vector.z * Vector.z), 0.5);

}

##### Share on other sites
Drythe    122
I mean.. it does the same thing, but there must be a key difference on the compiler level or something. Possibly something to do with memory? i dunno:)

##### Share on other sites
I have no idea...
What compiler are you using?

Drythe    122
VC++6

##### Share on other sites
Can you take a look at the assembly code the compiler generates in both cases?
You can post it here.

##### Share on other sites
choffstein    1090
My 2 cents: This conundrum is whack.
I think the compiler output is necessary to help solve this one

##### Share on other sites
Extrarius    1412
The VC++6 compiler left quite a bit to be desired as far as optimizations go. I have to use VC6 at work for some project, and I usually work on them in VC2003 then just make sure it still works in VC6 after changes are made.
I've found that often, the VS2003 debug executables are about as fast as (and smaller than) the VC6 release (w/ max optimizations turned on) executables for some of the projects.

##### Share on other sites
Drythe    122
sorry, how do I get to the assembly code?

##### Share on other sites
Drew_Benton    1861
1. Project->Settings
2. C/C++ Tab
3. Listing Files under category
4. Assembly-Only Listing/Assembly with Source Code for the Listing File Type

Then you can use the [source="asm"] code here [/source] tags to paste it in [wink]

##### Share on other sites
Jiia    592
Umm, just a note. Debug builds will usually avoid inlining functions. My entire engine is plagued with this problem. I get about 100fps in debug mode, and about 1500 in release. That's with zero rendering. It's because I contain almost every little action into methods. With optimizations on, you should notice zero difference between those two versions of code.

##### Share on other sites
Drythe    122
dear god.. ok, I think this is the assembly for the GetLength function in the bad loop.

[source = "asm"]?MoofGetLength@CGame@@QAE?BMABUP3D@1@@Z PROC NEAR	; CGame::MoofGetLength, COMDAT; File C:\WINDOWS\Desktop\Myth\Game.h; Line 65	sub	esp, 8; Line 69	mov	eax, DWORD PTR _Vector$[esp+4] mov ecx, DWORD PTR [eax+8] mov edx, DWORD PTR [eax+4] mov eax, DWORD PTR [eax] mov DWORD PTR -8+[esp+8], edx mov DWORD PTR 8+[esp+4], eax mov DWORD PTR -4+[esp+8], ecx fld DWORD PTR 8+[esp+4] fmul DWORD PTR 8+[esp+4] fld DWORD PTR -8+[esp+8] fmul DWORD PTR -8+[esp+8] faddp ST(1), ST(0) fld DWORD PTR -4+[esp+8] fmul DWORD PTR -4+[esp+8] faddp ST(1), ST(0) fld QWORD PTR __real@8@3ffe8000000000000000 call __CIpow; Line 73 add esp, 8 ret 4?MoofGetLength@CGame@@QAE?BMABUP3D@1@@Z ENDP ; CGame::MoofGetLength  and the function that the bad loop is in [source = "asm"] ?Update@CGame@@QAEXXZ PROC NEAR ; CGame::Update, COMDAT; File C:\WINDOWS\Desktop\Myth\Game.cpp; Line 156 sub esp, 12 ; 0000000cH push ebx push esi push edi mov esi, ecx; Line 158 call ?GetElapsedTime@CGame@@QAEMXZ ; CGame::GetElapsedTime fstp DWORD PTR [esi+8]; Line 172 mov DWORD PTR _LightPos$[esp+24], 1131020288 ; 436a0000H; Line 173	mov	DWORD PTR _LightPos$[esp+28], 1131020288 ; 436a0000H; Line 174 mov DWORD PTR _LightPos$[esp+32], 1131020288 ; 436a0000H	mov	ebx, 13976				; 00003698H$L59093:; Line 192 mov edi, 3$L59097:; Line 265	lea	eax, DWORD PTR _LightPos$[esp+24] mov ecx, esi push eax call ?MoofGetLength@CGame@@QAE?BMABUP3D@1@@Z ; CGame::MoofGetLength dec edi fstp ST(0) jne SHORT$L59097; Line 182	dec	ebx	jne	SHORT $L59093; Line 397 fld DWORD PTR [esi+181232] fadd DWORD PTR __real@4@3fff8000000000000000 fst DWORD PTR [esi+181232]; Line 398 fld DWORD PTR [esi+181228] fadd DWORD PTR [esi+8] fst DWORD PTR [esi+181228]; Line 399 fcomp DWORD PTR __real@4@3fff8000000000000000 fnstsw ax test ah, 1 jne SHORT$L60409; Line 405	sub	esp, 8	mov	ecx, OFFSET FLAT:?file@@3Vofstream@@A	mov	DWORD PTR ?file@@3Vofstream@@A+4, 1	fstp	QWORD PTR [esp]	call	??6ostream@@QAEAAV0@N@Z			; ostream::operator<<; Line 406	push	OFFSET FLAT:??_C@_01BJG@?6?$AA@ ; string' mov ecx, OFFSET FLAT:?file@@3Vofstream@@A call ??6ostream@@QAEAAV0@PBD@Z ; ostream::operator<<; Line 407 xor eax, eax pop edi mov DWORD PTR [esi+181232], eax; Line 408 mov DWORD PTR [esi+181228], eax pop esi pop ebx; Line 417 add esp, 12 ; 0000000cH ret 0$L60409:	pop	edi	pop	esi; Line 408	fstp	ST(0)	pop	ebx; Line 417	add	esp, 12					; 0000000cH	ret	0?Update@CGame@@QAEXXZ ENDP				; CGame::Update

##### Share on other sites
Drythe    122
and the function with the good loop

[source = "asm"]?Update@CGame@@QAEXXZ PROC NEAR				; CGame::Update, COMDAT; File C:\WINDOWS\Desktop\Myth\Game.cpp; Line 156	push	ecx	push	esi	mov	esi, ecx; Line 158	call	?GetElapsedTime@CGame@@QAEMXZ		; CGame::GetElapsedTime	fst	DWORD PTR [esi+8]; Line 397	fld	DWORD PTR [esi+181232]	fadd	DWORD PTR __real@4@3fff8000000000000000	fst	DWORD PTR -4+[esp+8]	fstp	DWORD PTR [esi+181232]; Line 398	fadd	DWORD PTR [esi+181228]	fst	DWORD PTR [esi+181228]; Line 399	fcomp	DWORD PTR __real@4@3fff8000000000000000	fnstsw	ax	test	ah, 1	jne	SHORT $L59101; Line 405 fld DWORD PTR -4+[esp+8] sub esp, 8 mov ecx, OFFSET FLAT:?file@@3Vofstream@@A mov DWORD PTR ?file@@3Vofstream@@A+4, 1 fstp QWORD PTR [esp] call ??6ostream@@QAEAAV0@N@Z ; ostream::operator<<; Line 406 push OFFSET FLAT:??_C@_01BJG@?6?$AA@		; string'	mov	ecx, OFFSET FLAT:?file@@3Vofstream@@A	call	??6ostream@@QAEAAV0@PBD@Z		; ostream::operator<<; Line 407	xor	eax, eax	mov	DWORD PTR [esi+181232], eax; Line 408	mov	DWORD PTR [esi+181228], eax\$L59101:	pop	esi; Line 417	pop	ecx	ret	0?Update@CGame@@QAEXXZ ENDP				; CGame::Update

##### Share on other sites
JohnBolton    1372
The reason is that the compiler optimized away all the code in the first case. It didn't optimize away nearly as much in the second case. I don't know why the second case wasn't optimized as well as the first, but that's irrelevant since the test is flawed.

##### Share on other sites
iMalc    2466
This kind of thing comes up time and time again. Do not write just just to test the speed of something unless you are well experienced at doing so. There are a number of rules to go by like:
There must be a result (other than the timing results) output somewhere.
It must not be possible to remove a single line of code from anywhere and still get the same result. i.e. any value you assign to a variable must be subsequently used.
All tests must be with optimisations on.
There's more to it than that, but that's a bare minimum.

Also in this specific case sqrt is likely to be faster than pow(x, 0.5) I believe.

##### Share on other sites
s_p_oneil    443
If the loop is actually calling the function instead of inlining it, then that is your primary problem. The function call/return mechanism can add a lot of overhead in small loops like that. On top of that, the compiler can optimize the inlined version to cache LightPos.x, LightPos.y, and LightPos.z in registers because it knows they're not being modified in that loop. It's not safe to make the same assumption for parameters passed to a function. Since it doesn't have to touch memory at all, the performance of that loop should be lightning fast. Make sure you're not compiling in debug (which disables inlining), and that optimizations and inlining are turned on.

Another thing to point out is that you should be using powf() instead of pow(). Well, actually in this case I think you should be using sqrtf(). In the Microsoft runtime library, most of the math functions have double and single precision float versions, and while in some cases the single-precision version may just be recasting the double-precision version, in many cases it uses a faster algorithm. I've validated that with tests like the one you're running. The sqrtf() should be faster than powf() because it is more specific, which makes it easier to use hacks to optimize it. (A hack that applies to sqrt() may not apply to all cases of pow()).

EDIT: I just realized that the compiler should actually be caching the result of (LightPos.x * LightPos.x) + (LightPos.y * LightPos.y) + (LightPos.z * LightPos.z) instead of the individual values. Reading assembler makes my head hurt, so I'm just making an educated guess. To be honest, it's not a very good test loop because it is too optimizable to be a good real-world scenario. Try the same thing with an array of POINT3D objects, using a different one each time through the loop. I think you'll find it doesn't run nearly as fast, but it should still be faster than the function call.

##### Share on other sites
Drythe    122
yeah, i'm compiling in release. Just to be sure, which options specifically should I have set in the compiler to optimize the code? I don't think I've ever touched them.

the sqrtf did make it a bit faster, btw

##### Share on other sites
s_p_oneil    443
Open the Project Settings. First make sure the correct project and build configuration are selected. Then select the C/C++ tab and the Optimizations category. You should have "Maximize Speed" selected, and make sure the inline function expansion is not disabled. You may want to try changing it from "Only __inline" to "Any Suitable". Visual Studio 6.0 also has some pragmas that affect function inlining, but I'm not sure if they help at all. They are:

#pragma auto_inline( [{on | off}] )#pragma inline_depth( [0... 255] )#pragma inline_recursion( [{on | off}] )

Aside from that, try using an array of POINT3D objects like I suggested earlier. This will let you know if the compiler is making unfair optimizations based on the fact that the data isn't changing.

Oh, and click on the tiny "faq" link near the top-right corner of the page to find out how to create those source code boxes.

##### Share on other sites
Anon Mike    1098
Something is fishy with the code you posted. The code shows GetLength as a free top-level function. The assembly on the other hand shows it as CGame::MoofGetLength. Which is it?

Assuming that CGame::MoofGetLength is correct, I don't think VC6 will inline member functions unless the inline code is in the class definition directly. It definitely won't do it if the called function is in a different file caller.

i.e. if you have this:

class CGame{    inline const float MoofGetLength(const POINT3D& Vector);};inline const float CGame::MoofGetLength(const POINT3D& Vector){// yadda yadda}

Change it to:

class CGame{    inline const float CGame::MoofGetLength(const POINT3D& Vector)    {    // yadda yadda    }};

If that doesn't do it you can also try __forceinline instead of just inline. It's MS specific though and should be used only if you are absolutely sure you need it.

##### Share on other sites
Drythe    122
yeah, the POINT3D array seems to be slowing down the good loop now