Slowdown!

Started by
21 comments, last by Drythe 18 years, 10 months ago
The VC++6 compiler left quite a bit to be desired as far as optimizations go. I have to use VC6 at work for some project, and I usually work on them in VC2003 then just make sure it still works in VC6 after changes are made.
I've found that often, the VS2003 debug executables are about as fast as (and smaller than) the VC6 release (w/ max optimizations turned on) executables for some of the projects.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
Advertisement
sorry, how do I get to the assembly code?
1. Project->Settings
2. C/C++ Tab
3. Listing Files under category
4. Assembly-Only Listing/Assembly with Source Code for the Listing File Type

Then you can use the [source="asm"] code here [/source] tags to paste it in [wink]
Umm, just a note. Debug builds will usually avoid inlining functions. My entire engine is plagued with this problem. I get about 100fps in debug mode, and about 1500 in release. That's with zero rendering. It's because I contain almost every little action into methods. With optimizations on, you should notice zero difference between those two versions of code.
dear god.. ok, I think this is the assembly for the GetLength function in the bad loop.

[source = "asm"]?MoofGetLength@CGame@@QAE?BMABUP3D@1@@Z PROC NEAR	; CGame::MoofGetLength, COMDAT; File C:\WINDOWS\Desktop\Myth\Game.h; Line 65	sub	esp, 8; Line 69	mov	eax, DWORD PTR _Vector$[esp+4]	mov	ecx, DWORD PTR [eax+8]	mov	edx, DWORD PTR [eax+4]	mov	eax, DWORD PTR [eax]	mov	DWORD PTR -8+[esp+8], edx	mov	DWORD PTR 8+[esp+4], eax	mov	DWORD PTR -4+[esp+8], ecx	fld	DWORD PTR 8+[esp+4]	fmul	DWORD PTR 8+[esp+4]	fld	DWORD PTR -8+[esp+8]	fmul	DWORD PTR -8+[esp+8]	faddp	ST(1), ST(0)	fld	DWORD PTR -4+[esp+8]	fmul	DWORD PTR -4+[esp+8]	faddp	ST(1), ST(0)	fld	QWORD PTR __real@8@3ffe8000000000000000	call	__CIpow; Line 73	add	esp, 8	ret	4?MoofGetLength@CGame@@QAE?BMABUP3D@1@@Z ENDP		; CGame::MoofGetLength 


and the function that the bad loop is in

[source = "asm"] ?Update@CGame@@QAEXXZ PROC NEAR				; CGame::Update, COMDAT; File C:\WINDOWS\Desktop\Myth\Game.cpp; Line 156	sub	esp, 12					; 0000000cH	push	ebx	push	esi	push	edi	mov	esi, ecx; Line 158	call	?GetElapsedTime@CGame@@QAEMXZ		; CGame::GetElapsedTime	fstp	DWORD PTR [esi+8]; Line 172	mov	DWORD PTR _LightPos$[esp+24], 1131020288 ; 436a0000H; Line 173	mov	DWORD PTR _LightPos$[esp+28], 1131020288 ; 436a0000H; Line 174	mov	DWORD PTR _LightPos$[esp+32], 1131020288 ; 436a0000H	mov	ebx, 13976				; 00003698H$L59093:; Line 192	mov	edi, 3$L59097:; Line 265	lea	eax, DWORD PTR _LightPos$[esp+24]	mov	ecx, esi	push	eax	call	?MoofGetLength@CGame@@QAE?BMABUP3D@1@@Z	; CGame::MoofGetLength	dec	edi	fstp	ST(0)	jne	SHORT $L59097; Line 182	dec	ebx	jne	SHORT $L59093; Line 397	fld	DWORD PTR [esi+181232]	fadd	DWORD PTR __real@4@3fff8000000000000000	fst	DWORD PTR [esi+181232]; Line 398	fld	DWORD PTR [esi+181228]	fadd	DWORD PTR [esi+8]	fst	DWORD PTR [esi+181228]; Line 399	fcomp	DWORD PTR __real@4@3fff8000000000000000	fnstsw	ax	test	ah, 1	jne	SHORT $L60409; Line 405	sub	esp, 8	mov	ecx, OFFSET FLAT:?file@@3Vofstream@@A	mov	DWORD PTR ?file@@3Vofstream@@A+4, 1	fstp	QWORD PTR [esp]	call	??6ostream@@QAEAAV0@N@Z			; ostream::operator<<; Line 406	push	OFFSET FLAT:??_C@_01BJG@?6?$AA@		; `string'	mov	ecx, OFFSET FLAT:?file@@3Vofstream@@A	call	??6ostream@@QAEAAV0@PBD@Z		; ostream::operator<<; Line 407	xor	eax, eax	pop	edi	mov	DWORD PTR [esi+181232], eax; Line 408	mov	DWORD PTR [esi+181228], eax	pop	esi	pop	ebx; Line 417	add	esp, 12					; 0000000cH	ret	0$L60409:	pop	edi	pop	esi; Line 408	fstp	ST(0)	pop	ebx; Line 417	add	esp, 12					; 0000000cH	ret	0?Update@CGame@@QAEXXZ ENDP				; CGame::Update
and the function with the good loop

[source = "asm"]?Update@CGame@@QAEXXZ PROC NEAR				; CGame::Update, COMDAT; File C:\WINDOWS\Desktop\Myth\Game.cpp; Line 156	push	ecx	push	esi	mov	esi, ecx; Line 158	call	?GetElapsedTime@CGame@@QAEMXZ		; CGame::GetElapsedTime	fst	DWORD PTR [esi+8]; Line 397	fld	DWORD PTR [esi+181232]	fadd	DWORD PTR __real@4@3fff8000000000000000	fst	DWORD PTR -4+[esp+8]	fstp	DWORD PTR [esi+181232]; Line 398	fadd	DWORD PTR [esi+181228]	fst	DWORD PTR [esi+181228]; Line 399	fcomp	DWORD PTR __real@4@3fff8000000000000000	fnstsw	ax	test	ah, 1	jne	SHORT $L59101; Line 405	fld	DWORD PTR -4+[esp+8]	sub	esp, 8	mov	ecx, OFFSET FLAT:?file@@3Vofstream@@A	mov	DWORD PTR ?file@@3Vofstream@@A+4, 1	fstp	QWORD PTR [esp]	call	??6ostream@@QAEAAV0@N@Z			; ostream::operator<<; Line 406	push	OFFSET FLAT:??_C@_01BJG@?6?$AA@		; `string'	mov	ecx, OFFSET FLAT:?file@@3Vofstream@@A	call	??6ostream@@QAEAAV0@PBD@Z		; ostream::operator<<; Line 407	xor	eax, eax	mov	DWORD PTR [esi+181232], eax; Line 408	mov	DWORD PTR [esi+181228], eax$L59101:	pop	esi; Line 417	pop	ecx	ret	0?Update@CGame@@QAEXXZ ENDP				; CGame::Update
The reason is that the compiler optimized away all the code in the first case. It didn't optimize away nearly as much in the second case. I don't know why the second case wasn't optimized as well as the first, but that's irrelevant since the test is flawed.
John BoltonLocomotive Games (THQ)Current Project: Destroy All Humans (Wii). IN STORES NOW!
This kind of thing comes up time and time again. Do not write just just to test the speed of something unless you are well experienced at doing so. There are a number of rules to go by like:
There must be a result (other than the timing results) output somewhere.
It must not be possible to remove a single line of code from anywhere and still get the same result. i.e. any value you assign to a variable must be subsequently used.
All tests must be with optimisations on.
There's more to it than that, but that's a bare minimum.

Also in this specific case sqrt is likely to be faster than pow(x, 0.5) I believe.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
If the loop is actually calling the function instead of inlining it, then that is your primary problem. The function call/return mechanism can add a lot of overhead in small loops like that. On top of that, the compiler can optimize the inlined version to cache LightPos.x, LightPos.y, and LightPos.z in registers because it knows they're not being modified in that loop. It's not safe to make the same assumption for parameters passed to a function. Since it doesn't have to touch memory at all, the performance of that loop should be lightning fast. Make sure you're not compiling in debug (which disables inlining), and that optimizations and inlining are turned on.

Another thing to point out is that you should be using powf() instead of pow(). Well, actually in this case I think you should be using sqrtf(). In the Microsoft runtime library, most of the math functions have double and single precision float versions, and while in some cases the single-precision version may just be recasting the double-precision version, in many cases it uses a faster algorithm. I've validated that with tests like the one you're running. The sqrtf() should be faster than powf() because it is more specific, which makes it easier to use hacks to optimize it. (A hack that applies to sqrt() may not apply to all cases of pow()).

EDIT: I just realized that the compiler should actually be caching the result of (LightPos.x * LightPos.x) + (LightPos.y * LightPos.y) + (LightPos.z * LightPos.z) instead of the individual values. Reading assembler makes my head hurt, so I'm just making an educated guess. To be honest, it's not a very good test loop because it is too optimizable to be a good real-world scenario. Try the same thing with an array of POINT3D objects, using a different one each time through the loop. I think you'll find it doesn't run nearly as fast, but it should still be faster than the function call.
yeah, i'm compiling in release. Just to be sure, which options specifically should I have set in the compiler to optimize the code? I don't think I've ever touched them.

the sqrtf did make it a bit faster, btw

This topic is closed to new replies.

Advertisement