CodeAnalyst & interpreting the results

Started by
4 comments, last by Kitt3n 18 years, 2 months ago
Hello, I've profiled my application with AMD codeAnalyst, and it seems that lots cpu-time goes into kernel and driver calls (over 64% of cpu time!) What I'd really like to know is where I'm wasting cpu-time. I hope someone can help me interpret the results below, and give me a hint where to look for reducing the #calls into the kernel which I'm apparently making? For example from where & why is KiReleaseSpinLock called in my code (I guess it has to do with mutex or sth, but I only have a few mutex- locks, I'd be really suprised if let's say 5 locks per frame (30 fps) would take 5.32% of the cpu time... And why is Direct3DShaderValidator taking so much cpu-time? I've checked in the control panel and I'm using the release version of dx9 - is each shader checked each time before it's executed?! Regards, Roger --- ntoskrnl.exe 18.57% - kiReleaseSpinLock 5.32% - ExAcquireResourceExclusive 3.28% - KeSaveFloatingPointState 2.48% - KeRestoreFloatingPointState 2.46% - ExReleaseResourceLite 2.13% - ZwYieldExecution 0.87% - KeDelayExecutionThread 0.51% nv4_disp.dll 17.29% - <no debug symbols> d3d9.dll 15.78% - Direct3DShaderValidatorCreate9 6.12% - DebugSetLevel 4.54% - PSGPError 5.07% win32k.sys 12.13% - EngReleaseSemaphoreX - XLATEOBJ_iXLate - EngAquireSemaphore - EngGetCurrentThreadID mygame.exe 6.80% hal.dll 1.60%
visit my website at www.kalmiya.com
Advertisement
...bump
visit my website at www.kalmiya.com
Well, first thing is you don't make any mention how long you ran the profile for. The longer you run it, the less significant initialisation related calls (such as Direct3DShaderValidatorCreate9) will be called. Also, double check that you aren't calling Direct3DShaderValidatorCreate9 more often than you need to.

As far as kiReleaseSpinLock, I believe this can be found in any sort of synchronisation, mutexes, semaphores, etc. You can be assured this is called by any part of the kernel or drivers that need synchronisation between threads, so you want to be looking at reducing any calls made into the kernel.

Lastly, if you're spending an unreasonable amount of time in the system, you're probably making too many calls into it [smile] A big killer here is the number of draw calls and state changes you make, try reducing this by batching as many polygons together as you can and reducing the number of state changes.
"Voilà! In view, a humble vaudevillian veteran, cast vicariously as both victim and villain by the vicissitudes of Fate. This visage, no mere veneer of vanity, is a vestige of the vox populi, now vacant, vanished. However, this valorous visitation of a bygone vexation stands vivified, and has vowed to vanquish these venal and virulent vermin vanguarding vice and vouchsafing the violently vicious and voracious violation of volition. The only verdict is vengeance; a vendetta held as a votive, not in vain, for the value and veracity of such shall one day vindicate the vigilant and the virtuous. Verily, this vichyssoise of verbiage veers most verbose, so let me simply add that it's my very good honor to meet you and you may call me V.".....V
Thanks for replying and your suggestions :)

I set it up to start profiling 15 seconds after the app started (like
you pointed out, I don't want to profile the initialisation itself).

The profiling time itself was 10 seconds, in release build

Each profile sessions was identical in that I start the app, and
then just stand there and don't do anything.


As far as I know, I'm not calling Direct3DShaderValidatorCreate9 (
at least directly) - and I'm not creating/destroying shaders on a
per-frame basis..
so I guess it's being called indirectly from inside some directX
call (but from where).

Same with the spinlock - I'm not directly calling it (as far as I know),
so where are the calls coming from... Maybe the gfx-driver, but the
entire renderer runs in it's own thread - independant of other threads
(like sound, network)
And why would the gfx-driver have to do multithreading-sync stuff (or
are the drivers doing threading internally?)...

>Lastly, if you're spending an unreasonable amount of time in the system,
>you're probably making too many calls into it
That's what I also assumed... but the question is which calls into
the system taking so much time...

>A big killer here is the number of draw calls and state changes you make,
>try reducing this by batching as many polygons together as you can and
>reducing the number of state changes.
Yeah, I know what you mean.

But somehow I think I'm doing something fundamentally wrong... In one
test where I draw scene of around 20.000 polys with normal mapping - no
shadows, I get around 150 fps - and it really feels to "too slow" for
what it's doing (my estimation here would be 500+ fps), comparing with
nvidia samples which are more or less comparable in complexity.
We do have some _really_ complex shaders though...

Probably my best bet is to "tear apart" the app at the top until I only
have a dumb message loop (which really should run at >1000 fps) and then
slowly add in the subsystems (sound, networking, terrain, 3d models, ...)
to get an idea where the cpu time is lost.
visit my website at www.kalmiya.com
Quote:Original post by Kitt3n
As far as I know, I'm not calling Direct3DShaderValidatorCreate9 (
at least directly) - and I'm not creating/destroying shaders on a
per-frame basis..
so I guess it's being called indirectly from inside some directX
call (but from where).

Just checked in the D3D docs and yep, it's an internal method. I'm far from being a D3D pro so I'd really only be making guesses on this one.

Quote:Same with the spinlock - I'm not directly calling it (as far as I know),
so where are the calls coming from... Maybe the gfx-driver, but the
entire renderer runs in it's own thread - independant of other threads
(like sound, network)
And why would the gfx-driver have to do multithreading-sync stuff (or
are the drivers doing threading internally?)...

No idea if the drivers start their own threads internally, but keep in mind that any process could perform an (indirect) call to the graphics driver (or any driver for that matter) at any time so the spin locks are required to prevent multiple threads from changing the drivers internal state at the same time. kiSpinLockRelease is a *very* low level call, so really it could be called from just about anywhere. My guess at the reason it's showing up in your profile is the driver is waiting for the graphics card to complete an operation, for example if something requires a pipeline flush or if the pipeline is full (the later of which sounds unlikely in your situation).

Quote:Yeah, I know what you mean.

But somehow I think I'm doing something fundamentally wrong... In one
test where I draw scene of around 20.000 polys with normal mapping - no
shadows, I get around 150 fps - and it really feels to "too slow" for
what it's doing (my estimation here would be 500+ fps), comparing with
nvidia samples which are more or less comparable in complexity.
We do have some _really_ complex shaders though...


Depending on your graphics card, that may not be unreasonable. You didn't specify whether most of the complexity was in the vertex or pixel shaders, but complex pixel shaders can *very* quickly bring all but the most powerful of gfx cards to their knees. The fact that the spin lock is making such a strong showing in your profile would suggest that this could well be the case. When you call Present() the CPU will block until the rendering of the frame is complete so that it can be swapped to the front buffer. If you're doing much CPU bound work (eg. number crunching) try place this after you finish rendering but before you call Present(). This gives the gfx card a chance to catch up before being forced to finish.

Quote:Probably my best bet is to "tear apart" the app at the top until I only
have a dumb message loop (which really should run at >1000 fps) and then
slowly add in the subsystems (sound, networking, terrain, 3d models, ...)
to get an idea where the cpu time is lost.


Only as an absolute last resort. I'd be surprised if your performance problems were anywhere but the rendering code though. Try putting in a simple vertex/pixel shader and see if that makes a difference, do the usual tests of changing resolution, poly count, etc. to see if that makes any differences that could suggest what the problem is (I remember seeing a powerpoint slide about this from either nVidia or ATI, googling may find it).
"Voilà! In view, a humble vaudevillian veteran, cast vicariously as both victim and villain by the vicissitudes of Fate. This visage, no mere veneer of vanity, is a vestige of the vox populi, now vacant, vanished. However, this valorous visitation of a bygone vexation stands vivified, and has vowed to vanquish these venal and virulent vermin vanguarding vice and vouchsafing the violently vicious and voracious violation of volition. The only verdict is vengeance; a vendetta held as a votive, not in vain, for the value and veracity of such shall one day vindicate the vigilant and the virtuous. Verily, this vichyssoise of verbiage veers most verbose, so let me simply add that it's my very good honor to meet you and you may call me V.".....V
For the record,

GfxCard = Gf6800, 256Mb on AMD64 3400+

According to NVPerf we are drawing 70k polys in 500 calls

There was a Sleep (0) in my messageloop, which cause the app to always
take 100% cpu time - unless there's another process of equal priority
which wants to do sth.

Apparently our app is heavily using the gpu, up to a point where the
cpu is just being bored, especially in this part:

void DX::flipSync(void)
{
// Wait for the video card to finish rendering - using a dx9 eventQuery
pEventQuery->Issue(D3DISSUE_END);
while(pEventQuery->GetData(NULL, 0, D3DGETDATA_FLUSH) == S_FALSE)
{
// avoid being bored while waiting for the GPU,
// so let another thread do something usefull
Sleep(0);
}
}


Apparently it ran numerous times through the loop, doing nothing and
triggering sleeps, eating 100% cpu time but running at 30fps

I've changed it to the following - to force it to sleep at least 1 ms.
Framerate is slightly lower (2fps), but it's using much less cpu-time
(40% instead of 100%)

void DX::flipSync(void)
{
U16 sleepMs = 1;
pEventQuery->Issue(D3DISSUE_END);
while(pEventQuery->GetData(NULL, 0, D3DGETDATA_FLUSH) == S_FALSE)
{
Sleep(sleepMs);
sleepMs = 0;
}
if (sleepMs)
Sleep (1);
}

Since we've got lots of threads for various things, that should
give them some extra time between endScene and the Present-call.

Profiling gives:
HAL 30% of which 26% in halProcessorIdle
NVidia 27%
D3D 8%
ntoskrnl 7%
App 6%

Still enough things to optimize, but it looks better.
visit my website at www.kalmiya.com

This topic is closed to new replies.

Advertisement