Jump to content
  • Advertisement
Sign in to follow this  
Hyunkel

Identifying bottlenecks (profiling)

This topic is 3069 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've made a few posts recently about trying to implement msaa in a deferred renderer.
The performance drop seemed exessive, and there were other things which seemed to indicate there was a problem in my application.

Finally, after days of trying to get nvidia perfhud to run, I managed to find an old driver that seems to work (at least for now)

Here's the gpu load graphs with and without AA:

Without MSAA
With CSAAx16 (4 samples)

My main concern are the high driver & sleep times.
I should investigate and fix those before further analyzing how to optimize my MSAA.

I should probably mention that I have never done any "real" profiling, I've mostly used perfhud to detect errors in my application.
So yeah, technically I'm doing this for the first time I suppose, and I need some guidance.

What exactly is driver time? Does this mean that the gpu is waiting for data from the cpu so it can continue?
I read that it can be caused when using many draw calls (for example when rendering a mesh many times) which can be solved by using hardware instancing.
However I'm not really making that many draw calls.
I have one draw call for each of my light volumes, and there's less then 20 meshes on screen, which would mean 64 lights + ~20 meshes + ~2 fullscreen passes, which aren't really that many draw calls, right?

What are my next steps to take?
How do I identify what's causing the high driver times?

Cheers,
Hyu

Share this post


Link to post
Share on other sites
Advertisement
As GPU Idle is at 0 you are totally limited by the GPU and not the CPU. This causes high sleep times and therefore high driver times as it sleeps inside the driver while waiting for the GPU to finish work.

The next step would be using the profiler function of PerfHUD. It will run some tests and will tell you how much ms your different draw calls need. Then you can check if you can optimize the most time consuming once.

Share this post


Link to post
Share on other sites
Thanks for your reply! :)

Even though I'm gpu bound, high driver and sleep times still mean that the gpu isn't doing anything most of the time, or am I misunderstanding this?

Here's the result of the performance test:
Performance test

Would you be so kind to go over my following thoughts, to see if I understand the information provided by perfhud correctly?


I know I use 64 dynamic point lights, so the operation with 64 draw calls must be my lighting pass (you can actually see the bounding volume on the RT).
I'm shading a total of ~9,8 million pixels, which since I'm running at 1600x900 equals to about ~6,8 fullscreen passes, not bad for 64 lights.
It's the most time consuming operation, taking 11,43 ms.
48640 primitives are alot, I should use LOD on my bounding spheres or replace them with boxes when they only occupy a small portion of the screen.
I don't see this improving performance though.

The next with 30 draw calls has to be my geometry pass, where I render my G-Buffer.
1,95 mpixel indicates a small overdraw, which is okay (I'm not drawing front to back)
Not really sure if 8.2ms is reasonable for this or not, I'm doing normal mapping on all surfaces in this pass with over 52k triangles, so yeah, could be alright?
The ones with 3 and 5 calls are part of the geometry pass as well, I have no idea why they are split from the other 30 calls?

The one with 2 draw calls is my combinations pass, where I combine the light accumulation buffer with the diffuse component of my G-Buffer.

The last calls are the debug information I'm printing to the top left part of the screen.



So, I'm wondering, does all of this indicate that I've reached the limit of my gpu?
Or does this (and high driver times) indicate a bottleneck?

Cheers,
Hyu

Share this post


Link to post
Share on other sites
Driver sleep time means that the driver is sleeping while waiting for GPU to finish. This means that your GPU is spending more time per frame than the CPU, and the CPU has to wait to let the GPU catch up. If it were the other way around, your GPU idle time wouldn't be 0.

Share this post


Link to post
Share on other sites
The driver time is on the CPU. But as long as your GPU doesn’t idle you don’t need to care about it.

Your analysis of the results is correct and you are GPU limited.

Based on the values the vertex shader consumes only a small amount of processing power for your lights. Therefore you should check if you can simplify the pixel calculation or reducing the amount of pixel.

Share this post


Link to post
Share on other sites
Oh, alright, makes sense :)

As for the vertex shaders, they only transform 2 coordinates into view and screen space respectively, which is rather cheap.
I don't think I can reduce the pixel calculations much further, I'm already taking alot of steps (bounding volumes, depth rejection) to ensure that I only shade pixels that are actually affected by a light.

I suppose my next step is to investigate why using MSAA is so much slower, and test a few methods (such as marking edges with stencil) to improve this.

Unfortunately perfhud died on me during my last performance test (driver reset), and now it only displays garbage.
Hopefully a reboot will fix this.

Share this post


Link to post
Share on other sites
One trick to speed up deferred lighting is to make use of the stencil buffer to reject lights that are floating in mid air. The process goes something like this:

1. Render the back faces of the light sphere with a reversed depth test into the stencil buffer (with colour writes disabled). That is you want the stencil buffer to be 1 where the backfaces of the light are buried inside the geometry. If the backfaces don't hit geometry then the light is floating in the air.

2. Swap over the depth test, enable the stencil test, and render the front faces of the light. The depth test will reject pixels where the front faces are buried behind geometry.

This all assumes the camera isn't inside the light. If it is you never need to render the front faces of the light, which means you don't need the stencil buffer - just the z test.

To render backfaces instead of front faces, swap the cull mode.

Note that you either need to clear the stencil buffer between lights, or use a different stencil value for each light.

For very small lights (in screen space) all that messing about with render states and the stencil buffer may not be worthwhile.

Share this post


Link to post
Share on other sites
Yes, I was playing around with this technique some time ago, but then quickly realized that in my application it is quite rare for light volumes to be occluded by other geometry.
In fact the few situations where this happens can usually be identified on the cpu (for the bigger volumes at least), and I can avoid rendering the light alltogether.

This is why I decided to only do the backface test to reject pixels behind the bounding volumes.

Share this post


Link to post
Share on other sites
Another useful technique is to move operations from the pixel shader back to the vertex shader. If anything can be linearly interpolated properly it's a good candidate and can potentially give you significant speed increases.

Share this post


Link to post
Share on other sites
I've further investigated why I suffer such an immense performance drop when using msaa in my deferred rendering application.

Without MSAA
With 4xMSAA Textures

Note that I only changed the Texture types of my G-Buffer to use 4xMSAA.
At this point I do not use multiple samples! I only take a single sample from each texture, so the image is still the same as Without MSAA.
It turns out that only changing the textures results in this immense performance drop, taking multiple samples itself (and running the lighting calculations 4 times for each pixel) is way less expensive then I expected, it only adds ~2ms to the lighting pass.

I tried to use edge detection to reduce the lighting calculations by only taking multiple samples for edge pixels, but obviously this cannot reduce my frame time by more than 2ms (and that is without the fullscreen pass to feed the stencil buffer and running the lighting pass twice).


Why is it taking so much longer when using textures with 4 samples?
I understand the geometry pass will take longer, because since it's taking multiple samples, it's writing more data to the G-Buffer thus making it slower.
This is reflected by the geometry pass shading twice the amount of pixels it did without msaa.
So I'm guessing this is to be expected?

As for the lighting pass, I'm a bit at a loss.
I perform exactly the same calculations I do without msaa.
The only difference is that instead of using .Sample() I'm using .Load(), but only on sample 0.
Why is this so much slower? It takes 8.1ms longer.

Any ideas?
Cheers,
Hyu


Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!