Performance issues with IDXGISwapChain::Present
I have implemented a deferred renderer for some time and now i started to profile a little.
Right now I am kind of dissappointed for the following reason:
My test scene looks like this, some Geometry with 100 pointlights moving in circles:
The values for rendering buffers and lights and updating are ok by me.
It throws me off that the Present call consumes that much time. Im calling it this way: Present(0, 0) - so it should be rendered as fast as possible.
Here are the Profiling results:[table][tr][th][/th]
[th]close[/th]
[th]far[/th]
[th][/th][/tr]
[tr]
[td]FPS[/td]
[td]14.5[/td]
[td]27[/td]
[td][/td]
[/tr]
[tr]
[td]Update/Render[/td]
[td]8.1[/td]
[td]4.7[/td]
[td](Time updating the scene and rendering combined just for doublechecking)[/td]
[/tr]
[tr]
[td]Present[/td]
[td]57.5[/td]
[td]30[/td]
[td](The SwapChain::Present call)[/td]
[/tr]
[tr]
[td]UpdateScene [/td]
[td]3.7[/td]
[td]2.2[/td]
[td](Updating scene only)[/td]
[/tr]
[tr]
[td]UpdateLights [/td]
[td]0.5[/td]
[td]0.3[/td]
[td](Updating light buffers)[/td]
[/tr]
[tr]
[td]G-Buffer[/td]
[td]2[/td]
[td]1.2[/td]
[td](Rendering G-Buffer)[/td]
[/tr]
[tr]
[td]SSAO [/td]
[td]1.5[/td]
[td]1.0[/td]
[td](Rendering SSAO + Blur)[/td]
[/tr]
[tr]
[td]RenderLights [/td]
[td]1[/td]
[td]0.5[/td]
[td](Rendering Lights)[/td]
[/tr]
[/table]
It throws me off that the Present call consumes that much time. Im calling it this way: Present(0, 0) - so it should be rendered as fast as possible.
Present almost always will take up the most time, since the application has to wait for the card to finish rendering entirely before it can presume, even without VSYNC. You need to use a specialized GPU debugging tool (PIX, nsight, the one integrated in VS2012) in order to see where the work is really happening.
Besides, 100 spotlights is (AFAIK) a huge value, even for a deferred renderer. How many passes does that take? Do you draw multiple lights in one pass, or does each light take its own rendering pass?
When the GPU is taking longer than the CPU, Present will start to block once the CPU is a few frames ahead (default number is 3, can be adjusted with IDXGIDevice1::SetMaximumFrameLatency). The driver has to do this, otherwise the CPU could get infinitely far ahead of the GPU and the driver would have to buffer a gigantic number of commands for the GPU to consume. Its not like you'd actually want your game simulation to be that far ahead anyway.
Like Juliean mentioned, you need to profile the GPU here and not the CPU to know where your performance issues are.
Besides, 100 spotlights is (AFAIK) a huge value, even for a deferred renderer. How many passes does that take? Do you draw multiple lights in one pass, or does each light take its own rendering pass?
I render them all in one pass.
I tried PIX and nvision to profile but didnt got them working (some kind of runtime check error #0 where the stack pointer gets corrupted supposedly when I do a ID3D11Texture2D::GetDesc() call o.O dont know whats going on there, only happens when i try to start through the profiling tools...)
But thanks anyway.
When the GPU is taking longer than the CPU, Present will start to block once the CPU is a few frames ahead
Is there an easy way to find out if this is really the case?
I appreciate your help!
edit:
I'm kind of confused right now, so just for clearification:
when I perform a Draw() - call on an immediate context the execution will be performed immediate...
//...
context->Draw(...);
// Marker
So when the execution gets to the "Marker" the drawing should be finished and the possible results in the buffers ready.
If that is so the actual render time of my scene is like 10ms.
The question is now what is left to do at the Present() call that takes so long. "Presents a rendered image to the user." what could be found all over the documentation is kind of vague to me.
when I perform a Draw() - call on an immediate context the execution will be performed immediate...
//...
context->Draw(...);
// Marker
So when the execution gets to the "Marker" the drawing should be finished and the possible results in the buffers ready.
No, the other two people explained how it works. I believe immediate context means it is starting the command immediately, as opposed to adding them to a list of commands for you to execute later.
The huge difference between your near and far times should be all the proof you need to know it's waiting on drawing. Why else would distance affect the time?
I believe immediate context means it is starting the command immediately, as opposed to adding them to a list of commands for you to execute later.
Hmm,.. you are right. But I can not help myself that somewhere I read that the GPU blocks the CPU execution till finished. Could that be Dispatch (). would make sense to me, because the CPU will most likely need the results.
Reducing light also helps the performance. Looks like I have to implement clustering :)
The only functions that can block the CPU are Map, Present, and Flush. All other functions cause commands to be generated in a command buffer that is consumed later by the GPU, which means they are asynchronous from the point of view of the CPU. Map will block when you're trying to read data on the CPU that was generated by the GPU, so that the GPU can catch up and the data will actually be in the buffer that you're mapping. Present will block when the CPU gets too far ahead of the GPU, as I explained already. The reason this happens in Present is because it's generally considered to be the "finishing point" of a frame, since you're taking everything the GPU rendered and showing it on screen. Flush will stall and wait for all pending commands to be executed by the GPU.
If you're looking to do GPU performance analysis, I would suggest using Nsight for Nvidia hardware and GPU PerfStudio for AMD hardware. Those programs can give you low-level timing and counter information for a particular Draw or Dispatch call.