Help making sense of PresentMon, aiming for full screen minimal latency DX11

Graphics and GPU Programming Programming DX11

Started by troy2000 September 11, 2018 06:08 AM

5 comments, last by troy2000 5 years, 8 months ago

troy2000

122

Author

September 11, 2018 06:08 AM

This is pretty much a question for @SoldierOfLight, probably..

I've read a ton of information about the different flip modes and the various ways of configuring the swap chain. Would really like to get down to near 0ms latency at 60fps.

My GPU is somewhat old - NVidia GTX 430 - but my software is up to date. Latest NVidia drivers, latest Windows 10 (April 2018 version 1803)

PresentMon indicates dwm.exe is "Hardware: Legacy Flip" (not sure if this is important but thought I'd include it since 'Legacy' sounds bad)

If I run windowed, PresentMon indicates "Composed: Flip" with a latency around 48ms

If I run fullscreen with SetFullscreenState(true), PresentMon indicates "Hardware Composed: Independent Flip: Plane 0" with a latency around 46ms

If I run fullscreen as just a borderless window covering the whole screen, PresentMon indicates "Hardware Composed: Independent Flip: Plane 0" and around 32ms latency

In windowed mode, DXGI_SWAP_CHAIN_DESC1 setup is:

swapChainDesc.SwapEffect = DXGI_SWAP_EFFECT_FLIP_DISCARD;
swapChainDesc.Flags = DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT;

SetMaximumFrameLatency is 1 frame

(DISCARD seems to have the same latency as SEQUENTIAL)

In fullscreen mode with SetFullscreenState, I find I have to remove the WAITABLE_OBJECT flag - if I don't, DX gives an error when SetFullscreenState is called. Running in DX debug mode, it logs a message saying that the WAITABLE_OBJECT flag can't be combined with fullscreen (although I've seen other posts claiming that this restriction was lifted at some point?? not on my machine hehe)

when I call present, I'm just calling swapChain->Present(1,0)

Questions:

1) Why can't I combine WAITABLE_OBJECT with SetFullscreenState?

2) Do I need to use SetFullscreenState anyway? Currently the lowest latency is just borderless window covering the screen, with 32ms latency. But why is it not 16ms?

3) Why is SetFullscreenState slower, at 48ms latency? It's worth mentioning that I am in this case also creating a borderless window that covers the screen.. and then calling SetFullscreenState on that window.. maybe that's confusing the system (?)

4) Is "Hardware Composed: Independent Flip: Plane 0" the best I can hope for or is there some other flip mode that is optimal? If so, what changes do I need to make to the code to get there?

-------

More information after further testing:

With the borderless fullscreen window (not using SetFullscreenState), loop looks like this:

1) WaitForSingleObject(WAITABLE_OBJECT)

2) Spin loop for 15ms (almost the entire duration of the frame) <-- added after writing the original post

3) read controller/user inputs

4) Draw the next frame of the game

5) Present

With the above, PresentMon indicates around 17ms latency with "Hardware Composed: Independent Flip: Plane 0"

Is this as good as I can do or can I somehow get the latency reported by PresentMon even lower?

I am measuring controller-to-display latency with a 240hz camera and a gamepad with an LED wired into the start button. I am seeing as low as 5 240hz frames (just over 16ms latency) between the LED lighting up on the controller and visible results appearing on screen. But, sometimes I see up to 14 240hz frames. The average is probably around 8/9 frames. Have I minimized the latency from the perspective of the application? For some reason I feel like I should be able to achieve very close to 0ms latency. Conceptually if I wait until the very end of a vertical refresh cycle.. then sample the user input, draw the game, call Present() *right* before the gpu is ready to display the next frame.. then it would get my back buffer and swap it to front only 0-2 ms after I call Present. How do I get to a solution like this?

If you're curious I'm using the Dell 2414H monitor which is reviewed to have 4ms latency, and other tests I've done with dedicated hardware more or less confirm this (http://www.tftcentral.co.uk/reviews/dell_u2414h.htm#lag)

Thanks!

jbadams

26,386

September 11, 2018 06:20 AM

Moderation note: Removed some unnecessary wording from title and added tags for clarity.

- Jason Astle-Adams

SoldierOfLight

2,378

September 11, 2018 03:37 PM

So direct response to your questions:

1) Depending on your driver version, SetFullscreenState may do one of two things: it may simply maximize your window and try to get independent flip, or it may actually take exclusive ownership of the display. The semaphore used to implement the waitable object is only present in the former case, not the latter. Note that DX12 always implements fullscreen as the former, and therefore the restriction of this API is not present there.

2) No, you don't. The performance characteristics of an app which is in independent flip vs exclusive fullscreen should be identical. And as you noticed, on newer drivers, SetFullscreenState may not even do anything extra compared to a borderless window with FLIP_DISCARD swap effect.

3) Not sure, I'd need to see a trace of that scenario to tell you why it's higher latency.

4) Independent flip (or legacy fullscreen flip, i.e. exclusive fullscreen) are the best you can hope fore.

Regarding your later tests, that should be able to get you down below 16ms. Have you measured the amount of GPU work that you're submitting? If you sleep for 15ms, but take longer than 1ms from the point of starting your rendering on the CPU to completing it on the GPU, you'll miss your VBlank and you'll end up with your frame queued for nearly another whole VBlank. You'd also end up at ~30fps, because your wait for the waitable object would block until the flip actually happened.

Alternatively, are you waiting for the waitable object before the first frame? If you're not, you'll always have an extra frame of latency inherent in the system.

troy2000

122

Author

September 11, 2018 06:59 PM

Thanks for the quick reply!

I was maintaining 60fps so I wasn't doing too much CPU/GPU work.

However I was not waiting for the waitable object before the first frame - that seems counter intuitive but speaks to my lack of a deeper understanding about what the waitable object represents. I switched the code around so it does wait before the first frame and as you predicted the latency numbers reported in PresentMon are lower, around 3ms. I had to back off on the duration of my spin-wait to 13ms (the purpose of this is to wait as long as possible before sampling the controller input, minimizing the time between controller input and its representation on screen).

With the 240fps camera I now measure as low as 3 240hz frames between the light appearing on the LED and the change appearing on-screen, which is ~12 ms from controller to screen. But in a trial of 8 controller presses, I still measured as high as 12 240hz frames (~48ms), and probably averaged around 6 (~24ms).. this variability is disappointing, but I suspect this may be coming from the controller/controller driver rather than the rendering/display code.

So the formula for this is:

1) WaitForSingleObject (taking around 2ms for the call to complete)

2) Software Spin Wait (set to around 13ms)

3) Sample Controller, Render Frame (not measuring how long this takes but seems like it's 1ms or less)

4) Present (taking 0ms)

So, more questions ?

1) Just to confirm, the string displayed in PresentMon "Hardware Composed: Independent Flip: Plane 0" means that I am running in the best mode possible, correct?

2) Are there any other strings I should look for that would indicate a similar optimal path or is this the only one?

3) Is there any significance to the fact that WaitForSingleObject is taking 2ms? What is happening during this 2ms?

4) I'm using a dual monitor setup and seeing a lot of situations where the fullscreen 60fps realtime rendered monitor drops out of the optimal "Hardware Composed: Independent Flip: Plane 0" mode, and falls back to Composed: Flip (according to PerfMon). I expect this kind of thing to happen if I drag some other window over the fullscreen window or if some other popup appears over it, but I am finding it falls back to Composed:Flip in some scenarios where it seems like it shouldn't.

- My application is a multi-window application so I can have many windows separate from the realtime rendering window (I can have several rendering windows too but I'm just testing 1 for now). These are normal GDI windows, not being rendered in realtime. If I click over to one of these windows, PerfMon reports Composed:Flip. Is it maybe reporting the window that I activated, rather than the rendering window?

- If I switch to a different application running fullscreen on the alternate monitor (Chrome, Notepad, doesnt seem to matter so long as its fullscreen), PerfMon reports Composed: Flip. If I switch to PerfMon itself from my fullscreen window, or switch to any other non-fullscreen window that is *not* one of my application windows, it maintains Independent Flip.

- If I click back on the 60fps fullscreen realtime rendered window it goes back to Independent Flip. So, it's not a huge issue, but I'm curious if you have any insight into why these actions would lose the optimal display path.

Thanks again!

SoldierOfLight

2,378

September 11, 2018 09:02 PM

Regarding the initial wait, the documentation for the waitable object does clearly call it out:

Step 4: Wait before rendering each frame

Your rendering loop should wait for the swap chain to signal via the waitable object before it begins rendering every frame. This includes the first frame rendered with the swap chain.

As for your questions:

1) Correct, "Hardware [Composed]: Independent Flip" means that the application buffer is directly scanned out.

2) The only alternative is the legacy fullscreen flip. Both of these indicate no copies, with your buffer going straight to screen.

3) This is probably the amount of time after you called the Present API, while your GPU work is getting processed by the driver and hardware, and then waiting for the VBlank to actually flip it, and then the graphics scheduler acknowledging that the flip occurred.

4) PresentMon definitely monitors swapchains, not windows. So it's not that it suddenly switched to monitoring a standard GDI window. Unfortunately, independent flip state is something that's managed by the desktop compositor, so I don't really have any additional insight into why you'd exit it.

troy2000

122

Author

September 11, 2018 09:30 PM

Thank you, now where do I send my check?

Help making sense of PresentMon, aiming for full screen minimal latency DX11

Step 4: Wait before rendering each frame

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Help making sense of PresentMon, aiming for full screen minimal latency DX11

Step 4: Wait before rendering each frame

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines