How to get only 1 frame of latency with capped framerate and vsync?

Graphics and GPU Programming Programming

Started by Dr_Asik May 09, 2017 03:03 AM

12 comments, last by SoldierOfLight 6 years, 11 months ago

183

Author

May 09, 2017 03:03 AM

So I was experimenting with PresentMon and different modes in Direct3D 11, using a basic game loop (SharpDX.Desktop.Windows.RenderLoop) and just rendering a rectangle at the cursor's location so I can get a good feel for latency just by moving the mouse. This is what PresentMon says and it definitely feels like it:


Fullscreen SyncInterval PresentFlags    SwapEffect  Latency #frames   Tearing
true       0            None            Discard      0                yes
true       1            None            Discard      2                no
false      0            None            Discard      1                no
false      1            None            Discard      3                no

I have many questions, sorry:

The only way I can get less than 2 frames of latency is by uncapping the framerate, is that something I could address by using other settings? Maybe a waitable swap chain? Ideally I'd want to have windowed + capped framerate + 1 frame of latency.

Why does going from 0 to 1 SyncInterval add 2 frames of latency in fullscreen? I can understand 1, but why 2?

Changing MaximumFrameLatency doesn't affect these numbers in any way, I suppose this is because my game loop does practically nothing but I don't really understand how the mechanics work?

galop1n

1,046

May 09, 2017 04:59 AM

You can use gpu view to observe queing in the driver.

Setmaxframelatency only work with a waitable swap chain, if you used the debug layer, it would have warn.

Usually, fullscreen is better than windowed for flipping.

Sometime, render is not the problem, keep an eye on when you update inputs for example

SoldierOfLight

2,378

May 09, 2017 01:52 PM

With the current architecture of Windows, it is not possible to get less than 1 frame of latency in windowed mode, unless we're talking about a fullscreen borderless window or running on systems with dedicated scanout composition hardware. Additionally, these can only be achieved with the FLIP_SEQUENTIAL or FLIP_DISCARD swap effects in windowed mode.

When your app renders and presents, it will enqueue a token behind the rendering work in the OS. When that token is ready for "execution," the compositor will be notified about the frame and will pick up the contents next time it composes the desktop. This happens once per VSync, early in the VSync period, meaning your frame had to be done >1 VSync before it will reach the screen.

When the compositor is able to scan out your application's buffer directly and you enter the Independent Flip optimization, your latency can drop by 1, because the token that is enqueued is transformed into a scanout token instead of a compositor notification token. This gives you performance characteristics that are very similar to fullscreen exclusive mode.

So now we know the minimum latency is >1 (typically closer to 2 because you typically snap input at the beginning of a VSync boundary, and DWM will compose it on the beginning of the next) for a normal window, with some exceptions where it can be less (typically closer to 1). Now how do you hit minimum latency?

This can be very tricky. In fullscreen exclusive, the driver will frequently batch up to 1 frame in usermode before submitting it to the OS scheduler, which manages fullscreen frame queueing limits (i.e. SetMaximumFrameLatency). In windowed mode, I'd recommend using the waitable swapchain, however I'm aware of an issue where a requested frame latency of 1 actually translates to 2, even when using the independent flip optimizations.

While working with someone else trying to achieve this, the only approach that I was able to give them was using IDXGISwapChain::GetFrameStatistics to understand when a frame was being displayed, and only then to render the next frame. Beware that using this approach to get minimum latency for windowed mode will cause your throughput to drop to half refresh (i.e. 30fps on a 60hz display) due to the 2 frame minimum latency if you fall out of the optimized path.

galop1n

1,046

May 09, 2017 03:15 PM

Hello Jesse,

I have a little question for you in regards to the waitable swapchain. It used to be windowed only, but at some point, a windows 10 update made it work for fullscreen too. Could you retrieve the exact version that did the change ?

galop1n

1,046

May 09, 2017 03:36 PM

As for gpu view, this is a capture of my sandbox with waitable swap chain ( triple buffer ) and max latency at one, you can see that the frame start on the cpu right after a vblank and is fully processed before the next one then just sit here waiting the next vblank to be diplayed :

[sharedmedia=core:attachments:35851]

And this is with max latency to two, you can observe the extra frame delay :

[sharedmedia=core:attachments:35852]

For completness, i put a non waitable gpu view capture, you can see then that many frames now are queued and if you look closely the highlight, it now takes many frames ( 4 ) before a queued frame reach execution and presentation :

[sharedmedia=core:attachments:35863]

SoldierOfLight

2,378

May 09, 2017 05:37 PM

For D3D12 specifically, the waitable swapchain can transition to "fullscreen" because D3D12 doesn't support an exclusive fullscreen mode, only a DXGI-managed borderless window. Since the swapchain is always windowed, there's no reason to block waitable swapchains from doing this transition. This relaxation was done in the Anniversary Update (1607).

Note in your screenshots you posted that the GPU work finishes very quickly, and when the GPU work completes there's a light blue packet generated in the flip queue. That packet converts to dark blue and hashed on the next VSync. Light blue means queued in software, dark hashed means queued in the driver, when it's gone that means it's currently being scanned out. That's 2 frames of latency. A proper 1 frame of latency would never have two packets stacked in the flip queue. When you bump the waitable object to 2, you end up with 3 frames total - your GPU work takes an extra frame to complete because you only have 2 buffer. If you had 3 buffers you'd see 3 packets stacked in the flip queue instead.

galop1n

1,046

May 09, 2017 06:05 PM

Yes, i was unclear on the amount of frame latency in my message, the actual flip happen when the dashed bar disappear. You also said i have only 2 buffers, but i create my swapchain with 3, and it is visible (A0,A1,A2 i believe reference to each buffer).

But most importantly, what you said also raise a concern. We are very particular on latency for the game i work on at my company, and this d3d12 extra frame latency without a real exclusive fullscreen will be an issue. Is this something that is plan to be improve in the next 18 months ?

SoldierOfLight

2,378

May 09, 2017 06:14 PM

You also said i have only 2 buffers, but i create my swapchain with 3, and it is visible (A0,A1,A2 i believe reference to each buffer).

Yeah, you're totally right, my bad.

But most importantly, what you said also raise a concern. We are very particular on latency for the game i work on at my company, and this d3d12 extra frame latency without a real exclusive fullscreen will be an issue. Is this something that is plan to be improve in the next 18 months ?

Yes, we're investigating with the compositor what can be done to improve this.

Dr_Asik

183

Author

May 10, 2017 03:16 AM

This can be very tricky. In fullscreen exclusive, the driver will frequently batch up to 1 frame in usermode before submitting it to the OS scheduler, which manages fullscreen frame queueing limits (i.e. SetMaximumFrameLatency). In windowed mode, I'd recommend using the waitable swapchain, however I'm aware of an issue where a requested frame latency of 1 actually translates to 2, even when using the independent flip optimizations.

Is that issue just for D3D12 though? I'm using D3D11.

SoldierOfLight

2,378

May 10, 2017 03:47 AM

The performance characteristics of all DXGI APIs that are applicable to both D3D11 and D3D12 are the same. The only difference between the two, aside from the ability to send a waitable swapchain "fullscreen" is what set of APIs are available.

How to get only 1 frame of latency with capped framerate and vsync?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

How to get only 1 frame of latency with capped framerate and vsync?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines