Windows 10: DX12 low latency tearing free rendering

Started by
8 comments, last by LowLatencyGuy 7 years, 1 month ago
Hi everyone,
we are investing how to port our application from Windows 7 to Windows 10.
The Requirements are: 60 fps, low latency and no tearing on Windows 10 using DX12
Currently we use a DX9ex full screen application on Windows 7 with DWM disabled.
We have 2 GPUs which have 2 heads each but for this discussion I want to stick to a single GPU using one head.
Configuration:
- Windows 10 version 10.0.14393
- NVIDIA driver 376.63.
- NVIDIA K600 graphics card
Our current measurements (using a ligth sensor and a scope) show that we have
one vsync off additional latency on Windows 10 / DX12 compared to our DX9 full screen solution.
We measure around 50ms (Windows 7 32ms), so one additional frame.
The main question is of course what settings should we use to get the best result (full screen or windows, etc.)
Our initial feeling was based on the info we have: a full screen waitable swap chain were we render each time
when the waitable object is signaled (so a buffer is free).
Our measurement do not show the results expected, why???
We have been watching the video from Jesse Natalie about flipping modes but still have some questions.:
Q1: In the video Jesse talks about windowed mode and full screen.
At @13:20 into video he states: the best option for low latency is a full screen swap chain OR a waitable swap chain.
Does this imply that a waitable swap chain is cannot be a full screen swap chain?
Or in other words one cannot use a waitable object on a full screen swap chain to check if buffer is free
(and as a result the next present call does not block)
Q2: The video suggest (to my understanding) that there are two queues in a swap chain.
a) the present queue (present blocks when this queue is full).
The size of this queue is determined by the buffer count of the swap chain.
How does this relate to the SetMaximumFrameLatency setting???
b) the number of frames completed on the GPU that need to be displayed.
How can one control this?
I am not sure how this works can anybody explain this is more detail?
Q3: Some of the flip modes are not exposed by the API and the system itself
switches between flip, d-flip and i-flip (@26:00).
It is important for us to be in control. We do not want the latency to change
by some mechanism in the OS. Is it therefore better to use the full screen APIs?
Q4: The video discusses all kinds of flip modes. @35:22 into the video it is mentioned that the system
switches to Windowed immediate iflip when using a DX12 swap chain in full screen.
So it there still a difference between a border-less window covering the whole screen
and a swap chain set to full screen LATENCY wise?
Any feedback is welcome
Regards,
TF
Advertisement
An exclusive fullscreen mode is your best chance no matter what. It is even more important in multi-gpu configuration.

The waitable swapchain used to be window only, but it works in fullscreen too as of build 14943 ( do not have other build to track when it started support ). be sure to call the setframelatency funtion too.

As for measuring latency, gpuview can help too.

I'd expect to be able to get <16ms of latency using the waitable object, with a maximum frame latency of 1. If you ensure that your window covers the screen (or use SetFullscreenState) and call ResizeBuffers, you should engage independent flip, and your frames should make it to the screen on the next VSync. In practice, it looks like we may have an off-by-one here, as I'm only able to get ~32ms, but it seems like that should be sufficient for you guys.

Are you sure you're waiting before every frame (including the first one)? If not, you could end up with an extra frame of latency getting added into the waitable object.

Like Galop1n said, GPUView and PresentMon tools are helpful for determining why the latency is there.

hi,

thanks for the quick response. About the SetFrameLatency API, MSDN states:

"Sets the number of frames that the system is allowed to queue for rendering. ....

.......

The maximum number of back buffer frames that a driver can queue."

If I have two buffers (a front and back buffer in a full screen swap chain), buffer count on the swap chain set to 2.

How does this SetFrameLatency queue related to these buffers?

I am trying to understand were in the present chain we have queues as queues add latency :-)

regards,

TF

Hi Jesse,

I have some output from "presentmon", however my own measurements are done use a light sensor (taped to the screen).

The last column states 32 ms (MsUntilDisplayed) is that what to expect or should it be in the order of 16 ms?

I have not (yet) worked with gpuview. Is presentmon up to the job or should I invest in gpuview?

Application,ProcessID,SwapChainAddress,Runtime,SyncInterval,AllowsTearing,PresentFlags,PresentMode,Dropped,TimeInSeconds,MsBetweenPresents,MsBetweenDisplayChange,MsInPresentAPI,MsUntilRenderComplete,MsUntilDisplayed

Dx12LatencyTest1.exe,312,0x0000000006FB2D40,DXGI,1,0,0,Hardware: Independent Flip,0,0.025946,16.646,16.666,0.143,16.589,32.635
Dx12LatencyTest1.exe,312,0x0000000006FB2D40,DXGI,1,0,0,Hardware: Independent Flip,0,0.042615,16.670,16.666,0.140,16.566,32.631
Dx12LatencyTest1.exe,312,0x0000000006FB2D40,DXGI,1,0,0,Hardware: Independent Flip,0,0.059262,16.647,16.666,0.139,16.614,32.651

So seems like your monitor has a ~18ms latency built in, if you're measuring 50 but PresentMon is saying 32. Unfortunately it looks like the waitable object may not be working properly - are you able to use the same present stats technique that you used in D3D9Ex? That should give you similar results.

I tried to use the GetFrameStatistics however the struct returned contains zeroes.

So the trick with the present stats we use on DX9 does not work on DX12.

Tried DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL but also DXGI_SWAP_EFFECT_FLIP_DISCARD

I am using "CreateSwapChain" API maybe I should use "CreateSwapChainForHnwd"???

I created a version of the program which has v-sync disabled. It gives tearing but I would like to see what the latency will when measuring with the light sensor. I will pick this up tomorrow when I am back at the office.

PS: what do you mean by "Unfortunately it looks like the waitable object may not be working properly"?

A bug in my program or in the OS/Driver??

In practice, it looks like we may have an off-by-one here, as I'm only able to get ~32ms, but it seems like that should be sufficient for you guys

The 32 ms I was referring to on Windows 7 (full screen / DWM disabled) includes the latency added by the display. So it seems we get one frame additional latency on Windows 10 compared to Windows 7.

Anybody succeeded getting down to one frame latency on using a waitable swap chain on Windows 10???

Circling back to this, it does look like there's an off-by-one in the frame latency waitable object. A requested frame latency value of 1 means "give me the minimum frame latency possible from any present mode, not necessarily the current one." So a composed swapchain will get you 2 frames of latency, just like a fullscreen / independent flip swapchain will.

Regarding the present stats workaround, I've confirmed we've got a bug there which is causing the zeroes. The workaround is to avoid using the SetFullscreenState API and just adjust your windows to cover the screens manually. If you do this (and call ResizeBuffers afterwards), then present stats should work correctly and you should be able to use that to get down to 1 frame of latency when your swapchain qualifies for independent flip.

Note that there are scenarios where composition will still be used (e.g. the volume indicator pops up), and the minimum latency does become 2. If you go the route of using frame statistics, and wait for a frame to be on-screen before rendering another, this will cause your application's framerate to drop to 30hz or worse. I have, however, confirmed that this approach does allow a 16ms latency measured by PresentMon. This approach will work in D3D11 or D3D12 as long as you use one of the FLIP swap effects (mandatory in D3D12).

Bottom line: we had good support from Microsoft over the last weeks but eventually we gave up on the DX12 waitable swap chain approach because it gives one additional frame of latency.

See comment Jesse previous post:

"it does look like there's an off-by-one in the frame latency waitable object"

For our multi-GPU / multi head application we have started testing on DX11.

The first results look good.

This topic is closed to new replies.

Advertisement