• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By turanszkij
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
    • By NikiTo
      Some people say "discard" has not a positive effect on optimization. Other people say it will at least spare the fetches of textures.
      if (color.A < 0.1f) { //discard; clip(-1); } // tons of reads of textures following here // and loops too
      Some people say that "discard" will only mask out the output of the pixel shader, while still evaluates all the statements after the "discard" instruction.

      discard: Do not output the result of the current pixel.
      clip: Discards the current pixel..

      As usual it is unclear, but it suggests that "clip" could discard the whole pixel(maybe stopping execution too)

      I think, that at least, because of termal and energy consuming reasons, GPU should not evaluate the statements after "discard", but some people on internet say that GPU computes the statements anyways. What I am more worried about, are the texture fetches after discard/clip.

      (what if after discard, I have an expensive branch decision that makes the approved cheap branch neighbor pixels stall for nothing? this is crazy)
    • By NikiTo
      I have a problem. My shaders are huge, in the meaning that they have lot of code inside. Many of my pixels should be completely discarded. I could use in the very beginning of the shader a comparison and discard, But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.
      Initially I wanted to use stencil to discard pixels before the execution flow enters the shader. Even before the GPU distributes/allocates resources for this shader, avoiding stale of pixel shaders execution flow, because initially I assumed that Depth/Stencil discards pixels before the pixel shader, but I see now that it happens inside the very last Output Merger state. It seems extremely inefficient to render that way a little mirror in a scene with big viewport. Why they've put the stencil test in the output merger anyway? Handling of Stencil is so limited compared to other resources. Does people use Stencil functionality at all for games, or they prefer discard/clip?

      Will GPU stale the pixel if I issue a discard in the very beginning of the pixel shader, or GPU will already start using the freed up resources to render another pixel?!?!

    • By Axiverse
      I'm wondering when upload buffers are copied into the GPU. Basically I want to pool buffers and want to know when I can reuse and write new data into the buffers.
    • By NikiTo
      AMD forces me to use MipLevels in order to can read from a heap previously used as RTV. Intel's integrated GPU works fine with MipLevels = 1 inside the D3D12_RESOURCE_DESC. For AMD I have to set it to 0(or 2). MSDN says 0 means max levels. With MipLevels = 1, AMD is rendering fine to the RTV, but reading from the RTV it shows the image reordered.

      Is setting MipLevels to something other than 1 going to cost me too much memory or execution time during rendering to RTVs, because I really don't need mipmaps at all(not for the 99% of my app)?

      (I use the same 2D D3D12_RESOURCE_DESC for both the SRV and RTV sharing the same heap. Using 1 for MipLevels in that D3D12_RESOURCE_DESC gives me results like in the photos attached below. Using 0 or 2 makes AMD read fine from the RTV. I wish I could sort this somehow, but in the last two days I've tried almost anything to sort this problem, and this is the only way it works on my machine.)

  • Advertisement
  • Advertisement
Sign in to follow this  

DX12 Windows 10: DX12 low latency tearing free rendering

This topic is 400 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi everyone,
we are investing how to port our application from Windows 7 to Windows 10.
The Requirements are: 60 fps, low latency and no tearing on Windows 10 using DX12
Currently we use a DX9ex full screen application on Windows 7 with DWM disabled.
We have 2 GPUs which have 2 heads each but for this discussion I want to stick to a single GPU using one head.
- Windows 10 version 10.0.14393
- NVIDIA driver 376.63.
- NVIDIA K600 graphics card
Our current measurements (using a ligth sensor and a scope) show that we have
one vsync off additional latency on Windows 10 / DX12 compared to our DX9 full screen solution.
We measure around 50ms (Windows 7 32ms), so one additional frame.
The main question is of course what settings should we use to get the best result (full screen or windows, etc.)
Our initial feeling was based on the info we have: a full screen waitable swap chain were we render each time
when the waitable object is signaled (so a buffer is free).
Our measurement do not show the results expected, why???
We have been watching the video from Jesse Natalie about flipping modes but still have some questions.:
Q1: In the video Jesse talks about windowed mode and full screen.
At @13:20 into video he states: the best option for low latency is a full screen swap chain OR a waitable swap chain.
Does this imply that a waitable swap chain is cannot be a full screen swap chain?
Or in other words one cannot use a waitable object on a full screen swap chain to check if buffer is free 
(and as a result the next present call does not block)
Q2: The video suggest (to my understanding) that there are two queues in a swap chain.
a) the present queue (present blocks when this queue is full).
The size of this queue is determined by the buffer count of the swap chain.
How does this relate to the SetMaximumFrameLatency setting???
b) the number of frames completed on the GPU that need to be displayed.
How can one control this?
I am not sure how this works can anybody explain this is more detail?
Q3: Some of the flip modes are not exposed by the API and the system itself
switches between flip, d-flip and i-flip (@26:00).
It is important for us to be in control. We do not want the latency to change
by some mechanism in the OS. Is it therefore better to use the full screen APIs?
Q4: The video discusses all kinds of flip modes. @35:22 into the video it is mentioned that the system 
switches to Windowed immediate iflip when using a DX12 swap chain in full screen.
So it there still a difference between a border-less window covering the whole screen
and a swap chain set to full screen LATENCY wise?
Any feedback is welcome

Share this post

Link to post
Share on other sites
An exclusive fullscreen mode is your best chance no matter what. It is even more important in multi-gpu configuration.

The waitable swapchain used to be window only, but it works in fullscreen too as of build 14943 ( do not have other build to track when it started support ). be sure to call the setframelatency funtion too.

As for measuring latency, gpuview can help too.

Share this post

Link to post
Share on other sites

I'd expect to be able to get <16ms of latency using the waitable object, with a maximum frame latency of 1. If you ensure that your window covers the screen (or use SetFullscreenState) and call ResizeBuffers, you should engage independent flip, and your frames should make it to the screen on the next VSync. In practice, it looks like we may have an off-by-one here, as I'm only able to get ~32ms, but it seems like that should be sufficient for you guys.


Are you sure you're waiting before every frame (including the first one)? If not, you could end up with an extra frame of latency getting added into the waitable object.


Like Galop1n said, GPUView and PresentMon tools are helpful for determining why the latency is there.

Share this post

Link to post
Share on other sites



thanks for the quick response. About the SetFrameLatency API, MSDN states:


"Sets the number of frames that the system is allowed to queue for rendering. ....


The maximum number of back buffer frames that a driver can queue."


If I have two buffers (a front and back buffer in a full screen swap chain), buffer count on the swap chain set to 2.

How does this SetFrameLatency queue related to these buffers? 


I am trying to understand were in the present chain we have queues as queues add latency :-)




Share this post

Link to post
Share on other sites

Hi Jesse,


I have some output from "presentmon", however my own measurements are done use a light sensor (taped to the screen).

The last column states 32 ms (MsUntilDisplayed) is that what to expect or should it be in the order of 16 ms?


I have not (yet) worked with gpuview. Is presentmon up to the job or should I invest in gpuview?



Dx12LatencyTest1.exe,312,0x0000000006FB2D40,DXGI,1,0,0,Hardware: Independent Flip,0,0.025946,16.646,16.666,0.143,16.589,32.635
Dx12LatencyTest1.exe,312,0x0000000006FB2D40,DXGI,1,0,0,Hardware: Independent Flip,0,0.042615,16.670,16.666,0.140,16.566,32.631
Dx12LatencyTest1.exe,312,0x0000000006FB2D40,DXGI,1,0,0,Hardware: Independent Flip,0,0.059262,16.647,16.666,0.139,16.614,32.651

Share this post

Link to post
Share on other sites

So seems like your monitor has a ~18ms latency built in, if you're measuring 50 but PresentMon is saying 32. Unfortunately it looks like the waitable object may not be working properly - are you able to use the same present stats technique that you used in D3D9Ex? That should give you similar results.

Share this post

Link to post
Share on other sites

I tried to use the GetFrameStatistics however the struct returned contains zeroes.

So the trick with the present stats we use on DX9 does not work on DX12.




I am using "CreateSwapChain" API maybe I should use "CreateSwapChainForHnwd"???


I created a version of the program which has v-sync disabled. It gives tearing but I would like to see what the latency will when measuring with the light sensor. I will pick this up tomorrow when I am back at the office.


PS: what do you mean by "Unfortunately it looks like the waitable object may not be working properly"?

A bug in my program or in the OS/Driver??

Share this post

Link to post
Share on other sites

In practice, it looks like we may have an off-by-one here, as I'm only able to get ~32ms, but it seems like that should be sufficient for you guys


The 32 ms I was referring to on Windows  7 (full screen / DWM disabled) includes the latency added by the display. So it seems we get one frame additional latency on Windows 10 compared to Windows 7.


Anybody succeeded getting down to one frame latency on using a waitable swap chain on Windows 10??? 

Share this post

Link to post
Share on other sites

Circling back to this, it does look like there's an off-by-one in the frame latency waitable object. A requested frame latency value of 1 means "give me the minimum frame latency possible from any present mode, not necessarily the current one." So a composed swapchain will get you 2 frames of latency, just like a fullscreen / independent flip swapchain will.


Regarding the present stats workaround, I've confirmed we've got a bug there which is causing the zeroes. The workaround is to avoid using the SetFullscreenState API and just adjust your windows to cover the screens manually. If you do this (and call ResizeBuffers afterwards), then present stats should work correctly and you should be able to use that to get down to 1 frame of latency when your swapchain qualifies for independent flip.


Note that there are scenarios where composition will still be used (e.g. the volume indicator pops up), and the minimum latency does become 2. If you go the route of using frame statistics, and wait for a frame to be on-screen before rendering another, this will cause your application's framerate to drop to 30hz or worse. I have, however, confirmed that this approach does allow a 16ms latency measured by PresentMon. This approach will work in D3D11 or D3D12 as long as you use one of the FLIP swap effects (mandatory in D3D12).

Edited by Jesse Natalie

Share this post

Link to post
Share on other sites

Bottom line: we had good support from Microsoft over the last weeks but eventually we gave up on the DX12 waitable swap chain approach because it gives one additional frame of latency.

See comment Jesse previous post:

"it does look like there's an off-by-one in the frame latency waitable object"

For our multi-GPU / multi head application we have started testing on DX11.

The first results look good.

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement