How come changing DXGI_SWAP_CHAIN_DESC.BufferCount has no effect?

Started by
7 comments, last by Holy Fuzz 7 years, 10 months ago

So I've been trying to implement triple-buffering in my application by changing the BufferCount* parameter of DXGI_SWAP_CHAIN_DESC, but regardless of what I set it to, there is no detectable change in the performance or latency of my application. Let me elaborate...

I would expect that increasing the number of swap chain buffers would lead to an increase in latency. So I started experimenting: First, I added a 50ms sleep to every frame so as to artificially limit the FPS to about 20. Then I tried setting BufferCount to 1, 2, 4, 8, and 16 (the highest it would go without crashing) and tested latency by moving my game's camera. With a BufferCount of 1 and an FPS of ~19, my game was choppy but otherwise had low latency. Now, with a BufferCount of 16 I would expect 16 frames of latency, which at ~19 FPS is almost a whole second of lag. Certainly this should be noticeable just moving the game camera, but there was no more latency than there was with a BufferCount of 1. (And none of the other values I tried had any effect either.)

Another possibly-related thing that's confusing me: I read that with vsync on (and no triple-buffering), the FPS should be locked to an integer divisor of your monitor's refresh rate (i.e., 60, 30, 20, 15, etc...) since any frame that takes longer than a vertical blank needs to wait until the next one before being presented. And indeed, when I give Present a SyncInterval of 1, my FPS is capped at 60. But my FPS does *not* drop to 30 once a frame takes longer than 1/60 of a second as I would expect; if I get about 48 FPS with vsync off then I still get about 48 FPS with vsync on. (And no, this isn't a result of averaging of frame times. I'm recording individual frame times and they're all very stable at around 1/48 second. I've also checked my GPU settings for any kind of adaptive vsync but couldn't find any.)

More details:

I'm testing this in (I think exclusive) fullscreen, though I've tested in windowed mode as well. (I've fixed all the DXGI runtime warnings about fullscreen performance issues, so I'm pretty sure I have my swap chain configured correctly)

If it matters, I'm using DXGI_SWAP_EFFECT_DISCARD (but have tested SEQUENTIAL, FLIP_SEQUENTIAL, and FLIP_DISCARD with no apparent effect).

I've tried calling Present with a SyncInterval of both 0 (no vsync) and 1 (vsync every vertical blank). Using 1 adds small but noticeable latency as one would expect, but increasing BufferCount doesn't add to it.

I've tested on three computers: One with a GTX970, one with a mobile Radeon R9 M370X, and one virtual machine running on VirtualBox. All exhibit this same behavior (or lack thereof)

So can anyone explain why I'm not seeing any change in latency or locking to 60/30/20/... FPS with vsync on? Am I doing something wrong? Am I not understanding how swap chains work? Is the graphics driver being too clever?

Thanks for your help!

*(As an aside, does anyone know for sure what I *should* be setting BufferCount to for double- and triple-buffering? In some places I've read that it should be set to 1 and 2 respectively for double and triple buffering, but in some other places they say set it to 2 and 3.)

Advertisement

I could be wrong here as I am far from an expert but depending on your scene complexity the GPU may very well have enough time to draw and upload those buffers to the display. I would try limiting the FPS through scene complexity first with a standard double buffered swap chain then try it with higher buffer counts. As far as buffer counts go for double buffering you should be setting the DXGI_SWAP_CHAIN_DESC::BufferCount = 2.

Have you look at playing with "SetMaximumFrameLatency"? (or the other one..)

By default you're only allowed to queue up 3 frames worth of data before being throttled, so maybe you're not building up the latency you're expecting?

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Allow me to explain.

A typical deeply-queued game will be bounded by one of 3 things:

  1. The CPU.
  2. The GPU.
  3. VSync.

There's a fourth which is the CPU waiting on the GPU (e.g. calling Map() without DISCARD or DO_NOT_WAIT, or spinning on queries), but in that case the app is controlling its own latency.

The trivial app that you constructed falls into category 1. In category 1, each frame has rendering commands recorded, submitted, and completed all before the next frame is finished recording. When the rendering commands for a frame have completed on the GPU, then the Present() is processed, and will complete as well, potentially with a delay for VSync. There's no artificial requirement that the frame has to be held on to for a certain amount of time to generate specific latency.

In category 2, the app records and submits commands, but they take a long time to complete on the GPU. This is always the case in sync interval 0, but can also be sync interval 1 where the GPU takes longer than 16ms. In this case, the Present() from a given frame completes before the rendering work from the next frame completes. Again, no requirement to hold off the Present() any longer than necessary - at most just until the next VBlank.

In category 3, the app records, submits, and completes commands very quickly, but has requested each frame to displayed on a VSync. In this scenario, you can build up multiple completed frames before any Present() ops have completed.

There are two integer values that are relevant for frame latency:

  1. MaxFrameLatency (defaults to 3).
  2. BufferCount.

The MaxFrameLatency indicates how far ahead the CPU can get from the GPU. It kicks in to throttle the CPU in scenarios 2 and 3 above, preventing you from having too many frames in flight which could take a long time to complete (either because the GPU has a lot of work to do, or because a lot of VSyncs need to pass). If you Sleep() on the CPU, this value never kicks in because the CPU never gets very far ahead of the GPU.

The BufferCount is only relevant in scenario 3. It indicates how far ahead the GPU can get from the Present() queue. If you have a frame that takes a lot longer than normal to complete, a high buffer count can smooth your framerate because the GPU can start working on it sooner.

To address your point of framerate quantization when double-buffered instead of triple-buffered:

There's an important factor to recognize which affects this. Not every game hits this quantization to the same level. The key has to do with when the swapchain back buffer is touched during the frame. DX prevents writing to a buffer while it is on screen (or being composed from). If the very first thing you do is clear the back buffer and your frame takes ~16ms to render, then your entire frame will be blocked until the buffer is off-screen, which might take until the next VBlank if you only have two buffers (i.e. 30hz). But if you have 3, then your buffer is probably already offscreen (i.e. 60hz). But if you wait until the very end of the frame to touch your back buffer (e.g. deferred rendering), then you can start rendering your frame ahead of time and only synchronize with the display at the end of the frame. This avoids/minimizes the quantization.

To address your question about desired buffer count:

It really depends on your swapchain configuration. Also keep in mind if you're reading anything about D3D9, the value there was "back buffer count" where DXGI specifies "buffer count." In other words, D3D9 typically adds 1 more for your "front buffer." Allow me to enumerate some common ones:

  • Exclusive fullscreen, VSync on: You should use 2 or 3, depending on when you touch the back buffer. If you can spare the memory, go for 3 to be safe.
  • Exclusive fullscreen, VSync off: You should use 2.
  • Never use 1 buffer in fullscreen, please. Also note that swap effect doesn't really affect exclusive fullscreen, they're all pretty much identical.
  • Windowed blt model (SEQUENTIAL/DISCARD): Use 1. These models do not allow scenario 3 to occur, so there's no point in having more buffers. Note: DISCARD automatically overrides your buffer count in windowed to 1.
  • Windowed flip model (FLIP_SEQUENTIAL/FLIP_DISCARD): Follow the same advice as exclusive fullscreen VSync on.
    • There's a lot more I could say about windowed flip model because it's still evolving to a place where you can get proper VSync off behavior, and it has different performance characteristics depending on whether certain optimizations engage... but in general you should assume that it always has VSync and plan your buffer count accordingly. For more details you can watch my video on this topic (D3D12 requires FLIP_SEQUENTIAL/FLIP_DISCARD but I don't think anything here is 12-specific).

As a last question... how are you measuring latency? Have you tried out PresentMon (my shameless plug). The real-time display includes an average measure of latency, and the CSV includes a column which is the latency for each frame. This is measured until the GPU is instructed to flip and doesn't account for things like built-in monitor latency.

Edit: One more thing I forgot to mention. The reason for having BufferCount higher than 3 or 4 would be to render a dozen frames, have the CPU and GPU active for a little while, then submit them all to displayed on VSync and let the CPU and GPU go idle. This race-to-idle behavior is very power-efficient usually, and is ideal for video playback. If you have a low buffer count, the GPU will need to wake up as each frame is retired to render another one into the newly available buffer, and you end up using a lot more power. Outside of this race-to-idle or trying to get high framerates while using frame coalescing (a FLIP_ swap effect with sync interval 0 and no tearing) there's not really a reason to use higher buffer counts.

Jesse, this is by far the clearest and most informative explanation I've read on the internet or in a book on how BufferCount and MaxFrameLatency works. (And the video was useful too.) Thank you so much! If I could upvote you a thousand times, I would! I hope lots of other people find your explanation as useful as I have.

There are still a few things that I'm puzzled by:

1. As a test, I removed the trivial sleep(50) call every frame and instead looped part of my rendering code 30 times (a part that uses very little CPU but draws lots of pixels) which brings my FPS down to about 20. (90% of my frame time is now spent in Present, so I'm pretty sure I'm now GPU-limited.) Setting FrameCount to 16 had no noticeable effect, which now makes sense given your explanation (since this is GPU-limited and not vsync limited). I also tried setting MaxFrameLatency to 16, which if I understand correctly should introduce 16 frames of latency since my CPU can execute so much faster than my GPU? But again, I'm seeing no latency, which should be quite obvious at ~20 FPS, correct? Am I misunderstanding something? (I also tried PresentMon, which is reporting ~130ms of latency regardless of how I set MaxFrameLatency.)

2. I've been using a BufferCount of 1 in full-screen with no obvious ill-effect. Will the driver automatically increase it to 2 if I specify 1 in full-screen mode? Or maybe I'm not actually running in exclusive full-screen? (Is there any way to check that? PresentMon's CSV says "Hardware: Legacy Flip" if that's at all relevant.)

3. Now that I have my game GPU-limited, I am seeing my FPS locked to 60/30/20/15/etc when vsync is on. Why don't I see the same behavior when my game is CPU-limited? (And yeah, I've set MaxFrameLatency to 1.)

Thanks again!

1. What swap effect are you using? Are you windowed or fullscreen? I would expect SetMaximumFrameLatency(16) to allow 16 frames of latency, though it's possible that the driver might be intervening. You should double-check your driver's control panel to make sure that relevant settings are indeed controlled by the app. You might also try using D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS on your device. I've confirmed that with a simple app that renders at 60fps with VSync on, I can hit 300ms of latency (according to PresentMon).

2. I think the DISCARD swap effect might get auto-upgraded to 2 in fullscreen. The 1-buffer fullscreen SEQUENTIAL swapchain ends up using a copy for presentation, and shows up in PresentMon as "Hardware: Legacy Copy to front buffer."

3. I think I need to explain this one by example. Let's start by assuming that the CPU queue is always full and you're GPU-bound.

  • If your frames take 15ms on the GPU, you can produce one frame per VSync. You render a frame, it gets queued, and flips on the next VSync. You never build up presents to become VSync-bound. For two frames, you get 15ms of work, 1ms idle, 15ms work, 1ms idle.
  • If your frames take 17ms on the GPU, it takes a little longer than a VSync to produce a frame. You render a frame, and it gets queued to flip on the next VSync. Meanwhile, the GPU is now sitting there idle, waiting for that flip to happen, because it needs to start writing to the texture that's on-screen. So for two frames you have 17ms of work, 15ms idle, 17ms of work, 15ms idle.

If you add up the work/idle time per frame, the first scenario is 16ms (60hz), the second is 33ms (30hz).

But now what if you can do the first 10ms of work without touching the back buffer. Those two scenarios now become:

  • Your frames take 15ms on the GPU. No change.
  • Your frames take 17ms on the GPU. Your times for a few frames are now: 17ms work frame 0, 10ms work frame 1, 5ms idle, 7ms work frame 1, 10ms work frame 2, 10ms idle, 7ms work frame 2, 10ms work frame 3...

Now scenario 2 ends up averaging 27ms per frame, or 37hz.

So the quantization all comes down to how long the GPU is idle waiting for a VSync, and what that does to your overall frame time. When you're CPU-bound, any time the GPU spends waiting for a buffer to come off-screen isn't affecting your overall framerate (if it was, you'd end up GPU-bound).

Again, a very clear explanation that really helps me understand what's going on. Thanks!

1. I'm using DXGI_SWAP_EFFECT_DISCARD in full-screen with vsync on and have experimented with both 2 and 3 buffers. No significant difference in average latency between SetMaximumFrameLatency(1) and SetMaximumFrameLatency(16) according to PresentMon. D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS had no effect either. So maybe my driver is overriding this setting? There aren't many options to tweak in my AMD control panel. I can try on my other computer (a GTX 970) later.

FYI, I have confirmed that AMD inserts a wait in their driver to enforce a maximum frame latency of 3. This is unexpected and I'll be following up with them to see what's going on. You can experiment getting a deeper queue using another GPU or WARP. It might be easier with WARP since the "GPU" is so much slower.

FYI, I have confirmed that AMD inserts a wait in their driver to enforce a maximum frame latency of 3. This is unexpected and I'll be following up with them to see what's going on. You can experiment getting a deeper queue using another GPU or WARP. It might be easier with WARP since the "GPU" is so much slower.

I finally got around to testing on my GTX 970, and I can confirm that SetMaxFrameLatency(16) does indeed create the expected latency, unlike my laptop's M370X. I'm certainly curious why AMD limits it to 3.

Thanks again for your help!

This topic is closed to new replies.

Advertisement