Why DirectX doesn't implement triple buffering?

Started by
16 comments, last by wintertime 10 years, 5 months ago

Thank you for such in depth response. How long is additional latency with triple buffering? Is it always one frame or is one frame lag worst case scenario?

What happens in his case:
Front buffer: frame 1
Back buffer 1: frame 2
Back buffer 2: gpu processing frame 3

--monitor refresh-- (circular flip)

Front buffer: frame 2
Back buffer 1: gpu processing frame 3
Back buffer 3: empty?

is second state correct?

Also wouldn't detached input, logic & physics loop that is running faster than graphic part give similar result as frame skipping?
Input, logic & physics:1,2,3,4,5,6,7,8,9,...
Graphics:1 ,2 ,4 ,5 ,7, 8 ,...

So effectively graphics is skipping frames 3 & 6. How can this form of frame skipping be better than Anand's frame skipping? I know I am asking a lot of questions but please bear with me. I understand it can reduce input lag, but wouldn't skipping input,logic&physics frames produce the same stuttering effect as Anand's frame skipping?
Also now we are talking about synchronizing three loops:
1) Input, logic & physics
2) Graphics
3) Monitor refresh rate

the CPU may be issuing 4 frames, but the GPU is still presenting the front buffer, the 1st backbuffer is waiting to be presented, and the 2nd backbuffer is also waiting; and the CPU wants a 4th one.
So, instead of using quadruple buffering or waiting, the Anandtech suggest the method of dropping the contents of the 2nd backbuffer and replace it with the contents of that 4th frame.

One mistake you made here I think. I think Anandtech suggest the method of dropping the contents of the 1st backbuffer (because it has the oldest frame) and start calculating/writing contents of that 4th frame in it. So the monitor will end up showing frames 1, 3, 4. And it is the GPU that must be ready for the 4th frame for that to happen, not just CPU. As we are focusing on synchronizing graphics and monitor refresh rate. By dropping one frame we have given GPU more time to process 4th frame right?

Advertisement

Thank you for such in depth response. How long is additional latency with triple buffering? Is it always one frame or is one frame lag worst case scenario?

What happens in his case:
Front buffer: frame 1
Back buffer 1: frame 2
Back buffer 2: gpu processing frame 3

--monitor refresh-- (circular flip)

Front buffer: frame 2
Back buffer 2: gpu processing frame 3
Back buffer 1: empty?

is second state correct?

I've corrected the numbers in bold for you. (After the circular flip, back buffer #1 becomes #2 and #2 becomes #1, assuming they're using the swap technique)
And yes that's correct. However "empty" is inaccurate. A better term is "undefined". Most likely it will contain the contents it had before swapping, but this isn't guaranteed IIRC. The data could've been corrupted by now. Or the Driver could for some random reason use an unused chunk of VRAM from some other region and discarded the old one. This is specially sensitive when dealing with more rare setups (i.e. SLI, CrossFire)

If the driver is using copy instead of flip/swap; the best guess is that buffers #1 & #2 now hold the same data (because #2 was copied into #1).

Also wouldn't detached input, logic & physics loop that is running faster than graphic part give similar result as frame skipping?

Short answer: mostly yes.
Longer answer: You're the one in control of what graphic logic state you will be rendering, so there is a finer granularity. But the actual reason is that rendering tends to take a lot of time (it is very rare to spend just 3.33ms; few games take 16.33ms/60hz while most need 33.33ms/30hz!) so if the CPU can run at 120hz, you will be limited to just 30hz, regardless of triple buffer or VSync.
By detaching, you can process at 120hz or more while still render at 60 or 30hz. This is covered in the excellent article fix your timestep. You're looking at 30hz updates, but the game "feels" responsive (because complex key inputs are processed almost immediately).
It's like running towards a cliff to get to the other side with your eyes closed. Just because your eyes are closed doesn't mean you have to wait until they're open again to jump. If you've calculated the distance well enough, press the space bar to jump at the right time.
With graphics locked to input & logic, your "space bar" jump could be processed either too early or too late.

In other words, the reason is the same as Anandtech's article (reduce latency); it's just the article makes a very optimistic assumption about how long takes a GPU to render the frame (unless you're playing a very old game where current GPUs can handle) and how stable that framerate is.
Aaaand that's why G-Sync is a cool gadget by the way.


One mistake you made here I think. I think Anandtech suggest the method of dropping the contents of the 1st backbuffer (because it has the oldest frame) and start calculating/writing contents of that 4th frame in it. So the monitor will end up showing frames 1, 3, 4. And it is the GPU that must be ready for the 4th frame for that to happen, not just CPU. As we are focusing on synchronizing graphics and monitor refresh rate. By dropping one frame we have given GPU more time to process 4th frame right?

Yes, my mistake.

I think you forgot about front buffer in circular flip example. Front buffer becomes back buffer 2, back buffer 1 becomes front buffer and back buffer 2 becomes back buffer 1. That is for the case of circular flip technique where only the pointers (or names/purpose of buffers) change, but actual content stays in the same spot in memory.

I think I understand know.

  • Triple buffering is meant to be used with V-Sync & the reason to use it is to prevent frame drop (like with double buffering) when V-Sync is enabled.
  • Triple buffering can increase lag up to one frame (one frame of monitors refresh rate right? ... so 120hz monitors can be an improvement when triple buffering is used?)
  • Anand's technique would only be improvement (reduce lag) compared to circular flip technique in old games where GPU is rendering quicker than monitors refresh rate. With slow GPU and low FPS, there would not be any notable difference between techniques (for both there will be lag of up to one frame of monitor refresh rate). A much better technique to reduce lag is to separate input,logic&physics loop from graphics loop and move them to separated threads, so the game's input stays responsive even when the frame rate drops below monitors refresh rate (useful even when not using triple buffering).
  • The title of this thread is wrong hehe

I though minimum of two buffers are needed. One front buffer and one back buffer. How would only one buffer work?

I know that, at least in blitting/copy mode, there is a minimum of only one back-buffer. There's no front buffer, or the front buffer is whatever video memory D3D maps internally to the screen.

One thing is for sure: when you create the swap chain for a D3D device, you olny give it the count of back-buffers to create, and you can always set that to 1, which is valid, and this is what I was referring to when I said you need only one buffer. But I also think that maybe there are restrictions on using a swapchain with only one back buffer with the "flip" presentation mode - maybe it requires the back buffer count from the swap chain to be at least 2, or maybe it creates a front buffer internally and uses it for flipping with the one back buffer you asked it to create for the swap chain - I don't know, but I'm sure it's documented somewhere how it works. I always thought of the "front buffer" as an internal buffer that only the D3D API knows and cares about, or maybe it's even only internal to the video driver (I think it's the same thing as that elusive part of video memory that linux/OpenGL users like to call a "framebuffer").

There's also a comment about this at the bottom of the msdn article: http://msdn.microsoft.com/en-us/library/windows/desktop/bb173075(v=vs.85).aspx

Great! Now I'm confused too. jk smile.png Anyway, I'm happy to know that my swap chains work with just one buffer (BackBufferCount=1) in copy presentation mode (but D3D11 complains when I give it a BackbufferCount of 0 or less)... Don't know what it accepts for the flipping presentation mode.

In fact, you're lucky, because the D3D9 documentation seems to be more clear on what the BackBufferCount can be: http://msdn.microsoft.com/en-us/library/windows/desktop/bb172588%28v=vs.85%29.aspx

The DXGI_SWAP_CHAIN_DESC documentation makes it implicit that - in windowed modes at least - D3D certainly does support rotating back-buffers.

in windowed mode, the desktop is the front buffer

If you think about this, it's obvious. Your front buffer (i.e. the desktop itself) is just not going to be the same size as your back buffer(s), so there's absolutely no way that the front buffer (desktop) is going to be able to be swapped with a back buffer.

You'll see the same in D3D9 if you use IDirect3DDevice9::GetFrontBufferData - the surface data you retrieve will be sized to the desktop, not to your back buffer size. Again, this is documented behaviour:

the size of the destination surface should be the size of the desktop

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

In case anyone else is interested here are some good articles I found while researching this subject.

Two buffers is minimum. One front buffer and one back buffer. Front buffers is on-screen buffer to which we cannot write. Back buffers are off-screen surfaces to which we draw. When creating swap chain we only specify back buffer count (front buffer always exist in one form or another. It may not be same size as back buffers if you game is in windowed mode. You do not have direct control over front buffer in D3D9).

Front buffer: A rectangle of memory that is translated by the graphics adapter and displayed on the monitor or other output device.

...

The front buffer is not directly exposed in Direct3D 9. As a result, applications cannot lock or render to the front buffer.

http://msdn.microsoft.com/en-us/library/windows/desktop/bb174607%28v=vs.85%29.aspx

Note that any surface other than the front buffer is called an off-screen surface because it is never directly viewed by the monitor. By using a back buffer, an application has the freedom to render a scene whenever the system is idle ... without having to consider the monitor's refresh rate. Back buffering brings an additional complication of how and when to move the back buffer to the front buffer.

http://msdn.microsoft.com/en-us/library/windows/desktop/bb153350%28v=vs.85%29.aspx

MSDN article on OpenGL (may be old but still relevant) that has clear explanation of relationship between terms: framebuffer, front buffer, back buffer

The framebuffer consists of a set of logical buffers: color, depth, accumulation, and stencil buffers. The color buffer itself consists of a set of logical buffers; this set can include a front-left, a front-right, a back-left, a back-right, and some number of auxiliary buffers.

...
By default, drawing commands are directed to the back buffer (the off-screen buffer), while the front buffer is displayed on the screen.

http://msdn.microsoft.com/en-us/library/windows/desktop/dd318339%28v=vs.85%29.aspx

More in depth article that describes differences in windowed and full screen mode (DX10):

When DXGI makes the transition to full-screen mode, it attempts to exploit a flip operation in order to reduce bandwidth and gain vertical-sync synchronization. The following conditions can prevent the use of a flip operation:

  • The application did not re-allocate its back buffers in a way that they match the primary surface.

  • The driver specified that it will not scan-out the back buffer (for example, because the back buffer is rotated or is MSAA).

  • The application specified that it cannot accept the Direct3D runtime discarding of the back buffer's contents and requested only one buffer (total) in the chain. (In this case, DXGI allocates a back surface and a primary surface; however, DXGI uses the driver's PresentDXGI function with the Blt flag set.)

http://msdn.microsoft.com/en-us/library/windows/hardware/ff557525%28v=vs.85%29.aspx

Another interesting thing is, it seems that driver can create additional buffers on its own.

However, drivers are notorious for adding more buffering of their own. This is an unfortunate side effect of benchmark tools such as 3DMark. The more aggressively you buffer, the more you can smooth out perf variations, and the more parallelism can be guaranteed, so you achieve better overall throughput and visual smoothness. But this obviously messes up input response time, so it's not a good optimization for anything more interactive than a movie player! Unfortunately, though, there is no way for automated benchmarks like 3DMark to test input response times, so drivers tend to over-index on maximizing their benchmark scores at the cost of more realistic user scenarios.

This was particularly bad in the DX8 era, where drivers would sometimes buffer up 5, 6 or more frames, and games resorted to crazy tricks like drawing to a 1x1 rendertarget every frame, then calling GetData on it at the start of the next frame, in an attempt to force the driver to flush its buffers. I haven't seen any drivers with that kind of pathological behavior for a long time, though.

http://xboxforums.create.msdn.com/forums/p/58428/358113.aspx#358113

You da man!cool.png


Unfortunately, though, there is no way for automated benchmarks like 3DMark to test input response times

I have to disagree with this: Input repsonse time can be calculated as current time minus the input-event timestamp that is reported by functions like GetMessageTime. And if it is used after a call timeBeginPeriod(1), the response time will be accurate down to 1 millisecond.

Well, after re-thinking about it does make sense: there is no way for 3DMark to simulate physical input events.

There are ways to measure this. Just have a camera and then count the frames between you hitting the button and something appearing on screen:

http://cowboyprogramming.com/2008/05/27/programming-responsiveness/

http://cowboyprogramming.com/2008/05/30/measuring-responsiveness-in-video-games/

http://cowboyprogramming.com/2008/12/03/custom-responsiveness-measuring-device/

What I wondered about when reading the posts above(sorry I'm late as I dont regularly read this subforum), especially the postings from Matias Goldberg is the following:

Couldn't it be possible to have an intermediate method, where the gpu does not always wait when the backbuffers are in use, but also not throws away the half-rendered frame 3 to get a free buffer?

It could wait until the earlier of two things:

- until frame 3 is rendered completely but frame 2 is still queued for the next vsync and then cycle only the backbuffers to throw away frame 2 and queue the ready and newer frame 3 for next vsync or

- until the vsync frees a buffer when it switches from frame 1 to frame 2 by cycling front and back buffers at once.

Depending on circumstances that would reduce latency for frame 3 by not showing a stale frame 2 first and could allow for unblocking cpu and gpu for starting rendering of frame 4 earlier, which maybe allows frame 4 or 5 to show up even another one vsync earlier.

That would trade a faster latency on triple buffering with vsync for a more complicated driver logic and higher cpu/gpu use when rendering is already going fast, I think.

This topic is closed to new replies.

Advertisement