D3D12: Copy Queue and ResourceBarrier

Started by
19 comments, last by iedoc 8 years, 2 months ago

Okay, so if I create the Upload Heap resource with D3D12_RESOURCE_STATE_GENERIC_READ, the Default Heap resource with D3D12_RESOURCE_STATE_COPY_DEST, I can then transition the resource to D3D12_RESOURCE_STATE_COMMON or D3D12_RESOURCE_STATE_COPY_DEST without it failing.

But when I want to use it in the Direct Queue, don't I need to first transition it to a D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER? So I have to use a fence on the Copy Queue, check that it's done, and then transition the resource on the Direct queue?

Advertisement

1. create upload heap (D3D12_RESOURCE_STATE_GENERIC_READ state)

2. create default heap (D3D12_RESOURCE_STATE_COPY_DEST state)

3. fill upload heap with data

4. copy data from upload heap to default heap (UpdateSubresources(), give command list which creates the command, stores in command allocator)

5. transition default heap from copy dest state to vertex and constant buffer state (next command in command list after UpdateSubresources)

6. execute command list

7. update fence value

8. create signal "command" with command queue which updates the fence value on the gpu with new fence value

9. check fence value on gpu to make sure that signal command was executed, at that point you know the executed command list above has finished executing.

you have to update the fence value, otherwise you may end up trying to access that data in the default heap before its finished copying

iedoc, step 5 is the problem. You can't do that transition on a Copy Queue. So my question is really, do I have to do the fence on the Copy Queue, and then transition to D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER on the Direct Queue. Which it looks like I do.

So my question is really, do I have to do the fence on the Copy Queue, and then transition to D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER on the Direct Queue. Which it looks like I do.

That makes sense. A copy queue is an abstraction over a DMA controller, which really can't do anything useful besides memcpy... however, resource transitions are abstractions over cache invalidations, cache flushes, and data format/packing/pitch/swizzling transformations -- which a DMA unit might not be able to perform. So it makes sense that your "graphics" queue would wait for a signal that the DMA task has completed, and then perform these "transition" tasks itself.

Honestly i use a direct queue for everything including moving data to and from the GPU. The Present API only presents at multiples of the screen refresh rate (I haven't had luck getting unlocked fps yet), and i get 120 FPS whether i use a direct queue or a copy queue. Unless you are moving a lot of data to and from the GPU, i personally feel the copy queue just makes things more complex than they really need to be for how much performance gain you might get with it.

Anyway, If you are going to use a copy queue, you would still definitely have to use a fence like hodgman said, because again you need to make sure that the data has finished copying before using or modifying it (which includes changing its state, the GPU could actually physically move the data from what i understand when changing states, someone please correct me if i'm wrong about that though). Not that this bit is that helpful, but you can still use the data as vertex/index resources if you leave the default heap in a copy dest state. Of course the GPU can do optimizing for certain things depending on the state its in though, so you'll want to make sure it's in the proper state before you do anything with it.

A copy queue shouldn't have much to do with your fps should it? It's got more to do with your swap chain texture count. You could churn out frames to render target and copy them to your swap chain textures as they finish, but you'd be doing useless work -- which is kind of what would happen anyway with a DX11- present(0, 0).

E: see https://developer.nvidia.com/dx12-dos-and-donts

In my experience waiting on 3 or 4 swap chain textures creates a very muddy experience but ymmv.

  • If not in fullscreen state (true immediate independent flip mode) do control your latency and buffer count in your swap-chain carefully for the desired FPS and latency
    • Use IDXGISwapChain2::SetMaximumFrameLatency(MaxLatency) to set the desired latency
      • For this to work you need to create your swap-chain with the DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT flag set.
    • li>A sync interval of 0 indicates that "the buffer I am presenting now is the newest buffer available next time composition happens" and discards all previous presents. However, the present does not go through until composition happens, which currently is only at VSync.
    • DXGI will start to block in Present() after you have presented MaxLatency-1 times
      • At the default latency of 3 this means that you FPS can’t go higher than 2 * RefershRate. So for a 60Hz monitor the FPS can’t go above 120 FPS.
    • Try using about 1-2 more swap-chain buffers than you are intending to queue frames (in terms of command allocators and dynamic data and the associated frame fences) and set the "max frame latency" to this number of swap-chain buffers.
  • If not in fullscreen state (true immediate independent flip mode) consider using a waitable object swap-chain along with WaitForSingleObjectEx() to generate higher FPS
    • Please note that this will lead to some frame never being even partially visible, but may be a good solution for benchmarking
    • Using the waitable object swapchain and GetFrameLatencyWaitableObject(), one can test if a buffer is available before rendering to it or presenting it – the following options are available:
    1. Use an additional off-screen surface
      • Render to the off-screen surface. Test the waitable object with timeout 0 to check if a buffer is available. If so copy to the swap-chain back buffer and Present(). If no buffer is available start the frame over again.
      • At the beginning of the frame, test the waitable object. If it succeeds, render to the available swapchain buffer. If it fails, render to the offscreen surface.
    2. Use a 3 or 4 buffer swapchain
      • Render directly to a back buffer. Before calling Present(), test the waitable object. If it succeeds, call Present(), if not, start over.

yeah, thats a good point dingleberry, what i meant though was that my applications have not had to wait noticeably longer for the GPU to finish copying data from upload heaps to default heaps using only a direct command queue to do everything rather than using a direct queue along side a copy queue, so I personally have not seen or needed the benefit you might get from utilizing a copy queue. I mentioned the FPS because the fps cap is preventing me from seeing the actual difference if any between the performance of using a copy queue or not

Oh yeah, I think you can use the visual studio graphics debugger to see how long it takes various tasks to execute on various engines, kind of as if you put a timestamp query around every call. But also the suggestions in the quote would work too -- just remember your "actual" frame rate would be how fast you're rendering to an offscreen target, not how often you're presenting.

I've only recently began working with the beautiful vs graphics debugger (god i love how convenient that thing is!). I didn't realize you could track times in it, so thanks for pointing that out!

Honestly i use a direct queue for everything including moving data to and from the GPU. The Present API only presents at multiples of the screen refresh rate (I haven't had luck getting unlocked fps yet), and i get 120 FPS whether i use a direct queue or a copy queue. Unless you are moving a lot of data to and from the GPU, i personally feel the copy queue just makes things more complex than they really need to be for how much performance gain you might get with it.


It definitely depends on how much data you're moving around, and how long it might take the GPU to copy that data. The problem with using the direct queue for everything is that it's probably going to serialize with your "real" graphics work. So if you submit 15ms worth of graphics work for a frame and you also submit 1ms worth of resource copying on the direct queue, then your entire frame will probably take 16ms on the GPU. Using a copy queue could potentially allow the GPU to execute the copy while also concurrently executing graphics work, reducing your total frame time.

This topic is closed to new replies.

Advertisement