Sign in to follow this  

D3D12: Copy Queue and ResourceBarrier

This topic is 671 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm looking at creating queues for each of the following command list types:

 

  • D3D12_COMMAND_LIST_TYPE_DIRECT
  • D3D12_COMMAND_LIST_TYPE_COMPUTE
  • D3D12_COMMAND_LIST_TYPE_COPY

I've created a setup that uses a DIRECT queue and primarily just renders geometry.  Now I want to set up a queue for copying data from Upload Heaps to Default Heaps.  This works fine as long as I only make copy calls with the queue (CopyBufferRegion or CopyTextureRegion).  But if I try to call ResourceBarrier on it, the program crashes.  This seems to jibe with the sparse documentation on COPY queues: "COPY queues and command lists accept only copy commands."

 

So... say I'm copying vertex data.  Should I transition the resource from D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER to D3D12_RESOURCE_STATE_COPY_DEST on the DIRECT queue, then use the COPY queue to do the copy, and then transition back on the DIRECT queue?  Is the COPY queue even optimized for copies anyway?  I'm guessing that it is, since it has its own type, but the documentation on it is a bit thin.  And having to transition the resource on 2 separate queues seems cumbersome...

Share this post


Link to post
Share on other sites
The copy queue 'could' be optimised; it is a bit hardware specific.

For example on Intel I doubt you'd get any pay back, AMD however have dedicated DMA hardware on their GPU so copying can be handled separately from other operations. (Same with compute queues; GCN has up to 8 hardware queues each servicing up to 8 software queues - although if memory serves currently you can only create one unique queue per type with D3D12.)

Share this post


Link to post
Share on other sites
I have the exact same problem as you!

The documentation is super sparse with regards to this. It says that to execute copy commands on the copy queue, the resource needs to be in different states compared to submitted the copy through the 3D/compute queues. All the resource states need to start in D3D12_RESOURCE_STATE_COMMON. Here's a link to the documentation: https://msdn.microsoft.com/en-us/library/windows/desktop/dn899217(v=vs.85).aspx -- have a look at "Multi-Queue resource access".

Once again, I'm doing what that documentation says but I'm still getting a crash ont the resource barrier command too...

And yes, the copy queue is supposed to be optimised for copying. At least with regards to CPU overhead -- except for on the Intel integrated graphics chips

Share this post


Link to post
Share on other sites

That is a much more informative page than anything I'd found, thanks!  Interestingly, I don't get a crash on ResourceBarrier.  I get an error when I call Close on the command list I tried to use for the copy, with it returning E_INVALIDARG.  This only happens if I put a ResourceBarrier in the commands though.

Share this post


Link to post
Share on other sites

That is a much more informative page than anything I'd found, thanks!  Interestingly, I don't get a crash on ResourceBarrier.  I get an error when I call Close on the command list I tried to use for the copy, with it returning E_INVALIDARG.  This only happens if I put a ResourceBarrier in the commands though.

 

Copy queue doesn't seem to support all resource states. I found it is safe to use COMMON, COPY_SOURCE and COPY_DEST states in copy queue. Also, if I'm correctly understand "promotable flags" part of the MSDN article, then resource barrier isn't necessary on copy queue if resources are in COMMON state.

Share this post


Link to post
Share on other sites

I still can't get it to work.  I'm attempting to use it to upload vertex data.  I created a committed resource, in a DEFAULT heap, initialized as a COPY_DEST.

 

Then I created a second resource, in an UPLOAD heap, initialized as a COPY_SOURCE (or as GENERIC_READ, or COMMON, it didn't make a difference).

 

I made the call to CopyBufferRegion, but as soon as I add a resource barrier to transition from COPY_DEST to VERTEX_AND_CONSTANT_BUFFER, I get the error when I close the list.  I tried transitioning to COMMON (so that I'm only using COPY_SOURCE, COPY_DEST, and COMMON), and I get the same error.

Edited by Funkymunky

Share this post


Link to post
Share on other sites


. I found it is safe to use COMMON, COPY_SOURCE and COPY_DEST states in copy queue

 

Are you sure? In the MSDN article it states that, "The COPY flags (COPY_DEST and COPY_SOURCE) used as initial states represent states in the 3D/Compute type class. To use a resource initially on a Copy queue it should start in the COMMON state". So I assume both resources that are involved in a copy must start in D3D12_RESOURCE_STATE_COMMON.

 

I never really read that part on promotion of resource state that thoroughly, but having a look at it now, does it mean the resource barrier command is not necessary at all (as per your post)? When I try to access this resource on the 3D/copy queue, it'll implicitly be promoted to the required state (if it's supported)?

Share this post


Link to post
Share on other sites

 


. I found it is safe to use COMMON, COPY_SOURCE and COPY_DEST states in copy queue

 

Are you sure? In the MSDN article it states that, "The COPY flags (COPY_DEST and COPY_SOURCE) used as initial states represent states in the 3D/Compute type class. To use a resource initially on a Copy queue it should start in the COMMON state". So I assume both resources that are involved in a copy must start in D3D12_RESOURCE_STATE_COMMON.

 

 

Sloppy phrasing on my side. I meant these states are safe in resource barriers on copy queue.

 

 

 

I never really read that part on promotion of resource state that thoroughly, but having a look at it now, does it mean the resource barrier command is not necessary at all (as per your post)? When I try to access this resource on the 3D/copy queue, it'll implicitly be promoted to the required state (if it's supported)?

 

 

I just compiled and tested my program without any resource barriers on copy queue, it still works fine without warnings from debug layer. Then I tried to transition texture to UNORDERED_ACCESS (just to check that this portion of code gets executed), it failed with E_INVALIDARG, as expected. So it seems promotion works as advertised.

 

EDIT: I tested the program on GTX 980, HD 4600, R7 360, Microsoft Basic Render Driver.

Edited by red75prime

Share this post


Link to post
Share on other sites

Thanks, red75prime, I was able to get my copy queue working!

 

FunkyMonkey, try the resource barrier command but transition from D3D12_RESOURCE_STATE_COMMON to D3D12_RESOURCE_STATE_COPY_DEST (you need to initialise the default heap with the D3D12_RESOURCE_STATE_COMMON state)

Share this post


Link to post
Share on other sites

Okay, so if I create the Upload Heap resource with D3D12_RESOURCE_STATE_GENERIC_READ, the Default Heap resource with D3D12_RESOURCE_STATE_COPY_DEST, I can then transition the resource to D3D12_RESOURCE_STATE_COMMON or D3D12_RESOURCE_STATE_COPY_DEST without it failing.

 

But when I want to use it in the Direct Queue, don't I need to first transition it to a D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER?  So I have to use a fence on the Copy Queue, check that it's done, and then transition the resource on the Direct queue?

Share this post


Link to post
Share on other sites

1. create upload heap (D3D12_RESOURCE_STATE_GENERIC_READ state)

2. create default heap (D3D12_RESOURCE_STATE_COPY_DEST state)

3. fill upload heap with data

4. copy data from upload heap to default heap (UpdateSubresources(), give command list which creates the command, stores in command allocator)

5. transition default heap from copy dest state to vertex and constant buffer state (next command in command list after UpdateSubresources)

6. execute command list

7. update fence value

8. create signal "command" with command queue which updates the fence value on the gpu with new fence value

9. check fence value on gpu to make sure that signal command was executed, at that point you know the executed command list above has finished executing.

 

you have to update the fence value, otherwise you may end up trying to access that data in the default heap before its finished copying

Edited by iedoc

Share this post


Link to post
Share on other sites

iedoc, step 5 is the problem.  You can't do that transition on a Copy Queue.  So my question is really, do I have to do the fence on the Copy Queue, and then transition to D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER on the Direct Queue.  Which it looks like I do.

Edited by Funkymunky

Share this post


Link to post
Share on other sites

So my question is really, do I have to do the fence on the Copy Queue, and then transition to D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER on the Direct Queue.  Which it looks like I do.

That makes sense. A copy queue is an abstraction over a DMA controller, which really can't do anything useful besides memcpy... however, resource transitions are abstractions over cache invalidations, cache flushes, and data format/packing/pitch/swizzling transformations -- which a DMA unit might not be able to perform. So it makes sense that your "graphics" queue would wait for a signal that the DMA task has completed, and then perform these "transition" tasks itself.

Share this post


Link to post
Share on other sites

Honestly i use a direct queue for everything including moving data to and from the GPU. The Present API only presents at multiples of the screen refresh rate (I haven't had luck getting unlocked fps yet), and i get 120 FPS whether i use a direct queue or a copy queue. Unless you are moving a lot of data to and from the GPU, i personally feel the copy queue just makes things more complex than they really need to be for how much performance gain you might get with it.

 

Anyway, If you are going to use a copy queue, you would still definitely have to use a fence like hodgman said, because again you need to make sure that the data has finished copying before using or modifying it (which includes changing its state, the GPU could actually physically move the data from what i understand when changing states, someone please correct me if i'm wrong about that though). Not that this bit is that helpful, but you can still use the data as vertex/index resources if you leave the default heap in a copy dest state. Of course the GPU can do optimizing for certain things depending on the state its in though, so you'll want to make sure it's in the proper state before you do anything with it.

Edited by iedoc

Share this post


Link to post
Share on other sites

A copy queue shouldn't have much to do with your fps should it? It's got more to do with your swap chain texture count. You could churn out frames to render target and copy them to your swap chain textures as they finish, but you'd be doing useless work -- which is kind of what would happen anyway with a DX11- present(0, 0). 

 

E: see https://developer.nvidia.com/dx12-dos-and-donts

 

In my experience waiting on 3 or 4 swap chain textures creates a very muddy experience but ymmv.

 

 

  • If not in fullscreen state (true immediate independent flip mode) do control your latency and buffer count in your swap-chain carefully for the desired FPS and latency
    • Use IDXGISwapChain2::SetMaximumFrameLatency(MaxLatency) to set the desired latency
      • For this to work you need to create your swap-chain with the DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT flag set.
    • li>A sync interval of 0 indicates that "the buffer I am presenting now is the newest buffer available next time composition happens" and discards all previous presents. However, the present does not go through until composition happens, which currently is only at VSync.
    • DXGI will start to block in Present() after you have presented MaxLatency-1 times
      • At the default latency of 3 this means that you FPS can’t go higher than 2 * RefershRate. So for a 60Hz monitor the FPS can’t go above 120 FPS.
    • Try using about 1-2 more swap-chain buffers than you are intending to queue frames (in terms of command allocators and dynamic data and the associated frame fences) and set the "max frame latency" to this number of swap-chain buffers.
  • If not in fullscreen state (true immediate independent flip mode) consider using a waitable object swap-chain along with WaitForSingleObjectEx() to generate higher FPS
    • Please note that this will lead to some frame never being even partially visible, but may be a good solution for benchmarking
    • Using the waitable object swapchain and GetFrameLatencyWaitableObject(), one can test if a buffer is available before rendering to it or presenting it – the following options are available:
    1. Use an additional off-screen surface
      • Render to the off-screen surface. Test the waitable object with timeout 0 to check if a buffer is available. If so copy to the swap-chain back buffer and Present(). If no buffer is available start the frame over again.
      • At the beginning of the frame, test the waitable object. If it succeeds, render to the available swapchain buffer. If it fails, render to the offscreen surface.
    2. Use a 3 or 4 buffer swapchain
      • Render directly to a back buffer. Before calling Present(), test the waitable object. If it succeeds, call Present(), if not, start over.

 

      •  
Edited by Dingleberry

Share this post


Link to post
Share on other sites

yeah, thats a good point dingleberry, what i meant though was that my applications have not had to wait noticeably longer for the GPU to finish copying data from upload heaps to default heaps using only a direct command queue to do everything rather than using a direct queue along side a copy queue, so I personally have not seen or needed the benefit you might get from utilizing a copy queue. I mentioned the FPS because the fps cap is preventing me from seeing the actual difference if any between the performance of using a copy queue or not

Share this post


Link to post
Share on other sites

Oh yeah, I think you can use the visual studio graphics debugger to see how long it takes various tasks to execute on various engines, kind of as if you put a timestamp query around every call. But also the suggestions in the quote would work too -- just remember your "actual" frame rate would be how fast you're rendering to an offscreen target, not how often you're presenting.

Edited by Dingleberry

Share this post


Link to post
Share on other sites

I've only recently began working with the beautiful vs graphics debugger (god i love how convenient that thing is!). I didn't realize you could track times in it, so thanks for pointing that out!

Share this post


Link to post
Share on other sites

Honestly i use a direct queue for everything including moving data to and from the GPU. The Present API only presents at multiples of the screen refresh rate (I haven't had luck getting unlocked fps yet), and i get 120 FPS whether i use a direct queue or a copy queue. Unless you are moving a lot of data to and from the GPU, i personally feel the copy queue just makes things more complex than they really need to be for how much performance gain you might get with it.


It definitely depends on how much data you're moving around, and how long it might take the GPU to copy that data. The problem with using the direct queue for everything is that it's probably going to serialize with your "real" graphics work. So if you submit 15ms worth of graphics work for a frame and you also submit 1ms worth of resource copying on the direct queue, then your entire frame will probably take 16ms on the GPU. Using a copy queue could potentially allow the GPU to execute the copy while also concurrently executing graphics work, reducing your total frame time.

Share this post


Link to post
Share on other sites

This topic is 671 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this