[D3D12] About CommandList, CommandQueue and CommandAllocator

Graphics and GPU Programming Programming

Started by nbertoa April 20, 2016 05:30 AM

8 comments, last by nbertoa 8 years ago

nbertoa

1,022

Author

April 20, 2016 05:30 AM

I read D3D12 Documentation and I am reading Frank Luna's book about DirectX12.

One thing I do not understand is the following:

When you record commands in a ID3D12CommandList, you are really writing commands in a ID3D12Allocator memory. This memory should be in CPU memory (some place in heap memory (RAM)), fo performance reasons.

When you execute a ID3D12CommandList, you actually send ID3D12Allocator memory from CPU to GPU memory (CommandQueue memory I think) so the GPU can consume these commands once it reaches your commands in the queue.

The question is: If ID3D12CommandList adds commands in an ID3D12Allocator that is in CPU before we execute that command list. Why cannot we record new commands in the same ID3D12CommandList and for the same ID3D12Allocator, if you are actually writing commands in CPU memory?

Any of my assumptions is incorrect? Am I missing some architectural decision?

Thanks in advance!

My Blog

nbertoa.wordpress.com

WFP

2,787

April 21, 2016 12:24 AM

You can definitely re-use command lists, as well as command allocators, albeit at different paces.

You can think of a command list as providing the interface to write instructions to the command allocator. The command allocator is basically just a chunk of memory that lives somewhere. It doesn't really matter exactly where - that's for the driver implementer to decide. When you submit a command list to the command queue, you're basically saying "this is the next chunk of work for the GPU to complete and here's the chunk of memory that contains all the instructions". Even if the GPU immediately takes that work on (it probably won't), you still need that chunk of memory containing the instructions to be available the entire time they're being processed and executed.

So for command lists, you can reset them as soon as you have submitted them to the command queue and immediately begin to reuse them, because all you're really resetting is the command allocator they're using. However, you must not reset the command allocator that's just been submitted until you are completely sure that all GPU work has completed. Once you're sure the GPU has finished working on the tasks submitted to it, then you're free to cycle back and reset/reuse that allocator.

I'd recommend searching this site for ID3D12Fence. I'm sure I've seen a few posts over the past few months that have touched on this topic and show how to synchronize command list and allocator usage across multiple frames. To get you started, though, you'll need to think in terms of having enough allocators to cover however many frames of latency you'll be working with.

nbertoa

1,022

Author

April 21, 2016 01:30 AM

Thanks for the detailed explanation. Yes, I know you can reuse same cmdList but you nees to be sure about cmdAllocator first or change it. I think I cannot assume any behavior (cpu or gpu memory) because that will depend on gpu architecture and driver implementation. In my case I try to avoid fences when possible.

My Blog

nbertoa.wordpress.com

MJP

20,295

April 21, 2016 02:44 AM

There is no implied copy from CPU->GPU memory when you submit a command list. GPU's are perfectly capable of reading from CPU memory across PCI-e, and on some systems the CPU and GPU may even share the memory.

The Blog | The Book

nbertoa

1,022

Author

April 21, 2016 01:58 PM

HI MJP

If CPU/GPU memory is not unified there is a latency that GPU must pay if you want to read things from CPU through PCI-Express. And that latency implies GPU could be idle wailting for the new data, and that is not desired.

My Blog

nbertoa.wordpress.com

Hodgman

52,717

April 21, 2016 02:29 PM

If CPU/GPU memory is not unified there is a latency that GPU must pay if you want to read things from CPU through PCI-Express. And that latency implies GPU could be idle wailting for the new data, and that is not desired.

Let's say it looks like:

CPU<-->"CPU RAM"<-->PCIe<-->GPU Command Processor<-->GPU Compute Units

Let's also say that:

* it takes the command processor 1?s to fetch a draw command and to schedule that work for the GPU compute units.

* the compute cores take 2?s to actually execute the draw command.

* these two parts of the GPU operate in parallel.

^^^ In this situation, the PCIe latency is not an issue whatsoever; the GPU will not idle. While the GPU compute cores are busy executing Draw#1, the command processor easily has enough time to fetch and prepare Draw#2.

n.b. if your draw-calls take 2?s each on average, that means you can be performing over 8000 draw-calls in a 60Hz frame. That's a lot; many games won't use anywhere near that many draw commands per frame. So a lot of the time it may be more like 1?s to fetch the next command, in parallel with 20?s execution of the previous command!

PCIe latency only becomes a problem if your draw-calls are so small that they take a fraction of a microsecond each, and you're trying to perform 100k of these tiny draw-calls per frame. e.g. if the CP takes 1?s to fetch/prepare a draw command, but the CU's execute it in 0.5?s -- in this situation the GPU will begin to idle due to the slow command fetching. The solution here is to make your draw-calls contain more work each, and use less of them :wink:

. 22 Racing Series .

Spinningcubes

1,598

April 21, 2016 02:36 PM

Nikko_Bertoa: I'm thinking about getting the Frank Luna's DirectX12 book. Would you recommend it? :).

@spinningcubes | Blog: Spinningcubes.com | Gamedev notes: GameDev Pensieve | Spinningcubes on Youtube

nbertoa

1,022

Author

April 21, 2016 05:41 PM

@Hodgman:

But I think you are assuming you have all the bus for you, in that scenario, latency could not be a problem. But also, through that bus your app or other app that is running simultaneously could be sending resources like buffers, textures, etc. This should be handled by GPU scheduler. Then your bandwidth/latency will not be the same as if the bus is all for you.

After reading all your answers, I think I found the answer to my question. If CPU/GPU memory is unified (like consoles) then the CmdAllocator memory will be the same memory than the GPU references, then you cannot write on it until you are sure (through a fence, I think) that all your commands in you allocator were processed by GPU. That is why API has that restriction.

@Spinningcubes:

I am in Chapter 6, and reading it + documentation + GDC's presentations + forums simultaneously. For now, I think the book is useful and understandable.

My Blog

nbertoa.wordpress.com

MJP

20,295

April 21, 2016 10:59 PM

GPU's can also pre-fetch command buffer memory in order to hide any latency. Pre-fetching is easy in this case because the front-end will typically just march forward, since jumps are not common (unless you're talking about the PS3 :D)

The Blog | The Book