Management of CommandQueue/CommandAllocator/CommandList

Started by
16 comments, last by MJP 4 years, 6 months ago

Hello!

I need some guidance on CommandQueue/CommandAllocator/CommandList management. In my current project I have a few "systems" that need to execute graphical commands, such as rendering terrain, rendering water, rendering particles etc. Right now my project is very simple so I'm not even using command lists during initialization. However, that's starting to be required.

Currently I'm just using a single command queue with a ring buffer of 2 command allocators that get recorded by a single command list. Each time I render the scene, a command allocator and a command list is being reset and then recorded. After all commands has been recorded, the list is executed and the swap chain is flipped. Here's some pseudo-code:


void Initialize()
{
  [...]
  
  device->CreateCommandQueue(...);
  device->CreateCommandAllocator(...); // commandAllocator[0]
  device->CreateCommandAllocator(...); // commandAllocator[1]
  device->CreateCommandList(...);
  
  commandList->Close();
  
  [...]
}

void Render()
{
  WaitForPreviousFrame();
  
  commandAllocator[i]->Reset(); // i = swapChain->GetCurrentBackBufferIndex()
  commandList->Reset(...);
  
  RecordAllCommands();
  
  commandList->Close();
  commandQueue->ExecuteCommandLists(...);
  
  Signal(...);
  
  swapChain->Present(...);
}

The issue with this is that I cannot record commands during initialization, and with this design it's also quite cumbersome to execute command lists multiple times during one frame since the command allocator ring buffer is tied together with the swap chain buffer index. So I started to think about how I should redesign this, preferably also with future support for threading. And I've thought about it for quite some time now and can't come up with a good solution.

One idea is that each system should have their own command list with a ring buffer of 2 command allocators, and then record it and just use a global command queue to execute the list. This works well from a parallel point of view, but the issue is that now each system need to check individually if the GPU is done with the commands before resetting the command allocator. This feels like a huge CPU waste.

Another idea is that there is only one global command list, that is aviable already during initialization of other systems, and after the initialization this command list gets executed, before entering the game loop. During the game loop, the global command list gets executed once per frame as I do it now. However, there are 2 issues with this. First of all, some systems might want to execute their commands earlier than at the end of each frame. Secondly, if multiple threads record into the same command list, then we might get a situation like this:


commandList->SetPipelineState(pipelineState1); // Thread 1 wants pipelineState1.
commandList->SetPipelineState(pipelineState2); // Thread 2 wants pipelineState2.
[...]
commandList->DrawInstanced(...); // Thread 1 expects pipelineState1 to be set...

I'm out of ideas of how to implement this in a simple and elegant way. Or maybe I'm doing this entirely wrong. Basically what I need is:

  • Systems should be able to record commands already during initialization.
  • Atleast during initialization, it should be possible to execute commands in multiple steps and even wait for the GPU to complete them.
  • When rendering the scene, it would be nice if multiple threads could record commands in parallel.

Does any of you have a good solution to this problem? What is the AAA game engine way of dealing with this?

Advertisement

First of all, make sure you signal your fence *after* you Present your swap chain. The Present causes a tiny bit of GPU work to get scheduled on the queue, and if you really want to make sure that your GPU has gone idle (for instance so you can delete everything during shutdown), then you want your fence to only signal after that bit of work for the swap chain has completed.

Some other suggestions/notes:

  • If you want multiple threads recording commands, then you need as many command lists and command allocators (x2 for double buffering) as you have threads running simultaneously. There's no way to have multiple threads write to a single command list simultaneously.
  • A simple thing you can do is to split your Render() function into RenderBegin() and RenderEnd(), and have these denote the start and end points of your frame where recording to command lists (and touching any GPU-accessible memory!) is allowed to happen. RenderBegin() would wait for the previous frame and get the command buffers ready, and RenderEnd() would submit the command buffers, present, and signal the fence. Then you can do things like RenderBegin() -> Kick off tasks for subsystems that record commands -> RenderEnd() without the subsystems needing to handle waiting on fences to know that they can use a command buffer.
  • I'm not sure I understand the problems you're having with issuing commands during initialization. A simple approach is to just have at least 1 command list ready that's already attached to a command allocator, and once initialization is done you can either submit that command list or just keep adding onto it when you render the frame (note that this may bloat the command allocator size if you do record a ton of commands during initialization). What sorts of things exactly are you trying to do during initialization that require recording commands? If it's for initializing GPU resource memory, then I would suggest creating a separate system for that. You generally need to handle that in a special-case way, and you'll also want to submit on the COPY queue when running on dedicated video cards.
  • One option you can consider is to wait on the previous frame immediately after submitting the current frame. This can make it harder to absorb transient GPU spikes since your wait is earlier, but on the upside your know for the entire next frame that your command buffers and GPU-accessible memory is ready to be written to.

Thank you very much for your reply @MJP.

8 hours ago, MJP said:

First of all, make sure you signal your fence *after* you Present your swap chain.

I understand, will change this directly, thank you for letting me know.

8 hours ago, MJP said:

If you want multiple threads recording commands, then you need as many command lists and command allocators (x2 for double buffering) as you have threads running simultaneously. There's no way to have multiple threads write to a single command list simultaneously.

Multiple command lists can use the same command allocator. Is that a good idea and how would this work with threading?

Also, what is considered better practice, to update multiple subsystems in parallel on their own thread, or, to update the subsystems sequentially but using multiple threads to split up the internal work?

9 hours ago, MJP said:

I'm not sure I understand the problems you're having with issuing commands during initialization.

The main problem is that I can't make up my mind when it comes to the design. I like the approach of giving each subsystem their own command list and command allocators, but it bothers me that then each subsystem needs to check whether the GPU is done with the allocator or not. If I would do it like you suggested with a RenderBegin() and RenderEnd(), then I wouldn't need to check this, but at the same time, I would be limited to only calling execute once per subsystem per frame, right? Could this limitation lead to problems later?

8 hours ago, MJP said:

What sorts of things exactly are you trying to do during initialization that require recording commands?

I'm creating resources (constant buffers, textures, etc) and then uploading data through upload buffers. The main issue I'm having is that I can't release the upload buffers until the GPU is done with the copying, and in order to do that, I need to execute the command list, wait until the GPU has finished with it, and only then can I release it. From what I've understood, I cannot record commands that tell the GPU to release resources itself, right?

So this issue I'm having with command lists is quite related to the management of upload buffers, which I struggle with aswell.

8 hours ago, MJP said:

If it's for initializing GPU resource memory, then I would suggest creating a separate system for that. You generally need to handle that in a special-case way, and you'll also want to submit on the COPY queue when running on dedicated video cards.

I'll check this out, thank you for the advice.

1 hour ago, fighting_falcon93 said:

The main problem is that I can't make up my mind when it comes to the design. I like the approach of giving each subsystem their own command list and command allocators, but it bothers me that then each subsystem needs to check whether the GPU is done with the allocator or not. If I would do it like you suggested with a RenderBegin() and RenderEnd(), then I wouldn't need to check this, but at the same time, I would be limited to only calling execute once per subsystem per frame, right? Could this limitation lead to problems later? 

Simply don't make yourself synchronise the CPU threads with the GPU at all! Only the one 'main' thread. Instead, as mentioned by MJP, double-(or triple-)buffer your command lists (and allocators). It isn't such a terrible amount of memory that you'd have to "ration" it. Your subsystems will assume that the resources they are handed are safe to CPU-write and GPU isn't touching them.

Also, do NOT use the same command allocator (or any allocator for that matter) on two different threads. Recording commands is a pretty 'intense' operation and if several threads do it at once, they compete for the mutex (or at least an atomic) that's making the allocator thread-safe. Just don't do it. Hand each thread its own allocator (from a pool or double-buffer at least).

[OT]If you really need one allocator to serve multiple threads, then for your own hand-written allocators, don't enter the critical section every time for every tiny allocation - give threads only bigger chunks which they'll be using for some time and only occasionally contend for the 'mutex' when their chunk is depleted. Heh, now that I read what I wrote, you still end up with a kinda allocator per-thread anyway :DID3D12CommandAllocator isn't your hand-written allocator though, it will have a mutex inside. I reckon it's thread-unfriendly. [/OT]

Thank you very much for your reply @pcmaster.

I'm not sure that I understand what you mean in the first part. If I understand you correctly, do you mean that I should do something like this:


void System::Render()
{
  index = swapChain->GetCurrentBackBufferIndex;
  
  WaitForPreviousFrame();
  
  commandAllocators[index][0]->Reset();
  commandAllocators[index][1]->Reset();
  commandAllocators[index][2]->Reset();
  
  commandLists[0]->Reset(commandAllocators[index][0], ...);
  commandLists[1]->Reset(commandAllocators[index][1], ...);
  commandLists[2]->Reset(commandAllocators[index][2], ...);
  
  SubSystem1::Render(commandLists[0]);
  SubSystem2::Render(commandLists[1]);
  SubSystem3::Render(commandLists[2]);
  
  commandLists[0]->Close();
  commandLists[1]->Close();
  commandLists[2]->Close();
  
  commandQueue->ExecuteCommandLists(3, commandLists);
  
  swapChain->Present(...);
  
  Signal(...);
}

void SubSystem1::Render(ID3D12GraphicsCommandList* commandList)
{
  commandList->[...];
  commandList->[...];
  commandList->[...];
}

void SubSystem2::Render(ID3D12GraphicsCommandList* commandList)
{
  commandList->[...];
  commandList->[...];
  commandList->[...];
}

void SubSystem3::Render(ID3D12GraphicsCommandList* commandList)
{
  commandList->[...];
  commandList->[...];
  commandList->[...];
}

What I'm thinking about this solution is that there will be problems if a subsystem would like to split up the recording of commands on multiple threads, as that would require more than one command list. Another problem I'm thinking of is if a subsystem would like to execute some commands before proceeding with the rest of the commands.

I'm also thinking about the initialization, because there I would need to use a different approach, since during initialization one subsystem might want to execute a command list, wait until it's done, and then record again, for example:


void SubSystem1::Initialize()
{
  CreateResource(...);
  CreateUploadBuffer(...);
  ExecuteCopyCommand(...);
  WaitForGPU(...);
  ReleaseUploadBuffer(...);
  DoSomethingWithResource(...);
}

I understand the part about command allocators and threads though, and I will make sure that each thread has their own command list and pair of command allocators so that they don't have to compete for the mutex.

Why would a SubSystem want to wait for a command list? I smell you're going to read something back immediately? :)

I'm thinking mostly of upload buffers. In that case, the subsystem needs to know when it's safe to release the upload buffer.

Does the upload buffer need to be released ASAP? Short on memory? Why not in 1-2 frames with everything else from the past frame(s)? Less sync, better life, do yourself the favour :)

No, memory is not a problem, but if I'm going to release it 1-2 frames later, that means that I need to store the pointers to the upload buffers somewhere, and on top of that, each frame I'd need to check if there's something to remove, which feels like a waste of performance when the there's only cleanup to do after initialization. Basically I'd check an empty list every time I call the update function of of subsystem. Unless there's some better way of cleaning it up?

Think about it - nicely looping over a few hundred little pointers every frame on the CPU... that's NOTHING compared to synchronising with the GPU.

Once a fence after a frame is passed, each and every resource the GPU touched can be recycled safely. Obviously. In the meantime, record commands into an alternate set of buffers and with different resources (ping-pong, double-buffer, round-robin, ...) as discussed, the pseudo-code you posted is fine.

You could have a pool of upload buffers. The thread asks the pool and gets a fresh and safe resource to work with. It gets 'marked' with the current frame counter when handed to the caller. Once its frame's fence signal has been seen some time in the future, all resources with that frame's counter (or older) are good to be reused. You don't even have to "return" them back to the pool. The only time the pool would BLOCK the caller is if it ran out of resources - but that should never happen if you inflate it enough. You won't need to iterate anything if you think about it. The pool is more a ring-buffer, really.

The same mechanism can be used for constant buffers if you assemble them somehow dynamically.

 

This topic is closed to new replies.

Advertisement