DX12 Multithreaded Rendering Architecture

Started by
2 comments, last by BornToCode 8 years, 8 months ago

I have two questions.

As I understand it, you generally want to keep your GPU a frame or two behind your CPU. While your GPU is rendering a frame, the CPU is generating the draw calls for the next frame, so that there isn't a bottleneck between the two.

1) How does this work in practical terms? Is the CPU side, "generating the draw calls", just building the command lists with calls to DrawIndexedInstanced and the like? And then to actually perform the rendering, the GPU side, you call ExecuteCommandLists?

2) In terms of multi-threaded rendering, is that a misnomer? Are the other threads just generating draw calls, with the main rendering thread being the only thing that actually calls ExecuteCommandLists? Or can you simultaneously render to various textures, and then your main rendering thread uses them to generate a frame for the screen?

Advertisement
1) Basically, yes. You'll want to build everything up front (command lists, update any buffers from the CPU side/setup copy operations) before pushing them to the GPU to execute allowing the GPU to chew on that data while you setup the next things for it to render. You can overlap things of course, so it doesn't have to be a case of [generate everything][push everything to gpu], you could execute draw commands as things finish up. So you could generate say all the shadow map command lists, then push those to the gpu before generating the colour passes. (in fact you could dedicate one 'task' to pushing while at the same time starting to generate the colour pass lists.)

2) Yes and no.
Generally its accepted to mean the first bit, that draw calls are generated across multiple threads and queued as work by a single thread (or task) to ensure correct ordering.
That said if you could keep your dependencies in order then there is nothing stopping you queuing work from multiple threads, although I'd have to check the thread safety of the various command queues to see what locks/protection you might need.

However your 'render to various textures' thing brings up a second part; the GPU is itself highly threaded so even if you have one thread pushing execute commands the GPU itself can have multiple commands in flight at once (dependencies allowing) so regardless of what method you use to queue work to the device it can be doing multiple things at the same time.

1) How does this work in practical terms?  Is the CPU side, "generating the draw calls", just building the command lists with calls to DrawIndexedInstanced and the like?  And then to actually perform the rendering, the GPU side, you call ExecuteCommandLists?

That's not exactly how it works. When you call ExecuteCommandLists() it's more the equivalent of how you called draw() previously on an immediate context. It's like a very efficient draw() (because supposedly the hard work has been done already).

What happens next is that the driver/OS will queue those calls (draw and executecommandlists) as the GPU processes them in order they've been received. That's the queue you're concerned about.
So you don't have really to do anything to make the GPU go behind the CPU. It is already behind it and the more GPU bound your are the more behind the CPU it's going to be. (if you are totally CPU bound.. then the GPU is only 0 step behind the CPU).
What you can do is take measures to limit how far ahead the CPU can go by waiting on the CPU side for the GPU to advance a certain point before submitting commands again (you can use fences for example). The reason you'd want to limit how far ahead the CPU is is for example : to save memory (you have to keep things in memory for as long the GPU can use them, so for buffers, the highest the number of commands in flight the bigger the buffer have to be. We're not talking about vsync and double buffering here (the sync is between GPU and screen) but more like constant buffer updates, dynamic vertex data, renderstates and so on (the sync is between CPU and GPU)), and to limit latency (if you record commands too early then the player will see the result of their actions a long time after they've done them).

TLDR; what I wanted to say is that execute() is not the message for the GPU to go ahead immediately. There are more queues involved and that's what is meant by letting the GPU go behind the CPU.

2) In terms of multi-threaded rendering, is that a misnomer?  Are the other threads just generating draw calls, with the main rendering thread being the only thing that actually calls ExecuteCommandLists?  Or can you simultaneously render to various textures, and then your main rendering thread uses them to generate a frame for the screen?

I don't think anybody (who knows how things work) actually think that. Your GPU mostly accepts things in a serial manner (while being able to process them in a massively parallel way..). Note : In d3d12 you are also able to submit work to separate engines (who will begin and end things on a separate queue but they target the same processing units). BUT the "multithreaded" word refers of course to the CPU building of commands. Building those commands is expensive, the submitting part is less expensive so you factor out the building you can reduce the cost to mostly "submitting" and building in parallel (the multithreaded part). On older APIs like d3d9 and d3d10 (discounting the multithreaded D3D11 that was supposed to work more like d3d12 today but didn't for variety of reasons) the building and submitting are basically one and the same and because the submitting had to happen in a serialized fashion you couldn't get much advantage to using multiple CPU cores for building the commands (you could get some but let's not go there).

The way i am handling that in my code base is that i have 1 thread for GraphicsSubmission and 1 thread for ComputeSubmission and N threads for CommandList build up. Two CommandQueueList Class which i create that are thread safe so adding and removing command list from multiple threads there will be no clash. The graphics/compute thread just spin and checks to see if their appropriate command queue have any command list to execute. Doing it like that you can keep filling up new command list without having to wait on the gpu to execute them.

This topic is closed to new replies.

Advertisement