[D3D12]FrontoBack rendering and multi-threaded rendering

This topic is 742 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

I've been studying multithreaded command submission in D3D12 and realized I want to render my scene front to back.  So I need to synchronize the command list submissions and was considering using fences but I remembered reading fences are expensive and not to use to many.  So I'm looking for the best way on the CPU side to order my command list submission... any ideas?

Share on other sites

What do you mean you need to synchronize your command list submission?

If you have 4 threads each recording a command list with a subset of 1000 models (0-249 on Thread 0, 250-499 on Thread 1 etc) and they're in front to back order then you simply need to submit those command lists to ExecuteCommandLists in the order you want them executed.

Share on other sites

A bubble-sort algorithm is effective for already partially sorted arrays, so if your observer does not change position in world revolutionary, you would be most set with this, as for cache coherency and computation.

It is actualy so trivialy fast, that moving the sorting a thread away can imply a harm in most scenarios (mind shared cache traveling to exchange mem to thread, unless you are socketing, what is not cheapest as well).

Profile.

Share on other sites

I've been studying multithreaded command submission in D3D12 and realized I want to render my scene front to back.

Sorting a set is a problem which makes multithreading go grab a popcorn though.

Process your scene computation problems that need single processing unit first, then off load parallel problems (or backwards, or at once), and after it, use the central list when finsihed.

Share on other sites

Fences are for CPU<->GPU synchronization, or synchronization between multiple GPU queues.

I know but unless I'm mistaken I can order submission using them if I wanted to. (although it will cause bubbles in both CPU and GPU execution)

You have one GPU queue with multiple command lists being submitted to it, by one CPU thread. There's no synchronization problem there at all.
All you need to do is ensure that those many-lists have actually been generated (by your many threads) before one thread submits them.

Are there any other options?

That's a traditional multithreading problem with no link to D3D

Yeah I was going to post this in general but I figured there would be more context here.

Share on other sites

Most of the benefits of going properly multithreaded are going to be from using multiple threads to generate multiple command lists and submitting them to a single command queue. I assume you're proposing creating multiple DIRECT command queues (one per thread) while also trying to synchronise work across these queues so it executes in the order you want?

Do you have a particular scenario in mind where serialising command list submission to a single DIRECT command queue is not ideal?

Share on other sites

To get rendering in order ordered you have to submit your set of commands to the queue in order.

For one command list this just means

• sort the command set
• submit each command to the command list in order
• submit the command list to the graphics queue

Extending this to multiple command lists you need to

• sort the command set
• split the command set into subsets
• submit each subset to a command list in order
• submit the command lists to the graphics queue in order

In pseduo code

void Renderer::RenderFrame()
{
RenderModelRange opaque, translucent;
vector<CommandList> commandLists;

tie( opaque, translucent ) = parallel_partition( renderModels, IsOpaque() );
parallel_sort( opaque, FrontToBack() );
parallel_sort( translucent, BackToFront() );
parallel_reduce( renderModels, BuildCommandLists(commandLists) );
for_each( commandLists, SubmitCommandList(renderQueue) );
}


In this pseudo code BuildCommandLists is the magic it (see the parallel_reduce Body concept from TBB for example)

• splits the command set into subsets
• builds a command list for each subset
• aggregates the resulting lists into a vector (in order, command list building is associative but not commutative)

Of course this is pseudo code so there is a lot of details missing.

For example you need to limit number of command lists (number of subsets) to a reasonable number.

This pseudo code also waits for all command lists to be built before submitting any to the queue.

Ideally you want to submit the first list as soon as it's built.

And submit the Nth command list as soon as it's built and the (N-1)th command list has been submitted.

Share on other sites

I assume you're proposing creating multiple DIRECT command queues (one per thread) while also trying to synchronise work across these queues so it executes in the order you want?

No, one queue multiple submission threads.  But right now I am imagining something along the lines of round-robin submission...cmdlist from thread 1 then cmdlist from thread 2... then back at one again.

Share on other sites

Extending this to multiple command lists you need to
sort the command set
split the command set into subsets
submit each subset to a command list in order
submit the command lists to the graphics queue in order

What I was kicking around in my mind was using a spatial subdivision for a course grained sort, then divide the work up (this is a little complex since there will be overlap) among multiple threads, sort and build cmdlists and then use some sort of synchronization to submit in order.

But I will think about what you proposed.

Share on other sites

Are there any other options?
Sure, you can use many GPU queues, and then use fences on the GPU side to synchronize them, but that's a lot of extra overhead.

You could also pass ownership of the your single GPU queue from thread to thread, so that the thread processing the first range of your list performs it's own submission, then the thread that owns the 2nd range performs it's submission, etc... IMHO that would be much more complex and require more synch work than just having N write-command-list jobs, followed by a single submit job with a dependency between them. The latter should easily fit into any modern engine's job system.

Share on other sites

IMHO that would be much more complex and require more synch work than just having N write-command-list jobs, followed by a single submit job with a dependency between them.

http://www.gamedev.net/topic/673336-dx12-m-commandallocator-reset-is-necessary-every-frame/

In this thread Sergio J. de los Santos seems convinced that in D3D12 commandlist building is cheap but submission is expensive.  I haven't confirmed this yet but if true wouldn't overloading one thread with submission be suboptimal?

Share on other sites

In this thread Sergio J. de los Santos seems convinced that in D3D12 commandlist building is cheap but submission is expensive.  I haven't confirmed this yet but if true wouldn't overloading one thread with submission be suboptimal?
Putting commands into a cmd list is cheap, but you do it tens of thousands of times per frame. Submission is "expensive" for a single function call (similar to typical function calls in older D3D versions :P) but you do it a handful of times per frame.

Share on other sites

Sure, you can use many GPU queues, and then use fences on the GPU side to synchronize them, but that's a lot of extra overhead.

Just to clarify, the dx12 multithreaded sample is generating command lists in multiple threads so all the gpu commands are still executed in order (by list and then command), while the multiadapter sample uses different gpu queues and executes two sets of commands concurrently, right? Is it ever useful to have multiple Gpu queues on a single adapter? I thought that on Nvidia cards it isn't even very useful to have a separate Gpu/Compute queue on the same adapter. It's weird though because I'm pretty sure cuda allows concurrent execution of different kernels.

Edited by Dingleberry

Share on other sites

Putting commands into a cmd list is cheap, but you do it tens of thousands of times per frame. Submission is "expensive" for a single function call (similar to typical function calls in older D3D versions ) but you do it a handful of times per frame.

But will submitting all cmdlists through a single processor/thread reduce the number of draw calls from there maximums?  Just curious.

Share on other sites

Submitting the command lists is the quick bit, you can even submit multiple command lists in a single call to ExecuteCommandLists. It's unlikely to have any material effect on how many draw calls you can make per second.

Share on other sites
Unless you're putting one draw call in a command list and queuing a million lists I think it's kind of unnecessary to worry about submission performance. It's presumably dwarfed by the time it takes to create the lists.