[DX11] Command Lists on a Single Threaded Renderer

Started by
11 comments, last by _the_phantom_ 12 years, 4 months ago

The problem with using a deferred context in a single threaded system is that you are doing more work per core in that situation; you have to prepare the CL, which takes some extra CPU overhead as the driver needs to do things and then you have to reaccess it again to send it to the card properly. Spread across multiple threads the cost-per-setup drops significantly and, if you batch them, your send arch will benifit greatly from code cache reuse (and depending on how it's stored maybe some data cache too).

<snip>

(Also, as a side note, I do recall reading that 'create, store and reuse' isn't an optimal pattern for command lists. The runtime isn't really setup for this case and it assumes you'll be remaking them each frame, which is a fair assumption because as you can't chain them together to adjust each others state and most command lists will change each frame in a 'real world' situation it is best to test against this)


Thanks Phantom, but in my case I was thinking of creating the command list once and then executing it for each frame.

As I'm sure you know, a command list containing a constant buffer will only contain references (Or pointers) to the constant buffer and not the actual data containined int he buffer itself, so an app can still change the data in the constant buffer from frame to frame without having to create a new command list.

So for example I was thinking:
1. At startup create a command list (DrawTankCL) that draws a Tank at a position defined in a Contant Buffer ("TankCB")
2. Update TankCB.position on the CPU based on user input, physics etc
3. ExecuteCommandList(DrawTankCL)
4. Repeat from step 2.

As you can see the command list is created once and executed over and over, and yet the tanks posiiton is still dynamic.

Its a shame that this "create, store, reuse" pattern is not optimised int he drivers.

Anyway, at least now I know the answer so I code my game accordingly.

Thanks for your help
Ben
Advertisement
There are two problems with your idea.

Firstly, you are being too fine grain with your CL for it to really be useful. There is a good PDF from GDC2011 which covers some of this (google: Jon Jansen DX11 Performance Gems, that should get you it). The main thing is that a CL has overhead, apprently a few dozen API calls so doing too little work in one is going to be a problem as it will just get swamped with overhead. Depending on your setup scenes or material groups are better fits for CL building and execution.

Secondly; you run the risk of suffering a stall at step 2. The driver buffers commands and the GPU should be working at the same time as you execute other work, so there is a chance that when you come to update in step 2 you could be waiting a 'significant' amount of time for the GPU to be done with your buffer and release it so that you can update it again. Discard/lock or other update might avoid the problem, I've not tried it myself, but it still presents an issue.

Indeed, if it is natively supported by the driver, It can be optimized. A coworker found also on NVIDIA a performance boost when they introduced support for command list, though on AMD, It is already fine without the support from the driver... probably the command buffer on AMD is already layout in the same way DirectX11 command buffer is layout...


NV is a strange beast; before they had 'proper' support they kinda emulated it by spinning up a 'server' thread and serialising the CL creation via that. Amusingly if any of your active threads ended up on the same core as the server thread it tended to murder performance but by staying clear you could get a small improvement. Once the drivers came out which did the work correctly this problem went away.

In our test NV with proper support soundly beat AMD without it; this was a 470GTX vs 5870 on otherwise basically identical hardware (i7 CPUs, the NV one had a few hundred Mhz over the AMD one, but not enough for the performance delta seen). AMD's performance was more in line with the single thread version. However our test was a very heavy CPU bound one; 15,000 draw calls spread over 6 cores each one drawing a single flat shaded cube. Basically an API worse nightmare ;)

(Amusing side note; the same test/code on an X360 @ 720p could render at a solid 60fps with a solid 16.6ms frame time. That's command lists being generated each frame over 6 cores; shows just how much CPU overhead/performance loss you take when running on Windows :( )

This topic is closed to new replies.

Advertisement