[DX11] Command Lists on a Single Threaded Renderer

Started by
11 comments, last by _the_phantom_ 12 years, 5 months ago
One of the new major features of DirectX 11 is its support for multithreaded rendering using Immediate and Deferred Contexts, however it seems to me that the ability to create a Command List would potentially be beneficial even for a single threaded renderer. Is this correct?

Basically a Command List is a more efficient way of submitting a number of state and draw commands than calling each API separately, so even if you do all your rendering on one thread it would still seem more efficient to use Command Lists to perform repeat lists of actions.

If this is correct, then why isn't this mentioned more? Why are Deferred Contexts pretty much exclusively documented as a multi threading feature?

Thanks
ben
Advertisement
I'm pretty sure I remembered reading somewhere that the runtime/drivers weren't optimized for this case. However I still think it would be worth trying out, especially as the drivers get better support for deferred command list generation.

I'm pretty sure I remembered reading somewhere that the runtime/drivers weren't optimized for this case. However I still think it would be worth trying out, especially as the drivers get better support for deferred command list generation.

You are right, I just ran a test on 10,000 cubes, single threaded:
  • Immediate: 200 fps
  • Deferred: 150 fps
So Deferred for single threaded application is slower (at least on my machine). Note that checking the threading support for command list for my graphics card (AMD 6970M) is returning false, so I assume that It is not supported natively by the driver but "emulated" by DX11...
I'm not sure if the AMD GPU's support it yet, but my GTX 470 on latest drivers says that it does. I'll have to check on my HD 6970 when I get home.

If your test is easily packagable, I would be happy to try it out on my machine to see how it performs.

[quote name='MJP' timestamp='1321297420' post='4883879']
I'm pretty sure I remembered reading somewhere that the runtime/drivers weren't optimized for this case. However I still think it would be worth trying out, especially as the drivers get better support for deferred command list generation.

You are right, I just ran a test on 100,000 cubes, single threaded:
  • Immediate: 200 fps
  • Deferred: 150 fps
So Deferred for single threaded application is slower (at least on my machine). Note that checking the threading support for command list for my graphics card (AMD 6970M) is returning false, so I assume that It is not supported natively by the driver but "emulated" by DX11...
[/quote]
I would caution against making blanket statements about the performance of single vs. deferred rendering - it totally depends on what your renderer does when it is submitting work to the API. For example, if your engine does lots of work in between the API calls that it makes, then it would likely be beneficial to utilize multiple threads which could reduce the total time needed to process a rendering path. On the other hand, if your submission routines are very bare bones and only submits API calls, there could be some benefit to compiling a long list of commands into a command list and then reusing it from frame to frame. This will depend on the hardware, the driver, your engine, and the application that is using your engine - you need to profile and see if it is worth it in a given context. You could even dynamically test it out on the first startup of your application and then choose the appropriate rendering method.

To that end - I would suggest setting up your rendering code to not know if it is using a deferred or immediate context so that you can delay the decision as long as possible as to which method you will use. That is just my own suggestion though - I have found it to be useful, but your mileage may vary!

[quote name='MJP' timestamp='1321297420' post='4883879']
I'm pretty sure I remembered reading somewhere that the runtime/drivers weren't optimized for this case. However I still think it would be worth trying out, especially as the drivers get better support for deferred command list generation.

You are right, I just ran a test on 100,000 cubes, single threaded:
  • Immediate: 200 fps
  • Deferred: 150 fps
So Deferred for single threaded application is slower (at least on my machine). Note that checking the threading support for command list for my graphics card (AMD 6970M) is returning false, so I assume that It is not supported natively by the driver but "emulated" by DX11...
[/quote]

Thanks all

xoofx, in your test do you re-create the command list every frame, or do you create the command list once at startup and then just re-execute it every frame?

I'm amazed that AMD cards don't support multithreading at the driver level yet!

I'd be interested in seeing the results on a NVidia card too.

Thanks
Ben
If have released the executable test along some analysis about the results here.

Of course, I agree with Jason.Z statements about taking carefully this kind of results, and the fact that a renderer can easily be built to switch transparently from a deferred context to an immediate context.

To respond to your initial question BenS1, It seems that hardware support for command list doesn't seems to change a lot (using a pre-prepared command list once and run it on an immediate context) compare to using the default Direct3D11 behavior.
The problem with using a deferred context in a single threaded system is that you are doing more work per core in that situation; you have to prepare the CL, which takes some extra CPU overhead as the driver needs to do things and then you have to reaccess it again to send it to the card properly. Spread across multiple threads the cost-per-setup drops significantly and, if you batch them, your send arch will benifit greatly from code cache reuse (and depending on how it's stored maybe some data cache too).

Any speed gain also depends very much on what you are doing; in a test case at work which was setup to be heavily CPU bound, switching on mutli-threading CL support when NV's drivers were updated to fix it did give us a speed boost however it wasn't that much. I then spent some time playing with the test case and discovered that when we got over a certain threshold for data per CL we started spending more and more time in the buffer swapping function than anywhere else in the submission due to the driver having to do more. (I can't recall the specifics but from what I do recall drivers are limited memory wise or something like that... basically we blew a buffer right out).

However up until that point the MT CL rendering WAS making a significant difference with our CPU time usage and we had near perfect scaling for the test case.

The key point from all this; MT CL, if implimented by the drivers, will help but ONLY your CPU time.

I make a point of saying this because there is no 'hardware support' for CL; Command Lists are purely a CPU side thing, the difference is between letting the DX11 runtime cache the commands or letting the driver cache them and optimise them. (AMD still lacks support for this, although it is apprently 'coming soon')

(Also, as a side note, I do recall reading that 'create, store and reuse' isn't an optimal pattern for command lists. The runtime isn't really setup for this case and it assumes you'll be remaking them each frame, which is a fair assumption because as you can't chain them together to adjust each others state and most command lists will change each frame in a 'real world' situation it is best to test against this)

I then spent some time playing with the test case and discovered that when we got over a certain threshold for data per CL we started spending more and more time in the buffer swapping function than anywhere else in the submission due to the driver having to do more. (I can't recall the specifics but from what I do recall drivers are limited memory wise or something like that... basically we blew a buffer right out).

this could come from the Map/UnMap on command buffers, with an immediate context that is giving directly a kind of DMA to the GPU memory, but with a deferred context, it has to copy to a temporary buffer (which is probably on the RAM, but not sure it is on a shared memory on the GPU)...


I make a point of saying this because there is no 'hardware support' for CL; Command Lists are purely a CPU side thing, the difference is between letting the DX11 runtime cache the commands or letting the driver cache them and optimise them. (AMD still lacks support for this, although it is apprently 'coming soon')

Indeed, if it is natively supported by the driver, It can be optimized. A coworker found also on NVIDIA a performance boost when they introduced support for command list, though on AMD, It is already fine without the support from the driver... probably the command buffer on AMD is already layout in the same way DirectX11 command buffer is layout...

If have released the executable test along some analysis about the results here.

Of course, I agree with Jason.Z statements about taking carefully this kind of results, and the fact that a renderer can easily be built to switch transparently from a deferred context to an immediate context.

To respond to your initial question BenS1, It seems that hardware support for command list doesn't seems to change a lot (using a pre-prepared command list once and run it on an immediate context) compare to using the default Direct3D11 behavior.


Wow, great article! Thanks.

Its a shame that the results show that command lists aren't really a faster way of repeating the same drawing commands over and over for a single threaded renderer.

I suspect they have the potential to be faster if the driver developers had sufficient motivation to optimise this area of their code, especially if they optimised the command list when you call FinishCommandList. I guess the problem is that the driver has no idea if you're only going to use the command list once and throw it away (In which case the act of optimising the command list may cost more than the potential gains), if if you're going to create the comand list once and execute it many times (In which case optimisng the list may be beneficial).

I guess we'd need a tweak to the API so that you can either pass in a boolean to FinishCommandList to tell the driver whether the command list should be optimised or not, or maybe there could be a separate explicit OptimizeCommandList method.

Thanks again for your detailed analysis.

Thanks
Ben

This topic is closed to new replies.

Advertisement