[DX11] Command Lists on a Single Threaded Renderer
#1 Members - Reputation: 100
Posted 14 November 2011 - 10:22 AM
Basically a Command List is a more efficient way of submitting a number of state and draw commands than calling each API separately, so even if you do all your rendering on one thread it would still seem more efficient to use Command Lists to perform repeat lists of actions.
If this is correct, then why isn't this mentioned more? Why are Deferred Contexts pretty much exclusively documented as a multi threading feature?
Thanks
ben
#3 Members - Reputation: 170
Posted 15 November 2011 - 09:20 AM
MJP, on 14 November 2011 - 01:03 PM, said:
- Immediate: 200 fps
- Deferred: 150 fps
#4 Moderators - Reputation: 2118
Posted 15 November 2011 - 12:17 PM
If your test is easily packagable, I would be happy to try it out on my machine to see how it performs.
#5 GDNet+ - Reputation: 1136
Posted 15 November 2011 - 12:36 PM
xoofx, on 15 November 2011 - 09:20 AM, said:
MJP, on 14 November 2011 - 01:03 PM, said:
- Immediate: 200 fps
- Deferred: 150 fps
To that end - I would suggest setting up your rendering code to not know if it is using a deferred or immediate context so that you can delay the decision as long as possible as to which method you will use. That is just my own suggestion though - I have found it to be useful, but your mileage may vary!
Check out our (now available) D3D11 book: Practical Rendering and Computation with Direct3D 11
Check out my Direct3D 11 engine on CodePlex: Hieroglyph 3
Check out our free online D3D10 book: Programming Vertex, Geometry, and Pixel Shaders
Lunar Rift :: Dual-Paraboloid Mapping Article :: Parallax Occlusion Mapping Article :: Fast Silhouettes Article
#6 Members - Reputation: 100
Posted 15 November 2011 - 03:19 PM
xoofx, on 15 November 2011 - 09:20 AM, said:
MJP, on 14 November 2011 - 01:03 PM, said:
- Immediate: 200 fps
- Deferred: 150 fps
Thanks all
xoofx, in your test do you re-create the command list every frame, or do you create the command list once at startup and then just re-execute it every frame?
I'm amazed that AMD cards don't support multithreading at the driver level yet!
I'd be interested in seeing the results on a NVidia card too.
Thanks
Ben
#7 Members - Reputation: 170
Posted 20 November 2011 - 01:27 AM
Of course, I agree with Jason.Z statements about taking carefully this kind of results, and the fact that a renderer can easily be built to switch transparently from a deferred context to an immediate context.
To respond to your initial question BenS1, It seems that hardware support for command list doesn't seems to change a lot (using a pre-prepared command list once and run it on an immediate context) compare to using the default Direct3D11 behavior.
#8 Moderators - Reputation: 2180
Posted 20 November 2011 - 04:46 AM
Any speed gain also depends very much on what you are doing; in a test case at work which was setup to be heavily CPU bound, switching on mutli-threading CL support when NV's drivers were updated to fix it did give us a speed boost however it wasn't that much. I then spent some time playing with the test case and discovered that when we got over a certain threshold for data per CL we started spending more and more time in the buffer swapping function than anywhere else in the submission due to the driver having to do more. (I can't recall the specifics but from what I do recall drivers are limited memory wise or something like that... basically we blew a buffer right out).
However up until that point the MT CL rendering WAS making a significant difference with our CPU time usage and we had near perfect scaling for the test case.
The key point from all this; MT CL, if implimented by the drivers, will help but ONLY your CPU time.
I make a point of saying this because there is no 'hardware support' for CL; Command Lists are purely a CPU side thing, the difference is between letting the DX11 runtime cache the commands or letting the driver cache them and optimise them. (AMD still lacks support for this, although it is apprently 'coming soon')
(Also, as a side note, I do recall reading that 'create, store and reuse' isn't an optimal pattern for command lists. The runtime isn't really setup for this case and it assumes you'll be remaking them each frame, which is a fair assumption because as you can't chain them together to adjust each others state and most command lists will change each frame in a 'real world' situation it is best to test against this)
#9 Members - Reputation: 170
Posted 20 November 2011 - 06:00 AM
phantom, on 20 November 2011 - 04:46 AM, said:
phantom, on 20 November 2011 - 04:46 AM, said:
#10 Members - Reputation: 100
Posted 20 November 2011 - 08:11 AM
xoofx, on 20 November 2011 - 01:27 AM, said:
Of course, I agree with Jason.Z statements about taking carefully this kind of results, and the fact that a renderer can easily be built to switch transparently from a deferred context to an immediate context.
To respond to your initial question BenS1, It seems that hardware support for command list doesn't seems to change a lot (using a pre-prepared command list once and run it on an immediate context) compare to using the default Direct3D11 behavior.
Wow, great article! Thanks.
Its a shame that the results show that command lists aren't really a faster way of repeating the same drawing commands over and over for a single threaded renderer.
I suspect they have the potential to be faster if the driver developers had sufficient motivation to optimise this area of their code, especially if they optimised the command list when you call FinishCommandList. I guess the problem is that the driver has no idea if you're only going to use the command list once and throw it away (In which case the act of optimising the command list may cost more than the potential gains), if if you're going to create the comand list once and execute it many times (In which case optimisng the list may be beneficial).
I guess we'd need a tweak to the API so that you can either pass in a boolean to FinishCommandList to tell the driver whether the command list should be optimised or not, or maybe there could be a separate explicit OptimizeCommandList method.
Thanks again for your detailed analysis.
Thanks
Ben
#11 Members - Reputation: 100
Posted 20 November 2011 - 08:22 AM
phantom, on 20 November 2011 - 04:46 AM, said:
<snip>
(Also, as a side note, I do recall reading that 'create, store and reuse' isn't an optimal pattern for command lists. The runtime isn't really setup for this case and it assumes you'll be remaking them each frame, which is a fair assumption because as you can't chain them together to adjust each others state and most command lists will change each frame in a 'real world' situation it is best to test against this)
Thanks Phantom, but in my case I was thinking of creating the command list once and then executing it for each frame.
As I'm sure you know, a command list containing a constant buffer will only contain references (Or pointers) to the constant buffer and not the actual data containined int he buffer itself, so an app can still change the data in the constant buffer from frame to frame without having to create a new command list.
So for example I was thinking:
1. At startup create a command list (DrawTankCL) that draws a Tank at a position defined in a Contant Buffer ("TankCB")
2. Update TankCB.position on the CPU based on user input, physics etc
3. ExecuteCommandList(DrawTankCL)
4. Repeat from step 2.
As you can see the command list is created once and executed over and over, and yet the tanks posiiton is still dynamic.
Its a shame that this "create, store, reuse" pattern is not optimised int he drivers.
Anyway, at least now I know the answer so I code my game accordingly.
Thanks for your help
Ben
#12 Moderators - Reputation: 2180
Posted 20 November 2011 - 09:02 AM
Firstly, you are being too fine grain with your CL for it to really be useful. There is a good PDF from GDC2011 which covers some of this (google: Jon Jansen DX11 Performance Gems, that should get you it). The main thing is that a CL has overhead, apprently a few dozen API calls so doing too little work in one is going to be a problem as it will just get swamped with overhead. Depending on your setup scenes or material groups are better fits for CL building and execution.
Secondly; you run the risk of suffering a stall at step 2. The driver buffers commands and the GPU should be working at the same time as you execute other work, so there is a chance that when you come to update in step 2 you could be waiting a 'significant' amount of time for the GPU to be done with your buffer and release it so that you can update it again. Discard/lock or other update might avoid the problem, I've not tried it myself, but it still presents an issue.
#13 Moderators - Reputation: 2180
Posted 20 November 2011 - 09:15 AM
xoofx, on 20 November 2011 - 06:00 AM, said:
NV is a strange beast; before they had 'proper' support they kinda emulated it by spinning up a 'server' thread and serialising the CL creation via that. Amusingly if any of your active threads ended up on the same core as the server thread it tended to murder performance but by staying clear you could get a small improvement. Once the drivers came out which did the work correctly this problem went away.
In our test NV with proper support soundly beat AMD without it; this was a 470GTX vs 5870 on otherwise basically identical hardware (i7 CPUs, the NV one had a few hundred Mhz over the AMD one, but not enough for the performance delta seen). AMD's performance was more in line with the single thread version. However our test was a very heavy CPU bound one; 15,000 draw calls spread over 6 cores each one drawing a single flat shaded cube. Basically an API worse nightmare ;)
(Amusing side note; the same test/code on an X360 @ 720p could render at a solid 60fps with a solid 16.6ms frame time. That's command lists being generated each frame over 6 cores; shows just how much CPU overhead/performance loss you take when running on Windows


















