[DX11] Fastest way to update a constant buffer per draw call
#1 Members - Reputation: 133
Posted 05 December 2011 - 07:30 PM
A) Create a single constant buffer and call Map/Unmap on that same constant buffer before each draw call.
B) Create 500 constant buffers, one for each draw call, and call Map/Unmap on the draw call's own constant buffer.
C) Or, another idea?
I know that for (A) the driver will rename the buffer each time I Map it, discarding the previous contents which is fine. But is it ok to expect that the driver can handle hundreds or even thousands of renames per frame? And I assume the rename process consumes some time, too.
On the other hand, (B) avoids the renaming and any associated overhead at the expense of possibly more video memory being consumed (500 constant buffers, even if fewer draw calls are actually used) and more code complexity.
#2 Moderators - Reputation: 13615
Posted 05 December 2011 - 09:08 PM
For these kinds of GPUs, I'd theorise that option (A) would be the most efficient, as there really is no cbuffer management going on behind the scenes.
On newer GPUs, it's possible for cbuffers to be stored in VRAM, and then moved into registers when required. On these cards, when you put data into a cbuffer, it can actually perform a VRAM transfer (and possibly issue a cache-invalidation command to the command-buffer). When you bind a cbuffer, you're writing a command into the command buffer that instructs the GPU to fetch some register values from VRAM.
On these cards, using option (B) would allow you to perform all of the VRAM transfers well in advance of any draw-calls that use that data, which reduces the amount of data flowing through the command-buffer. However, as you're still moving the same amount of data to the GPU every frame anyway (as you're regenerating the cbuffers each frame), there isn't really a bandwidth saving here... though it still might be more efficient...
You'd probably have to test it (on multiple GPUs) to find out
On really old GPUs, there's no such thing as cbuffers AND there's no such thing as shader registers! On these cards, when you set a cbuffer, the driver actually takes the compiled shader code and inserts new instructions into it that contain your shader values (now as hard-coded numbers, not variables). On this class of GPUs, no matter what you do, setting shader variables is going to be bad for performance, as every change-of-variables actually produces a whole new shader program ;)
#3 Members - Reputation: 133
Posted 05 December 2011 - 11:41 PM
#4 Moderators - Reputation: 3974
Posted 06 December 2011 - 07:45 AM
So; [generate all data] -> [bind] -> [draw objects as required with indexing]
Generating data at render time seems like Bad Voodoo to me anyway; render time should just be rendering, sort your data out before hand.
#5 Members - Reputation: 516
Posted 06 December 2011 - 10:33 AM
Let's say I have 500 draw calls per frame and all 500 draw calls use the same shader and that shader uses one constant buffer. Let's also assume that the data in the constant buffer needs to be built dynamically for each draw call. What would be the most desirable way to update the constant buffers, in terms of efficiency?
A) Create a single constant buffer and call Map/Unmap on that same constant buffer before each draw call.
B) Create 500 constant buffers, one for each draw call, and call Map/Unmap on the draw call's own constant buffer.
C) Or, another idea?
I know that for (A) the driver will rename the buffer each time I Map it, discarding the previous contents which is fine. But is it ok to expect that the driver can handle hundreds or even thousands of renames per frame? And I assume the rename process consumes some time, too.
On the other hand, (B) avoids the renaming and any associated overhead at the expense of possibly more video memory being consumed (500 constant buffers, even if fewer draw calls are actually used) and more code complexity.
To A) i belive that you shuld use UpdateResource instead.
think i read it in the sdk that states that it´s faster for constant buffers.
map/unmap is for vertexbuffers and textures i think.
NOTE, not 100% sure.
#6 Members - Reputation: 133
Posted 06 December 2011 - 01:22 PM
1323179144[/url]' post='4891086']
Is there any reason you can't generate the data up front, before issuing draw calls, then build one large cbuffer and index in the shader based on an instance ID? Maybe split this up over a few buffers depending on cbuffer size so you aren't updating a massive chunk of data in one go.
So; [generate all data] -> [bind] -> [draw objects as required with indexing]
Generating data at render time seems like Bad Voodoo to me anyway; render time should just be rendering, sort your data out before hand.
I'm trying to implement what DICE has done for Battlefield 3, in terms of using buffers to store per-instance matrices to reduce draw calls. The constant buffer will hold data such as the number of matrices in the bone matrix palette, and that number (as well as additional data being stored in it) can be different for each type of object and so needs to be updated for each draw call. Here's a link to the DICE presentation. The instancing section is the first section in the Performance section, about half way through the doc.
http://publications.dice.se/attachments/GDC11_DX11inBF3_Public.pdf
#7 Moderators - Reputation: 3974
Posted 06 December 2011 - 04:29 PM
I suspect if you are moving around small enough buffers either option would be fine; we had a pure CPU limited rendering test at work which was drawing 50,000 cubes and, for each draw call, was doing a map/unmap for a cbuffer on mulitple contexts (6 iirc, there might have been one per context but don't quote me on that, its been a while since I played with that bit of the code). With that test we were good up until around 15,000 draw calls before the driver started to get into trouble internally with memory issues.
Do whatever makes organisation sense I guess...
#8 Members - Reputation: 133
Posted 06 December 2011 - 06:25 PM
#9 Moderators - Reputation: 3974
Posted 06 December 2011 - 06:49 PM
In short I'd probably go for a discard-map + a cbuffer per object type but make it easy to go with multiples if it proves to be a bottleneck.






