[DX11] Fastest way to update a constant buffer per draw call

Started by
8 comments, last by 360GAMZ 12 years, 4 months ago
Let's say I have 500 draw calls per frame and all 500 draw calls use the same shader and that shader uses one constant buffer. Let's also assume that the data in the constant buffer needs to be built dynamically for each draw call. What would be the most desirable way to update the constant buffers, in terms of efficiency?

A) Create a single constant buffer and call Map/Unmap on that same constant buffer before each draw call.

B) Create 500 constant buffers, one for each draw call, and call Map/Unmap on the draw call's own constant buffer.

C) Or, another idea?

I know that for (A) the driver will rename the buffer each time I Map it, discarding the previous contents which is fine. But is it ok to expect that the driver can handle hundreds or even thousands of renames per frame? And I assume the rename process consumes some time, too.

On the other hand, (B) avoids the renaming and any associated overhead at the expense of possibly more video memory being consumed (500 constant buffers, even if fewer draw calls are actually used) and more code complexity.
Advertisement
On older GPUs, there's no such thing as a cbuffer; there's just one global set of shader registers. On these cards, when you ask to set a cbuffer, it copies the register-id/value pairs out of the cbuffer and into the command-buffer. The GPU consumes the command-buffer in order, reading out the register values before reading the draw-call.
For these kinds of GPUs, I'd theorise that option (A) would be the most efficient, as there really is no cbuffer management going on behind the scenes.

On newer GPUs, it's possible for cbuffers to be stored in VRAM, and then moved into registers when required. On these cards, when you put data into a cbuffer, it can actually perform a VRAM transfer (and possibly issue a cache-invalidation command to the command-buffer). When you bind a cbuffer, you're writing a command into the command buffer that instructs the GPU to fetch some register values from VRAM.
On these cards, using option (B) would allow you to perform all of the VRAM transfers well in advance of any draw-calls that use that data, which reduces the amount of data flowing through the command-buffer. However, as you're still moving the same amount of data to the GPU every frame anyway (as you're regenerating the cbuffers each frame), there isn't really a bandwidth saving here... though it still might be more efficient...
You'd probably have to test it (on multiple GPUs) to find out tongue.gif


On really old GPUs, there's no such thing as cbuffers AND there's no such thing as shader registers! On these cards, when you set a cbuffer, the driver actually takes the compiled shader code and inserts new instructions into it that contain your shader values (now as hard-coded numbers, not variables). On this class of GPUs, no matter what you do, setting shader variables is going to be bad for performance, as every change-of-variables actually produces a whole new shader program ;)
This is a DX11 compliant card. An NVIDIA GeForce GTX 460, for example. The cbuffers are indeed in VRAM on this type of graphics card. I suppose my question boils down to, is it ok to assume that the driver for this class of modern graphics card can handle hundreds or even thousands of buffer renames each frame without breaking a sweat? Or is the buffer renaming mechanism really there only to handle a few rare cases of multiple Map/Unmaps to the same buffer?



Is there any reason you can't generate the data up front, before issuing draw calls, then build one large cbuffer and index in the shader based on an instance ID? Maybe split this up over a few buffers depending on cbuffer size so you aren't updating a massive chunk of data in one go.

So; [generate all data] -> [bind] -> [draw objects as required with indexing]

Generating data at render time seems like Bad Voodoo to me anyway; render time should just be rendering, sort your data out before hand.

Let's say I have 500 draw calls per frame and all 500 draw calls use the same shader and that shader uses one constant buffer. Let's also assume that the data in the constant buffer needs to be built dynamically for each draw call. What would be the most desirable way to update the constant buffers, in terms of efficiency?

A) Create a single constant buffer and call Map/Unmap on that same constant buffer before each draw call.

B) Create 500 constant buffers, one for each draw call, and call Map/Unmap on the draw call's own constant buffer.

C) Or, another idea?

I know that for (A) the driver will rename the buffer each time I Map it, discarding the previous contents which is fine. But is it ok to expect that the driver can handle hundreds or even thousands of renames per frame? And I assume the rename process consumes some time, too.

On the other hand, (B) avoids the renaming and any associated overhead at the expense of possibly more video memory being consumed (500 constant buffers, even if fewer draw calls are actually used) and more code complexity.



To A) i belive that you shuld use UpdateResource instead.
think i read it in the sdk that states that it´s faster for constant buffers.

map/unmap is for vertexbuffers and textures i think.
NOTE, not 100% sure.

"There will be major features. none to be thought of yet"

Is there any reason you can't generate the data up front, before issuing draw calls, then build one large cbuffer and index in the shader based on an instance ID? Maybe split this up over a few buffers depending on cbuffer size so you aren't updating a massive chunk of data in one go.

So; [generate all data] -> [bind] -> [draw objects as required with indexing]

Generating data at render time seems like Bad Voodoo to me anyway; render time should just be rendering, sort your data out before hand.


I'm trying to implement what DICE has done for Battlefield 3, in terms of using buffers to store per-instance matrices to reduce draw calls. The constant buffer will hold data such as the number of matrices in the bone matrix palette, and that number (as well as additional data being stored in it) can be different for each type of object and so needs to be updated for each draw call. Here's a link to the DICE presentation. The instancing section is the first section in the Performance section, about half way through the doc.
http://publications.dice.se/attachments/GDC11_DX11inBF3_Public.pdf
Right, I see... well, what I said above still stands for most of your data, if you look at slides 30/31 you'll see they have a very small cbuffer for the per-draw call data so you might want to consider how much you place in it.

I suspect if you are moving around small enough buffers either option would be fine; we had a pure CPU limited rendering test at work which was drawing 50,000 cubes and, for each draw call, was doing a map/unmap for a cbuffer on mulitple contexts (6 iirc, there might have been one per context but don't quote me on that, its been a while since I played with that bit of the code). With that test we were good up until around 15,000 draw calls before the driver started to get into trouble internally with memory issues.

Do whatever makes organisation sense I guess...
My cbuffer consists of eight 32-bit integers, so only 2 vector registers. Pretty darn small. We won't have anywhere near 15,000 draw calls. Probably under 1,000, but we need to maintain 60 FPS at all times. Also, since DICE is using this method, the hardware vendors may target it for optimization in their drivers. Though, there's no telling whether DICE is using a single cbuffer and relying on renaming by the driver, or using a bunch of cbuffers. Or, using UpdateSubresource instead of Map/Unmap, as Tordin mentioned earlier.



I've not seen anything which says prefered UpdateSubresource over Map/Unmap; a quick look at the SDK docs would suggest that best case the UpdateSubresource function will put it straight into "destination memory", worst case it creates an extra buffer, copies there first and then later copied again into destination memory when the command buffer is flushed. A discard-map would likely do much the same but probably quicker as it doesn't have to worry about checking for resource contention, it can just throw away the reference and grab a new chunk/reuse a chunk of memory.

In short I'd probably go for a discard-map + a cbuffer per object type but make it easy to go with multiples if it proves to be a bottleneck.
Sounds like a plan. Thanks again for all of your help!

This topic is closed to new replies.

Advertisement