[DX11] Fastest way to update a constant buffer per draw call

Graphics and GPU Programming Programming DX11

Started by 360GAMZ December 06, 2011 01:30 AM

8 comments, last by 360GAMZ 12 years, 4 months ago

133

Author

December 06, 2011 01:30 AM

Let's say I have 500 draw calls per frame and all 500 draw calls use the same shader and that shader uses one constant buffer. Let's also assume that the data in the constant buffer needs to be built dynamically for each draw call. What would be the most desirable way to update the constant buffers, in terms of efficiency?

A) Create a single constant buffer and call Map/Unmap on that same constant buffer before each draw call.

B) Create 500 constant buffers, one for each draw call, and call Map/Unmap on the draw call's own constant buffer.

C) Or, another idea?

I know that for (A) the driver will rename the buffer each time I Map it, discarding the previous contents which is fine. But is it ok to expect that the driver can handle hundreds or even thousands of renames per frame? And I assume the rename process consumes some time, too.

On the other hand, (B) avoids the renaming and any associated overhead at the expense of possibly more video memory being consumed (500 constant buffers, even if fewer draw calls are actually used) and more code complexity.

Hodgman

52,717

December 06, 2011 03:08 AM

On older GPUs, there's no such thing as a cbuffer; there's just one global set of shader registers. On these cards, when you ask to set a cbuffer, it copies the register-id/value pairs out of the cbuffer and into the command-buffer. The GPU consumes the command-buffer in order, reading out the register values before reading the draw-call.
For these kinds of GPUs, I'd theorise that option (A) would be the most efficient, as there really is no cbuffer management going on behind the scenes.

On newer GPUs, it's possible for cbuffers to be stored in VRAM, and then moved into registers when required. On these cards, when you put data into a cbuffer, it can actually perform a VRAM transfer (and possibly issue a cache-invalidation command to the command-buffer). When you bind a cbuffer, you're writing a command into the command buffer that instructs the GPU to fetch some register values from VRAM.
On these cards, using option (B) would allow you to perform all of the VRAM transfers well in advance of any draw-calls that use that data, which reduces the amount of data flowing through the command-buffer. However, as you're still moving the same amount of data to the GPU every frame anyway (as you're regenerating the cbuffers each frame), there isn't really a bandwidth saving here... though it still might be more efficient...
You'd probably have to test it (on multiple GPUs) to find out

On really old GPUs, there's no such thing as cbuffers AND there's no such thing as shader registers! On these cards, when you set a cbuffer, the driver actually takes the compiled shader code and inserts new instructions into it that contain your shader values (now as hard-coded numbers, not variables). On this class of GPUs, no matter what you do, setting shader variables is going to be bad for performance, as every change-of-variables actually produces a whole new shader program ;)

. 22 Racing Series .

360GAMZ

133

Author

December 06, 2011 05:41 AM

This is a DX11 compliant card. An NVIDIA GeForce GTX 460, for example. The cbuffers are indeed in VRAM on this type of graphics card. I suppose my question boils down to, is it ok to assume that the driver for this class of modern graphics card can handle hundreds or even thousands of buffer renames each frame without breaking a sweat? Or is the buffer renaming mechanism really there only to handle a few rare cases of multiple Map/Unmaps to the same buffer?

_the_phantom_

11,263

December 06, 2011 01:45 PM

Is there any reason you can't generate the data up front, before issuing draw calls, then build one large cbuffer and index in the shader based on an instance ID? Maybe split this up over a few buffers depending on cbuffer size so you aren't updating a massive chunk of data in one go.

So; [generate all data] -> [bind] -> [draw objects as required with indexing]

Generating data at render time seems like Bad Voodoo to me anyway; render time should just be rendering, sort your data out before hand.

Tordin

632

December 06, 2011 04:33 PM

Let's say I have 500 draw calls per frame and all 500 draw calls use the same shader and that shader uses one constant buffer. Let's also assume that the data in the constant buffer needs to be built dynamically for each draw call. What would be the most desirable way to update the constant buffers, in terms of efficiency?

A) Create a single constant buffer and call Map/Unmap on that same constant buffer before each draw call.

B) Create 500 constant buffers, one for each draw call, and call Map/Unmap on the draw call's own constant buffer.

C) Or, another idea?

I know that for (A) the driver will rename the buffer each time I Map it, discarding the previous contents which is fine. But is it ok to expect that the driver can handle hundreds or even thousands of renames per frame? And I assume the rename process consumes some time, too.

On the other hand, (B) avoids the renaming and any associated overhead at the expense of possibly more video memory being consumed (500 constant buffers, even if fewer draw calls are actually used) and more code complexity.

To A) i belive that you shuld use UpdateResource instead.
think i read it in the sdk that states that it´s faster for constant buffers.

map/unmap is for vertexbuffers and textures i think.
NOTE, not 100% sure.

"There will be major features. none to be thought of yet"

360GAMZ

133

Author

December 06, 2011 07:22 PM

Is there any reason you can't generate the data up front, before issuing draw calls, then build one large cbuffer and index in the shader based on an instance ID? Maybe split this up over a few buffers depending on cbuffer size so you aren't updating a massive chunk of data in one go.

So; [generate all data] -> [bind] -> [draw objects as required with indexing]

Generating data at render time seems like Bad Voodoo to me anyway; render time should just be rendering, sort your data out before hand.

I'm trying to implement what DICE has done for Battlefield 3, in terms of using buffers to store per-instance matrices to reduce draw calls. The constant buffer will hold data such as the number of matrices in the bone matrix palette, and that number (as well as additional data being stored in it) can be different for each type of object and so needs to be updated for each draw call. Here's a link to the DICE presentation. The instancing section is the first section in the Performance section, about half way through the doc.
http://publications.dice.se/attachments/GDC11_DX11inBF3_Public.pdf

_the_phantom_

11,263

December 06, 2011 10:29 PM

Right, I see... well, what I said above still stands for most of your data, if you look at slides 30/31 you'll see they have a very small cbuffer for the per-draw call data so you might want to consider how much you place in it.

I suspect if you are moving around small enough buffers either option would be fine; we had a pure CPU limited rendering test at work which was drawing 50,000 cubes and, for each draw call, was doing a map/unmap for a cbuffer on mulitple contexts (6 iirc, there might have been one per context but don't quote me on that, its been a while since I played with that bit of the code). With that test we were good up until around 15,000 draw calls before the driver started to get into trouble internally with memory issues.

Do whatever makes organisation sense I guess...

360GAMZ

133

Author

December 07, 2011 12:25 AM

My cbuffer consists of eight 32-bit integers, so only 2 vector registers. Pretty darn small. We won't have anywhere near 15,000 draw calls. Probably under 1,000, but we need to maintain 60 FPS at all times. Also, since DICE is using this method, the hardware vendors may target it for optimization in their drivers. Though, there's no telling whether DICE is using a single cbuffer and relying on renaming by the driver, or using a bunch of cbuffers. Or, using UpdateSubresource instead of Map/Unmap, as Tordin mentioned earlier.

_the_phantom_

11,263

December 07, 2011 12:49 AM

I've not seen anything which says prefered UpdateSubresource over Map/Unmap; a quick look at the SDK docs would suggest that best case the UpdateSubresource function will put it straight into "destination memory", worst case it creates an extra buffer, copies there first and then later copied again into destination memory when the command buffer is flushed. A discard-map would likely do much the same but probably quicker as it doesn't have to worry about checking for resource contention, it can just throw away the reference and grab a new chunk/reuse a chunk of memory.

In short I'd probably go for a discard-map + a cbuffer per object type but make it easy to go with multiples if it proves to be a bottleneck.

360GAMZ

133

Author

December 07, 2011 12:55 AM

Sounds like a plan. Thanks again for all of your help!

[DX11] Fastest way to update a constant buffer per draw call

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

[DX11] Fastest way to update a constant buffer per draw call

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines