Dynamic Updating of StructuredBuffer in DX11

Started by
9 comments, last by 360GAMZ 12 years, 5 months ago
Hello,

We have a lot of vehicles on the screen in our game and I want to draw them with as few draw calls as possible. Each vehicle has its own bone matrix palette, which is just an array of matrices that is indexed by an integer stored as part of the vertex data for the vehicle. The matrices control the animation of various parts of the vehicle such as the suspension.

In our old DX9 renderer, I would upload the matrix palette into the constant buffer, render one vehicle, load the next vehicle’s matrix palette into the constant buffer, render that vehicle, etc. One draw call per vehicle, even if the vehicles were the same model.

For our new DX11 renderer, I’d like to use hardware instancing to draw N vehicles in the same draw call. To do this, the shader would need access to each instance’s matrix palette. If I used the constant buffer, I would only have room to store several matrix palettes worst case (256 bones each). So, I’m looking for an alternative solution that would give me as much room as I needed, within reason.

One idea is to use a single StructuredBuffer to store the matrix palette for all instances in the draw call. Since the size of the matrix palette is the same for all instances of the draw call, a single constant buffer integer could be used to store the size of the matrix palette. Then, the vertex shader would simply multiply that size by the instance # and then add that to the bone matrix index from the vertex to arrive at the final index needed to fetch from the StructuredBuffer.

My first question is whether this sounds like a good, efficient approach?

If so, then my next question is how to best update the StructuredBuffer across draw calls without stalling the GPU:

For example, suppose I want to draw 20 instances of car mesh A, followed by 15 instances of car mesh B. Using the above method, I would write 20 matrix palettes into the StructuredBuffer and do a DrawIndexedInstance() to draw all instances of car A. Then, I would need to write 15 matrix palettes into the StructuredBuffer and draw all instances of car B.

Since DX11 doesn’t support D3D11_MAP_WRITE_NO_OVERWRITE for StructuredBuffers, I can’t append B’s matrix palette to follow A’s palette. So, what’s best here? Should I use D3D11_MAP_WRITE_DISCARD when writing out both A’s and B’s palettes, knowing that there will likely be hundreds of Map() calls doing this each frame? Or do I need separate StructuredBuffers for A and B, and hundreds more for the other draw calls that happen during the frame? What’s most efficient for the hardware here?

Or… is there another approach that’s all around faster?

Thanks for any help!
Advertisement
1. Yes, that should definitely be feasible. DICE is using similar techniques pretty aggressively in the PC version of BF3.

2. If you use D3D11_MAP_DISCARD on a buffer created with D3D11_USAGE_DYNAMIC then the runtime will automatically handle resource updates so that you don't stall the CPU. Usually this is done by looping through a driver-side ring buffer or collection of buffers on each subsequent call to Map, which allows you to keep pushing data while the GPU is reading from an older buffer. I'm not sure if having more dynamic buffers will help, since the driver is going to do its own hidden resource management behind the scenes. You'll probably have to profile on different hardware to find out for sure.
Wouldn't a normal "Buffer<float4>" be more efficient here if its just matrices? The indexing would be pretty simple. Or is the extra work introduced by using a structuredBuffer negligible?
You could always load all of the data into one buffer for all mesh types, then use a constant buffer to indicate when you are instancing mesh A vs mesh B. In the same constant buffer, you could simply provide offsets into your structured (or standard) buffer to begin indexing for your matrix data.

That would allow for zero stalling in between draw calls, plus still allow you to have different sized meshes for each vehicle type (i.e. each draw instanced call could be unique).
Thanks for the great responses! I can see that I've come to the right place :)

@MJP - If I'm thinking of the same thing as you are, I believe DICE is using this approach for rendering many dynamic lights in a deferred renderer using a single (or very few) draw call. So DISCARD would probably work fine for that since it's probably just one per frame. But I'm concerned that calling DISCARD hundreds of times on the same Buffer in the same frame will cause stalls because the hardware can probably provide only a relatively small number of alternate buffers.

@maya18222 - Yes, that's a really good idea, thanks.

@Jason Z - I had the same thought this morning. Doing that could dramatically reduce the number of DISCARD Map calls, which would be a good thing. Here's what I'm wondering, though. Let's say I do this and have one really big Buffer and start stacking lots and lots of matrix palettes into it. I set up the constant buffer with the offset into the Buffer of where A's series of matrix palettes start and I call DrawIndexedInstance() to draw all of the A's. Then, I set up the constant buffer with the offset into the Buffer of where B's series of matrix palettes start and I call DrawIndexedInstance() to draw all of the B's. Etc. What I'm wondering is whether the graphics hardware will re-upload the entire Buffer with each of those DrawIndexedInstance() calls, as opposed to uploading it only once with the first DrawIndexedInstance() call, and just leaving it there for the others to use as well.

Wouldn't a normal "Buffer<float4>" be more efficient here if its just matrices? The indexing would be pretty simple. Or is the extra work introduced by using a structuredBuffer negligible?


Not really, in both cases the hardware reads the same exact amount of memory and likely with the same amount of instructions. So you would just be making things more difficult yourself by having to have extra code to calculate the proper index.

@MJP - If I'm thinking of the same thing as you are, I believe DICE is using this approach for rendering many dynamic lights in a deferred renderer using a single (or very few) draw call. So DISCARD would probably work fine for that since it's probably just one per frame. But I'm concerned that calling DISCARD hundreds of times on the same Buffer in the same frame will cause stalls because the hardware can probably provide only a relatively small number of alternate buffers.


No, they mentioned using it for level geometry and skinned meshes.

Like I said before it's going to depend on how the driver manages it, but if you can allocate large buffers yourself then there's no reason the driver can't do it either. It might be able to do it more efficiently if you allocate larger buffers or larger numbers of buffers, but you'd have to profile to find out for sure.
@Jason Z - I had the same thought this morning. Doing that could dramatically reduce the number of DISCARD Map calls, which would be a good thing. Here's what I'm wondering, though. Let's say I do this and have one really big Buffer and start stacking lots and lots of matrix palettes into it. I set up the constant buffer with the offset into the Buffer of where A's series of matrix palettes start and I call DrawIndexedInstance() to draw all of the A's. Then, I set up the constant buffer with the offset into the Buffer of where B's series of matrix palettes start and I call DrawIndexedInstance() to draw all of the B's. Etc. What I'm wondering is whether the graphics hardware will re-upload the entire Buffer with each of those DrawIndexedInstance() calls, as opposed to uploading it only once with the first DrawIndexedInstance() call, and just leaving it there for the others to use as well. [/quote]

If there is nothing going on in between the calls, then I don't see any reason that the driver would evict the buffer. Especially if it remains bound to the pipeline, there shouldn't be an issue with having to re-upload the data more than once per frame. I do think that MJP's advice is the best though - you should really build a quick prototype and try it out to make sure it suites your needs and is actually faster than just individual calls like your D3D9 renderer.
What about using a tbuffer vs. a Buffer resource? tbuffers can be huge (unlike cbuffers), and according to an NVIDIA presentation I found (ftp://download.nvidia.com/developer/cuda/seminar/TDCI_DX10perf_DX11preview.pdf) tbuffers are optimal for random access. Whereas, according to the Skinning10 sample that comes with the DX11 SDK, Buffer resources are optimal for linear access.

@Jason Z - Yes, I will definitely prototype all of this out. Just trying to wrap my head around all of the options first.
I haven't done a lot of profiling myself, but I'd image that if there is any performance delta between tbuffers vs. textures vs. buffers it will be rather small. Modern hardware tends to be more generic with regards to sampling resources from memory.

This topic is closed to new replies.

Advertisement