[D3D12] Descriptor Heap Strategies

Started by
9 comments, last by mark_braga 4 years ago
Advertisement
Funkymunky
Author
1,420
February 15, 2017 03:15 AM

A lot of DX12 articles talk about implementing the descriptor heap entries as a ring buffer (notably the nVidia Do's and Don'ts). I've also read in these forums that some people prefer a stack-allocated scheme. I don't see why these methods would be the preferred way of solving this problem. A ring buffer of descriptors is great if you're always adding new descriptors while deleting the oldest ones. But what happens when you want to remove a descriptor from the middle of the active set? And as for a stack-allocated scheme, wouldn't that involve copying in the descriptors every frame? Why wouldn't something like a free-list or buddy allocator be preferable to either of these setups?

Hodgman
52,704
February 15, 2017 03:23 AM

A ring buffer of descriptors is great if you're always adding new descriptors while deleting the oldest ones. But what happens when you want to remove a descriptor from the middle of the active set? And as for a stack-allocated scheme, wouldn't that involve copying in the descriptors every frame?
Both the stack and ring strategies are designed for the descriptors to be added every frame. You don't delete anything from the middle of them -- everything has a transient lifetime. With the stack strategy, you'd divide your heap up into several stacks, and use a different one per frame. When the GPU has finished processing that frame, you reset that stack and reuse it (so it's basically a ring of stacks, or a double-buffer of stacks).

Yes, if you have long-lived descriptor tables that can be re-used every frame, a move typical heap allocator (e.g. a freelist, etc) is appropriate. However, you probably also have a lot of transient/per-frame descriptor tables too! So, you can make a large descriptor heap and then split it up between a few allocators. You can give a large portion of your descriptor heap to a ring allocator, and a different portion to a freelist-based allocator.

Funkymunky
Author
1,420
February 15, 2017 04:05 AM

I guess what I don't understand is why there would be a lot of objects with transient lifetimes. It seems like most textures and constant buffers are going to stick around for awhile. In fact it seems like adding new descriptors/removing old ones would happen pretty infrequently. Can you describe a use case where a majority of objects would require new descriptors every frame? And also, are you saying to call CreateShaderResourceView/CreateConstantBufferView every frame?

SoldierOfLight
February 15, 2017 04:12 AM

The problem isn't the individual textures having transient lifetimes, but rather the sets/tables of textures having transient lifetimes, as they're arbitrarily combined by the engine. I've seen both sides of this, one where every descriptor table was pre-allocated at the initialization of the engine, and others where everything was dynamic. In the static case, the unit of allocation was descriptor tables of fixed size, used in a heap allocator scheme. In the dynamic case, the unit of allocation was the descriptor or view.

For the dynamic case, a common pattern is to use a set of "offline" descriptor heaps which exist on the CPU timeline to stage the descriptors, and CopyDescriptors on a per-frame basis to gather them into "online" descriptor heaps, into tables for binding. The Create*View APIs only need to be called on these "offline" descriptor heaps.

Hodgman
52,704
February 15, 2017 04:14 AM

The reason you end up with transient data is because while textures (and their associate descriptors) are long-lived, you want to be binding descriptor-tables to shaders. If a shader uses three textures A, B and C, then you need the descriptors for A/B/C to be located in a shader visible descriptor heap in a contiguous array.

I don't know what best practice is (I'd love to hear how other people are using d3d12/vulkan too!!!), but my personal strategy at the moment is to pre-create descriptors for textures/buffers/cbuffers (with the Create*View functions) in a non-shader-visible heap that is managed by a pool/free-list allocator. These are long-lived, but not used directly by shaders.

When preparing a draw-call, I figure out which descriptors and tables the current shader requires, allocate those tables from a ring-buffer within a shader-visible heap, and then use CopyDescriptors to copy my pre-created descriptors into that contiguous table, and then bind the table to the root.

i.e. each texture has a long-lived descriptor, which is copied into transient descriptor-tables.

My engine is designed around the concept of preparing a draw-item once, and then re-using it every frame. So I really should be creating these tables once when preparing a draw-item, and then keeping the table around until the draw-item is deleted... however I haven't gotten around to that optimization yet, so I'm recreating these tables with per-frame lifetimes every time a draw-item is submitted for drawing.

Funkymunky
Author
1,420
February 15, 2017 04:43 AM

I feel like a lightbulb has just turned on over my head. That makes a lot more sense. Thanks for the insights!

MJP
20,253
February 15, 2017 08:41 AM

One of the problems you may run into with pre-allocating static descriptor tables is dealing with "dynamic" resources. Say for instance you have a structured buffer full of bone weights that's updated every frame from the CPU. In this case you're likely to have multiple SRV descriptors for the same logical "buffer", since you'll probably need to perform some form of versioning (double-buffering, or pulling from a pool of temp memory) in order to make sure that the CPU isn't writing memory that the GPU is still reading. This serves as another reason why a lot of people start off with "copy descriptors into GPU-visible descriptors every frame" approach, since it more closely matches the way you can bind resources in D3D11 (where you didn't have to care about dynamic resources, since the driver juggled the descriptors for you behind your back).

On a certain other platform that also lets you bind resources at the descriptor level, I split up descriptors into separate tables based on whether they were static or dynamic and how they were used. So for instance the static material textures were all in one static descriptor table, render targets and other "engine" textures/buffers were in another, and dynamic per-draw data (like the currently-selected reflection probe cubemap or buffers of vertex bone weights) were in another. This should be doable in D3D12 as well, but at the moment we're still doing lots of copying until I get some more time to work on this. Alternatively, I've been considering using the copy queue to upload new contents to dynamic buffers. This would allow the descriptor to stay static, and could also improve performance for some cases. However it might not work out so great if the copy queue is already in-use, which would be happening while we're streaming in new level data.

I've also been thinking about ditching the idea of separate descriptor tables and just going full bindless. Basically just declare a big unbounded array of textures (and buffers) in the shader, and then pass the descriptor index through a constant buffer to index into the array. With this approach you wouldn't have to care about descriptors being contiguous (unless it turns out that having contiguous descriptors is faster for some hardware), and you could use 1 shared root signature for most draws. It would also make it much easier to do GPU-side "binding", by writing descriptor indices into buffers or even render targets.

Funkymunky
Author
1,420
February 15, 2017 03:33 PM

Until now I've just been creating two descriptors for any per-frame resources (mostly buffers that are constantly updated), But I've also been calling SetGraphicsRootDescriptorTable for every bound resource, rather than batching things into contiguous regions and minimizing those calls. This has worked fine for the relatively small shaders I've tested my scenes with, but it's clear now that this strategy could quickly hit a wall.

It's pretty much a classic allocation problem, except there's no reason not to apply extra memory/processing power to making the allocations/deallocations as fast as possible. I am trying to dream up a faster scheme, but so far it seems like the ring buffer / stack allocator strategy is the way to go.

That bindless strategy is intriguing. That approach would use a common root signature with access to every descriptor, right? It might be tricky getting root constants to work with that, but the tradeoff for not having to manage the descriptor heap is enticing...

Funkymunky
Author
1,420
February 16, 2017 07:02 PM

After some deliberation, I think I'm going to adopt the following scheme:

I'll create one or more "offline" heaps to create descriptors in, as this will let me create resources in separate threads. For the "online" heap, I'll use a freelist allocator to give me descriptor ranges. I'll track three lists for maintaining this. The first will be a list of available allocations, sorted by size (a normal freelist). The second will be a list of allocated and deallocated ranges, sorted by their offset from the start of the heap. The third will be a list of just the deallocated ranges, also sorted by their offset from the start of the heap. (The second and third lists will use the same structures, with each structure having pointers to its neighbors).

Every frame I will run a basic defragmentation pass. It will look at the first entry in that third list (deallocations). If the neighbor to the right of that entry is also a deallocation, then I will coalesce the two into a single deallocation. If the neighbor is instead already-allocated, then I will shuffle that allocated range to the left, essentially bubbling the deallocation toward the end of the heap.

In practice, I'll probably split the "online" heap into multiple regions (one for each frame). This way I can shuffle descriptors after a fence without disrupting something that's being used. I think as long as I don't hammer the heap with constant allocations and fragmenting deallocations, this should keep me relatively well managed. And even if I do, I can always increase the number of defragmentation passes to keep things in check.

mark_braga
August 27, 2017 04:53 AM

My approach to this was every command list gets a portion of the heap so there is no need for any kind of fence synchronization. Works well with Vulkan too where every command list has its own descriptor pool. The user has the option to specify the size of the sub-allocation and some more customizations. This approach guarantees lock free descriptor allocation except for the one time where the command list needs to sub-allocate its "local" descriptor heap from the global one.

Share:

This topic is closed to new replies.

Advertisement