Question about per-frame resources in Vulkan/DX12

Started by
6 comments, last by Hodgman 6 years, 10 months ago

Hi,

In older APIs (OpenGL, DX11 etc) you have been able to access textures, buffers, render targets etc pretty much like you access CPU resources. You bind a buffer, draw using it, then update the buffer with new data, draw that data etc. It has all just worked.

In new low-level APIs such as Vulkan or DX12 you no longer have this luxury, but instead you have to take into account the fact that the GPU will be using the buffer long after you have called "draw" followed by "submit to queue".

Most Vulkan texts I have read suggests having resources created for three frames in a ring buffer, ie. you have three sets of command pools, command buffers and frame buffers + any semaphores and/or fences you need to sync graphics and present queues etc. AFAIK it works the same in DX12. With this system you can continue with rendering the next frame immediately and only have to wait if the GPU cannot keep up and you have rendered all three of the frames.

My question is, since there are obviously many more resources you need to keep around "per-frame", how do you structure your code? Do you simply allocate three of everything that might get written to during a frame and then pass around the current frame index whenever you need to access one of those resources? Is there a more elegant way to handle these resources? Also, where do you draw the line of what you need several of? Eg. a vertex buffer that gets updated every frame obviously needs its own instance per frame. But what about a texture that is read from a file into device local memory? Sounds to me like you only need one of those, but is there some case where you several?

Is there some grand design I am missing?

Thanks!

Advertisement

Also, where do you draw the line of what you need several of? Eg. a vertex buffer that gets updated every frame obviously needs its own instance per frame. But what about a texture that is read from a file into device local memory? Sounds to me like you only need one of those, but is there some case where you several?
In older API's you've still made these decisions by which flags/hints you pass to the API - e.g. old GL had GL_STATIC_DRAW, GL_STREAM_DRAW and GL_DYNAMIC_DRAW, D3D11 has D3D11_USAGE_IMMUTABLE, D3D11_USAGE_DYNAMIC.

IMHO you should force your users to declare at creation time how they will be using the resource. Will it be immutable? Will they be updating it from the CPU once per frame? Will they be updating it from the CPU many times per frame? Will they be updating it from the CPU once per many frames? Will they be reading data back to the CPU from the resource?

3x isn't always the limit. If you write to a constant buffer 100 times per frame, then you need 300x it's size in storage capacity!

Also, for the vertex streaming case -- a buffer that gets updated every frame. On old APIs you can just update it every frame and let the driver work things out for you... but that doesn't mean that you should. It's common on older APIs (even D3D9) to implement vertex streaming via a buffer that is 3x bigger than the per-frame capacity and streaming data into it (e.g. with MAP_NOOVERWRITE in D3D). If the game already doing stuff like this for dynamic resources, then it will port to Vulkan/D3D12 just fine :D

See this old talk for old APIs, which is still super relevant: http://gamedevs.org/uploads/efficient-buffer-management.pdf

I'm just starting to finish up a simple Vulkan wrapper library for some hobby projects, so anything I say should be taken with a grain of salt. Also I've only played with Vulkan and not DX12...

I've focused my fine-grained synchronization around VkFence. I have a VkFence wrapper that has a function ExecuteOnReset(). I can pass any function object, and when I reset the fence, all the stored functions get executed. When I have any resources that need to be released/recycled at a later time (when they are no longer in use) I simply add the cleanup function to their associated fence. At some point in the future I will have to check/wait on that fence, when the fence is signaled I then present the associated swap buffer image and reset the fence, which causes all the associated cleanup functions to execute.

Its surprisingly simple and efficient, and handles nearly 95% of all synchronization. I tried a couple other methods, and found this was by far the easiest to both implement and use. It was really one of those 'ah-ha' moments. All the other attempts at making a full blown all bells-whistles resource manager either was very complex, inefficient, or awkward; and I found that no matter what I did I was always passing around VkFence's to synchronize on anyways. So I eventually just decided to stick it all in the fence and be done with it.

I also cache/re-use command pools. So my Device class allows manual creation/destruction of pools, but also allows you to pull/return pools from a cache so I'm not constantly re-creating them every frame. Coupled with the above Fence class drawing is usually as simple as: request a command pool, create command buffers, fill buffers, submit buffers, pass pool to fence to be recycled. If I want to store/reuse the command buffers for later that's trivial as well. I know online a lot of people are talking about creating the command buffers once, then using draw indirect. I have a hard time believing this will be a better option; but I could be wrong and have no data to go on. I'd love to see a comparison of the two styles: dynamic/resuse command buffers vs static buffers with dynamic draw data manually uploaded with a proper benchmark.

The problem I find with fixing command pools or resources ahead of time is that you really don't know what/how many you'll need before hand. If you're managing each thread 'by hand' it can probably work (ie. I need 3 command pools for each thread to rotate on, one thread for physics, one thread of fore-ground objects, one for UI, etc...), but I'd rather just throw everything at a thread pool and let things work themselves out. On top of that sometimes you want to re-use command pools, other times you want to recycle them. I found it quickly became impractical to manage. So the cache system works great. Any thread can pull from the cache, any thread can recycle to the cache. I can just toss all the rendering jobs at a thread/job pool without any pre-planning or thought, the command pools are recycled using the Fence's, the command buffers returned via Future's from the threads. Its stupidly simple to use and implement and I like that.

As far as updating dynamic data (apart from using push constants whenever possible) for the vast majority of buffer updates (matrices, shader constats, etc...) I'm using vkCmdUpdateBuffer(); this means I only need to allocate the buffer/memory once and can re-use it each frame (no buffer rotations necessary, but you do need pipeline barriers). For the rather rare cases where I actually need to dynamically upload data each frame and I can't use push constants or vkCmdUpdateBuffer() I'm writing two dynamic memory allocators. The first is a very simple slab/queue allocator designed to handle situations where allocations/free's occur in order. The second is a buddy allocator for situations where allocations/free's happen randomly.

I'm not claiming that what I've done is optimal, just thought I'd throw it up for discussion/idea purposes. I'm interested as well to see what others have done/are planning to do.

Thanks guys! I have now implemented a resource pooling system where resources are returned to the pool on fence release, as per Ryan_001's suggestion. Works great!

The modern low-level APIs are great in the sense that they make you a better programmer whether you want it or not. I ported my old GUI rendering system which I originally wrote for DX11 4 years ago to Vulkan, and I now realize how many hoops the driver has had to jump through to get my GUI on the screen.

Where possible, I try to do all my resource tracking with a single fence per frame. I tag items with the frame number on which they were last submitted to the GPU on, and then use a single per-frame fence to count which frame the GPU has most recently completed. This scales really well as you can track any number of resources with one fence.

I do use more fine-grained fences for operations where you may want to be more aggressive about recovering unused memory quicker, or things that don't complete on a per-frame timeframe, such as the upload queue.

The modern low-level APIs are great in the sense that they make you a better programmer whether you want it or not. I ported my old GUI rendering system which I originally wrote for DX11 4 years ago to Vulkan, and I now realize how many hoops the driver has had to jump through to get my GUI on the screen.
Yeah I learned so much from having to do graphics programming on consoles, which have always had these low-level API's, and because they're all secretive and NDA'ed it's created a divide in the graphics programming community. It's great for the PC to finally have low-level APIs available to everyone so they can learn this stuff :D

My question is, since there are obviously many more resources you need to keep around "per-frame", how do you structure your code? Do you simply allocate three of everything that might get written to during a frame and then pass around the current frame index whenever you need to access one of those resources? Is there a more elegant way to handle these resources? Also, where do you draw the line of what you need several of? Eg. a vertex buffer that gets updated every frame obviously needs its own instance per frame. But what about a texture that is read from a file into device local memory? Sounds to me like you only need one of those, but is there some case where you several?
Is there some grand design I am missing?
Thanks!

First, like Hodgman said, you don't need 3 of everything. Only of the resources you would consider "dynamic".
Also "static" resources you want them to be GPU-only accessible, so that they always get allocated in the fastest memory (GPU device memory); while dynamic resources need obviously CPU access.

Second, you don't need 3x number of resources and handles. Most of the things you'll be dealing with are going to be just buffers in memory.
This means all you need to do is reserve 3x memory size; and then have a starting offset:


currentOffset = baseOffset + (currentFrame % 3) * bufferSize;

That's it. The "grand design of things" is having an extra variable to store the current offset.
There is one design issue you need to be careful: you can only write to that buffer once per frame. However you can violate that rule if you know what you're doing by "regressing" the currentOffset to a range you know its not in use (in GL terms this is the equivalent of doing GL_MAP_UNSYNCHRONIZED_BIT|GL_MAP_INVALIDATE_RANGE_BIT and in D3D11 of doing a map with D3D11_MAP_WRITE_NO_OVERWRITE).

In design terms this means you need to delay writing to the buffers as much as possible until you have everything you need, because "writing as you go" is a terrible approach as you may end up advancing the currentOffset too early (i.e. thinking that you're done when you're not), and now you don't know how to go regress currentOffset to where it was before; so you need to grab a new buffer (which is also 3x size; so you end wasting memory).

If you're familiar with the concept of render queues, then this should be natural; as all you need is for Render Queues to collect everything and once you're done; start rendering what's in those queues.

Last but not least, there are cases where you want to do something as an exception; in which cases you may want to implement a "fullStall()" which waits for everything to finish. It's slow, it's not pretty; but it's great for debugging problems and for saving you in a pinch.

Thanks guys, this is all really good stuff. Currently I am still working on wrapping the APIs, (DX11, DX12 and Vulkan) under a common interface. DX11 and Vulkan are now both rendering my GUI and the next piece of work is to get DX12 to that point. My plan is to rewrite large parts of the high-level renderer to make better use of the GPU, but leave other parts as-is for now, eg. the GUI and debug rendering. It would be nice to go the route of allocating larger buffers and offsetting based on the frame, but for now I am using a pool, ala Ryan_001's suggestion, where I can acquire temporary buffers and command buffers. The buffers as still as small as they used to be, there are just more of them. This is probably not the most performant way, but it gets the job done.

Regarding the "full stall" I actually had to implement something like that already for shutdown (ie. you want to wait until all GPU work is done before destroying resources) and for swap chain recreations. In Vulkan this is easy, you can just do:


void RenderDeviceVulkan::waitUntilDeviceIdle()
{
    vkDeviceWaitIdle(mDevice);
}

However, I am a little confused about how to do that on DX12. This is what I have come up with but it has not been tested yet. What do you think?


void RenderDevice12::waitUntilDeviceIdle()
{
    mCommandQueue->Signal(mFullStallFence.Get(), ++mFullStallFenceValue);

    if(mFullStallFence->GetCompletedValue() < mFullStallFenceValue)
    {
        HANDLE eventHandle = CreateEventEx(nullptr, false, false, EVENT_ALL_ACCESS);
        mFullStallFence->SetEventOnCompletion(mFullStallFenceValue, eventHandle);
        WaitForSingleObject(eventHandle, INFINITE);
        CloseHandle(eventHandle);
    }
}

That would obviously only stall the one queue, but I think that might be enough for now. Is there an easier way to wait until the GPU has finished all work on DX12?

Cheers!

That's pretty much what I do when shutting down a queue:
	//TODO - look into this
	u64 frameCount = m_frameCount + 1;
	m_mainQueue->Signal(m_mainFence, frameCount);
	{
		YieldThreadUntil([this, frameCount](){ return (s64)m_mainFence->GetCompletedValue() >= ((s64)frameCount); });
	}

This topic is closed to new replies.

Advertisement