Jump to content
  • Advertisement
Sign in to follow this  
dpj

Vulkan Safe resource deallocation in Vulkan

This topic is 985 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm starting to think about some of the issues involved in incorporating Vulkan into an engine. If I understand correctly, it's not safe to destroy a device memory resource while it's still referenced by a command buffer. In other words, Vulkan doesn't do any internal reference counting, and so a destroy operation will take effect immediately, even if the GPU is currently using the resource (or will do so in the near future).

 

Obviously the goal of this design approach was to allow engine builders to roll special case solutions for this. So I was wondering what approaches people were taking to deal with this problem? 

 

I've run across this before, when implementing a PS3 version of an engine that was designed for DX9. The engine assumed that it could destroy GPU resources safely at any time (which is true for DX9). So, for my low level PS3 code, I had to provide the same guarantees.

 

To do this, I kept a list of referenced resources long with each command buffer. Periodically, I checked for completed command buffers, and could dereference the associated resource.

 

My goal was to reduce the cost in the most common case (ie, nothing being destroyed) to as little as possible. This solution was ok, but was a little awkward because the list of resources could get quite long. And (in that particular engine) some resources tended to be referenced many times by the same list; so I actually ended up sorting the list to avoid adding duplicates. That worked well, because it duplicated the kind of behaviour we expected from DX9.

 

Another option might be to group related resources together into a "box"... We could then do reference counting on the entire box, so that all contained resources get destroyed together. It would require some extra structure in the engine, but might reduce the low level overhead. That would be handy for streaming in and out character models -- where multiple resources will typically be evicted at the same time.

 

Another possibility would be to always delay any resource deallocation until all active command buffers have completed (regardless of whether it's actually referenced). It other words, we could assume that all resources are referenced by all command buffers... That would introduce the absolute minimal overhead in the normal case. But it would mean that deallocation never completes rapidly. It would also cause problems if even a single command buffer isn't promptly ended and submitted.

 

What approaches are people here taking for this issue?

 

Share this post


Link to post
Share on other sites
Advertisement

First thing is to boil it down a bit.  Reference counting is out of the scope of this - it's dealing with what happens when you do decide to delete something that is the issue at hand.

 

Consider a simplistic rendering loop that renders, presents, and then waits for the device to be idle each frame.  After the device is idle, but before anything is rendered, you can be sure that destroying a resource will be safe, as long as you don't try to use it again.  So you can either defer all logic that would ever delete a resource until that time, or allow a deletion to be requested at any time but put it in a buffer to be "played back" in the safe window.

 

Fairly trivially, this can be extended to a pipeline that is multiple frames deep.  Let's go with 3, for the sake of example.  This means you might have rendering commands on the fly up to 2 frames "behind", and therefore you have to take that into account whenever you want to delete a resource.  The easiest thing to do in this case IMO would be to have a growable array for each frame (e.g. 3); they keep track of deletion requests, each corresponding to requests issued when the CPU is processing that frame.  Each frame would also have a fence associated with it.  Whenever a new frame is started, it waits on its fence, and then before rendering anything, it deletes the appropriate resources and clears the vector for new requests to be added.  When submitting the commands for that frame, you say that the fence should be signaled when the submission is complete.  The fence ensures that the CPU will never get too far ahead of the GPU and cause the renderer to trip over itself, ensuring that you won't ever try to delete a resource that's being used by the GPU.

 

This concept of shifting by the number of frames in your pipeline applies to destructive updates, such as rendering to a framebuffer, as well.  For N frames, you will need N copies of each attachment, and cycle through to determine which copy you are writing to.

Edited by Boreal

Share this post


Link to post
Share on other sites
Great, thanks for the answer. If I understand correctly, I think what you're suggesting is to basically assume that all resources are used by all command lists.

I think that this leads to. A pretty straight forward implementation... Expanding upon your description, I would probably just keep a list of objects pending destruction. When a vkEndCommandBuffer occurs, I would record the position in this list. Now, when we know for sure that the GPU has finished with a frame's command buffer, we can just track through the list, destroying as we go...

Of course, there are some complications related to secondary command buffers (particularly those build in background threads), but it could be handled within the same basic design.

So, here we're making 2 important assumptions:
* command buffers we're creating are going to be used once (within the current frame) and then discarded
* resources are used in the same frame that they are destroyed (or, at least some recent frame)

It feels to me that these assumptions are probably reasonable in the majority of cases. So it makes sense to use this simple design for the majority of cases...

But there are cases where we may want to break these rules. So, I'm thinking about building a "destruction queue" concept. This would sit alongside the VkQueue, and would be customisable for special cases. Each reource would end up tied to a single destruction queue. So the majority of resource could follow the simple mechanism described here. But other cases could also be handled (such as recording a command buffer to be replayed many times).

I'm also thinking about streaming world cases. In this situation, we may have a finite amount of device memory that is used as a window on a very large set of streamed resources.

In this case, we will frequently want to evict one resource, and then reallocate (or overwrite). Usually the evicted resource will be the least recently used resource... Probably something offscreen, and probably something that isn't being used by the gpu.

Often, it will be safe to overwrite immediately. So we will want to take advantage of that.

But maybe there's still a simple solution -- just by recording that last used frame as part of the LRU cache. Perhaps all we really need is some special case handling for the VkDeviceMemory objects in these cases. Maybe that wouldn't really be too hard, and it would avoid complicating the more general case with extra machinery.

Share this post


Link to post
Share on other sites

Yeah I keep this as simple as possible. Many moons ago, we inherited an engine that used fences (write to label) every time you issued a draw call that used a newly bound resource, so that you could query to find out when it was safe to deallocate/reallocate/defrag as soon as possible. This kind of fine-grained management is not really ever required IMHO.

 

Instead we simply just enforced a maximum latency between CPU and GPU -- e.g. CPU is only ever 2 frames behind the GPU -- at which point you can simply create a "deferred deallocation queue". When deallocating a resource, you free up the API handle right away so that the user can pretend that resource has been free'ed, but really you grab the current frame ID, add 2 to it, and put the allocation pointer and this future frame ID into a queue. At the start of each frame (after the current frame ID has been incremented), you pop "deallocation jobs" from this queue until it's empty or you find a job with an ID that's still in the future.

This way, there's no overhead at all when binding resources, and at worst, they're freed up a frame or two later than optimally possible.

When doing memory intensive operations like switching between levels, the loading system could query the GPU device to check if this deallocation queue was empty, so that it would wait those extra two frames before starting to try and allocate all the GPU memory again -- this was important on PS3 where you've only got 256MiB, but probably not even required any more.

 

We do this on PC too, from D3D9 onwards! Even in D3D9, you can create large vertex buffers and sub-allocate smaller objects within them with, updated using the D3DLOCK_NOOVERWRITE flag to promise to handle the CPU/GPU sync issues yourself, like you do in Vulkan/D3D12/consoles... D3D9 ring buffer management worked the same way -- simply track used regions of the ring buffer at whole-frame granularity, and combine that data with the knowledge of your max CPU/GPU latency to safely overwrite old data without using any fences (except the single fence that ensures your max CPU/GPU latency is enforced).

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!