Vertex buffer efficiency

Started by
4 comments, last by kastro 9 years, 6 months ago

In an engine I'm currently working on for self-education pruposes, all resources are asynchronously streamed in and out. Up to now, I've simply just kept a one-for-one relationship between index/vertex buffers and each model. I recently read the GDC 2012 presentation by John McDonald of NVIDIA (https://developer.nvidia.com/sites/default/files/akamai/gamedev/files/gdc12/Efficient_Buffer_Management_McDonald.pdf) and have been looking at how to incorporate its recommendations. Specifically, I'm working on implementing his "long-lived buffers" that are reused to hold streaming (static) geometry data. I've been unable to find much information on how best to implement it, however.

The presentation goes into detail on using "transient" buffers for UI/text, allocated similarly to a block of memory treated as a heap. Is it adviseable to do something similar for the longer-lived buffers? i.e. in streaming in a resource package containing geometry, I'd step through each allocation, first preferring space in the set of existing buffers found via the heap, and if not enough space is found create a new buffer to include in the heap (similar to how in managing system memory, Doug Lea's malloc can be backed by virtual page allocations from the OS via calls to mmap/munmap). Or is the CPU overhead of that likely to cause hitches? The other alternative I've considered is something more static, where individual buffers are assigned all or none and the geometry in a resource package is pre-packed in the pipeline to fit buffers of that size. Upon loading and unloading a streaming package, entire buffers would be marked as used/unused, creating new ones as needed.

I'm doing this as a learning experience, so I'm working on trying out each of the above anyways, but any tips from someone more knowlegeable would be greatly appreciated.

Advertisement

That presentation is basically l33t speech for "how to fool the driver and hit no stalls until DX12 arrives".

What they do in "Transient buffers" is an effective hack that allows you to get immediate unsynchronized write access to a buffer and use D3D11 queries as a replacement for real fences.

Specifically, I'm working on implementing his "long-lived buffers" that are reused to hold streaming (static) geometry data. I've been unable to find much information on how best to implement it, however.

Create a default buffer. Whenever you need to update it, upload the data to a staging buffer (you should have a preallocated pool to avoid stalling if you create the staging buffer), then copy the subresource from staging to default. You're done.

You won't find much because there's no much more to it. Long-lived buffers assume you will rarely modify them, and as such shouldn't be a performance bottleneck nor a concern.

Usually you also have a lot of knowledge about the size you will need for the buffer. Even if you need to calculate it, the frequency of doing this is so little that often you should be capable of calculating it, or at least cache it.

The problem is when it comes to buffers that you need to update very often (i.e. every frame)

Thanks for the explanation. While I wasn't focusing on that part, the transient buffers make more sense to me now from a synchronization standpoint. When you talk about creating a default buffer, do you mean I should try to have as much as possible of my non-dynamic streaming data stored within a single buffer, and the pool refers to the staging buffers?

I would think with level streaming, it would be risky to implement a single fixed capacity on the live buffer(s), so pool-type management for the live buffers would be useful too. For example, when streaming in a new package, the loader knows exactly how much capacity it will require, and can grab how ever many buffers it needs from the pool of unused buffers. Likewise as a level is streamed out, the buffers are no longer needed and are added back into the pool of unused buffers. Maybe I'm over complicating things. I definitely see the benefit of staging buffers for updating dynamic data, but I guess it's not as clear for the case of loading in a large amount of streaming data (not behind a loading screen).

Thanks for the explanation. While I wasn't focusing on that part, the transient buffers make more sense to me now from a synchronization standpoint. When you talk about creating a default buffer, do you mean I should try to have as much as possible of my non-dynamic streaming data stored within a single buffer, and the pool refers to the staging buffers?

Yes. Immutable if possible.

I would think with level streaming, it would be risky to implement a single fixed capacity on the live buffer(s), so pool-type management for the live buffers would be useful too. For example, when streaming in a new package, the loader knows exactly how much capacity it will require, and can grab how ever many buffers it needs from the pool of unused buffers. Likewise as a level is streamed out, the buffers are no longer needed and are added back into the pool of unused buffers. Maybe I'm over complicating things. I definitely see the benefit of staging buffers for updating dynamic data, but I guess it's not as clear for the case of loading in a large amount of streaming data (not behind a loading screen).

The problem is that you're trying to build a car that runs over roads, can submerge into the ocean, fly across the sky, is also capable of travelling into outer space; and even intergalactic travel (and God only knows what you'll find!).

You will want to keep everything together into one single buffer (or a couple of them) to reduce switching buffers at runtime while rendering.
From a streaming perspective, it depends how you organize your data. i.e. some games divide a level into "sections" and force the gameplay to go through corridors, and while you run through these corridors, start streaming the data to the GPU (Gameplay like Tomb Raider, Castlevania Lord Of Shadows, fits this use case). In this scenario, each "section" could be granted it's own buffer. You already know the size required for each buffer. And if you page the buffer out, you know if it can be permanent (i.e. can the player go back?) or use some heuristic (i.e. after certain distance from that section, schedule the buffer for deletion, but don't do it if you don't need it, i.e. you still got lot of spare GPU RAM). You may even get away with immutable buffers in this case.

Second, you can keep an adjustable pool of immutable/default buffers based on size and regions. Remember you're not going into the unknown depths of the ocean or into the unknowns of a distant galaxy. You know the level that is going to be streamed. You know its size in megabytes, in kilometers, its number of vertices, how it's going to be used, how many materials it needs etc. You know how each section gets connected with each section (e.g. if F can only be reached from A, put it in its own buffer, and the player is likely to not return to F very often once it has been visited).

You have a lot of data at your disposal.

Open World games are trickier, but it's the same concept (divide the region into chunks that has some logic behind it, i.e. spatial subdivision, and start from there). Open World usually have a very low poly model of the whole scene to use until the higher quality data has been streamed.

My advice, algorithms are supposed to solve a problem. An engine solves problems. The answer on how to design your engine will be clearer if you approach the problem instead of trying to solve a problem you know nothing about. Try to make a simple game. Even a walking cube moving across cardboard city (open world) or pipe-land (corridor-based loading) should be enough.

Stop thinking on how to write the solution and start thinking on how to solve the problem. After that, how to write the solution will appear obvious.


That presentation is basically l33t speech for "how to fool the driver and hit no stalls until DX12 arrives".

What they do in "Transient buffers" is an effective hack that allows you to get immediate unsynchronized write access to a buffer and use D3D11 queries as a replacement for real fences.
That's a pretty dismissive way to sum it up laugh.png

I don't see why transient buffers should be implemented as a heap like in that presentation's "CTransientBuffer" -- it's much simpler to implement it as a ring buffer (what they call a "Discard-Free Temp Buffer").

Write-no-overwrite based ring buffers have been standard practice since D3D9 for storing transient / per-frame geometry. You map with the no-overwrite flag, the driver gives you a pointer to the actual GPU memory (uncached, write-combined pages) and lets the CPU stream geometry directly into the buffer with the contract that you won't touch any data that the GPU is yet to consume.

Even on the console engines I've worked on (where real fences are available), we typically don't fence per resource, as that creates a lot of resource tracking work per frame (which is the PC-style overhead we're trying to avoid). Instead we just fence once per frame so we know which frame the GPU is currently consuming.

Your ring buffers then just have to keep track of the GPU-read cursor for each frame, and make sure not to write any data past the read-cursor for the frame that you know the GPU is up to. We do it the same way on PC (D3D9/11/GL/Mantle/etc).

Other map modes are the performance-dangerous ones. Map-discard is ok sometimes, but can be dangerous in terms of internal driver resource-tracking overheads (especially if using deferred contexts), but read/write/read-write map modes should only ever be used on staging resources which you're buffering yourself manually to avoid stalls.

Create "Forever" buffers as needed, at the right size. You pretty much have to do this anyway, because they're immutable so you can't reuse parts of them / use them as a heap.

The recommendations for "long lived" buffers basically just reduces the driver's workload in terms of memory management (implementing malloc/free for VRAM is much more complex than a traditional malloc/free, because you shouldn't append allocation headers to the front of allocations like you do in most allocation schemes). In my engine I currently ignore this advice and treat them the same as forever buffers. The exception is when you need to be able to modify a mesh -- e.g. a character who's face is built with a hundred morph targets but then doesn't change again often -- in that situation, you need DEFAULT / UpdateSubResource.

Thank you, both of you. Your notes are very helpful.


Stop thinking on how to write the solution and start thinking on how to solve the problem. After that, how to write the solution will appear obvious.

That's very good advice. While I have a very definite problem (I'm creating far too many vertex buffers for my streamed in levels--1 per mesh), I think I've been looking too generically for a solution. Part of my problem was I didn't really know my options since my current approach is so poor and far from a solution.


In my engine I currently ignore this advice and treat them the same as forever buffers.

I think I'm going to go with your approach of using immutable buffers, particularly for the streamed levels that are larger and/or expected to live longer. For smaller, details-providing levels, I think I may try out Matias suggestion of using one or a few default buffers, mapping the data in via staging buffers. At least this seems like a reasonable starting point based on both of your suggestions. I'm trying to stick to what Niklas Frykholm of Bitsquid said here: http://www.gamasutra.com/view/news/172006/Indepth_Read_my_lips__No_more_loading_screens.php

Thanks again.

This topic is closed to new replies.

Advertisement