Low-level platform-agnostic video subsystem design: Resource management

Started by
4 comments, last by Klutzershy 9 years, 1 month ago

Something I've been stuck on for a while is the part of the low-level video subsystem that relates to creating/deleting resources and handling memory transfers between the CPU and GPU. There are lots of resources for the submission, batching and drawing itself but never anything about how those buffers, textures, etc. got there in the first place or how the API for resource management is designed. Note that I'm not talking about loading and caching resource files from disk. This is concerned with the low-level aspects of memory and resource management across processors.

One idea I had was a layered API loosely based off of D3D11, where you separate the memory itself (buffer, texture) from how it's going to be used (view). This way, for example, you can allocate a texture and use it as both a render target and a sampler target. Of course, this also brings up the issue of making sure access is "exclusive", i.e. you can't sample from it and render to it at the same time. Another issue I'm interested in is making the API thread-safe. While this is relatively trivial to do with D3D11 thanks to the Device/Context separation, with OpenGL it's more complicated since you have to deal with context sharing and currency.

Basically I'm just looking for some advice, ideas, open-source code to look at, or anything else that can help me see how this stuff has been taken care of before.

"So there you have it, ladies and gentlemen: the only API I’ve ever used that requires both elevated privileges and a dedicated user thread just to copy a block of structures from the kernel to the user." - Casey Muratori

boreal.aggydaggy.com

Advertisement

Hi.

Sound is now done on cpu and threading using a lockfree type queue you should look into how sound is done and it may sove the video play back as well.

Key term lockfree queue , or producer and consumer models, you sacrifice more memory but could get speed.

One idea I had was a layered API loosely based off of D3D11,

I've largely cloned D3D11, with the device/context split.
My main difference is that I use integer IDs to refer to all resources, don't do any reference counting at this level, and have cbuffers (aka UBOs) as their own distinct kind of resource, rather than supporting them as a generic buffer resource, as there's a lot of special case ways to deal with cbuffers on different APIs.

For fast streaming textures from disc, I use a file format tweaked for each platform with two segments - a header and the pixel data.
On platforms with "GPU malloc" (where memory allocation and resource creation are not linked), I have the streaming system do a regular (CPU) malloc for the header, and a GPU-malloc (write-combine, uncached, non-coherent) for the contents, and then pass the two pointers into a "device.CreateTexture" call, which has very little work to do, seeing that all the data has already been streamed into the right place.

I treat other platforms the same way, except the "GPU malloc" is just a regular malloc, which temporarily holds the pixel data until D3D/GL copies it into an immutable resource, at which point it's free'ed.

...separate the memory itself (buffer, texture) from how it's going to be used (view). This way, for example, you can allocate a texture and use it as both a render target and a sampler target. Of course, this also brings up the issue of making sure access is "exclusive", i.e. you can't sample from it and render to it at the same time.

I treat texure-resources, shader-resource-views, render-target-views, and depth-stencil-views as the same thing: a "texture". At creation time I'll make the resource and all the applicable views, and bundle them into the one structure.
Later if another view is needed - e.g. to render to a specific mip-level, or to alias the same memory allocation as a different format - then I support these alternate views by passing the original TextureId into a CreateTextureView function, which returns a new TextureId (which contains the same resource pointer internally, but new view pointers).

To ensure there's no data hazards (same texture as render target and shader resource), when binding a texture to a shader slot, I loop through the currently bound render targets and assert it's not bound as one. This (costly) checking is only done in development builds - in shipping builds, it's assumed the code is correct, and this validation code is disabled. Lots of other usage errors are treated the same way - e.g. checking if a draw-call will read past the end of a vertex buffer...

Another issue I'm interested in is making the API thread-safe. While this is relatively trivial to do with D3D11 thanks to the Device/Context separation, with OpenGL it's more complicated since you have to deal with context sharing and currency.

As above, I copy D3D11. I have a boolean property in a GpuCapabilities struct that tells high level code if multithreaded resource creation is going to be fast or will incur a performance penalty.

As you mention, modern drivers are pretty ok with using multiple GL contexts and doing shared resource creation. On some modern GPUs, this is even recommended, as it triggers the driver's magic fast-path of transferring the data to the GPU "for free" via the GPU's underutilised and API-less asynchronous DMA controller!

As a fall-back for single-threaded APIs, you can check the thread-ID inside your resource creation functions, and if it's not the "main thread" ID, then generates a resource ID in a thread-safe manner, and push the function parameters into a queue. Later on the main thread, it can pop the function parameters from the queue and actually create the resource and link it up to the ID you returned earlier. This obviously won't give you any performance boost (perhaps the opposite if you have to do an extra malloc and memcpy in the queuing process...) but it does let you use the same multithreaded resource loading code even on old D3D9 builds.

Hi.

Sound is now done on cpu and threading using a lockfree type queue you should look into how sound is done and it may sove the video play back as well.

Key term lockfree queue , or producer and consumer models, you sacrifice more memory but could get speed.

Thanks for the reply! Apologies, I didn't mean video playback, but rather low-level generic rendering.

On platforms with "GPU malloc" (where memory allocation and resource creation are not linked), I have the streaming system do a regular (CPU) malloc for the header, and a GPU-malloc (write-combine, uncached, non-coherent) for the contents, and then pass the two pointers into a "device.CreateTexture" call, which has very little work to do, seeing that all the data has already been streamed into the right place.

I treat other platforms the same way, except the "GPU malloc" is just a regular malloc, which temporarily holds the pixel data until D3D/GL copies it into an immutable resource, at which point it's free'ed.

Interesting, I'm unfamiliar with the concept. Am I correct to assume it's strictly on consoles with unified memory (Xbox One, PS4)?

To ensure there's no data hazards (same texture as render target and shader resource), when binding a texture to a shader slot, I loop through the currently bound render targets and assert it's not bound as one. This (costly) checking is only done in development builds - in shipping builds, it's assumed the code is correct, and this validation code is disabled. Lots of other usage errors are treated the same way - e.g. checking if a draw-call will read past the end of a vertex buffer...

This is a good point, I think I may be trying to over-complicate things in the interest of "free" safety by leveraging the compiler.

As above, I copy D3D11. I have a boolean property in a GpuCapabilities struct that tells high level code if multithreaded resource creation is going to be fast or will incur a performance penalty.

How do you check for this? Is it a simple divide like PC vs. consoles or can it also relate to graphics card/driver version?

As you mention, modern drivers are pretty ok with using multiple GL contexts and doing shared resource creation. On some modern GPUs, this is even recommended, as it triggers the driver's magic fast-path of transferring the data to the GPU "for free" via the GPU's underutilised and API-less asynchronous DMA controller!

Neat, and all you have to do is kick off a DMA on a concurrent context?

"So there you have it, ladies and gentlemen: the only API I’ve ever used that requires both elevated privileges and a dedicated user thread just to copy a block of structures from the kernel to the user." - Casey Muratori

boreal.aggydaggy.com

On platforms with "GPU malloc" (where memory allocation and resource creation are not linked), I have the streaming system do a regular (CPU) malloc for the header, and a GPU-malloc (write-combine, uncached, non-coherent) for the contents, and then pass the two pointers into a "device.CreateTexture" call, which has very little work to do, seeing that all the data has already been streamed into the right place.
I treat other platforms the same way, except the "GPU malloc" is just a regular malloc, which temporarily holds the pixel data until D3D/GL copies it into an immutable resource, at which point it's free'ed.

Interesting, I'm unfamiliar with the concept. Am I correct to assume it's strictly on consoles with unified memory (Xbox One, PS4)?
Xb360, XbOne, PS4 are known to have unified memory, but that's not important here -- unified memory means that it's all physically the same, as opposed to a typical PC where some is physically on the motherboard and some is physically on the GPU PCIe card. This doesn't matter.

What matters is whether you can map "GPU memory" (whether that's physically on the GPU, or unified memory) into the CPU's address space.

Pointers in C, C++ (or even in assembly!) aren't physical addresses - they're virtual addresses, which the hardware translates into addresses of some physical resources, which doesn't even have to be RAM! A pointer might refer to RAM on the motherboard, RAM on a PCIe card, a file on a HDD, a register in an IO device, etc...
The act of making a pointer (aka virtual address) correspond to a physical resources is called mapping.

When you call malloc, you're allocating some virtual address space (a contiguous range of pointer value), allocating a range of physical RAM, and then mapping that RAM to those pointers.

In the same way, some OS's let you allocate RAM that's physically on your PCIe GPU, but map it to a range of pointers so the CPU can use it.
This just magically works, except will of course be slower than usual when you use it, because there's a bigger physical distance between the CPU and GPU-RAM than there is between the CPU and motherboard-RAM (plus the busses on this physical path are slower).

So, if you can obtain a CPU-usable pointer (aka virtual address) into GPU-usable physical RAM, then your streaming system can stream resources directly into place, with zero graphics API involvement!

Yes, this is mostly reserved for game console devs... :(
But maybe Mantle/GLNext/D3D12 will bring into PC land.

GL4 has already kinda added support for it though! You cant completely do your own resource management (no "GPU malloc"), but you can actually map GPU-RAM into the CPU's address space.
The GL_ARB_buffer_storage extension lets you create a texture with the appropriate size/format, but no initial data, and then map is using the "PERSISTENT" flag.
This maps the texture's GPU-side allocation into CPU address space so you can write/stream into it.
You should avoid the "COHERENT" flag, as this will reduce performance dramatically by forcing the GPU to sniff the CPU's caches when reading from the texture :(
If not specifying COHERENT, the CPU-mapped virtual addresses will be marked as uncached, write-combined pages. This means the CPU will automatically bypass it's own caches, as you'll only be writing to these addresses (never reading from them) and that it will queue up your writes into a "write combining buffer" and do more efficient bulk-transfers through to the GPU (even if your code is only writing one byte at a time). The only catch is you have to call a GL fence function when you've finished writing/streaming, which will ensure the write-combine buffer is flushed out completel, and any GPU-side caches are invalidated if required.
Pretty awesome!
So one of the fastest loading algorithms on GL4 may be to create the resource first, which just does the resource allocation. Then use map-persistent to get a pointer to that allocation. Then stream data into that pointer, and unmap it and fence.

AFAIK, the background context DMA transfer method is probably the fastest though.

As above, I copy D3D11. I have a boolean property in a GpuCapabilities struct that tells high level code if multithreaded resource creation is going to be fast or will incur a performance penalty.

How do you check for this? Is it a simple divide like PC vs. consoles or can it also relate to graphics card/driver version?
If I'm emulating the feature myself (e.g. On D3D9), I know to simply set that boolean to false :D
On D3D11, you can ask the device if multithreaded resource creation is performed natively by the driver (fast) or emulated by the MS D3D runtime (slower).

On GL you have to use some vendor specific knowledge to decide if you'll even attempt with background context resource creation (or whether you'll emulate multithreaded resource creation yourself), and make an educated guess as to whether the vendor is going to actually optimize that code path by using DMA transfers, or whether you'll actually have fallen onto a slow-path... Fun times...

As you mention, modern drivers are pretty ok with using multiple GL contexts and doing shared resource creation. On some modern GPUs, this is even recommended, as it triggers the driver's magic fast-path of transferring the data to the GPU "for free" via the GPU's underutilised and API-less asynchronous DMA controller!

Neat, and all you have to do is kick off a DMA on a concurrent context?
As with everything in GL, there's multiple ways to do it, it might silenty be emulated really slowly, it might be buggy, you'll have to test on every vendors GPU's and each vendor probably has conflicting guidelines on how best to implement it.
Besides that, it's pretty simple :lol:



GL4 has already kinda added support for it though! You cant completely do your own resource management (no "GPU malloc"), but you can actually map GPU-RAM into the CPU's address space.
The GL_ARB_buffer_storage extension lets you create a texture with the appropriate size/format, but no initial data, and then map is using the "PERSISTENT" flag.
This maps the texture's GPU-side allocation into CPU address space so you can write/stream into it.
You should avoid the "COHERENT" flag, as this will reduce performance dramatically by forcing the GPU to sniff the CPU's caches when reading from the texture sad.png
If not specifying COHERENT, the CPU-mapped virtual addresses will be marked as uncached, write-combined pages. This means the CPU will automatically bypass it's own caches, as you'll only be writing to these addresses (never reading from them) and that it will queue up your writes into a "write combining buffer" and do more efficient bulk-transfers through to the GPU (even if your code is only writing one byte at a time). The only catch is you have to call a GL fence function when you've finished writing/streaming, which will ensure the write-combine buffer is flushed out completel, and any GPU-side caches are invalidated if required.
Pretty awesome!
So one of the fastest loading algorithms on GL4 may be to create the resource first, which just does the resource allocation. Then use map-persistent to get a pointer to that allocation. Then stream data into that pointer, and unmap it and fence.

Okay, that makes sense. I'm familiar with persistent mapping as a concept and an optimization but I wasn't sure exactly how it worked, thanks.

I'll probably try to stay away from true multithreaded resource creation for now and instead use asynchronous transfers with persistent mapping (for buffers) and pixel buffer objects (for textures). Sounds like more hassle than it's worth if I'm using OpenGL.

"So there you have it, ladies and gentlemen: the only API I’ve ever used that requires both elevated privileges and a dedicated user thread just to copy a block of structures from the kernel to the user." - Casey Muratori

boreal.aggydaggy.com

This topic is closed to new replies.

Advertisement