Jump to content
  • Advertisement
Sign in to follow this  
Klutzershy

Low-level platform-agnostic video subsystem design: Resource management

This topic is 1303 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Something I've been stuck on for a while is the part of the low-level video subsystem that relates to creating/deleting resources and handling memory transfers between the CPU and GPU.  There are lots of resources for the submission, batching and drawing itself but never anything about how those buffers, textures, etc. got there in the first place or how the API for resource management is designed.  Note that I'm not talking about loading and caching resource files from disk.  This is concerned with the low-level aspects of memory and resource management across processors.

 

One idea I had was a layered API loosely based off of D3D11, where you separate the memory itself (buffer, texture) from how it's going to be used (view).  This way, for example, you can allocate a texture and use it as both a render target and a sampler target.  Of course, this also brings up the issue of making sure access is "exclusive", i.e. you can't sample from it and render to it at the same time.  Another issue I'm interested in is making the API thread-safe.  While this is relatively trivial to do with D3D11 thanks to the Device/Context separation, with OpenGL it's more complicated since you have to deal with context sharing and currency.

 

Basically I'm just looking for some advice, ideas, open-source code to look at, or anything else that can help me see how this stuff has been taken care of before.

Edited by Boreal Games

Share this post


Link to post
Share on other sites
Advertisement

Hi.

Sound is now done on cpu and threading using a lockfree type queue you should look into how sound is done and it may sove the video play back as well.

Key term lockfree queue , or producer and consumer models, you sacrifice more memory but could get speed.

Share this post


Link to post
Share on other sites

One idea I had was a layered API loosely based off of D3D11,

I've largely cloned D3D11, with the device/context split.
My main difference is that I use integer IDs to refer to all resources, don't do any reference counting at this level, and have cbuffers (aka UBOs) as their own distinct kind of resource, rather than supporting them as a generic buffer resource, as there's a lot of special case ways to deal with cbuffers on different APIs.

For fast streaming textures from disc, I use a file format tweaked for each platform with two segments - a header and the pixel data.
On platforms with "GPU malloc" (where memory allocation and resource creation are not linked), I have the streaming system do a regular (CPU) malloc for the header, and a GPU-malloc (write-combine, uncached, non-coherent) for the contents, and then pass the two pointers into a "device.CreateTexture" call, which has very little work to do, seeing that all the data has already been streamed into the right place.

I treat other platforms the same way, except the "GPU malloc" is just a regular malloc, which temporarily holds the pixel data until D3D/GL copies it into an immutable resource, at which point it's free'ed.

...separate the memory itself (buffer, texture) from how it's going to be used (view).  This way, for example, you can allocate a texture and use it as both a render target and a sampler target.  Of course, this also brings up the issue of making sure access is "exclusive", i.e. you can't sample from it and render to it at the same time.

I treat texure-resources, shader-resource-views, render-target-views, and depth-stencil-views as the same thing: a "texture". At creation time I'll make the resource and all the applicable views, and bundle them into the one structure.
Later if another view is needed - e.g. to render to a specific mip-level, or to alias the same memory allocation as a different format - then I support these alternate views by passing the original TextureId into a CreateTextureView function, which returns a new TextureId (which contains the same resource pointer internally, but new view pointers).

To ensure there's no data hazards (same texture as render target and shader resource), when binding a texture to a shader slot, I loop through the currently bound render targets and assert it's not bound as one. This (costly) checking is only done in development builds - in shipping builds, it's assumed the code is correct, and this validation code is disabled. Lots of other usage errors are treated the same way - e.g. checking if a draw-call will read past the end of a vertex buffer...

Another issue I'm interested in is making the API thread-safe.  While this is relatively trivial to do with D3D11 thanks to the Device/Context separation, with OpenGL it's more complicated since you have to deal with context sharing and currency.

As above, I copy D3D11. I have a boolean property in a GpuCapabilities struct that tells high level code if multithreaded resource creation is going to be fast or will incur a performance penalty.

As you mention, modern drivers are pretty ok with using multiple GL contexts and doing shared resource creation. On some modern GPUs, this is even recommended, as it triggers the driver's magic fast-path of transferring the data to the GPU "for free" via the GPU's underutilised and API-less asynchronous DMA controller!

As a fall-back for single-threaded APIs, you can check the thread-ID inside your resource creation functions, and if it's not the "main thread" ID, then generates a resource ID in a thread-safe manner, and push the function parameters into a queue. Later on the main thread, it can pop the function parameters from the queue and actually create the resource and link it up to the ID you returned earlier. This obviously won't give you any performance boost (perhaps the opposite if you have to do an extra malloc and memcpy in the queuing process...) but it does let you use the same multithreaded resource loading code even on old D3D9 builds.

Share this post


Link to post
Share on other sites

Hi.

Sound is now done on cpu and threading using a lockfree type queue you should look into how sound is done and it may sove the video play back as well.

Key term lockfree queue , or producer and consumer models, you sacrifice more memory but could get speed.

 

Thanks for the reply!  Apologies, I didn't mean video playback, but rather low-level generic rendering.

 

 

 

On platforms with "GPU malloc" (where memory allocation and resource creation are not linked), I have the streaming system do a regular (CPU) malloc for the header, and a GPU-malloc (write-combine, uncached, non-coherent) for the contents, and then pass the two pointers into a "device.CreateTexture" call, which has very little work to do, seeing that all the data has already been streamed into the right place.

I treat other platforms the same way, except the "GPU malloc" is just a regular malloc, which temporarily holds the pixel data until D3D/GL copies it into an immutable resource, at which point it's free'ed.

 

Interesting, I'm unfamiliar with the concept.  Am I correct to assume it's strictly on consoles with unified memory (Xbox One, PS4)?

 

 

 

To ensure there's no data hazards (same texture as render target and shader resource), when binding a texture to a shader slot, I loop through the currently bound render targets and assert it's not bound as one. This (costly) checking is only done in development builds - in shipping builds, it's assumed the code is correct, and this validation code is disabled. Lots of other usage errors are treated the same way - e.g. checking if a draw-call will read past the end of a vertex buffer...

 

This is a good point, I think I may be trying to over-complicate things in the interest of "free" safety by leveraging the compiler.

 

 

 

As above, I copy D3D11. I have a boolean property in a GpuCapabilities struct that tells high level code if multithreaded resource creation is going to be fast or will incur a performance penalty.

 

How do you check for this?  Is it a simple divide like PC vs. consoles or can it also relate to graphics card/driver version?

 

 

 

As you mention, modern drivers are pretty ok with using multiple GL contexts and doing shared resource creation. On some modern GPUs, this is even recommended, as it triggers the driver's magic fast-path of transferring the data to the GPU "for free" via the GPU's underutilised and API-less asynchronous DMA controller!

 

Neat, and all you have to do is kick off a DMA on a concurrent context?

Edited by Boreal Games

Share this post


Link to post
Share on other sites



GL4 has already kinda added support for it though! You cant completely do your own resource management (no "GPU malloc"), but you can actually map GPU-RAM into the CPU's address space.
The GL_ARB_buffer_storage extension lets you create a texture with the appropriate size/format, but no initial data, and then map is using the "PERSISTENT" flag.
This maps the texture's GPU-side allocation into CPU address space so you can write/stream into it.
You should avoid the "COHERENT" flag, as this will reduce performance dramatically by forcing the GPU to sniff the CPU's caches when reading from the texture sad.png
If not specifying COHERENT, the CPU-mapped virtual addresses will be marked as uncached, write-combined pages. This means the CPU will automatically bypass it's own caches, as you'll only be writing to these addresses (never reading from them) and that it will queue up your writes into a "write combining buffer" and do more efficient bulk-transfers through to the GPU (even if your code is only writing one byte at a time). The only catch is you have to call a GL fence function when you've finished writing/streaming, which will ensure the write-combine buffer is flushed out completel, and any GPU-side caches are invalidated if required.
Pretty awesome!
So one of the fastest loading algorithms on GL4 may be to create the resource first, which just does the resource allocation. Then use map-persistent to get a pointer to that allocation. Then stream data into that pointer, and unmap it and fence.

 

Okay, that makes sense.  I'm familiar with persistent mapping as a concept and an optimization but I wasn't sure exactly how it worked, thanks.

 

I'll probably try to stay away from true multithreaded resource creation for now and instead use asynchronous transfers with persistent mapping (for buffers) and pixel buffer objects (for textures).  Sounds like more hassle than it's worth if I'm using OpenGL.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!