Combining Deferred rendering, Batching, Model Matrices, Skeletal animations, and shadow maps

Started by
9 comments, last by kanageddaamen 6 years, 1 month ago

Hello all,

I am currently working on a game engine for use with my game development that I would like to be as flexible as possible.  As such the exact requirements for how things should work can't be nailed down to a specific implementation and I am looking for, at least now, a default good average case scenario design.

Here is what I have implemented:

  • Deferred rendering using OpenGL
  • Arbitrary number of lights and shadow mapping
  • Each rendered object, as defined by a set of geometry, textures, animation data, and a model matrix is rendered with its own draw call
  • Skeletal animations implemented on the GPU.  
  • Model matrix transformation implemented on the GPU
  • Frustum and octree culling for optimization

Here are my questions and concerns:

  • Doing the skeletal animation on the GPU, currently, requires doing the skinning for each object multiple times per frame: once for the initial geometry rendering and once for the shadow map rendering for each light for which it is not culled.  This seems very inefficient.  Is there a way to do skeletal animation on the GPU only once across these render calls?
  • Without doing the model matrix transformation on the CPU, I fail to see how I can easily batch objects with the same textures and shaders in a single draw call without passing a ton of matrix data to the GPU (an array of model matrices then an index for each vertex into that array for transformation purposes?)
  • If I do the matrix transformations on the CPU, It seems I can't really do the skinning on the GPU as the pre-transformed vertexes will wreck havoc with the calculations, so this seems not viable unless I am missing something

Overall it seems like simplest solution is to just do all of the vertex manipulation on the CPU and pass the pre-transformed data to the GPU, using vertex shaders that do basically nothing.  This doesn't seem the most efficient use of the graphics hardware, but could potentially reduce the number of draw calls needed.

Really, I am looking for some advice on how to proceed with this, how something like this is typically handled.  Are the multiple draw calls and skinning calculations not a huge deal?  I would LIKE to save as much of the CPU's time per frame so it can be tasked with other things, as to keep CPU resources open to the implementation of the engine.  However, that becomes a moot point if the GPU becomes a bottleneck.

Advertisement
Quote

 

  • Doing the skeletal animation on the GPU, currently, requires doing the skinning for each object multiple times per frame: once for the initial geometry rendering and once for the shadow map rendering for each light for which it is not culled.  This seems very inefficient.  Is there a way to do skeletal animation on the GPU only once across these render calls?

 

If you really want to save results, you could store the resultant transforms in an SSBO (or a texel storage unit or something) on your first pass, via vertex index, and grab them on your second. However, I get the feeling that the memory writes and reads will be slower than a few matrix multiplications, and that's not to mention you would need to have one of these objects per instance of your animated mesh.

This approach also introduces a dependency between shadow map passes and your general pipeline. If you don't do this, both shaders can be executing at the same time.

Typically, the majority of objects in a scene are not undergoing skeletal animation. For your general use case, I wouldn't worry about recalculating animations. Vertices are processed pretty fast.

Quote
  • Without doing the model matrix transformation on the CPU, I fail to see how I can easily batch objects with the same textures and shaders in a single draw call without passing a ton of matrix data to the GPU

Don't worry about it. pcie x16 transfers at a rate of 4 GB/s, so ~67 mb/frame for a 60 fps target. A matrix is 64 bytes, and you're passing bone transforms. if we go ham and say you have 500 bones per model (YEESH!), you could still pass 10,000 full skeletons and have more than half of your PCIE bandwidth left over for the frame.

I'm also a bit confused here. If you do the model transforms on the cpu, you have to pass not only a bunch of transformed verts, but now you have to pass every instance of a transformed mesh as a -separate mesh-, meaning you can't do instancing for that mesh now.

Quote

(an array of model matrices then an index for each vertex into that array for transformation purposes?)

yes.

Edit: usually you have some float weights and integer bone indices (corresponding to the weights) per vert. You can store these as attributes (vec4+ivec4) or put them in a buffer and get them from a vert index attribute.

I personally have used the second to keep my mesh format consistent and to make attaching arbitrary vertex data less of a horror for future development, obviously at the cost of a bit of performance

17 hours ago, Ugly said:

a bit confused here. If you do the model transforms on the cpu, you have to pass not only a bunch of transformed verts, but now you have to pass every instance of a transformed mesh as a -separate mesh-, meaning you can't do instancing for that mesh now.

Indeed, another reason I would not like to go that route.

Thanks for all the insights, they will be an immense help moving forward.  I am currently passing in the bones as a uniform (actually as dual quaternions which I then convert to matrices in the shader) and the weights and indices as vert attributes for my skeletal animations, but I will look into trying it indexed to see if it fits my needs.

One related question I had just cropped up as I was doing some more reading.  Something I came across mentioned not to reuse buffers for write calls (e.g. a single fixed size VAO reused for batches) due to implicit synchronization killing performance, though some of what I have seen on batching does just that. 

How would you perform view frustum culling of objects each frame if modifying the data in buffers can be a killer on performance?  I can't imagine you would want to submit/maintain a bunch of data to/on the GPU that isn't needed for rendering

I think I figured part of this out myself.  Between frames shouldn't be an issue since all of those draw calls will need to be completed for the frame anyway. 

Should you not reuse VAO for batching, if you need more space than a batch can handle, create a new VAO?  This seems like the number of buffers can grow significantly though if you are sending a lot of data

5 minutes ago, kanageddaamen said:

if you need more space than a batch can handle, create a new VAO?

I'm assuming you mean VBO, rather than VAO?

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

24 minutes ago, swiftcoder said:

I'm assuming you mean VBO, rather than VAO?

Wouldn't you need to create an entire new VAO, otherwise the other VBOs in the VAO you are rendering will be passed by the draw call, thereby increasing the batch size which you are trying to keep constant?  I must admit I am no expert on the various draw call options and their capabilities

 

EDIT: I suppose you would just bind different VBOs and make some glVertexAttribPointer calls for the next batch call

10 minutes ago, kanageddaamen said:

Wouldn't you need to create an entire new VAO, otherwise the other VBOs in the VAO you are rendering will be passed by the draw call, thereby increasing the batch size which you are trying to keep constant? 

VAOs are purely client-side state. They work exactly the same as making the individual glBindBuffer/glEnableVertexAttribArray/glVertexAttribPointer calls yourself.

As such they don't affect batching at all. You still have one batch per glDraw* call, regardless of how you bound the vertex buffers.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Just now, swiftcoder said:

VAOs are purely client-side state. They work exactly the same as making the individual glBindBuffer/glEnableVertexAttribArray/glVertexAttribPointer calls yourself.

As such they don't affect batching at all. You still have one batch per glDraw* call, regardless of how you bound the vertex buffers.

Gotcha

In my engine I am doing skinning in compute shaders before the rendering starts. This is very nice from a shader management point of view, because I have a single skinning shader, and every model can use a regular vertex shader, regular input layout in rendering, so the amount of vertex shader permutations is minimized. From a performance point of view, this is a trickier question and maybe not always results in the same answer. For example, I spawned a little conversation on twitter one day regarding performance implications on tile based architectures. And I wrote a small blog on the subject as well, take a look if interested. :)

 

Actually, rather than spinning up a new VBO for each batch of the same state in a frame if one gets filled, would the following be a better approach:

For batch size N MB
   Using a VBO allocated with N MB
   For each chunk of N MB data with the same state
       Fill VBO with chunk of data using glMapBufferRange with GL_MAP_INVALIDATE_BUFFER_BIT 
       Make draw call

From what I have read this should safely mitigate implicit synchronization while allowing for a single VBO handle to be used

This topic is closed to new replies.

Advertisement