Your order of operations are wrong. Render-queues have nothing to do with culling (as in you don’t cull out of a render-queue—a render-queue can’t be filled until culling has been performed).
An implementation begins with:
- A spatial partition of the scene in which the actual renderable objects can be found.
- Empty render-queues. One for each perspective from which you will render (1 for rendering from the camera, 1 for each shadow map, etc.) For the main camera render, you actually have 2—1 for opaque and 1 for translucent. Sorting 2 smaller queues is faster than 1 big one. For now, we will assume only 1 render from the camera.
- Render-queues store items in a linear array, such as std::vector. The vector never deallocates memory, so repopulating it is virtually free.
- 1 frustum for each renderable viewpoint. For now, that means just the player’s view camera.
These are members of the scene manager. The render-queues are cleared every frame and repopulated.
And some terms:
- Model: Is an actor. Contains all the meshes that create the entire model/object. It has 1 bounding AABB and sphere which encompass all of the smaller meshes inside it.
- Mesh: Is an actor. Multiple of these make up a model. Contains 1 bounding AABB and sphere which encompass all the vertices in the mesh. A mesh is not drawn directly. Instead it holds an array of sub-meshes (if a single mesh has 2 materials on it, then it has 2 sub-meshes, one for each render needed to draw the whole mesh).
- Sub-mesh: For rendering 1 material on 1 mesh. A mesh has any number of these depending on how many renders it takes to draw the whole mesh. It contains no bounding boxes.
A sub-mesh is the smallest “unit” of a draw call, and thus we need to sort by sub-meshes. As noted above, a mesh can have some faces using different materials, textures, shaders, etc. These materials, shaders, textures, etc. are on the sub-meshes. That is why the mesh itself is not a renderable unit.
The purpose of sorting is to reduce swaps between shaders, textures, vertex buffers, etc. Sub-meshes have the data needed for a sort, and also all the data for a single draw call.
Basic implementation for a single camera:
- Clear the render-queue.
- Cull objects through the spatial partitioning system (or even just brute-force if you are just testing things) using the camera’s frustum. You can populate the render-queue directly as you do this.
- If a model is fully inside the frustum, all meshes in the model are immediately added to the render-queue without checking their bounding boxes against the frustum.
- If a model is not inside the frustum at all, it is skipped.
- If a model is partially inside the frustum, meshes are each checked against the frustum before being added to the render-queue.
- Important. Remember that I said sub-meshes get added to the render-queue. When a mesh “adds itself” to a render-queue, it actually adds all of its sub-meshes to the render-queue. Sub-meshes do not have bounding boxes. When a mesh is determined to be inside the frustum, all sub-meshes are automatically added.
- The keys for sorting are created by the information on the sub-meshes. Shader ID, texture ID, etc. Only the depth comes from the mesh, which means all sub-meshes on a mesh have the same depth as far as the sort will know.
- Sub-meshes go to either the opaque or translucent render-queue. Both are being populated as you traverse the spatial partition.
- You have now populated a render-queue. For opaque render-queues, sort based on shader ID’s, texture ID’s, vertex buffers ID’s, and depth (front-to-back). For translucent renders, sort based on depth only (back-to-front).
- Render each sub-mesh in the order specified by the render-queue.
This is the basic high-level overview of any typical render-queue system. Temporal coherence is a simple extension of this and fully implemented inside the render-queue itself.
As mentioned, the render-queues persist between frames because the scene manager keeps them around as members.
Additionally, when a render-queue sorts its contents, it doesn’t actually move any of the render-queue items. It moves indices to each item (this saves on copy bandwidth). Meaning it keeps an array of render-queue items (data generated by the sub-meshes, containing shader ID’s etc.), and also keeps an array of indices. At first, the indices are 0, 1, 2, 3, 4, 5…N. Instead of moving the render-queue items, you move the indices, so the index array may then become 5, 1, 4, 3, 2, 0…N after the sort. This is the order in which you draw all the sub-meshes.
With all of that in place, you can easily add temporal coherence.
When the render-queue is cleared, it simply does “m_ui32Total = 0;” or “m_vList.clear();” (if you use std::vector). No deallocation and no resetting of the array of indices.
By keeping the array of indices for the next frame, you are likely to have the indices in already-sorted order.
Although going through the spacial partition and culling may seem complicated, they are deterministic. They always produce the same resulting list of sub-meshes in the same order each frame if neither the meshes nor camera moves.
So if you end up with the same array of sub-meshes and the indices are already 5, 1, 4, 3, 2, 0…N, then the render-queue is already sorted and you have achieved temporal coherence.
As a result, insertion sort works best for a render-queue that takes advantage of temporal coherence this way.
There are only 3 cases to handle when preparing for a sort:
- If the new total render-queue items is larger than it was before, add the extra indices to your index array in order.
- If last frame we had 6 objects in the queue and now we have 8, add indices 6 and 7 to the end of the index array and leave the rest of the indices alone. So your sorted index array might look like 5, 1, 4, 3, 2, 0, 6, 7.
- If the new total is the same as the last total, do nothing. Leave indices as they are.
- If the new total is less than before, reset all the indices to 0, 1, 2, 3, 4, 5…
This code illustrates these 3 cases.
/**
* Prepares to sort the given number of elements, but does not sort them.
*
* \param _ui32Total The number of elements to prepare to sort.
*/
LSVOID LSE_CALL PrepareFor( LSUINT32 _ui32Total ) {
// Do we need to grow the array of incides?
if ( _ui32Total > m_ui32AllocIndices ) {
// Up-size the allocated amount.
delete [] m_pui32Indices;
m_pui32Indices = new LSUINT32[_ui32Total];
m_ui32AllocIndices = _ui32Total;
m_ui32TotalIndices = 0UL; // Causes list to be fully reset to 0, 1, 2… below.
}
// Check for cases where we need to add or adjust indices.
if ( m_ui32TotalIndices > _ui32Total ) {
for ( LSUINT32 I = 0UL; I < _ui32Total; ++I ) {
// Add indices that were not added before.
m_pui32Indices[I] = I;
}
}
else {
for ( LSUINT32 I = m_ui32TotalIndices; I < _ui32Total; ++I ) {
// Add indices that were not added before.
m_pui32Indices[I] = I;
}
}
m_ui32TotalIndices = _ui32Total;
}
The second for-loop implicitly handles case #2 (if m_ui32TotalIndices and _ui32Total are the same the loop will not be entered).
Supporting frame-to-frame temporal coherence is a very simple extension of a standard render-queue, and benchmarking shows it makes a significant difference compared to sorting every frame, even if you use a quick sort (a radix sort is not possible, nor would it be faster anyway).
An insertion sort is fastest because it simply makes a linear pass over the array of indices and is done. It is also stable, so things don’t jitter inside the sorted array.
As you move through the world, things will get resorted only seldomly. If you move towards an object, all of the depth values in the object will change, but as they change the same way they won’t cause re-sorting of any of the data from the previous frame.
In general cases, you are taking advantage of a pre-sorted list over at least 90% of the frames in your game.
L. Spiro