Temporal coherence and render queue sorting

Started by
18 comments, last by johnchapman 9 years, 6 months ago

I've seen this mentioned a lot in the past (and again more recently by L. Spiro in this thread) where you take advantage of temporal coherence between frames so that you don't have to perform a full re-sort (I'm not sure what the best terminology for this is) on the queue. The idea being that between frames, objects will be more or less in the same order as such a small time has passed.

But how is this achieved? If you use your already sorted list, how do you know which (renderable) objects in the list should be removed (e.g. because they were culled)?

My first idea would be something like this:

  1. Iterate through sorted list, check if culled. If not, keep it. If it is, remove it.
  2. Traverse through spatial tree (or however your scene is partitioned) and add objects which need rendering - have to check list for duplicates though!
  3. Then we can sort the queue.

But step 2 doesn't seem very efficient if you have to iterate the list a load of times to check if the object already exists.

As I write this, I realise you could add two extra fields to each of the objects to say in which frame it was last rendered, and when in which frame it was last removed. Then checking for existence in the list can be a simple lastRenderedFrame == currentFrame - 1 && lastRemovedFrame < currentFrame. You'd also need to update the lastRenderFrame field when you add it to the queue (step 2), and update the lastRemovedFrame when you removed from the queue (step 1).

Am I on the right sort of lines?

Advertisement
Your order of operations are wrong. Render-queues have nothing to do with culling (as in you don’t cull out of a render-queue—a render-queue can’t be filled until culling has been performed).

An implementation begins with:
  • A spatial partition of the scene in which the actual renderable objects can be found.
  • Empty render-queues. One for each perspective from which you will render (1 for rendering from the camera, 1 for each shadow map, etc.) For the main camera render, you actually have 2—1 for opaque and 1 for translucent. Sorting 2 smaller queues is faster than 1 big one. For now, we will assume only 1 render from the camera.
    • Render-queues store items in a linear array, such as std::vector. The vector never deallocates memory, so repopulating it is virtually free.
  • 1 frustum for each renderable viewpoint. For now, that means just the player’s view camera.
These are members of the scene manager. The render-queues are cleared every frame and repopulated.


And some terms:
  • Model: Is an actor. Contains all the meshes that create the entire model/object. It has 1 bounding AABB and sphere which encompass all of the smaller meshes inside it.
  • Mesh: Is an actor. Multiple of these make up a model. Contains 1 bounding AABB and sphere which encompass all the vertices in the mesh. A mesh is not drawn directly. Instead it holds an array of sub-meshes (if a single mesh has 2 materials on it, then it has 2 sub-meshes, one for each render needed to draw the whole mesh).
  • Sub-mesh: For rendering 1 material on 1 mesh. A mesh has any number of these depending on how many renders it takes to draw the whole mesh. It contains no bounding boxes.
A sub-mesh is the smallest “unit” of a draw call, and thus we need to sort by sub-meshes. As noted above, a mesh can have some faces using different materials, textures, shaders, etc. These materials, shaders, textures, etc. are on the sub-meshes. That is why the mesh itself is not a renderable unit.
The purpose of sorting is to reduce swaps between shaders, textures, vertex buffers, etc. Sub-meshes have the data needed for a sort, and also all the data for a single draw call.


Basic implementation for a single camera:
  • Clear the render-queue.
  • Cull objects through the spatial partitioning system (or even just brute-force if you are just testing things) using the camera’s frustum. You can populate the render-queue directly as you do this.
    • If a model is fully inside the frustum, all meshes in the model are immediately added to the render-queue without checking their bounding boxes against the frustum.
    • If a model is not inside the frustum at all, it is skipped.
    • If a model is partially inside the frustum, meshes are each checked against the frustum before being added to the render-queue.
      • Important. Remember that I said sub-meshes get added to the render-queue. When a mesh “adds itself” to a render-queue, it actually adds all of its sub-meshes to the render-queue. Sub-meshes do not have bounding boxes. When a mesh is determined to be inside the frustum, all sub-meshes are automatically added.
      • The keys for sorting are created by the information on the sub-meshes. Shader ID, texture ID, etc. Only the depth comes from the mesh, which means all sub-meshes on a mesh have the same depth as far as the sort will know.
      • Sub-meshes go to either the opaque or translucent render-queue. Both are being populated as you traverse the spatial partition.
  • You have now populated a render-queue. For opaque render-queues, sort based on shader ID’s, texture ID’s, vertex buffers ID’s, and depth (front-to-back). For translucent renders, sort based on depth only (back-to-front).
  • Render each sub-mesh in the order specified by the render-queue.
This is the basic high-level overview of any typical render-queue system. Temporal coherence is a simple extension of this and fully implemented inside the render-queue itself.


As mentioned, the render-queues persist between frames because the scene manager keeps them around as members.
Additionally, when a render-queue sorts its contents, it doesn’t actually move any of the render-queue items. It moves indices to each item (this saves on copy bandwidth). Meaning it keeps an array of render-queue items (data generated by the sub-meshes, containing shader ID’s etc.), and also keeps an array of indices. At first, the indices are 0, 1, 2, 3, 4, 5…N. Instead of moving the render-queue items, you move the indices, so the index array may then become 5, 1, 4, 3, 2, 0…N after the sort. This is the order in which you draw all the sub-meshes.


With all of that in place, you can easily add temporal coherence.
When the render-queue is cleared, it simply does “m_ui32Total = 0;” or “m_vList.clear();” (if you use std::vector). No deallocation and no resetting of the array of indices.

By keeping the array of indices for the next frame, you are likely to have the indices in already-sorted order.
Although going through the spacial partition and culling may seem complicated, they are deterministic. They always produce the same resulting list of sub-meshes in the same order each frame if neither the meshes nor camera moves.
So if you end up with the same array of sub-meshes and the indices are already 5, 1, 4, 3, 2, 0…N, then the render-queue is already sorted and you have achieved temporal coherence.

As a result, insertion sort works best for a render-queue that takes advantage of temporal coherence this way.



There are only 3 cases to handle when preparing for a sort:
  • If the new total render-queue items is larger than it was before, add the extra indices to your index array in order.
    • If last frame we had 6 objects in the queue and now we have 8, add indices 6 and 7 to the end of the index array and leave the rest of the indices alone. So your sorted index array might look like 5, 1, 4, 3, 2, 0, 6, 7.
  • If the new total is the same as the last total, do nothing. Leave indices as they are.
  • If the new total is less than before, reset all the indices to 0, 1, 2, 3, 4, 5…
This code illustrates these 3 cases.
		/**
		 * Prepares to sort the given number of elements, but does not sort them.
		 *
		 * \param _ui32Total The number of elements to prepare to sort.
		 */
		LSVOID LSE_CALL		PrepareFor( LSUINT32 _ui32Total ) {
			// Do we need to grow the array of incides?
			if ( _ui32Total > m_ui32AllocIndices ) {
				// Up-size the allocated amount.
				delete [] m_pui32Indices;
				m_pui32Indices = new LSUINT32[_ui32Total];
				m_ui32AllocIndices = _ui32Total;
				m_ui32TotalIndices = 0UL;	// Causes list to be fully reset to 0, 1, 2… below.
			}
			// Check for cases where we need to add or adjust indices.
			if ( m_ui32TotalIndices > _ui32Total ) {
				for ( LSUINT32 I = 0UL; I < _ui32Total; ++I ) {
					// Add indices that were not added before.
					m_pui32Indices[I] = I;
				}
			}
			else {
				for ( LSUINT32 I = m_ui32TotalIndices; I < _ui32Total; ++I ) {
					// Add indices that were not added before.
					m_pui32Indices[I] = I;
				}
			}
			m_ui32TotalIndices = _ui32Total;
		}
The second for-loop implicitly handles case #2 (if m_ui32TotalIndices and _ui32Total are the same the loop will not be entered).



Supporting frame-to-frame temporal coherence is a very simple extension of a standard render-queue, and benchmarking shows it makes a significant difference compared to sorting every frame, even if you use a quick sort (a radix sort is not possible, nor would it be faster anyway).
An insertion sort is fastest because it simply makes a linear pass over the array of indices and is done. It is also stable, so things don’t jitter inside the sorted array.


As you move through the world, things will get resorted only seldomly. If you move towards an object, all of the depth values in the object will change, but as they change the same way they won’t cause re-sorting of any of the data from the previous frame.
In general cases, you are taking advantage of a pre-sorted list over at least 90% of the frames in your game.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

How do you deal with shadowcasters outside the cameras frustrum?

Repeat the process from the light’s “camera” (view and frustum). Store results in a render-queue assigned to that light.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

How do you deal with sorting of transparent polygons that belong to the same sub-mesh?

Also, why do you have to reset the sort order to 0, 1, 2, 3, ... N when the new render queue is shorter? Wouldn't it be faster to just remove the no-longer present sub-meshes and leave the existing sub-meshes sorted?

Thanks for the explanation L. Spiro! Makes more sense now.


Wouldn't it be faster to just remove the no-longer present sub-meshes and leave the existing sub-meshes sorted?

Try to imagine what steps need to be taken to remove existing indices or to build a new array without these indices. Now compare that to creating an array from 1, 2, ... N.

How do you deal with sorting of transparent polygons that belong to the same sub-mesh?

You don’t with this algorithm.
In practice that level of accuracy is never needed. It’s more important to draw triangles in the same order rather than in a specific order. As long as they don’t jitter the player won’t notice that the blending is slightly wrong just because they were drawn in the wrong order.

Also, why do you have to reset the sort order

See above (Mussi’s reply).
If you have 1,000 objects and then the next frame you have 1 object, the list is still sorted as long as the list is just index 0. Now, think about how to detect that. A generalized algorithm will only help in a few situations if you just happen to get lucky enough that all the removed indices are ones you didn’t need, and in every other case time spent trying to detect that is just time wasted.
Restarting the index list is a bit of a cost but it’s sorted again after only 1 frame, so it’s the best general case.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

How do you deal with sorting of transparent polygons that belong to the same sub-mesh?

Games usually just ignore this problem altogether!
Occasionally, I've seen people pre-sort their triangles for some small number of directions (e.g. 8) and generate 8 different index buffers for those orderings. When submitting the mesh, you compare the view direciton to those 8, and use the index buffer offset of the closest match.


Try to imagine what steps need to be taken to remove existing indices or to build a new array without these indices. Now compare that to creating an array from 1, 2, ... N.

You can do both in a single iteration over the array. To remove an index, you find it's position in the array by comparing each array value with the index, then in the same iteration, move all subsequent elements back by 1 for each removed element. So you just do array = array[i + number_of_removed_indexes_found_so_far] instead of array = i (number_of_removed_indexes_found_so_far will initially be 0).

Assuming that that the number of indexes removed each frame is small in comparison with the total number of indexes, this might be faster than re-ordering all indexes and breaking the coherence?

Even if you don't want to keep the remaining indexes sorted, array = i seems like a waste of CPU cycles for some reason. Wouldn't it be faster to just keep something like a "mustSortAll" flag - set it to TRUE when indexes must be removed, and the next frame, if mustSortAll is TRUE, you use the i = 0...N indexes (the iterator values) directly in the sort, instead of setting and then reading back the vector values (which are going to be the same as the iterator values anyway)? smile.png


If you have 1,000 objects and then the next frame you have 1 object, the list is still sorted as long as the list is just index 0. Now, think about how to detect that. A generalized algorithm will only help in a few situations if you just happen to get lucky enough that all the removed indices are ones you didn’t need, and in every other case time spent trying to detect that is just time wasted.
Restarting the index list is a bit of a cost but it’s sorted again after only 1 frame, so it’s the best general case.

Good point. Then how about a middle-ground algorithm: If the number of removed items is relatively small, you just keep the remaining indexes sorted. Otherwise, you re-sort everything next frame (using my flag idea above).

Now all we need is the formula for the "relatively small" threshold - it could probably be determined by testing, though I think it's going to be a logarithmic value. (I'm not good with big O calculations).

This topic is closed to new replies.

Advertisement