My last couple of commits to Hieroglyph 3
addressed a performance issue that most likely all graphics programmers that have made it beyond the basics have grappled with: Pipeline State Monitoring. This is the system used to ensure that your engine only submits the API calls that are really necessary in your rendered frame. This cuts down on any API calls that don't effectively add any value to your rendering workload, but still costs some time to execute them anyway. This is a problem that I have worked through a number of times, I am quite fond of the latest solution that I have arrived at. So let's talk about state monitoring!More Precise Problem Statement
Before we dive into the solution that I am using, I would like to more clearly identify what the problem is that I am trying to solve. We will consider the process of one rendering pass, since any additional, more complicated rendering schemes can be broken down into multiple rendering passes. I define a rendering pass as all of the operations performed between setting your render targets (or more generally your Output Merger states) and the next time you set your render targets (i.e. the start of the next rendering pass). These could include a sequence like the following:
- Set render target view / depth stencil view
- Clear render targets
- Configure the pipeline for rendering
- Configure the input assembler for receiving pipeline input
- Call a pipeline execution API (such as Draw, DrawIndexed, etc...)
- Repeat steps 3-5 for each object that has to be rendered
After a rendering pass has been completed, the contents of the render targets have been filled by rendering all the objects in a scene. If this is the main view of a scene, you would probably do a 'present' call to copy the results into a window for display to the user. Steps 3 and 4 are each composed of a number of calls to the ID3D11DeviceContext interface that are used to set some states. Each of these API calls takes some time to perform - some more than others, but they are all consuming at least some
time. Since we are involving the user application code, the D3D11 runtime, and the GPU driver, some of these calls are really time consuming.
This is a fairly straight-forward definition, but the devil (as usual) is in the details. Since step 6 has you repeating the previous three steps, you are most likely going to be repeating some of the same calls in subsequent pipeline (step 3) and input assembler (step 4) configuration steps. However, the pipeline state is an actual state machine. That means the states that you set are not reset or modified in any way when you execute the pipeline.
So our task is to take advantage of this state retention to minimize the amount of time spent in steps 3 & 4. If there are consecutive states which are setting the same values in the pipeline, we need to efficiently
detect that fact and prevent our renderer from carrying out any additional calls that don't actually update the pipeline state in a useful way. Sounds easy enough, but it isn't really that easy to have a general solution to the problem.The God Solution
One could make the argument that your application code should be able to do this on its own. A rendering pass should be analyzed before you get to calling API functions and by design we would only make the minimum API calls that are needed to push out a frame. Technically this is true, but in practice I don't think it is realistic if your engine is going to support a wide variety of different lighting models, object types, and special rendering passes.
In Hieroglyph 3, each object in the scene carries their desired state information around with them, so that would be the state granularity level - an object. Other approaches would be to collect all similar objects and render them together (I guess that would be 'group' granularity). Whatever granularity you choose to write your rendering code in, it will probably not be at the scene level, where no rendering code resides at lower levels than the scene. For this reason, I discount the 'God' solution - it isn't practical to say that we will only submit perfect sequences of API calls. It isn't possible to know the exact order that every object will be rendered in for all situations in your scene, so this isn't going to work...The Naive Solution
Another approach is to use the device context's various 'Get' methods to check the value of states before setting them. This might save some small amount of time if you save an expensive state change API from being called, but then again you are losing time for every 'Get' call that you make without preventing an un-necessary call... This one is too variable on the states being used, and can actually end up costing more time than not trying to reduce API calls at all!State Monitoring Solution
At this point, we can assume that we are going to have to keep some 'model' of the pipeline state in system memory in order to know what the current values of the pipeline are without using API calls to query its state. In Hieroglyph 3, I created a class for each pipeline stage that represents its 'state'. For the programmable shader stages, this includes the following state information:
- Shader program
- Array of constant buffers
- Array of shader resource views
- Array of samplers
For the fixed function stages, they each have their own state class. By using an object to represent state, we can then create a class to represent a pipeline stage
that holds one copy of its state
. We'll refer to that state as the 'CurrentState'. At this point, we could say that any time we try to make an API call that differs from the that currently held in the CurrentState, then we would execute the call and update CurrentState.
This is indeed an improvement, but it can actually be implemented a bit more efficiently. The issue here is that for any of the states that contain arrays, we could potentially call the 'Set' API multiple times when all of the values could be set in a single API call. In fact, from our list of steps above, we can actually collect all of the desired state changes right up until we are about to perform a draw call. If we do this, then we can optimize the number of calls down to a minimum. If we add a second state object to each pipeline stage, which we will call the 'DesiredState', then we have an easy way to collect these desired state changes.
However, the addition of a second state object means that for each draw call, we would have to compare the CurrentState and DesiredState objects. Some of these states are pretty large (with hundreds of elements in an array), so doing a full comparison before each draw call can become quite expensive, and would probably eclipse any gains from minimizing state changes...
You may have already guessed the solution - we link the two state objects and add some 'dirty' flags. Whenever a state is changed in the DesiredState object, it compares only that state with the CurrentState object. If they differ, then the flag is set. If they don't differ, we can potentially update the dirty flag to indicate that an update is no longer needed (saving an API call). Especially when working with the arrays of states, the logic for this update can be a little tricky - but it is possible. With this scheme, we can set only the needed state changes right before our draw call, effectively minimizing the number of API calls with a minimal amount of CPU work. We even have a nice object design to make it easy to manage the state monitoring.State Monitoring in Hieroglyph 3
That was a long way of arriving at my latest solution. Up to this point, I was implementing individual state arrays and monitoring each one uniquely in each of the pipeline state classes. However, this is very error prone, since there are many similar states, but not all are exactly the same, and you end up with repeated code all over the place. So I turned to my favorite solution of late - templates. I created two templates: TStateMonitor<T> and TStateArrayMonitor<T>. These allow me to encapsulate the state monitoring for single values and arrays into the templates, and then my pipeline stage state objects only need to declare an appropriate template instantiation and link the Current and Desired states together. The application code can interact directly with these template classes (of only the desired state, of course) and you only need to tell the pipeline stage when you are about to make a draw call and that it needs to flush its state changes.
In addition, since they are template classes, you can always use them regardless of what representation of individual states are used. If your engine works directly with raw pointers of API objects, that is fine. If it works with integer references to objects, that's fine too. The templates make the design more nimble and able to adapt to future changes. I have to say, I am really happy with how the whole thing turned out...
So if you have made it this far, I would be interested to hear if you use something similar, or have any comments on the design. Thanks for reading!