Jump to content

  • Log In with Google      Sign In   
  • Create Account

Chronicles of the Hieroglyph

Pipeline State Monitoring - Results

Posted by , 29 March 2013 - - - - - - · 794 views

Last time I discussed how I am currently using pipeline state monitoring to minimize the number of API calls that are submitted to the Direct3D runtime/driver. Some of you were wondering if this is a worthwhile thing to try out, and were interested in some empirical test results to see what the difference is. Hieroglyph 3 has quite a number of different rendering techniques available to it in the various sample programs, so I went to work trying out a few different samples - both with and without pipeline state monitoring.

The results were a little bit surprising to me. It turns out that for all of the samples which are GPU limited, there is a statistically insignificant difference between the two - so regardless of the number of API calls, the GPU was the one slowing things down. This makes sense, and should be another point of evidence that trying to optimize before you need to is not a good thing.

However, for the CPU limited samples there is a different story to tell. In particular, the MirrorMirror sample stands out. For those of you who aren't familiar, the sample was designed to highlight the multi-threading capabilities of D3D11 by performing many simple rendering tasks. This is accomplished by building a scene with lots of boxes floating around three reflective spheres in the middle. The spheres perform dual paraboloid environment mapping, which effectively amplifies the amount of geometry to be rendered since they have to generate their paraboloid maps every frame. Here is a screenshot of the sample to illustrate the concept:

Attached Image

This exercises the API call mechanism quite a bit, since the GPU isn't really that busy and there are many draw calls to perform (each box is drawn separately instead of using instance for this reason). It had shown a nice performance delta between single and multi-threaded rendering, but it also serves as a nice example for the pipeline state monitoring too. The results really speak for themselves. The chart below shows two traces of the frame time for running the identical scene both with and without the state monitoring being used to prevent unneeded API calls. Here, less is more since it means it takes less time to render each frame.

Attached Image

As you can see, the frame time is significantly lower for the trace using the state monitoring. So to interpret these results, we have to think about what is being done here. The sample is specifically designed to be an example of heavy CPU usage relative to the GPU usage. You can consider this the "CPU-Extreme" side of the spectrum. On the other hand, GPU bound samples show no difference in frame time - so we can call this the "GPU-Extreme" side of the spectrum.

Most rendering situations will probably fall somewhere in between these two situations. So if you are very GPU heavy, this probably doesn't make too much difference. However, once the next generation of GPUs come out, you can easily have a totally different situation and become CPU bound. I think Yogi Beara once said - "it isn't a problem until its a problem."

So overall, in my opinion it is worthwhile to spend the time and implement a state monitoring system. This also has other benefits, such as the fact that you will have a system that makes it easy to log all of your actual API calls vs. requested ones - which may become a handy thing if your favorite graphics API monitoring tools ever become unavailable... (yes, you know what I mean!). So get to it, get a copy of the Hieroglyph code base, grab the pipeline state monitoring classes and hack them up into your engine's format!

Pipeline State Monitoring

Posted by , 07 March 2013 - - - - - - · 1,508 views
D3D11, rendering
My last couple of commits to Hieroglyph 3 addressed a performance issue that most likely all graphics programmers that have made it beyond the basics have grappled with: Pipeline State Monitoring. This is the system used to ensure that your engine only submits the API calls that are really necessary in your rendered frame. This cuts down on any API calls that don't effectively add any value to your rendering workload, but still costs some time to execute them anyway. This is a problem that I have worked through a number of times, I am quite fond of the latest solution that I have arrived at. So let's talk about state monitoring!

More Precise Problem Statement
Before we dive into the solution that I am using, I would like to more clearly identify what the problem is that I am trying to solve. We will consider the process of one rendering pass, since any additional, more complicated rendering schemes can be broken down into multiple rendering passes. I define a rendering pass as all of the operations performed between setting your render targets (or more generally your Output Merger states) and the next time you set your render targets (i.e. the start of the next rendering pass). These could include a sequence like the following:
  • Set render target view / depth stencil view
  • Clear render targets
  • Configure the pipeline for rendering
  • Configure the input assembler for receiving pipeline input
  • Call a pipeline execution API (such as Draw, DrawIndexed, etc...)
  • Repeat steps 3-5 for each object that has to be rendered
After a rendering pass has been completed, the contents of the render targets have been filled by rendering all the objects in a scene. If this is the main view of a scene, you would probably do a 'present' call to copy the results into a window for display to the user. Steps 3 and 4 are each composed of a number of calls to the ID3D11DeviceContext interface that are used to set some states. Each of these API calls takes some time to perform - some more than others, but they are all consuming at least some time. Since we are involving the user application code, the D3D11 runtime, and the GPU driver, some of these calls are really time consuming.

This is a fairly straight-forward definition, but the devil (as usual) is in the details. Since step 6 has you repeating the previous three steps, you are most likely going to be repeating some of the same calls in subsequent pipeline (step 3) and input assembler (step 4) configuration steps. However, the pipeline state is an actual state machine. That means the states that you set are not reset or modified in any way when you execute the pipeline.

So our task is to take advantage of this state retention to minimize the amount of time spent in steps 3 & 4. If there are consecutive states which are setting the same values in the pipeline, we need to efficiently detect that fact and prevent our renderer from carrying out any additional calls that don't actually update the pipeline state in a useful way. Sounds easy enough, but it isn't really that easy to have a general solution to the problem.

The God Solution
One could make the argument that your application code should be able to do this on its own. A rendering pass should be analyzed before you get to calling API functions and by design we would only make the minimum API calls that are needed to push out a frame. Technically this is true, but in practice I don't think it is realistic if your engine is going to support a wide variety of different lighting models, object types, and special rendering passes.

In Hieroglyph 3, each object in the scene carries their desired state information around with them, so that would be the state granularity level - an object. Other approaches would be to collect all similar objects and render them together (I guess that would be 'group' granularity). Whatever granularity you choose to write your rendering code in, it will probably not be at the scene level, where no rendering code resides at lower levels than the scene. For this reason, I discount the 'God' solution - it isn't practical to say that we will only submit perfect sequences of API calls. It isn't possible to know the exact order that every object will be rendered in for all situations in your scene, so this isn't going to work...

The Naive Solution
Another approach is to use the device context's various 'Get' methods to check the value of states before setting them. This might save some small amount of time if you save an expensive state change API from being called, but then again you are losing time for every 'Get' call that you make without preventing an un-necessary call... This one is too variable on the states being used, and can actually end up costing more time than not trying to reduce API calls at all!

State Monitoring Solution
At this point, we can assume that we are going to have to keep some 'model' of the pipeline state in system memory in order to know what the current values of the pipeline are without using API calls to query its state. In Hieroglyph 3, I created a class for each pipeline stage that represents its 'state'. For the programmable shader stages, this includes the following state information:
  • Shader program
  • Array of constant buffers
  • Array of shader resource views
  • Array of samplers
For the fixed function stages, they each have their own state class. By using an object to represent state, we can then create a class to represent a pipeline stage that holds one copy of its state. We'll refer to that state as the 'CurrentState'. At this point, we could say that any time we try to make an API call that differs from the that currently held in the CurrentState, then we would execute the call and update CurrentState.

This is indeed an improvement, but it can actually be implemented a bit more efficiently. The issue here is that for any of the states that contain arrays, we could potentially call the 'Set' API multiple times when all of the values could be set in a single API call. In fact, from our list of steps above, we can actually collect all of the desired state changes right up until we are about to perform a draw call. If we do this, then we can optimize the number of calls down to a minimum. If we add a second state object to each pipeline stage, which we will call the 'DesiredState', then we have an easy way to collect these desired state changes.

However, the addition of a second state object means that for each draw call, we would have to compare the CurrentState and DesiredState objects. Some of these states are pretty large (with hundreds of elements in an array), so doing a full comparison before each draw call can become quite expensive, and would probably eclipse any gains from minimizing state changes...

You may have already guessed the solution - we link the two state objects and add some 'dirty' flags. Whenever a state is changed in the DesiredState object, it compares only that state with the CurrentState object. If they differ, then the flag is set. If they don't differ, we can potentially update the dirty flag to indicate that an update is no longer needed (saving an API call). Especially when working with the arrays of states, the logic for this update can be a little tricky - but it is possible. With this scheme, we can set only the needed state changes right before our draw call, effectively minimizing the number of API calls with a minimal amount of CPU work. We even have a nice object design to make it easy to manage the state monitoring.

State Monitoring in Hieroglyph 3
That was a long way of arriving at my latest solution. Up to this point, I was implementing individual state arrays and monitoring each one uniquely in each of the pipeline state classes. However, this is very error prone, since there are many similar states, but not all are exactly the same, and you end up with repeated code all over the place. So I turned to my favorite solution of late - templates. I created two templates: TStateMonitor<T> and TStateArrayMonitor<T>. These allow me to encapsulate the state monitoring for single values and arrays into the templates, and then my pipeline stage state objects only need to declare an appropriate template instantiation and link the Current and Desired states together. The application code can interact directly with these template classes (of only the desired state, of course) and you only need to tell the pipeline stage when you are about to make a draw call and that it needs to flush its state changes.

In addition, since they are template classes, you can always use them regardless of what representation of individual states are used. If your engine works directly with raw pointers of API objects, that is fine. If it works with integer references to objects, that's fine too. The templates make the design more nimble and able to adapt to future changes. I have to say, I am really happy with how the whole thing turned out...

So if you have made it this far, I would be interested to hear if you use something similar, or have any comments on the design. Thanks for reading!