The results were a little bit surprising to me. It turns out that for all of the samples which are GPU limited, there is a statistically insignificant difference between the two - so regardless of the number of API calls, the GPU was the one slowing things down. This makes sense, and should be another point of evidence that trying to optimize before you need to is not a good thing.
However, for the CPU limited samples there is a different story to tell. In particular, the MirrorMirror sample stands out. For those of you who aren't familiar, the sample was designed to highlight the multi-threading capabilities of D3D11 by performing many simple rendering tasks. This is accomplished by building a scene with lots of boxes floating around three reflective spheres in the middle. The spheres perform dual paraboloid environment mapping, which effectively amplifies the amount of geometry to be rendered since they have to generate their paraboloid maps every frame. Here is a screenshot of the sample to illustrate the concept:
This exercises the API call mechanism quite a bit, since the GPU isn't really that busy and there are many draw calls to perform (each box is drawn separately instead of using instance for this reason). It had shown a nice performance delta between single and multi-threaded rendering, but it also serves as a nice example for the pipeline state monitoring too. The results really speak for themselves. The chart below shows two traces of the frame time for running the identical scene both with and without the state monitoring being used to prevent unneeded API calls. Here, less is more since it means it takes less time to render each frame.
As you can see, the frame time is significantly lower for the trace using the state monitoring. So to interpret these results, we have to think about what is being done here. The sample is specifically designed to be an example of heavy CPU usage relative to the GPU usage. You can consider this the "CPU-Extreme" side of the spectrum. On the other hand, GPU bound samples show no difference in frame time - so we can call this the "GPU-Extreme" side of the spectrum.
Most rendering situations will probably fall somewhere in between these two situations. So if you are very GPU heavy, this probably doesn't make too much difference. However, once the next generation of GPUs come out, you can easily have a totally different situation and become CPU bound. I think Yogi Beara once said - "it isn't a problem until its a problem."
So overall, in my opinion it is worthwhile to spend the time and implement a state monitoring system. This also has other benefits, such as the fact that you will have a system that makes it easy to log all of your actual API calls vs. requested ones - which may become a handy thing if your favorite graphics API monitoring tools ever become unavailable... (yes, you know what I mean!). So get to it, get a copy of the Hieroglyph code base, grab the pipeline state monitoring classes and hack them up into your engine's format!