Smart Pointers Aren't Always So Smart
Hieroglyph 3 D3D11
The title may seem more controversial than I really intend, but I think it fits with the situation that I have recently encountered. I was going through each of the sample applications in Hieroglyph 3 after applying a user supplied patch, and I noticed that the frame rate of the MirrorMirror sample had dropped off significantly. This particular sample is designed to make multi-threaded rendering in Direct3D 11 pay off by rendering three reflective spheres which are surrounded by some large number of simple objects around them. Here is a screenshot of the sample to give you a visual:
However, when I did my test I found that the frame rate in debug mode had dropped from somewhere around 70 to ~9. After going back through my source code tree, I found that before some recent changes to how I handle the input assembler stage's state the performance was as expected. This seemed really strange, since the new state management actually should have been more or less equivalent to the old method.
To further investigate, I stepped through the drawing operation with the debugger, and immediately found out the issue. I changed to using a standalone object to represent all of the input assembler state within the engine, including all of the available vertex buffer slots. To set up the situation, there are a maximum of D3D11_IA_VERTEX_INPUT_RESOURCE_SLOT_COUNT slots in the IA, which is currently defined as 32 available slots... I typically reference resources in the engine with a smart pointer to a proxy object. The proxy object contains the indices of a resource, plus any resource views that it would need. The proxy is used by applications as a very easy way to reference several pieces of data as one, and has overall worked out very well.
To get to the point, I was declaring an input assembler state object on the stack in each draw call, which initialized the state object to have null pointers for all of those 32 vertex buffer slots. Even though I was only using a single vertex buffer, I was still initializing all the other slots to null, which amounts to 32 assignments of the smart pointer. With a little math, we can see what was going on: 32 x 4 rendering passes x 200 = 25,600 references per frame.
In the end, I simply switched the stored state to use directly the index of the vertex buffer (since the input assembler doesn't use resource views, there is no need to use the proxy object anyways). Just a few short changes popped the speed right back up to where it should have been. So the moral of the story is this - smart pointers are only as smart as the person that is using them, and sometimes (especially in my case) they end up being not so smart
Anyways, this has opened my eyes to some state management issues with Hieroglyph 3, which I am now working on. My goal is to reduce the number of API calls to as few as possible, all the while properly managing the cached pipeline state with respect to multithreaded rendering... This will be the subject of my next post!