Currently I keep track of what is bound to every resource/shader/render target slots using arrays and before binding a new resource I check if it's already in the array
I used to use this method (
my 'device' class would cache all states), but have ended up changing it quite a lot.
I group my state-changes into logical groups (
e.g. a "material" group might set some cbuffers, some textures and a blend-mode). Each type of render-state is allocated a bit in a bit-field (
states with multiple slots get multiple bits), e.g. when designing the bitfield:
StateName NumBits
VertexStreams, 1
ShaderPrograms, 1
DepthStencil, 1
PsConstantBuffer, 14
PsTexture, 16
Each group of state-changes (aka state-group) can then have a mask indicating which states it's going to set.
Each "render-item" then contains a draw-call and a collection of state-groups. I run a sorting function over a collection of render-items to get an appropriate order (
e.g. back-to-front for a transparent pass, or sort by expensive states for an regular pass, etc), but this sorting step is decoupled/optional.
When submitting a collection of render-items, I
do initialize a 'state cache' to track which states I've set as I go (
e.g. State* stateCache[numStates] = {};), but it's just a local variable, not a persistent cache. I also have another array that contains 'default' values for all states, which are used if a render-item's state-group doesn't contain a value for a specific state (
this is provided as a 'default' state-group for the current pass). This completely changes the abstraction of the device being a state-machine (
i.e. if you don't change a state, it's got whatever value the last user left it in) and instead makes all states explicit and deterministic (
i.e. it doesn't matter who used the device before you), which I think is an important feature for a rendering API.
When iterating through each render-item in the collection, I first have to iterate through each of that item's state-groups. As each state-group is processed, it's bitmask is ORed together and any state present in that mask is ignored -- this allows a render-item to contain 'layers' of state-groups containing the same state, and the 'top' instance of that state's value will be used while the lower ones are ignored. Any state within a state-group that passes this test then undergoes the regular redundancy test and is passed to the device / written to a command buffer.
After iterating through the render-item's state-groups, I find any states that weren't set and aren't in their default state, again using the bitmasks:
for(...each state in each state-group of the current render-item...)
if( (statesSet & state.bit)!=state.bit //'layering' test. Earlier state-groups take precedence.
&& stateCache[state.idx]!=&state ) //regular cache test, don't set redundant states.
{
stateCache[state.idx] = &state;
statesSet |= state.bit;
Submit(state.cmd);
}
needsReset = dirtyStates & ~statesSet;
dirtyStates = statesSet;//for the next render-itemAny bits that come up in the needsReset mask then have their states set back to the default values, and then the render-item's draw-call is submitted.
most expensive GPU operation is binding a texture to a slot.
Is that the CPU cost of issuing the command, or the GPU impact on render times? This probably differs between API, GPU model, driver version, specific application...
Edited by Hodgman, 04 September 2012 - 11:02 AM.