Currently I keep track of what is bound to every resource/shader/render target slots using arrays and before binding a new resource I check if it's already in the array
I used to use this method (
my 'device' class would cache all states), but have ended up changing it quite a lot.
I group my state-changes into logical groups (
e.g. a "material" group might set some cbuffers, some textures and a blend-mode). Each type of render-state is allocated a bit in a bit-field (
states with multiple slots get multiple bits), e.g. when designing the bitfield:
StateName NumBits
VertexStreams, 1
ShaderPrograms, 1
DepthStencil, 1
PsConstantBuffer, 14
PsTexture, 16
Each group of state-changes (aka state-group) can then have a mask indicating which states it's going to set.
Each "render-item" then contains a draw-call and a collection of state-groups. I run a sorting function over a collection of render-items to get an appropriate order (
e.g. back-to-front for a transparent pass, or sort by expensive states for an regular pass, etc), but this sorting step is decoupled/optional.
When submitting a collection of render-items, I
do initialize a 'state cache' to track which states I've set as I go (
e.g. [font=courier new,courier,monospace]State* stateCache[numStates] = {};[/font]), but it's just a local variable, not a persistent cache. I also have another array that contains 'default' values for all states, which are used if a render-item's state-group doesn't contain a value for a specific state (
this is provided as a 'default' state-group for the current pass). This completely changes the abstraction of the device being a state-machine (
i.e. if you don't change a state, it's got whatever value the last user left it in) and instead makes all states explicit and deterministic (
i.e. it doesn't matter who used the device before you), which I think is an important feature for a rendering API.
When iterating through each render-item in the collection, I first have to iterate through each of that item's state-groups. As each state-group is processed, it's bitmask is ORed together and any state present in that mask is ignored -- this allows a render-item to contain 'layers' of state-groups containing the same state, and the 'top' instance of that state's value will be used while the lower ones are ignored. Any state within a state-group that passes this test then undergoes the regular redundancy test and is passed to the device / written to a command buffer.
After iterating through the render-item's state-groups, I find any states that weren't set and aren't in their default state, again using the bitmasks:
for(...each state in each state-group of the current render-item...)
if( (statesSet & state.bit)!=state.bit //'layering' test. Earlier state-groups take precedence.
&& stateCache[state.idx]!=&state ) //regular cache test, don't set redundant states.
{
stateCache[state.idx] = &state;
statesSet |= state.bit;
Submit(state.cmd);
}
needsReset = dirtyStates & ~statesSet;
dirtyStates = statesSet;//for the next render-item
Any bits that come up in the needsReset mask then have their states set back to the default values, and then the render-item's draw-call is submitted.
most expensive GPU operation is binding a texture to a slot.
Is that the CPU cost of issuing the command, or the GPU impact on render times? This probably differs between API, GPU model, driver version, specific application...